123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200 |
- =====================================
- RFC for a persistence layer for bcolz
- =====================================
- :Author: Francesc Alted
- :Contact: francesc@blosc.org
- :Version: 0.1 (August 19, 2012)
- The original bcolz container (up to version 0.4) consisted on
- basically a list of compressed in-memory blocks. This document
- explains how to extend it to allow to store the data blocks on disk
- too.
- The goals of this proposal are:
- 1. Allow to work with data directly on disk, exactly on the same way
- than data in memory.
- 2. Must support the same access capabilities than bcolz objects
- including: append data, modifying data and direct access to data.
- 3. Transparent data compression must be possible.
- 4. User metadata addition must be possible too.
- 5. The data should be easily 'shardeable' for optimal behaviour in a
- distributed storage environment.
- This, in combination with a distributed filesystem, and combined with
- a system that is aware of the physical topology of the
- underlying storage media would allow to almost replace the need for
- a distributed infrastructure for data (e.g. Disco/Hadoop).
- The layout
- ==========
- For every dataset, it will be created a directory, with a
- user-provided name that, for generality, we will call it `root` here.
- The root will have another couple of subdirectories, named data and
- meta::
- root (the name of the dataset)
- / \
- data meta
- The `data` directory will contain the actual data of the dataset,
- while the `meta` will contain the metainformation (dtype, shape,
- chunkshape, compression level, filters...).
- The `data` layout
- -----------------
- Data will be stored by what is called a `superchunk`, and each
- superchunk will use exactly one file. The size of each superchunk
- will be decided automatically by default, but it could be specified by
- the user too.
- Each of these directories will contain one or more superchunks for
- storing the actual data. Every data superchunk will be named after
- its sequential number. For example::
- $ ls data
- __1__.bin __2__.bin __3__.bin __4__.bin ... __1030__.bin
- This structure of separate superchunk files allows for two things:
- 1. Datasets can be enlarged and shrinked very easily
- 2. Horizontal sharding in a distributed system is possible (and cheap!)
- At its time, the `data` directory might contain other subdirectories
- that are meant for storing components for a 'nested' dtype (i.e. an
- structured array, stored in column-wise order)::
- data (the root for a nested datatype)
- / \ \
- col1 col2 col3
- / \
- sc1 sc3
- This structure allows for quick access to specific chunks of columns
- without a need to load the complete data in memory.
- The `superchunk` layout
- ~~~~~~~~~~~~~~~~~~~~~~~
- The superchunk is made of a series of data chunks put together using
- the Blosc metacompressor by default. Blosc being a metacompressor,
- means that it can use different compressors and filters, while
- leveraging its blocking and multithreading capabilities.
- The layout of binary superchunk data files looks like this::
- |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
- | b l p k | ^ | ^ | ^ | ^ | chunk-size | last-chunk |
- | | | |
- version ----+ | | |
- options --------+ | |
- checksum ------------+ |
- typesize ----------------+
- |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
- | nchunks | RESERVED |
- The magic 'blpk' signature is the same than the bloscpack_ format.
- The new version (2) of the format will allow to include indexes
- (offsets to where the data chunks begin) and checksums (probably using
- the adler32 algorithm or similar).
- .. _blosckpack: https://github.com/esc/bloscpack/blob/feature/new_format/header_rfc.rst
- After the above header, it will follow index data and the actual data
- in blosc chunks::
- |-bloscpack-header-|-offset-|-offset-|...|-chunk-|-chunk-|...|
- The index part above stores the offsets where each chunk starts, so it
- is is easy to access the different chunks in the superchunk file.
- CAVEAT: The bloscpack format is still evolving, so don't trust on
- forward compatibility of the format, at least until 1.0, where the
- internal format will be declared frozen.
- And each blosc chunk has this format (Blosc 1.0 on)::
- |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
- ^ ^ ^ ^ | nbytes | blocksize | ctbytes |
- | | | |
- | | | +--typesize
- | | +------flags
- | +----------blosclz version
- +--------------blosc version
- At the end of each blosc chunk some empty space could be added in
- order to allow the modification of some data elements inside each
- block. The reason for the additional space is that, as these chunks
- will be typically compressed, when modifying some element of the chunk
- it is not guaranteed that it will fit in the same space than the old
- data chunk. Having this provision of small empty space at the end of
- each chunk will allow for storing the modifyed chunks in many cases,
- without a need to save the entire superchunk on a different part of
- the disk.
- The `meta` files
- ----------------
- Here there can be as many files as necessary. The format for every
- file will tentatively be YAML (although initial implementations are
- using JSON). There should be (at least) three files:
- The `sizes` file
- ~~~~~~~~~~~~~~~~
- This contains the shape and compressed and uncompressed sizes of the
- dataset. For example::
- $ cat meta/sizes
- shape: (5000000000,)
- nbytes: 5000000000
- cbytes: 24328038
- The `storage` file
- ~~~~~~~~~~~~~~~~~~
- Here comes the information about how data has to be stored and its
- meaning. Example::
- dtype:
- col1: int8
- col2: float32
- chunkshape: (30, 20)
- superchunksize: 10 # max. number of chunks in a single file
- endianness: big # default: little
- order: C # default: C
- compression:
- library: blosclz # could be zlib, fastlz or others
- level: 5
- filters: [shuffle, truncate] # order matters
- The `attributes` file
- ~~~~~~~~~~~~~~~~~~~~~
- In this file it comes additional user information. Example::
- temperature:
- value: 23.5
- type: scalar
- dtype: float32
- pressure:
- value: 225.5
- type: scalar
- dtype: float32
- ids:
- value: [1,3,6,10]
- type: array
- dtype: int32
- More files could be added for providing other kind of meta-information
- about data (read indexes, masks...).
|