persistence.rst 6.5 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200
  1. =====================================
  2. RFC for a persistence layer for bcolz
  3. =====================================
  4. :Author: Francesc Alted
  5. :Contact: francesc@blosc.org
  6. :Version: 0.1 (August 19, 2012)
  7. The original bcolz container (up to version 0.4) consisted on
  8. basically a list of compressed in-memory blocks. This document
  9. explains how to extend it to allow to store the data blocks on disk
  10. too.
  11. The goals of this proposal are:
  12. 1. Allow to work with data directly on disk, exactly on the same way
  13. than data in memory.
  14. 2. Must support the same access capabilities than bcolz objects
  15. including: append data, modifying data and direct access to data.
  16. 3. Transparent data compression must be possible.
  17. 4. User metadata addition must be possible too.
  18. 5. The data should be easily 'shardeable' for optimal behaviour in a
  19. distributed storage environment.
  20. This, in combination with a distributed filesystem, and combined with
  21. a system that is aware of the physical topology of the
  22. underlying storage media would allow to almost replace the need for
  23. a distributed infrastructure for data (e.g. Disco/Hadoop).
  24. The layout
  25. ==========
  26. For every dataset, it will be created a directory, with a
  27. user-provided name that, for generality, we will call it `root` here.
  28. The root will have another couple of subdirectories, named data and
  29. meta::
  30. root (the name of the dataset)
  31. / \
  32. data meta
  33. The `data` directory will contain the actual data of the dataset,
  34. while the `meta` will contain the metainformation (dtype, shape,
  35. chunkshape, compression level, filters...).
  36. The `data` layout
  37. -----------------
  38. Data will be stored by what is called a `superchunk`, and each
  39. superchunk will use exactly one file. The size of each superchunk
  40. will be decided automatically by default, but it could be specified by
  41. the user too.
  42. Each of these directories will contain one or more superchunks for
  43. storing the actual data. Every data superchunk will be named after
  44. its sequential number. For example::
  45. $ ls data
  46. __1__.bin __2__.bin __3__.bin __4__.bin ... __1030__.bin
  47. This structure of separate superchunk files allows for two things:
  48. 1. Datasets can be enlarged and shrinked very easily
  49. 2. Horizontal sharding in a distributed system is possible (and cheap!)
  50. At its time, the `data` directory might contain other subdirectories
  51. that are meant for storing components for a 'nested' dtype (i.e. an
  52. structured array, stored in column-wise order)::
  53. data (the root for a nested datatype)
  54. / \ \
  55. col1 col2 col3
  56. / \
  57. sc1 sc3
  58. This structure allows for quick access to specific chunks of columns
  59. without a need to load the complete data in memory.
  60. The `superchunk` layout
  61. ~~~~~~~~~~~~~~~~~~~~~~~
  62. The superchunk is made of a series of data chunks put together using
  63. the Blosc metacompressor by default. Blosc being a metacompressor,
  64. means that it can use different compressors and filters, while
  65. leveraging its blocking and multithreading capabilities.
  66. The layout of binary superchunk data files looks like this::
  67. |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
  68. | b l p k | ^ | ^ | ^ | ^ | chunk-size | last-chunk |
  69. | | | |
  70. version ----+ | | |
  71. options --------+ | |
  72. checksum ------------+ |
  73. typesize ----------------+
  74. |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
  75. | nchunks | RESERVED |
  76. The magic 'blpk' signature is the same than the bloscpack_ format.
  77. The new version (2) of the format will allow to include indexes
  78. (offsets to where the data chunks begin) and checksums (probably using
  79. the adler32 algorithm or similar).
  80. .. _blosckpack: https://github.com/esc/bloscpack/blob/feature/new_format/header_rfc.rst
  81. After the above header, it will follow index data and the actual data
  82. in blosc chunks::
  83. |-bloscpack-header-|-offset-|-offset-|...|-chunk-|-chunk-|...|
  84. The index part above stores the offsets where each chunk starts, so it
  85. is is easy to access the different chunks in the superchunk file.
  86. CAVEAT: The bloscpack format is still evolving, so don't trust on
  87. forward compatibility of the format, at least until 1.0, where the
  88. internal format will be declared frozen.
  89. And each blosc chunk has this format (Blosc 1.0 on)::
  90. |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
  91. ^ ^ ^ ^ | nbytes | blocksize | ctbytes |
  92. | | | |
  93. | | | +--typesize
  94. | | +------flags
  95. | +----------blosclz version
  96. +--------------blosc version
  97. At the end of each blosc chunk some empty space could be added in
  98. order to allow the modification of some data elements inside each
  99. block. The reason for the additional space is that, as these chunks
  100. will be typically compressed, when modifying some element of the chunk
  101. it is not guaranteed that it will fit in the same space than the old
  102. data chunk. Having this provision of small empty space at the end of
  103. each chunk will allow for storing the modifyed chunks in many cases,
  104. without a need to save the entire superchunk on a different part of
  105. the disk.
  106. The `meta` files
  107. ----------------
  108. Here there can be as many files as necessary. The format for every
  109. file will tentatively be YAML (although initial implementations are
  110. using JSON). There should be (at least) three files:
  111. The `sizes` file
  112. ~~~~~~~~~~~~~~~~
  113. This contains the shape and compressed and uncompressed sizes of the
  114. dataset. For example::
  115. $ cat meta/sizes
  116. shape: (5000000000,)
  117. nbytes: 5000000000
  118. cbytes: 24328038
  119. The `storage` file
  120. ~~~~~~~~~~~~~~~~~~
  121. Here comes the information about how data has to be stored and its
  122. meaning. Example::
  123. dtype:
  124. col1: int8
  125. col2: float32
  126. chunkshape: (30, 20)
  127. superchunksize: 10 # max. number of chunks in a single file
  128. endianness: big # default: little
  129. order: C # default: C
  130. compression:
  131. library: blosclz # could be zlib, fastlz or others
  132. level: 5
  133. filters: [shuffle, truncate] # order matters
  134. The `attributes` file
  135. ~~~~~~~~~~~~~~~~~~~~~
  136. In this file it comes additional user information. Example::
  137. temperature:
  138. value: 23.5
  139. type: scalar
  140. dtype: float32
  141. pressure:
  142. value: 225.5
  143. type: scalar
  144. dtype: float32
  145. ids:
  146. value: [1,3,6,10]
  147. type: array
  148. dtype: int32
  149. More files could be added for providing other kind of meta-information
  150. about data (read indexes, masks...).