DISK_FORMAT_v1.rst 4.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148
  1. ==========================================
  2. The persistence layer for bcolz 1.x series
  3. ==========================================
  4. :Author: Francesc Alted
  5. :Contact: francesc@blosc.org
  6. :Version: 1.0 (March 8, 2016)
  7. This document explains how the data is stored in the bcolz format.
  8. The goals of this format are:
  9. 1. Allow to work with datasets (carray/ctable) directly on disk,
  10. exactly on the same way than data in memory.
  11. 2. Must support the same access capabilities than carray/ctable
  12. objects including: append data, modying data and direct access to
  13. data.
  14. 3. Transparent data compression must be possible.
  15. 4. The data should be easily 'shardeable' for optimal behaviour in a
  16. distributed storage environment.
  17. 5. User metadata addition must be possible too.
  18. The layout
  19. ==========
  20. For every dataset, a directory is created, with a user-provided name
  21. that, for generality, we will call it `root` here. The root will have
  22. another couple of subdirectories, named data and meta::
  23. root (the name of the dataset)
  24. / \
  25. data meta
  26. The `data` directory contains the actual data of the dataset, while
  27. the `meta` will contain the meta-information (dtype, shape,
  28. chunkshape, compression level, filters...).
  29. The `data` layout
  30. -----------------
  31. Data is stored in data blocks that are called `superchunks`, and each
  32. superchunk will use exactly one file. The size of each superchunk
  33. will be decided automatically by default, but it could be specified by
  34. the user too.
  35. Each of these directories will contain one or more chunks for storing
  36. the actual data. Every data chunk will be named after its sequential
  37. number. For example::
  38. $ ls data
  39. __0.blp __1.blp __2.blp __3.blp ... __1030.blp
  40. This structure of separate superchunk files allows for two things:
  41. 1. Datasets can be enlarged and shrinked very easily.
  42. 2. Horizontal sharding in a distributed system is possible (and cheap!).
  43. At its time, the `data` directory might contain other subdirectories
  44. that are meant for storing components for a ctable (i.e. an structured
  45. array like, but stored in column-wise order)::
  46. data (the root for a nested datatype)
  47. / \ \
  48. col1 col2 col3 (first-level colmuns)
  49. / \
  50. sc1 sc3 (-> nested columns, if exist)
  51. This structure allows for quick access to specific superchunks of
  52. columns without a need to load the complete dataset in memory.
  53. The `superchunk` layout
  54. ~~~~~~~~~~~~~~~~~~~~~~~
  55. The superchunk is made of a series of data blocks put together using
  56. the C-Blosc 1.x metacompressor by default. Blosc being a
  57. metacompressor, means that it can use different compressors and
  58. filters, while leveraging its blocking and multithreading
  59. capabilities.
  60. The layout of binary superchunk data files looks like this::
  61. |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
  62. | b l p k | ^ | RESERVED | nchunks |
  63. version
  64. The first four are the magic string 'blpk'. The next one is an 8 bit
  65. unsigned little-endian integer that encodes the format version. The
  66. next three are reserved, and in the last eight there is a signed 64
  67. bit little endian integer that encodes the number of Blosc chunks
  68. inside the superchunk.
  69. Currently (bcolz 1.x), version is 1 and nchunks always has a
  70. value of 1 (this might change in bcolz 2.0).
  71. After the above header, it follows the actual data in Blosc chunk. At
  72. its time, each chunk has this format (Blosc 1.x)::
  73. |-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
  74. ^ ^ ^ ^ | nbytes | blocksize | ctbytes |
  75. | | | |
  76. | | | +--typesize
  77. | | +------flags
  78. | +----------blosclz version
  79. +--------------blosc version
  80. For more details on this, see the `C-Blosc header description
  81. <https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst>`_.
  82. The `meta` files
  83. ----------------
  84. Here there can be as many files as necessary. The format for every
  85. file is JSON. There should be (at least) two files:
  86. The `sizes` file
  87. ~~~~~~~~~~~~~~~~
  88. This contains the shape and compressed and uncompressed sizes of the
  89. dataset. For example::
  90. $ cat meta/sizes
  91. {"shape": [100000], "nbytes": 400000, "cbytes": 266904}
  92. The `storage` file
  93. ~~~~~~~~~~~~~~~~~~
  94. Here comes the information about how data has to be stored and its
  95. meaning. Example::
  96. $ cat meta/sizes
  97. {"dtype": "int32", "cparams": {"shuffle": true, "clevel": 5}, "chunklen": 65536, "dflt": 0, "expectedlen": 100000}
  98. The `__attrs__` file
  99. ---------------------
  100. Finally, in this file (placed at the root directory for each dataset)
  101. it comes additional user information (not mandatory) serialized in
  102. JSON format. Example::
  103. $ cat __attrs__
  104. {"temp": 22.5, "pressure": 999.2, "timestamp": "2016030915"}