intro.rst 5.7 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
  1. ------------
  2. Introduction
  3. ------------
  4. bcolz at glance
  5. ===============
  6. bcolz provides columnar, chunked data containers that can be
  7. compressed either in-memory and on-disk. Column storage allows for
  8. efficiently querying tables, as well as for cheap column addition and
  9. removal. It is based on `NumPy <http://www.numpy.org>`_, and uses it
  10. as the standard data container to communicate with bcolz objects, but
  11. it also comes with support for import/export facilities to/from
  12. `HDF5/PyTables tables <http://www.pytables.org>`_ and `pandas
  13. dataframes <http://pandas.pydata.org>`_.
  14. The building blocks of bcolz objects are the so-called ``chunks`` that
  15. are bits of data compressed as a whole, but that can be (partially)
  16. decompressed in order to improve the fetching of small parts of the
  17. array. This ``chunked`` nature of the bcolz objects, together with a
  18. buffered I/O, makes appends very cheap and fetches reasonably fast
  19. (although the modification of values can be an expensive operation).
  20. The compression/decompression process is carried out internally by
  21. Blosc, a high-performance compressor that is optimized for binary
  22. data. The fact that Blosc splits chunks internally in so-called
  23. blocks means that only the interesting part of the chunk will
  24. decompressed (typically in L1 or L2 caches). That ensures maximum
  25. performance for I/O operation (`either on-disk or in memory
  26. <https://github.com/FrancescAlted/DataContainersTutorials>`_).
  27. bcolz can use numexpr or dask internally (numexpr is used by default
  28. if installed, then dask and if these are not found, then the pure
  29. Python interpreter) so as to accelerate many internal vector and query
  30. operations (although it can use pure NumPy for doing so too). numexpr
  31. can optimize memory (cache) usage and uses multithreading for doing
  32. the computations, so it is blazing fast. This, in combination with
  33. carray/ctable disk-based, compressed containers, can be used for
  34. performing out-of-core computations efficiently, but most importantly
  35. *transparently*.
  36. carray and ctable objects
  37. -------------------------
  38. The main data container objects in the bcolz package are:
  39. * `carray`: container for homogeneous & heterogeneous (row-wise) data
  40. * `ctable`: container for heterogeneous (column-wise) data
  41. `carray` is very similar to a NumPy `ndarray` in that it supports the
  42. same types and basic data access interface. The main difference
  43. between them is that a `carray` can keep data compressed (both
  44. in-memory and on-disk), allowing to deal with larger datasets with the
  45. same amount of memory/disk. And another important difference is the
  46. chunked nature of the `carray` that allows data to be appended much
  47. more efficiently.
  48. On his hand, a `ctable` is also similar to a NumPy ``structured
  49. array`` that shares the same properties with its `carray` brother,
  50. namely, compression and chunking. Another difference is that data is
  51. stored in a column-wise order (and not on a row-wise, like the
  52. ``structured array``), allowing for very cheap column handling. This
  53. is of paramount importance when you need to add and remove columns in
  54. wide (and possibly large) in-memory and on-disk tables --doing this
  55. with regular ``structured arrays`` in NumPy is exceedingly slow.
  56. Furthermore, columnar means that the tabular datasets are stored
  57. column-wise order, and this turns out to offer better opportunities to
  58. improve compression ratio. This is because data tends to expose more
  59. similarity in elements that sit in the same column rather than those
  60. in the same row, so compressors generally do a much better job when
  61. data is aligned in such column-wise order.
  62. bcolz main features
  63. --------------------
  64. bcolz objects bring several advantages over plain NumPy objects:
  65. * Data is compressed: they take less storage space.
  66. * Efficient shrinks and appends: you can shrink or append more data
  67. at the end of the objects very efficiently (i.e. copies of the
  68. whole array are not needed).
  69. * Persistence comes seamlessly integrated, so you can work with
  70. on-disk arrays almost in the same way than with in-memory ones
  71. (bar some special attention to flush data being required).
  72. * `ctable` objects have the data arranged column-wise. This allows
  73. for much better performance when working with big tables, as well
  74. as for improving the compression ratio.
  75. * Can leverage Numexpr and Dask as virtual machines for fast
  76. operation with bcolz objects. Blosc ensures that the additional
  77. overhead of handling compressed data natively is very low.
  78. * Advanced query capabilities. The ability of a `ctable` object to
  79. iterate over the rows whose fields fulfill some conditions (and
  80. evaluated via numexpr, dask or pure python virtual machine) allows
  81. to perform queries very efficiently.
  82. bcolz limitations
  83. ------------------
  84. bcolz does not currently come with good support in the next areas:
  85. * Limited number of operations, at least when compared with NumPy.
  86. The supported operations are basically vectorized ones (i.e. those
  87. that are made element-by-element). But with is changing with the
  88. adoption of additional kernels like `Dask
  89. <https://github.com/dask/dask>`_ (and more to come).
  90. * Limited broadcast support. For example, NumPy lets you operate
  91. seamlessly with arrays of different shape (as long as they are
  92. compatible), but you cannot do that with bcolz. The only object
  93. that can be broadcasted currently are scalars
  94. (e.g. ``bcolz.eval("x+3")``).
  95. * Some methods (namely `carray.where()` and `carray.wheretrue()`)
  96. do not have support for multidimensional arrays.
  97. * Multidimensional `ctable` objects are not supported. However, as
  98. the columns of these objects can be fully multidimensional, this
  99. is not regarded as an important limitation.