README.rst 7.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220
  1. bcolz: columnar and compressed data containers
  2. ==============================================
  3. .. image:: https://badges.gitter.im/Blosc/bcolz.svg
  4. :alt: Join the chat at https://gitter.im/Blosc/bcolz
  5. :target: https://gitter.im/Blosc/bcolz?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge
  6. :Version: |version|
  7. :Travis CI: |travis|
  8. :Appveyor: |appveyor|
  9. :Coveralls: |coveralls|
  10. :And...: |powered|
  11. .. |version| image:: https://img.shields.io/pypi/v/bcolz.png
  12. :target: https://pypi.python.org/pypi/bcolz
  13. .. |travis| image:: https://img.shields.io/travis/Blosc/bcolz.png
  14. :target: https://travis-ci.org/Blosc/bcolz
  15. .. |appveyor| image:: https://img.shields.io/appveyor/ci/FrancescAlted/bcolz.png
  16. :target: https://ci.appveyor.com/project/FrancescAlted/bcolz/branch/master
  17. .. |powered| image:: http://b.repl.ca/v1/Powered--By-Blosc-blue.png
  18. :target: http://blosc.org
  19. .. |coveralls| image:: https://coveralls.io/repos/Blosc/bcolz/badge.png
  20. :target: https://coveralls.io/r/Blosc/bcolz
  21. .. image:: docs/bcolz.png
  22. bcolz provides columnar, chunked data containers that can be
  23. compressed either in-memory and on-disk. Column storage allows for
  24. efficiently querying tables, as well as for cheap column addition and
  25. removal. It is based on `NumPy <http://www.numpy.org>`_, and uses it
  26. as the standard data container to communicate with bcolz objects, but
  27. it also comes with support for import/export facilities to/from
  28. `HDF5/PyTables tables <http://www.pytables.org>`_ and `pandas
  29. dataframes <http://pandas.pydata.org>`_.
  30. bcolz objects are compressed by default not only for reducing
  31. memory/disk storage, but also to improve I/O speed. The compression
  32. process is carried out internally by `Blosc <http://blosc.org>`_, a
  33. high-performance, multithreaded meta-compressor that is optimized for
  34. binary data (although it works with text data just fine too).
  35. bcolz can also use `numexpr <https://github.com/pydata/numexpr>`_
  36. internally (it does that by default if it detects numexpr installed)
  37. or `dask <https://github.com/dask/dask>`_ so as to accelerate many
  38. vector and query operations (although it can use pure NumPy for doing
  39. so too). numexpr/dask can optimize the memory usage and use
  40. multithreading for doing the computations, so it is blazing fast.
  41. This, in combination with carray/ctable disk-based, compressed
  42. containers, can be used for performing out-of-core computations
  43. efficiently, but most importantly *transparently*.
  44. Just to whet your appetite, here it is an example with real data, where
  45. bcolz is already fulfilling the promise of accelerating memory I/O by
  46. using compression:
  47. http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb
  48. Rationale
  49. ---------
  50. By using compression, you can deal with more data using the same
  51. amount of memory, which is very good on itself. But in case you are
  52. wondering about the price to pay in terms of performance, you should
  53. know that nowadays memory access is the most common bottleneck in many
  54. computational scenarios, and that CPUs spend most of its time waiting
  55. for data. Hence, having data compressed in memory can reduce the
  56. stress of the memory subsystem as well.
  57. Furthermore, columnar means that the tabular datasets are stored
  58. column-wise order, and this turns out to offer better opportunities to
  59. improve compression ratio. This is because data tends to expose more
  60. similarity in elements that sit in the same column rather than those
  61. in the same row, so compressors generally do a much better job when
  62. data is aligned in such column-wise order. In addition, when you have
  63. to deal with tables with a large number of columns and your operations
  64. only involve some of them, a columnar-wise storage tends to be much
  65. more effective because minimizes the amount of data that travels to
  66. CPU caches.
  67. So, the ultimate goal for bcolz is not only reducing the memory needs
  68. of large arrays/tables, but also making bcolz operations to go faster
  69. than using a traditional data container like those in NumPy or Pandas.
  70. That is actually already the case in some real-life scenarios (see the
  71. notebook above) but that will become pretty more noticeable in
  72. combination with forthcoming, faster CPUs integrating more cores and
  73. wider vector units.
  74. Requisites
  75. ----------
  76. - Python >= 2.6
  77. - NumPy >= 1.8
  78. - Cython >= 0.22 (just for compiling the beast)
  79. - C-Blosc >= 1.8.0 (optional, as the internal Blosc will be used by default)
  80. - unittest2 (optional, only in the case you are running Python 2.6)
  81. Optional:
  82. - numexpr >= 2.5.2
  83. - dask >= 0.9.0
  84. - pandas
  85. - tables (pytables)
  86. Building
  87. --------
  88. There are different ways to compile bcolz, depending if you want to
  89. link with an already installed Blosc library or not.
  90. Compiling with an installed Blosc library (recommended)
  91. .......................................................
  92. Python and Blosc-powered extensions have a difficult relationship when
  93. compiled using GCC, so this is why using an external C-Blosc library is
  94. recommended for maximum performance (for details, see
  95. https://github.com/Blosc/python-blosc/issues/110).
  96. Go to https://github.com/Blosc/c-blosc/releases and download and
  97. install the C-Blosc library. Then, you can tell bcolz where is the
  98. C-Blosc library in a couple of ways:
  99. Using an environment variable:
  100. .. code-block:: console
  101. $ BLOSC_DIR=/usr/local (or "set BLOSC_DIR=\blosc" on Win)
  102. $ export BLOSC_DIR (not needed on Win)
  103. $ python setup.py build_ext --inplace
  104. Using a flag:
  105. .. code-block:: console
  106. $ python setup.py build_ext --inplace --blosc=/usr/local
  107. Compiling without an installed Blosc library
  108. ............................................
  109. bcolz also comes with the Blosc sources with it so, assuming that you
  110. have a C++ compiler installed, do:
  111. .. code-block:: console
  112. $ python setup.py build_ext --inplace
  113. That's all. You can proceed with testing section now.
  114. Note: The requirement for the C++ compiler is just for the Snappy
  115. dependency. The rest of the other components of Blosc are pure C
  116. (including the LZ4 and Zlib libraries).
  117. Testing
  118. -------
  119. After compiling, you can quickly check that the package is sane by
  120. running::
  121. $ PYTHONPATH=. (or "set PYTHONPATH=." on Windows)
  122. $ export PYTHONPATH (not needed on Windows)
  123. $ python -c"import bcolz; bcolz.test()" # add `heavy=True` if desired
  124. Installing
  125. ----------
  126. Install it as a typical Python package::
  127. $ pip install -U .
  128. Optionally Install the additional dependencies::
  129. $ pip install .[optional]
  130. Documentation
  131. -------------
  132. You can find the online manual at:
  133. http://bcolz.blosc.org
  134. but of course, you can always access docstrings from the console
  135. (i.e. ``help(bcolz.ctable)``).
  136. Also, you may want to look at the bench/ directory for some examples
  137. of use.
  138. Resources
  139. ---------
  140. Visit the main bcolz site repository at:
  141. http://github.com/Blosc/bcolz
  142. Home of Blosc compressor:
  143. http://blosc.org
  144. User's mail list:
  145. http://groups.google.com/group/bcolz (bcolz@googlegroups.com)
  146. An `introductory talk (20 min)
  147. <https://www.youtube.com/watch?v=-lKV4zC1gss>`_ about bcolz at
  148. EuroPython 2014. `Slides here
  149. <http://blosc.org/docs/bcolz-EuroPython-2014.pdf>`_.
  150. License
  151. -------
  152. Please see ``BCOLZ.txt`` in ``LICENSES/`` directory.
  153. Share your experience
  154. ---------------------
  155. Let us know of any bugs, suggestions, gripes, kudos, etc. you may
  156. have.
  157. **Enjoy Data!**