opt-tips.rst 4.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
  1. .. _opt-tips:
  2. -----------------
  3. Optimization tips
  4. -----------------
  5. Changing explicitly the length of chunks
  6. ========================================
  7. You may want to use explicitly the `chunklen` parameter to fine-tune
  8. your compression levels::
  9. >>> a = np.arange(1e7)
  10. >>> bcolz.carray(a)
  11. carray((10000000,), float64) nbytes: 76.29 MB; cbytes: 2.57 MB; ratio: 29.72
  12. cparams := cparams(clevel=5, shuffle=1)
  13. [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
  14. >>> bcolz.carray(a).chunklen
  15. 16384 # 128 KB = 16384 * 8 is the default chunk size for this carray
  16. >>> bcolz.carray(a, chunklen=512)
  17. carray((10000000,), float64) nbytes: 76.29 MB; cbytes: 10.20 MB; ratio: 7.48
  18. cparams := cparams(clevel=5, shuffle=1)
  19. [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
  20. >>> bcolz.carray(a, chunklen=8*1024)
  21. carray((10000000,), float64) nbytes: 76.29 MB; cbytes: 1.50 MB; ratio: 50.88
  22. cparams := cparams(clevel=5, shuffle=1)
  23. [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
  24. You see, the length of the chunk affects very much compression levels
  25. and the performance of I/O to carrays too.
  26. In general, however, it is safer (and quicker!) to use the
  27. `expectedlen` parameter (see next section).
  28. Informing about the length of your carrays
  29. ==========================================
  30. If you are going to add a lot of rows to your carrays, be sure to use
  31. the `expectedlen` parameter in creating time to inform the constructor
  32. about the expected length of your final carray; this allows bcolz to
  33. fine-tune the length of its chunks more easily. For example::
  34. >>> a = np.arange(1e7)
  35. >>> bcolz.carray(a, expectedlen=10).chunklen
  36. 512
  37. >>> bcolz.carray(a, expectedlen=10*1000).chunklen
  38. 4096
  39. >>> bcolz.carray(a, expectedlen=10*1000*1000).chunklen
  40. 16384
  41. >>> bcolz.carray(a, expectedlen=10*1000*1000*1000).chunklen
  42. 131072
  43. Lossy compression via the quantize filter
  44. =========================================
  45. Using the `quantize` filter for allowing lossy compression on floating
  46. point data. Data is quantized using ``np.around(scale*data)/scale``,
  47. where scale is 2**bits, and bits is determined from the quantize
  48. value. For example, if quantize=1, bits will be 4. 0 means that the
  49. quantization is disabled.
  50. Here it is an example of what you can get from the quantize filter::
  51. In [9]: a = np.cumsum(np.random.random_sample(1000*1000)-0.5)
  52. In [10]: bcolz.carray(a, cparams=bcolz.cparams(quantize=0)) # no quantize
  53. Out[10]:
  54. carray((1000000,), float64)
  55. nbytes: 7.63 MB; cbytes: 6.05 MB; ratio: 1.26
  56. cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=0)
  57. [ -2.80946077e-01 -7.63925274e-01 -5.65575047e-01 ..., 3.59036158e+02
  58. 3.58546624e+02 3.58258860e+02]
  59. In [11]: bcolz.carray(a, cparams=bcolz.cparams(quantize=1))
  60. Out[11]:
  61. carray((1000000,), float64)
  62. nbytes: 7.63 MB; cbytes: 1.41 MB; ratio: 5.40
  63. cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=1)
  64. [ -2.50000000e-01 -7.50000000e-01 -5.62500000e-01 ..., 3.59036158e+02
  65. 3.58546624e+02 3.58258860e+02]
  66. In [12]: bcolz.carray(a, cparams=bcolz.cparams(quantize=2))
  67. Out[12]:
  68. carray((1000000,), float64)
  69. nbytes: 7.63 MB; cbytes: 2.20 MB; ratio: 3.47
  70. cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=2)
  71. [ -2.81250000e-01 -7.65625000e-01 -5.62500000e-01 ..., 3.59036158e+02
  72. 3.58546624e+02 3.58258860e+02]
  73. In [13]: bcolz.carray(a, cparams=bcolz.cparams(quantize=3))
  74. Out[13]:
  75. carray((1000000,), float64)
  76. nbytes: 7.63 MB; cbytes: 2.30 MB; ratio: 3.31
  77. cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=3)
  78. [ -2.81250000e-01 -7.63671875e-01 -5.65429688e-01 ..., 3.59036158e+02
  79. 3.58546624e+02 3.58258860e+02]
  80. As you can see, the compression ratio can improve pretty significantly
  81. when using the quantize filter. It is important to note that by using
  82. quantize you are loosing precision on your floating point data.
  83. Also note how the first elements in the quantized arrays have less
  84. significant digits, but not the last ones. This is a side effect due
  85. to how bcolz stores the trainling data that do not fit in a whole
  86. chunk. But in general you should expect a loss in precision.