cjcpp
/
bcolz


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105
							.. _opt-tips:

-----------------
Optimization tips
-----------------

Changing explicitly the length of chunks
========================================

You may want to use explicitly the `chunklen` parameter to fine-tune
your compression levels::

  >>> a = np.arange(1e7)
  >>> bcolz.carray(a)
  carray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 2.57 MB; ratio: 29.72
    cparams := cparams(clevel=5, shuffle=1)
  [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
  >>> bcolz.carray(a).chunklen
  16384   # 128 KB = 16384 * 8 is the default chunk size for this carray
  >>> bcolz.carray(a, chunklen=512)
  carray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 10.20 MB; ratio: 7.48
    cparams := cparams(clevel=5, shuffle=1)
  [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]
  >>> bcolz.carray(a, chunklen=8*1024)
  carray((10000000,), float64)  nbytes: 76.29 MB; cbytes: 1.50 MB; ratio: 50.88
    cparams := cparams(clevel=5, shuffle=1)
  [0.0, 1.0, 2.0, ..., 9999997.0, 9999998.0, 9999999.0]

You see, the length of the chunk affects very much compression levels
and the performance of I/O to carrays too.

In general, however, it is safer (and quicker!) to use the
`expectedlen` parameter (see next section).

Informing about the length of your carrays
==========================================

If you are going to add a lot of rows to your carrays, be sure to use
the `expectedlen` parameter in creating time to inform the constructor
about the expected length of your final carray; this allows bcolz to
fine-tune the length of its chunks more easily.  For example::

  >>> a = np.arange(1e7)
  >>> bcolz.carray(a, expectedlen=10).chunklen
  512
  >>> bcolz.carray(a, expectedlen=10*1000).chunklen
  4096
  >>> bcolz.carray(a, expectedlen=10*1000*1000).chunklen
  16384
  >>> bcolz.carray(a, expectedlen=10*1000*1000*1000).chunklen
  131072

Lossy compression via the quantize filter
=========================================

Using the `quantize` filter for allowing lossy compression on floating
point data.  Data is quantized using ``np.around(scale*data)/scale``,
where scale is 2**bits, and bits is determined from the quantize
value.  For example, if quantize=1, bits will be 4.  0 means that the
quantization is disabled.

Here it is an example of what you can get from the quantize filter::

  In [9]: a = np.cumsum(np.random.random_sample(1000*1000)-0.5)

  In [10]: bcolz.carray(a, cparams=bcolz.cparams(quantize=0))  # no quantize
  Out[10]:
  carray((1000000,), float64)
    nbytes: 7.63 MB; cbytes: 6.05 MB; ratio: 1.26
    cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=0)
  [ -2.80946077e-01  -7.63925274e-01  -5.65575047e-01 ...,   3.59036158e+02
     3.58546624e+02   3.58258860e+02]

  In [11]: bcolz.carray(a, cparams=bcolz.cparams(quantize=1))
  Out[11]:
  carray((1000000,), float64)
    nbytes: 7.63 MB; cbytes: 1.41 MB; ratio: 5.40
    cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=1)
  [ -2.50000000e-01  -7.50000000e-01  -5.62500000e-01 ...,   3.59036158e+02
     3.58546624e+02   3.58258860e+02]

  In [12]: bcolz.carray(a, cparams=bcolz.cparams(quantize=2))
  Out[12]:
  carray((1000000,), float64)
    nbytes: 7.63 MB; cbytes: 2.20 MB; ratio: 3.47
    cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=2)
  [ -2.81250000e-01  -7.65625000e-01  -5.62500000e-01 ...,   3.59036158e+02
     3.58546624e+02   3.58258860e+02]

  In [13]: bcolz.carray(a, cparams=bcolz.cparams(quantize=3))
  Out[13]:
  carray((1000000,), float64)
    nbytes: 7.63 MB; cbytes: 2.30 MB; ratio: 3.31
    cparams := cparams(clevel=5, shuffle=1, cname='blosclz', quantize=3)
  [ -2.81250000e-01  -7.63671875e-01  -5.65429688e-01 ...,   3.59036158e+02
     3.58546624e+02   3.58258860e+02]

As you can see, the compression ratio can improve pretty significantly
when using the quantize filter.  It is important to note that by using
quantize you are loosing precision on your floating point data.

Also note how the first elements in the quantized arrays have less
significant digits, but not the last ones.  This is a side effect due
to how bcolz stores the trainling data that do not fit in a whole
chunk.  But in general you should expect a loss in precision.