# Tutorial on carray objects
[Go to tutorials´ index](tutorials.ipynb)


Index:
 1. Creating carrays
 - Enlarging your carray
 - Compression level and shuffle filter
 - Accessing carray data
 - Querying carrays
 - Modifying carrays
 - Multidimensional carrays
 - Operating with carrays
 - carray metadata
 - carray user attrs
 - Memory profiling

This tutorial focuses on how to use carray objects, 
we will also see which options they support, 
when & how should they be used. 

We will also see how does this container compares to 
`numpy arrays` and we will highlight some of it's strengths.

In [1]:
from __future__ import print_function

# Let's import the packages we need for this tutorial
import bcolz
import numpy as np
import sys

# Timing measurements will be saved here
bcolz_vs_numpy = {}

bcolz.print_versions()

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version: 1.0.1.dev58+dirty
bcolz git info: 1.0.0-77-ga34d0b1
NumPy version: 1.11.0
Blosc version: 1.9.1.dev ($Date:: 2016-04-29 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version: 2.5.3.dev0
Dask version: 0.9.0
Python version: 2.7.11 |Continuum Analytics, Inc.| (default, Dec 6 2015, 18:08:32) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform: linux2-x86_64
Byte-ordering: little
Detected cores: 4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=



## Creating carrays
Go to index

A carray can be created from any NumPy ndarray by using its `carray` constructor.

In [2]:
# Clear mydir, needed in case you run this tutorial multiple times
!rm -rf mydir

In [3]:
# create an in-memory numpy container
a = np.arange(10)

In [4]:
# create an in-memory carray container
b = bcolz.carray(a)

In [5]:
# create an on-disk carray container
c = bcolz.carray(a, rootdir='mydir')
c.flush()

**NOTE:** To avoid forgetting to flush your data to disk, you are encouraged to use the `with` statement for on-disk carrays (we will see an example later on). 

You could also create it by using one of its multiple constructors from the top-level-funtions, write mode will overwrite contents of the folder where the carray is created.


In [6]:
d = bcolz.arange(10, rootdir='mydir', mode='w')

Please note that carray allows to create disk-based arrays by just specifying the `rootdir` parameter in all the constructors. Disk-based arrays fully support all the operations of in-memory counterparts, so depending on your needs, you may want to use one or another (or even a combination of both).

Now, `b` is a carray object. Just check the following:

In [7]:
type(b)

bcolz.carray_ext.carray

You can have a peek at it by using its string form:

In [8]:
print(b)

[0 1 2 3 4 5 6 7 8 9]


And get more info about uncompressed size (nbytes), compressed (cbytes) and the compression ratio (ratio = nbytes/cbytes), by using its representation form::

In [9]:
b # <==> print(repr(b))

carray((10,), int64)
 nbytes := 80; cbytes := 16.00 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[0 1 2 3 4 5 6 7 8 9]

As you can see, the compressed size is much larger than the uncompressed one. How this can be? Well, it turns out that carray wears an I/O buffer for accelerating some internal operations. So, for small arrays (typically those taking less than 1 MB), there is little point in using a carray.

However, when creating carrays larger than 1 MB (its natural scenario), the size of the I/O buffer is generally negligible in comparison to its total size:

In [10]:
b = bcolz.arange(1e8)
b

carray((100000000,), float64)
 nbytes := 762.94 MB; cbytes := 25.01 MB; ratio: 30.50
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 131072; chunksize: 1048576; blocksize: 32768
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07
 9.99999980e+07 9.99999990e+07]

The carray consumes less than 24 MB, while the original data would have taken more than 760 MB; that's a huge gain. You can always get a hint on how much space it takes your carray by using `sys.getsizeof()`:

In [11]:
sys.getsizeof(b)

26228365

The take away message is: you can create very large arrays without the need to create a NumPy array first (that may not fit in memory).

Finally, you can get a copy of your created carrays by using the `copy()` method:

In [12]:
c = b.copy()
c

carray((100000000,), float64)
 nbytes := 762.94 MB; cbytes := 25.01 MB; ratio: 30.50
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 131072; chunksize: 1048576; blocksize: 32768
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07
 9.99999980e+07 9.99999990e+07]

Also, you can set desired parameters for the newly created copy too:

In [13]:
%time b.copy(cparams=bcolz.cparams(clevel=9))

CPU times: user 656 ms, sys: 92 ms, total: 748 ms
Wall time: 421 ms


carray((100000000,), float64)
 nbytes := 762.94 MB; cbytes := 7.56 MB; ratio: 100.97
 cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)
 chunklen := 131072; chunksize: 1048576; blocksize: 524288
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07
 9.99999980e+07 9.99999990e+07]

In [14]:
%time b.copy(cparams=bcolz.cparams(clevel=9, cname='lz4'))

CPU times: user 656 ms, sys: 40 ms, total: 696 ms
Wall time: 393 ms


carray((100000000,), float64)
 nbytes := 762.94 MB; cbytes := 7.56 MB; ratio: 100.97
 cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)
 chunklen := 131072; chunksize: 1048576; blocksize: 524288
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07
 9.99999980e+07 9.99999990e+07]

In [15]:
%time b.copy(cparams=bcolz.cparams(clevel=9, cname='zlib', shuffle=bcolz.BITSHUFFLE))

CPU times: user 7.42 s, sys: 484 ms, total: 7.9 s
Wall time: 7.46 s


carray((100000000,), float64)
 nbytes := 762.94 MB; cbytes := 2.30 MB; ratio: 331.96
 cparams := cparams(clevel=9, shuffle=2, cname='zlib', quantize=0)
 chunklen := 131072; chunksize: 1048576; blocksize: 1048576
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07
 9.99999980e+07 9.99999990e+07]


## Enlarging your carray
Go to index

One of the nicest features of carray objects is that they can be
enlarged very efficiently. This can be done via the `carray.append()`
method.

For example, if `b` is a carray with 10 million elements:

In [16]:
b = bcolz.arange(10*1e6)
b

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 3.32 MB; ratio: 23.01
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 32768
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06
 9.99999800e+06 9.99999900e+06]

It could be enlarged by 10 elements as follows:

In [17]:
b.append(np.arange(10.))
b

carray((10000010,), float64)
 nbytes := 76.29 MB; cbytes := 3.32 MB; ratio: 23.01
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 32768
[ 0. 1. 2. ..., 7. 8. 9.]

Let's check how fast appending can be:

In [18]:
a = np.arange(1e7)
b = bcolz.arange(1e7)

t_bcolz = %timeit -o b.append(a)
t_numpy = %timeit -o np.concatenate((a, a))
ratio = t_numpy.best / t_bcolz.best
bcolz_vs_numpy["append: array"] = ratio

print('\n* In this case appending to a carray was {0}x times faster than numpy'.format(round(ratio, 2)))

100 loops, best of 3: 17.2 ms per loop
10 loops, best of 3: 26.8 ms per loop

* In this case appending to a carray was 1.56x times faster than numpy


And this is specially the case when appending small bits to large arrays:

In [19]:
a = np.arange(1e7)
b = bcolz.carray(a)
c = np.arange(1e1)

t_bcolz = %timeit -o b.append(c)
t_numpy = %timeit -o np.concatenate((a, c))
ratio = t_numpy.best / t_bcolz.best
bcolz_vs_numpy["append: small array to large array"] = ratio

print('\n* Appending to a large carray can be around {0}x times faster than numpy'.format(round(ratio, 2)))

The slowest run took 6.49 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.94 µs per loop
100 loops, best of 3: 8.72 ms per loop

* Appending to a large carray can be around 2966.07x times faster than numpy


You can also enlarge your arrays by using the `resize()` method:

In [20]:
b = bcolz.arange(10)
b.resize(20)
b

carray((20,), int64)
 nbytes := 160; cbytes := 16.00 KB; ratio: 0.01
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0]

Note how the append values are filled with zeros. This is because the
default value for filling is 0. But you can choose a different value
too:

In [21]:
b = bcolz.arange(10, dflt=1)
b.resize(20)
b

carray((20,), int64)
 nbytes := 160; cbytes := 16.00 KB; ratio: 0.01
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1]

Another operation supported by carrays is trimming:

In [22]:
b = bcolz.arange(10)
b.resize(5)
b

carray((5,), int64)
 nbytes := 40; cbytes := 16.00 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[0 1 2 3 4]

If you wish you could even set the size to 0:

In [23]:
b.resize(0)
len(b)

0

Do not be afraid to resize extensively, as it is one of the strongest points of carray
objects.

Let's see below how it compares to numpy in case we had big arrays dangling around.

In [24]:
a = np.arange(1e7)
b = bcolz.carray(a)
desired_size = int(1e4)

t_bcolz = %timeit -o b.resize(desired_size)
t_numpy = %timeit -o np.resize(a, desired_size)
ratio = t_numpy.best / t_bcolz.best
bcolz_vs_numpy["resize large array"] = ratio

print('\n* Resizing a large carray is around {0}x times faster than numpy'.format(round(ratio, 2)))

The slowest run took 206.52 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 547 ns per loop
100 loops, best of 3: 8.76 ms per loop

* Resizing a large carray is around 16005.88x times faster than numpy


 a = a[:-extra]
 return reshape(newshape, order=order)



## Compression level and shuffle filter
Go to index

Bcolz carray objects use Blosc as the internal compressor, Blosc can be
directed to use different compression levels and you could activate at whim 
the internal shuffle filter. The shuffle filter is a way to improve
compression when using items that have type sizes > 1 byte, although
it might be counter-productive (very rarely) for some data distributions.

By default carrays are compressed using Blosc with compression level 5
with shuffle active. But depending on you needs, you can and it could 
be a good idea to use other compression levels too

Let's see some examples:



In [25]:
a = np.arange(1e7)
bcolz.carray(a, bcolz.cparams(clevel=1))

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 4.95 MB; ratio: 15.40
 cparams := cparams(clevel=1, shuffle=1, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 16384
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06
 9.99999800e+06 9.99999900e+06]

Let's set the maximum compression level:

In [26]:
bcolz.carray(a, bcolz.cparams(clevel=9))

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 1.03 MB; ratio: 73.81
 cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 524288
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06
 9.99999800e+06 9.99999900e+06]

And we can use different compressors too:

In [27]:
bcolz.carray(a, bcolz.cparams(clevel=9, cname="zlib"))

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 759.87 KB; ratio: 102.81
 cparams := cparams(clevel=9, shuffle=1, cname='zlib', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 524288
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06
 9.99999800e+06 9.99999900e+06]

And using other filters as well. Let's use BITSHUFFLE here:

In [28]:
bcolz.carray(a, bcolz.cparams(clevel=9, cname="zlib", shuffle=bcolz.BITSHUFFLE))

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 667.42 KB; ratio: 117.06
 cparams := cparams(clevel=9, shuffle=2, cname='zlib', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 524288
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06
 9.99999800e+06 9.99999900e+06]

As mentioned above, you could decide to disable the shuffle filter that
comes with Blosc:

In [29]:
bcolz.carray(a, bcolz.cparams(shuffle=bcolz.NOSHUFFLE))

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 39.01 MB; ratio: 1.96
 cparams := cparams(clevel=5, shuffle=0, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 32768
[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06
 9.99999800e+06 9.99999900e+06]

As can be seen, the compression ratio is much worse in this case.
In general, it is recommend to let SHUFFLE filter active (unless you are
fine-tuning the performance for a specific carray).


See the Optimization tips chapter for more info on how you can change other internal parameters like the chunk size.

Also, if you would like to set globally your own compression parameters defaults, please see the Defaults chapter.





## Accessing carray data
Go to index

The way to access carray data is very similar to the NumPy indexing
scheme, and in fact, supports all the indexing methods supported by
NumPy. First, start by specifying an index or slice:

In [30]:
a = np.arange(10)
b = bcolz.carray(a)
b[0]

0

In [31]:
b[-1]

9

In [32]:
b[2:4]

array([2, 3])

In [33]:
b[::2]

array([0, 2, 4, 6, 8])

In [34]:
b[3:9:3]

array([3, 6])

Note that NumPy objects are returned as the result of an indexing
operation. This was designed on purpose because normally
NumPy objects are more featured and flexible (specially if they are small). 

In fact, a handy way to get a NumPy array out of a carray object is 
asking for the complete range:

In [35]:
b[:]

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Fancy indexing is supported too. For example, indexing with boolean arrays gives:

In [36]:
list_barr = [True]*5+[False]*5
barr = np.array(list_barr)
b[barr]

array([0, 1, 2, 3, 4])

This also works with carray objects acting as the boolean index:

In [37]:
b[bcolz.carray(barr)]

array([0, 1, 2, 3, 4])

Be aware that if you provide al list of booleans it will be interpreted as a list of indices you want to extract and therefore you won't obtain what you are looking for:

In [38]:
b[list_barr]

array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])

As we just saw, you could also give a list of indices you are interested in:

In [39]:
b[[2,3,0,2]]

array([2, 3, 0, 2])

In [40]:
b[bcolz.carray([2,3,0,2])]

array([2, 3, 0, 2])


## Querying carrays
Go to index

The carrays can be queried in different ways. The easiest and most powerful one is by using its set rich of iterators:

In [41]:
a = np.arange(1e7)
b = bcolz.carray(a)
t_bcolz = %timeit -o sum(v for v in b if v < 10)
t_numpy = %timeit -o sum(v for v in a if v < 10)
ratio = t_numpy.best / t_bcolz.best
bcolz_vs_numpy["query sum large array"] = ratio

print('\n* In this case summing up the desired items of our carray is '
 '{0}x times faster than numpy'.format(round(ratio, 2)))

1 loop, best of 3: 649 ms per loop
1 loop, best of 3: 1.83 s per loop

* In this case summing up the desired items of our carray is 2.83x times faster than numpy


The iterator also has support for looking into slices of the array. The time taken in this case will be much shorter because the slice where we lookup is much shorter. Look at this:

In [42]:
%timeit sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10)

1000 loops, best of 3: 301 µs per loop


Also, you can quickly retrieve the indices of a boolean carray that 
are true::


In [43]:
barr = bcolz.eval("b < 10") # see 'Operating with carrays' section below
[i for i in barr.wheretrue()]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

In [44]:
%timeit [i for i in barr.wheretrue()]

1000 loops, best of 3: 1.2 ms per loop


And therefore, as we saw previously get the desired values using a boolean, which will return all the values from our carray where the boolean array is true:


In [45]:
[i for i in b.where(barr)]

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

In [46]:
%timeit [i for i in b.where(barr)]

1000 loops, best of 3: 2.38 ms per loop


Note how `wheretrue` and `where` iterators are really fast. They are
also very powerful. For example, they support `limit` and `skip`
parameters for limiting the number of elements returned and skipping
the leading elements respectively:

In [47]:
[i for i in barr.wheretrue(limit=5)]

[0, 1, 2, 3, 4]

In [48]:
[i for i in barr.wheretrue(skip=3)]

[3, 4, 5, 6, 7, 8, 9]

In [49]:
[i for i in barr.wheretrue(limit=5, skip=3)]

[3, 4, 5, 6, 7]

The advantage of the carray iterators is that you can use them in
generator contexts and hence, you don't need to waste memory for
creating temporaries, which can be important and be considered when
dealing with large arrays.

Again, the iterator toolset in bcolz is very fast, so try to
express your problems in a way that you can leverage it extensively.


## Modifying carrays
Go to index

Although it is a somewhat slow operation, carrays can be modified too.
You can do it by specifying scalar or slice indices:

In [50]:
a = np.arange(10)
b = bcolz.arange(10)
b

carray((10,), int64)
 nbytes := 80; cbytes := 16.00 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[0 1 2 3 4 5 6 7 8 9]

In [51]:
b[1:7] = 10
b

carray((10,), int64)
 nbytes := 80; cbytes := 16.00 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[ 0 10 10 10 10 10 10 7 8 9]

In [52]:
b[1::3] = -10
b

carray((10,), int64)
 nbytes := 80; cbytes := 16.00 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[ 0 -10 10 10 -10 10 10 -10 8 9]

Modification by using fancy indexing is supported too:

In [53]:
barr = np.array([True]*5+[False]*5)
b[barr] = -5
b

carray((10,), int64)
 nbytes := 80; cbytes := 16.00 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[ -5 -5 -5 -5 -5 10 10 -10 8 9]

In [54]:
b[[1,2,4,1]] = -10
b

carray((10,), int64)
 nbytes := 80; cbytes := 16.00 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 2048; chunksize: 16384; blocksize: 0
[ -5 -10 -10 -5 -10 10 10 -10 8 9]

However, you must be aware that modifying a carray is expensive:

In [55]:
a = np.arange(1e7)
b = bcolz.carray(a)

t_numpy = %timeit -o a[2] = 3
t_bcolz = %timeit -o b[2] = 3
ratio = t_numpy.best / t_bcolz.best
bcolz_vs_numpy["modify an array"] = ratio

print('\n* Modifying a carray is around {0}x times slower than modifying a numpy one'.format(round(1/ratio, 2)))

The slowest run took 51.52 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 78.7 ns per loop
1000 loops, best of 3: 211 µs per loop

* Modifying a carray is around 2688.01x times slower than modifying a numpy one


although modifying values inside the latest chunk is much cheaper:

In [56]:
t_numpy = %timeit -o a[-1] = 3
t_bcolz = %timeit -o b[-1] = 3
ratio = t_numpy.best / t_bcolz.best
bcolz_vs_numpy["modify array's last chunk"] = ratio

print('\n* Modifying data in the last chunk of a caray is around {0}x times slower than modifying a numpy one'.format(round(1/ratio, 2)))

The slowest run took 50.81 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 79.8 ns per loop
The slowest run took 5.25 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.81 µs per loop

* Modifying data in the last chunk of a caray is around 97.95x times slower than modifying a numpy one


So as you see, you should avoid modifications as much as possible (if you can) when using
carrays.


## Multidimensional carrays
Go to index

You can create multidimensional carrays too. Look at this example:

In [57]:
a = bcolz.zeros((2,3))
a

carray((2, 3), float64)
 nbytes := 48; cbytes := 15.98 KB; ratio: 0.00
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 682; chunksize: 16368; blocksize: 0
[[ 0. 0. 0.]
 [ 0. 0. 0.]]

So, you can access any element in any dimension:

In [58]:
a[1]

array([ 0., 0., 0.])

In [59]:
a[1,::2]

array([ 0., 0.])

In [60]:
a[:,1]

array([ 0., 0.])

In [61]:
a[0,1] = 0

As you see, multidimensional carrays support the same multidimensional
indexes than its NumPy counterparts.

Also, you can use the `reshape()` method to set your desired shape to
an existing carray:

In [62]:
b = bcolz.arange(12).reshape((3,4))
b

carray((3, 4), int64)
 nbytes := 96; cbytes := 16.00 KB; ratio: 0.01
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 512; chunksize: 16384; blocksize: 0
[[ 0 1 2 3]
 [ 4 5 6 7]
 [ 8 9 10 11]]

Iterators loop over the leading dimension:

In [63]:
[r for r in b]

[array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])]

And you can select columns there by using another indirection level:

In [64]:
[r[2] for r in b]

[2, 6, 10]

Above, the third column has been selected. Although for this case the
indexing is easier:

In [65]:
b[:,2]

array([ 2, 6, 10])


## Operating with carrays
Go to index

Right now, you cannot operate with carrays directly (although that
might be implemented in the future):

In [66]:
x = bcolz.arange(1e7)

In [67]:
# Running the operation below will raise an error
# x + x

Instead, you should use the `eval` function:

In [68]:
y = bcolz.eval("x + x")
y

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 3.32 MB; ratio: 23.01
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 32768
[ 0.00000000e+00 2.00000000e+00 4.00000000e+00 ..., 1.99999940e+07
 1.99999960e+07 1.99999980e+07]

You can also compute arbitrarily complex expressions in one shot:

In [69]:
y = bcolz.eval(".5*x**3 + 2.1*x**2")
y

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 41.16 MB; ratio: 1.85
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 32768
[ 0.00000000e+00 2.60000000e+00 1.24000000e+01 ..., 4.99999760e+20
 4.99999910e+20 5.00000060e+20]

Note how the output of `eval()` is also a carray object. You can pass
other parameters of the carray constructor too. Let's force maximum
compression for the output:

In [70]:
y = bcolz.eval(".5*x**3 + 2.1*x**2", cparams=bcolz.cparams(9, shuffle=bcolz.BITSHUFFLE, cname="zlib"))
y

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 24.50 MB; ratio: 3.11
 cparams := cparams(clevel=9, shuffle=2, cname='zlib', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 524288
[ 0.00000000e+00 2.60000000e+00 1.24000000e+01 ..., 4.99999760e+20
 4.99999910e+20 5.00000060e+20]

Also, we can get a native numpy array out of the computation:

In [71]:
y = bcolz.eval(".5*x**3 + 2.1*x**2", out_flavor="numpy")
y

array([ 0.00000000e+00, 2.60000000e+00, 1.24000000e+01, ...,
 4.99999760e+20, 4.99999910e+20, 5.00000060e+20])

By default, `eval` will use the "numexpr" virtual machine if it is installed. If not, "dask" is used if installed. And if neither of these can be found, then the "python" interpreter is used (via NumPy). You can use the `vm` parameter to select the desired virtual machine ("numexpr", "dask" or "python"):

In [72]:
%timeit bcolz.eval(".5*x**3 + 2.1*x**2", vm="numexpr")

10 loops, best of 3: 71.4 ms per loop


In [73]:
%timeit bcolz.eval(".5*x**3 + 2.1*x**2", vm="dask")

1 loop, best of 3: 464 ms per loop


In [74]:
%timeit bcolz.eval(".5*x**3 + 2.1*x**2", vm="python")

1 loop, best of 3: 865 ms per loop


Finally, `eval` lets you store the result directly on-disk in an efficient way (i.e. without temporaries):

In [75]:
bcolz.eval("x**3", out_flavor="carray", rootdir="mydir/eval_result")

carray((10000000,), float64)
 nbytes := 76.29 MB; cbytes := 41.02 MB; ratio: 1.86
 cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
 chunklen := 65536; chunksize: 524288; blocksize: 32768
 rootdir := 'mydir/eval_result'
 mode := 'a'
[ 0.00000000e+00 1.00000000e+00 8.00000000e+00 ..., 9.99999100e+20
 9.99999400e+20 9.99999700e+20]

For setting globally or permanently your own defaults for the `vm` and
`out_flavors`, see defaults chapter.



## `carray` metadata
Go to index

carray implements several attributes, like `dtype`, `shape` and `ndim`
that makes it to 'quack' like a NumPy array:

In [76]:
a = np.arange(1e7)
b = bcolz.carray(a)

In [77]:
b.dtype

dtype('float64')

In [78]:
b.shape

(10000000,)

In addition, it implements the `cbytes` attribute that tells how many
bytes in memory (or on-disk) the carray object is using:

In [79]:
b.cbytes

3476685

This figure is approximate and generally smaller than the original
(uncompressed) datasize, which can be accessed by retrieving the `nbytes` attribute:

In [80]:
b.nbytes

80000000

which is the same as his equivalent NumPy array:

In [81]:
a.size*a.dtype.itemsize

80000000

If you would like to know the compression level used and other optional filters used by a particular object, you can read this information from the `cparams` read-only attribute:

In [82]:
b.cparams

cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)

The default value of a carray is another attribute you would likely want to check before resizing a carray:

In [83]:
b.dflt

array(0.0)

You can access the `chunklen` (the length for each chunk) for this
carray:

In [84]:
b.chunklen

65536

For a complete list of public attributes of carray, see section on
carray attributes
.


## `carray` user attrs
Go to index

Besides the regular attributes like `shape`, `dtype` or `chunklen`,
there is another set of attributes that can be added (and removed) by
the user in another name space. This space is accessible via the
special `attrs` attribute, in the following example we will trigger flushing
data to disk manually:

In [85]:
a = bcolz.carray([1,2], rootdir='mydir/my_carray', mode='w')
a.attrs

*no attrs*

As you see, by default there are no attributes attached to `attrs`.
Also, notice that the carray that we have created is persistent and
stored on the 'mydata' directory. Let's add one attribute here:

In [86]:
a.attrs['myattr'] = 234
a.attrs

myattr : 234

We have now just attached the 'myattr' attribute with the value 234. Let's add a couple of attributes more:

In [87]:
a.attrs['temp'] = 23 
a.attrs['unit'] = 'Celsius'
a.attrs

unit : 'Celsius'
myattr : 234
temp : 23

Good, we have three of them now. You can attach as many as you want,
and the only current limitation is that they have to be serializable
via JSON.

As the 'a' carray is persistent, it can re-opened in other Python session:

In [88]:
a.flush()

We could get our data back as follows:

In [89]:
a2 = bcolz.open(rootdir="mydir/my_carray")

Now, let's remove a couple of user attrs:

In [90]:
del a2.attrs['myattr']
del a2.attrs['unit']
a2.attrs

temp : 23

So, it is really easy to make use of this feature and complement
your data with (potentially persistent) metadata of your choice. Of
course, the `ctable` object also wears this capability.


## Memory profiling
Go to index

We could say that `carrays` normally consume less memory 
than their counterparts `numpy arrays`, but as we said 
before, this would highly depend on the dataset you are 
trying to store: for small arrays the `carrays`'s 
overhead becomes noticeable and they might be even bigger 
than `numpy arrays`, but keep in mind that `carrays` 
were designed with large amount of data in mind.

Please see the following notebook for more details about this topic.
- [carray memory profiling](tutorial_carray_memory_profile.ipynb)