# Tutorial on ctable objects
[Go to tutorialsÂ´ index](tutorials.ipynb)

<a id='go to index'></a>
Index:
  1. <a href='#Creating a ctable'>Creating a ctable</a>
  -  <a href='#Accessing and setting rows'>Accessing and setting rows</a>
  -  <a href='#Adding and deleting columns'>Adding and deleting columns</a>
  - <a href='#Iterating over ctable data'>Iterating over ctable data</a>
  - <a href='#Iterating over the output of conditions along columns'>Iterating over the output of conditions along columns</a>
  - <a href='#Performing operations on ctable columns'>Performing operations on ctable columns</a>

The bcolz package comes with a handy object that arranges data by
column (and not by row, as in NumPy's structured arrays).  This allows
for much better performance for walking tabular data by column and
also for adding and deleting columns.

In [1]:
from __future__ import print_function
import sys
if sys.version_info[0] == 2:
    range = xrange

import numpy as np
import bcolz

bcolz.print_versions()

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
bcolz version:     1.1.1.dev13+dirty
bcolz git info:    1.1.0-15-g6565371
NumPy version:     1.11.0
Blosc version:     1.9.2 ($Date:: 2016-06-08 #$)
Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']
Numexpr version:   2.6.1.dev0
Dask version:      0.9.0
Python version:    2.7.12 |Continuum Analytics, Inc.| (default, Jun 29 2016, 11:08:50) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform:          linux2-x86_64
Byte-ordering:     little
Detected cores:    4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


In [2]:
# Clear mydir, needed in case you run this tutorial multiple times
!rm -rf mydir
!mkdir mydir

<a id='Creating a ctable'></a>
## Creating a ctable
<a href='#go to index'>Go to index</a>

You can build ctable objects in many different ways, but perhaps the
easiest one is using the `fromiter` constructor:

In [3]:
N = int(1e6)
ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N)
ct

ctable((1000000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 11.44 MB; cbytes: 2.30 MB; ratio: 4.97
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(0, 0.0) (1, 1.0) (2, 4.0) ..., (999997, 999994000009.0)
 (999998, 999996000004.0) (999999, 999998000001.0)]

Exactly the same as in a regular carray, a ctable can be stored to disk as well:

In [4]:
ct_disk = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N, rootdir="mydir/ct_disk")
ct_disk

ctable((1000000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 11.44 MB; cbytes: 2.30 MB; ratio: 4.97
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  rootdir := 'mydir/ct_disk'
[(0, 0.0) (1, 1.0) (2, 4.0) ..., (999997, 999994000009.0)
 (999998, 999996000004.0) (999999, 999998000001.0)]

**NOTE:** If you wish to create an empty ctable and append data afterwards, this is posible using `bzolz.zeros` indicating zero length (albeit this is significantly slower).  If you prefer to do that, we encourage you to use the `with` statement for this, it will take care of flushing data to disk once you are done appending data:

In [5]:
with bcolz.zeros(0, dtype="i4,f8", rootdir="mydir/ct_disk2") as ct_disk2:
    for i in range(20000):
        ct_disk2.append((i, i**2))
ct_disk2

ctable((20000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 234.38 KB; cbytes: 68.68 KB; ratio: 3.41
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
  rootdir := 'mydir/ct_disk2'
[(0, 0.0) (1, 1.0) (2, 4.0) ..., (19997, 399880009.0) (19998, 399920004.0)
 (19999, 399960001.0)]

<a id='Accessing and setting rows'></a>
## Accessing and setting rows
<a href='#go to index'>Go to index</a>

The ctable object supports the most common indexing operations in
NumPy:

In [6]:
ct[1]

(1, 1.0)

In [7]:
type(ct[1])

numpy.void

In [8]:
ct[1:6]

array([(1, 1.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

The first thing to have in mind is that, similarly to `carray`
objects, the result of an indexing operation is a native NumPy object
(in the case above a scalar and a structured array).

Fancy indexing is also supported:

In [9]:
ct[[1,6,13]]

array([(1, 1.0), (6, 36.0), (13, 169.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

You can even pass complex boolean expressions as an index:

In [10]:
ct["(f0>0) & (f1<10)"]

array([(1, 1.0), (2, 4.0), (3, 9.0)], 
      dtype=(numpy.record, [('f0', '<i4'), ('f1', '<f8')]))

Note that conditions over columns are expressed as string expressions
(in order to use either Numexpr or NumPy under the hood), and that the column names
are understood correctly.

Setting rows is also supported:

In [11]:
ct[1] = (0,0)
ct

ctable((1000000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 11.44 MB; cbytes: 2.30 MB; ratio: 4.97
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(0, 0.0) (0, 0.0) (2, 4.0) ..., (999997, 999994000009.0)
 (999998, 999996000004.0) (999999, 999998000001.0)]

In [12]:
ct[1:6]

array([(0, 0.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

And in combination with fancy indexing too:

In [13]:
ct[[1,6,13]] = (1,1)
ct[[1,6,13]]

array([(1, 1.0), (1, 1.0), (1, 1.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

In [14]:
ct["(f0>=0) & (f1<10)"] = (2,2)
ct[:7]

array([(2, 2.0), (2, 2.0), (2, 2.0), (2, 2.0), (4, 16.0), (5, 25.0),
       (2, 2.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

As you may have noticed, fancy indexing in combination with conditions
is a very powerful feature.

<a id='Adding and deleting columns'></a>
## Adding and deleting columns
<a href='#go to index'>Go to index</a>

Adding and deleting columns is easy and, due to the column-wise data
arrangement, very efficient.  Let's add a new column on an existing
ctable:

In [15]:
ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N)
new_col = np.linspace(0, 1, N)
ct.addcol(new_col)

Now, remove the already existing 'f1' column:

In [16]:
ct.delcol('f1')
ct

ctable((1000000,), [('f0', '<i4'), ('f2', '<f8')])
  nbytes: 11.44 MB; cbytes: 2.29 MB; ratio: 4.99
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(0, 0.0) (1, 1.000001000001e-06) (2, 2.000002000002e-06) ...,
 (999997, 0.9999979999979999) (999998, 0.9999989999989999) (999999, 1.0)]

As said, adding and deleting columns is very cheap (just adding or deleting keys in a Python dict), so don't be afraid of using this feature as much as you like.

<a id='Iterating over ctable data'></a>
## Iterating over ctable data
<a href='#go to index'>Go to index</a>

You can make use of the `iter()` method in order to easily iterate
over the values of a ctable.  `iter()` has support for start, stop and
step parameters:

In [17]:
t = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N)
[row for row in ct.iter(1,10,3)]

[row(f0=1, f2=1.000001000001e-06),
 row(f0=4, f2=4.000004000004e-06),
 row(f0=7, f2=7.000007000007e-06)]

Note how the data is returned as `namedtuple` objects of type
``row``.  This allows you to iterate the fields more easily by using
field names:

In [18]:
[(f0,f1) for f0,f1 in ct.iter(1,10,3)]

[(1, 1.000001000001e-06), (4, 4.000004000004e-06), (7, 7.000007000007e-06)]

You can also use the ``[:]`` accessor to get rid of the ``row``
namedtuple, and return just bare tuples:

In [19]:
[row[:] for row in ct.iter(1,10,3)]

[(1, 1.000001000001e-06), (4, 4.000004000004e-06), (7, 7.000007000007e-06)]

Also, you can select specific fields to be read via the `outcols`
parameter:

In [20]:
[row for row in ct.iter(1,10,3, outcols='f0')]

[row(f0=1), row(f0=4), row(f0=7)]

In [21]:
[(nr,f0) for nr,f0 in ct.iter(1,10,3, outcols='nrow__, f0')]

[(1, 1), (4, 4), (7, 7)]

Please note the use of the special 'nrow__' label for referring to
the current row.

<a id='Iterating over the output of conditions along columns'></a>
## Iterating over the output of conditions along columns
<a href='#go to index'>Go to index</a>

One of the most powerful capabilities of the ctable is the ability to
iterate over the rows whose fields fulfill certain conditions (without
the need to put the results in a NumPy container, as described in the
previous section).  This can be very useful for performing operations 
on very large ctables without consuming lots of storage space.

Here it is an example of use:

In [22]:
ct = bcolz.fromiter(((i,i*i) for i in range(N)), dtype="i4,f8", count=N)
[row for row in ct.where("(f0>0) & (f1<10)")]

[row(f0=1, f1=1.0), row(f0=2, f1=4.0), row(f0=3, f1=9.0)]

And by using the `outcols` parameter, you can specify the fields that
you want to be returned:

In [23]:
[row for row in ct.where("(f0>0) & (f1<10)", outcols="f1")]

[row(f1=1.0), row(f1=4.0), row(f1=9.0)]

You can even specify the row number fulfilling the condition:

In [24]:
[(f1,nr) for f1,nr in ct.where("(f0>0) & (f1<10)", outcols="f1, nrow__")]

[(1.0, 1), (4.0, 2), (9.0, 3)]

You can also iterate so that you get blocks of results:

In [25]:
[br for br in ct.whereblocks("(f0>0) & (f1<5e9)")]

[array([(1, 1.0), (2, 4.0), (3, 9.0), ..., (32766, 1073610756.0),
        (32767, 1073676289.0), (32768, 1073741824.0)], 
       dtype=[('f0', '<i4'), ('f1', '<f8')]),
 array([(32769, 1073807361.0), (32770, 1073872900.0), (32771, 1073938441.0),
        ..., (65534, 4294705156.0), (65535, 4294836225.0),
        (65536, 4294967296.0)], 
       dtype=[('f0', '<i4'), ('f1', '<f8')]),
 array([(65537, 4295098369.0), (65538, 4295229444.0), (65539, 4295360521.0),
        ..., (70708, 4999621264.0), (70709, 4999762681.0),
        (70710, 4999904100.0)], 
       dtype=[('f0', '<i4'), ('f1', '<f8')])]

In this case, three blocks of a maximum length of 32768 have been returned.  You can also specify your own block length via the `blen` parameter:

In [26]:
[br for br in ct.whereblocks("(f0>0) & (f1<5e9)", blen=15000)]

[array([(1, 1.0), (2, 4.0), (3, 9.0), ..., (14998, 224940004.0),
        (14999, 224970001.0), (15000, 225000000.0)], 
       dtype=[('f0', '<i4'), ('f1', '<f8')]),
 array([(15001, 225030001.0), (15002, 225060004.0), (15003, 225090009.0),
        ..., (29998, 899880004.0), (29999, 899940001.0),
        (30000, 900000000.0)], 
       dtype=[('f0', '<i4'), ('f1', '<f8')]),
 array([(30001, 900060001.0), (30002, 900120004.0), (30003, 900180009.0),
        ..., (44998, 2024820004.0), (44999, 2024910001.0),
        (45000, 2025000000.0)], 
       dtype=[('f0', '<i4'), ('f1', '<f8')]),
 array([(45001, 2025090001.0), (45002, 2025180004.0), (45003, 2025270009.0),
        ..., (59998, 3599760004.0), (59999, 3599880001.0),
        (60000, 3600000000.0)], 
       dtype=[('f0', '<i4'), ('f1', '<f8')]),
 array([(60001, 3600120001.0), (60002, 3600240004.0), (60003, 3600360009.0),
        ..., (70708, 4999621264.0), (70709, 4999762681.0),
        (70710, 4999904100.0)], 
       dtype=[('f0', '<i4'), (

<a id='Performing operations on ctable columns'></a>
## Performing operations on ctable columns
<a href='#go to index'>Go to index</a>

The ctable object also wears an `eval()` method, this method is 
handy for carrying out operations among columns.

The best way to illustrate the point would be to squeeze out an example, here we go:

In [27]:
ct.eval("cos((3+f0)/sqrt(2*f1))")

array(-0.7076921035197548)

Here, one can see an exception in ctable methods behaviour: the
resulting output is a ctable, and not a NumPy structured array.  
This was designed like this because the output of `eval()` has 
the same length than the ctable, and thus it can be pretty large, 
so compression maybe of help to reduce its storage needs.

In fact, if you are already dealing with large ctables, and you expect the output to be large too, it is always possible to store the result on a ctable that lives on-disk:

In [28]:
ct.eval("cos((3+f0)/sqrt(2*f1))", rootdir="mydir/ct_disk3")

array(-0.7076921035197548)

However, if what you want is having a numpy structured array as output, you can always specify that via the `out_flavor` parameter:

In [29]:
ct.eval("cos((3+f0)/sqrt(2*f1))", out_flavor="numpy")

array(-0.7076921035197548)

## Fetching data based on conditions

Finally, there is a powerful way to get data that you are interested in while using conditions:

In [30]:
ct.fetchwhere("(f0>0) & (f1<5e9)")

ctable((70710,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 828.63 KB; cbytes: 184.79 KB; ratio: 4.48
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(1, 1.0) (2, 4.0) (3, 9.0) ..., (70708, 4999621264.0)
 (70709, 4999762681.0) (70710, 4999904100.0)]

And you can skip the first rows fulfilling the condition and limit the total amount to returned too:

In [31]:
ct.fetchwhere("(f0>0) & (f1<5e9)", skip=10000, limit=2000)

ctable((2000,), [('f0', '<i4'), ('f1', '<f8')])
  nbytes: 23.44 KB; cbytes: 32.00 KB; ratio: 0.73
  cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)
[(10001, 100020001.0) (10002, 100040004.0) (10003, 100060009.0) ...,
 (11998, 143952004.0) (11999, 143976001.0) (12000, 144000000.0)]

Or get a NumPy array too:

In [32]:
ct.fetchwhere("(f0>0) & (f1<5e9)", skip=10000, limit=2000, out_flavor="numpy")

array([(10001, 100020001.0), (10002, 100040004.0), (10003, 100060009.0),
       ..., (11998, 143952004.0), (11999, 143976001.0),
       (12000, 144000000.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

Although perhaps using default contexts is a more elegant way to do the same thing:

In [33]:
with bcolz.defaults_ctx(out_flavor="numpy"):
    out = ct.fetchwhere("(f0>0) & (f1<5e9)", skip=10000, limit=2000)
out

array([(10001, 100020001.0), (10002, 100040004.0), (10003, 100060009.0),
       ..., (11998, 143952004.0), (11999, 143976001.0),
       (12000, 144000000.0)], 
      dtype=[('f0', '<i4'), ('f1', '<f8')])

## That's all folks!

That's all for this tutorial section.  Now you should have a look at the [reference section](http://bcolz.blosc.org/reference.html) so as to grasp all the functionality that bcolz is offering to you.  In general, ctable objects inherits most of the properties of carrays, so make sure that you master all the weaponery in carrays before getting too in deep into ctables.