{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial on ctable objects\n", "[Go to tutorials´ index](tutorials.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Index:\n", " 1. Creating a ctable\n", " - Accessing and setting rows\n", " - Adding and deleting columns\n", " - Iterating over ctable data\n", " - Iterating over the output of conditions along columns\n", " - Performing operations on ctable columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The bcolz package comes with a handy object that arranges data by\n", "column (and not by row, as in NumPy's structured arrays). This allows\n", "for much better performance for walking tabular data by column and\n", "also for adding and deleting columns." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n", "bcolz version: 1.1.1.dev13+dirty\n", "bcolz git info: 1.1.0-15-g6565371\n", "NumPy version: 1.11.0\n", "Blosc version: 1.9.2 ($Date:: 2016-06-08 #$)\n", "Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']\n", "Numexpr version: 2.6.1.dev0\n", "Dask version: 0.9.0\n", "Python version: 2.7.12 |Continuum Analytics, Inc.| (default, Jun 29 2016, 11:08:50) \n", "[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]\n", "Platform: linux2-x86_64\n", "Byte-ordering: little\n", "Detected cores: 4\n", "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n" ] } ], "source": [ "from __future__ import print_function\n", "\n", "import numpy as np\n", "import bcolz\n", "\n", "bcolz.print_versions()" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Clear mydir, needed in case you run this tutorial multiple times\n", "!rm -rf mydir\n", "!mkdir mydir" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Creating a ctable\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can build ctable objects in many different ways, but perhaps the\n", "easiest one is using the `fromiter` constructor:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "ctable((1000000,), [('f0', '\n", "## Accessing and setting rows\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ctable object supports the most common indexing operations in\n", "NumPy:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(1, 1.0)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct[1]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "numpy.void" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(ct[1])" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([(1, 1.0), (2, 4.0), (3, 9.0), (4, 16.0), (5, 25.0)], \n", " dtype=[('f0', '0) & (f1<10)\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that conditions over columns are expressed as string expressions\n", "(in order to use either Numexpr or NumPy under the hood), and that the column names\n", "are understood correctly.\n", "\n", "Setting rows is also supported:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "ctable((1000000,), [('f0', '=0) & (f1<10)\"] = (2,2)\n", "ct[:7]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you may have noticed, fancy indexing in combination with conditions\n", "is a very powerful feature." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Adding and deleting columns\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Adding and deleting columns is easy and, due to the column-wise data\n", "arrangement, very efficient. Let's add a new column on an existing\n", "ctable:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "ct = bcolz.fromiter(((i,i*i) for i in xrange(N)), dtype=\"i4,f8\", count=N)\n", "new_col = np.linspace(0, 1, N)\n", "ct.addcol(new_col)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, remove the already existing 'f1' column:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "ctable((1000000,), [('f0', '\n", "## Iterating over ctable data\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can make use of the `iter()` method in order to easily iterate\n", "over the values of a ctable. `iter()` has support for start, stop and\n", "step parameters:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[row(f0=1, f2=1.000001000001e-06),\n", " row(f0=4, f2=4.000004000004e-06),\n", " row(f0=7, f2=7.000007000007e-06)]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t = bcolz.fromiter(((i,i*i) for i in xrange(N)), dtype=\"i4,f8\", count=N)\n", "[row for row in ct.iter(1,10,3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the data is returned as `namedtuple` objects of type\n", "``row``. This allows you to iterate the fields more easily by using\n", "field names:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1.000001000001e-06), (4, 4.000004000004e-06), (7, 7.000007000007e-06)]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(f0,f1) for f0,f1 in ct.iter(1,10,3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also use the ``[:]`` accessor to get rid of the ``row``\n", "namedtuple, and return just bare tuples:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1.000001000001e-06), (4, 4.000004000004e-06), (7, 7.000007000007e-06)]" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[row[:] for row in ct.iter(1,10,3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, you can select specific fields to be read via the `outcols`\n", "parameter:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[row(f0=1), row(f0=4), row(f0=7)]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[row for row in ct.iter(1,10,3, outcols='f0')]" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(1, 1), (4, 4), (7, 7)]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(nr,f0) for nr,f0 in ct.iter(1,10,3, outcols='nrow__, f0')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please note the use of the special 'nrow__' label for referring to\n", "the current row." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Iterating over the output of conditions along columns\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the most powerful capabilities of the ctable is the ability to\n", "iterate over the rows whose fields fulfill certain conditions (without\n", "the need to put the results in a NumPy container, as described in the\n", "previous section). This can be very useful for performing operations \n", "on very large ctables without consuming lots of storage space.\n", "\n", "Here it is an example of use:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[row(f0=1, f1=1.0), row(f0=2, f1=4.0), row(f0=3, f1=9.0)]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct = bcolz.fromiter(((i,i*i) for i in xrange(N)), dtype=\"i4,f8\", count=N)\n", "[row for row in ct.where(\"(f0>0) & (f1<10)\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And by using the `outcols` parameter, you can specify the fields that\n", "you want to be returned:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[row(f1=1.0), row(f1=4.0), row(f1=9.0)]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[row for row in ct.where(\"(f0>0) & (f1<10)\", outcols=\"f1\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can even specify the row number fulfilling the condition:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(1.0, 1), (4.0, 2), (9.0, 3)]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[(f1,nr) for f1,nr in ct.where(\"(f0>0) & (f1<10)\", outcols=\"f1, nrow__\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also iterate so that you get blocks of results:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[array([(1, 1.0), (2, 4.0), (3, 9.0), ..., (32766, 1073610756.0),\n", " (32767, 1073676289.0), (32768, 1073741824.0)], \n", " dtype=[('f0', '0) & (f1<5e9)\")]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, three blocks of a maximum length of 32768 have been returned. You can also specify your own block length via the `blen` parameter:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[array([(1, 1.0), (2, 4.0), (3, 9.0), ..., (14998, 224940004.0),\n", " (14999, 224970001.0), (15000, 225000000.0)], \n", " dtype=[('f0', '0) & (f1<5e9)\", blen=15000)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Performing operations on ctable columns\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ctable object also wears an `eval()` method, this method is \n", "handy for carrying out operations among columns.\n", "\n", "The best way to illustrate the point would be to squeeze out an example, here we go:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(-0.7076921035197548)" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct.eval(\"cos((3+f0)/sqrt(2*f1))\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, one can see an exception in ctable methods behaviour: the\n", "resulting output is a ctable, and not a NumPy structured array. \n", "This was designed like this because the output of `eval()` has \n", "the same length than the ctable, and thus it can be pretty large, \n", "so compression maybe of help to reduce its storage needs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In fact, if you are already dealing with large ctables, and you expect the output to be large too, it is always possible to store the result on a ctable that lives on-disk:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(-0.7076921035197548)" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct.eval(\"cos((3+f0)/sqrt(2*f1))\", rootdir=\"mydir/ct_disk3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, if what you want is having a numpy structured array as output, you can always specify that via the `out_flavor` parameter:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(-0.7076921035197548)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct.eval(\"cos((3+f0)/sqrt(2*f1))\", out_flavor=\"numpy\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching data based on conditions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, there is a powerful way to get data that you are interested in while using conditions:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "ctable((70710,), [('f0', '0) & (f1<5e9)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you can skip the first rows fulfilling the condition and limit the total amount to returned too:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "ctable((2000,), [('f0', '0) & (f1<5e9)\", skip=10000, limit=2000)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or get a NumPy array too:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([(10001, 100020001.0), (10002, 100040004.0), (10003, 100060009.0),\n", " ..., (11998, 143952004.0), (11999, 143976001.0),\n", " (12000, 144000000.0)], \n", " dtype=[('f0', '0) & (f1<5e9)\", skip=10000, limit=2000, out_flavor=\"numpy\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although perhaps using default contexts is a more elegant way to do the same thing:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([(10001, 100020001.0), (10002, 100040004.0), (10003, 100060009.0),\n", " ..., (11998, 143952004.0), (11999, 143976001.0),\n", " (12000, 144000000.0)], \n", " dtype=[('f0', '0) & (f1<5e9)\", skip=10000, limit=2000)\n", "out" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## That's all folks!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's all for this tutorial section. Now you should have a look at the [reference section](http://bcolz.blosc.org/reference.html) so as to grasp all the functionality that bcolz is offering to you. In general, ctable objects inherits most of the properties of carrays, so make sure that you master all the weaponery in carrays before getting too in deep into ctables." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }