{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Tutorial on carray objects\n", "[Go to tutorials´ index](tutorials.ipynb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Index:\n", " 1. Creating carrays\n", " - Enlarging your carray\n", " - Compression level and shuffle filter\n", " - Accessing carray data\n", " - Querying carrays\n", " - Modifying carrays\n", " - Multidimensional carrays\n", " - Operating with carrays\n", " - carray metadata\n", " - carray user attrs\n", " - Memory profiling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial focuses on how to use carray objects, \n", "we will also see which options they support, \n", "when & how should they be used. \n", "\n", "We will also see how does this container compares to \n", "`numpy arrays` and we will highlight some of it's strengths." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n", "bcolz version: 1.0.1.dev58+dirty\n", "bcolz git info: 1.0.0-77-ga34d0b1\n", "NumPy version: 1.11.0\n", "Blosc version: 1.9.1.dev ($Date:: 2016-04-29 #$)\n", "Blosc compressors: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib']\n", "Numexpr version: 2.5.3.dev0\n", "Dask version: 0.9.0\n", "Python version: 2.7.11 |Continuum Analytics, Inc.| (default, Dec 6 2015, 18:08:32) \n", "[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]\n", "Platform: linux2-x86_64\n", "Byte-ordering: little\n", "Detected cores: 4\n", "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=\n" ] } ], "source": [ "from __future__ import print_function\n", "\n", "# Let's import the packages we need for this tutorial\n", "import bcolz\n", "import numpy as np\n", "import sys\n", "\n", "# Timing measurements will be saved here\n", "bcolz_vs_numpy = {}\n", "\n", "bcolz.print_versions()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Creating carrays\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A carray can be created from any NumPy ndarray by using its `carray` constructor." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Clear mydir, needed in case you run this tutorial multiple times\n", "!rm -rf mydir" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# create an in-memory numpy container\n", "a = np.arange(10)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create an in-memory carray container\n", "b = bcolz.carray(a)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# create an on-disk carray container\n", "c = bcolz.carray(a, rootdir='mydir')\n", "c.flush()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**NOTE:** To avoid forgetting to flush your data to disk, you are encouraged to use the `with` statement for on-disk carrays (we will see an example later on). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You could also create it by using one of its multiple constructors from the top-level-funtions, write mode will overwrite contents of the folder where the carray is created.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "d = bcolz.arange(10, rootdir='mydir', mode='w')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please note that carray allows to create disk-based arrays by just specifying the `rootdir` parameter in all the constructors. Disk-based arrays fully support all the operations of in-memory counterparts, so depending on your needs, you may want to use one or another (or even a combination of both).\n", "\n", "Now, `b` is a carray object. Just check the following:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "bcolz.carray_ext.carray" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can have a peek at it by using its string form:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1 2 3 4 5 6 7 8 9]\n" ] } ], "source": [ "print(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And get more info about uncompressed size (nbytes), compressed (cbytes) and the compression ratio (ratio = nbytes/cbytes), by using its representation form::" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10,), int64)\n", " nbytes := 80; cbytes := 16.00 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[0 1 2 3 4 5 6 7 8 9]" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b # <==> print(repr(b))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, the compressed size is much larger than the uncompressed one. How this can be? Well, it turns out that carray wears an I/O buffer for accelerating some internal operations. So, for small arrays (typically those taking less than 1 MB), there is little point in using a carray.\n", "\n", "However, when creating carrays larger than 1 MB (its natural scenario), the size of the I/O buffer is generally negligible in comparison to its total size:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((100000000,), float64)\n", " nbytes := 762.94 MB; cbytes := 25.01 MB; ratio: 30.50\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 131072; chunksize: 1048576; blocksize: 32768\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07\n", " 9.99999980e+07 9.99999990e+07]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = bcolz.arange(1e8)\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The carray consumes less than 24 MB, while the original data would have taken more than 760 MB; that's a huge gain. You can always get a hint on how much space it takes your carray by using `sys.getsizeof()`:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "26228365" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sys.getsizeof(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The take away message is: you can create very large arrays without the need to create a NumPy array first (that may not fit in memory)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, you can get a copy of your created carrays by using the `copy()` method:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((100000000,), float64)\n", " nbytes := 762.94 MB; cbytes := 25.01 MB; ratio: 30.50\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 131072; chunksize: 1048576; blocksize: 32768\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07\n", " 9.99999980e+07 9.99999990e+07]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "c = b.copy()\n", "c" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, you can set desired parameters for the newly created copy too:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 656 ms, sys: 92 ms, total: 748 ms\n", "Wall time: 421 ms\n" ] }, { "data": { "text/plain": [ "carray((100000000,), float64)\n", " nbytes := 762.94 MB; cbytes := 7.56 MB; ratio: 100.97\n", " cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 131072; chunksize: 1048576; blocksize: 524288\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07\n", " 9.99999980e+07 9.99999990e+07]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time b.copy(cparams=bcolz.cparams(clevel=9))" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 656 ms, sys: 40 ms, total: 696 ms\n", "Wall time: 393 ms\n" ] }, { "data": { "text/plain": [ "carray((100000000,), float64)\n", " nbytes := 762.94 MB; cbytes := 7.56 MB; ratio: 100.97\n", " cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 131072; chunksize: 1048576; blocksize: 524288\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07\n", " 9.99999980e+07 9.99999990e+07]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time b.copy(cparams=bcolz.cparams(clevel=9, cname='lz4'))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 7.42 s, sys: 484 ms, total: 7.9 s\n", "Wall time: 7.46 s\n" ] }, { "data": { "text/plain": [ "carray((100000000,), float64)\n", " nbytes := 762.94 MB; cbytes := 2.30 MB; ratio: 331.96\n", " cparams := cparams(clevel=9, shuffle=2, cname='zlib', quantize=0)\n", " chunklen := 131072; chunksize: 1048576; blocksize: 1048576\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999970e+07\n", " 9.99999980e+07 9.99999990e+07]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "%time b.copy(cparams=bcolz.cparams(clevel=9, cname='zlib', shuffle=bcolz.BITSHUFFLE))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Enlarging your carray\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the nicest features of carray objects is that they can be\n", "enlarged very efficiently. This can be done via the `carray.append()`\n", "method.\n", "\n", "For example, if `b` is a carray with 10 million elements:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 3.32 MB; ratio: 23.01\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 32768\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06\n", " 9.99999800e+06 9.99999900e+06]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = bcolz.arange(10*1e6)\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It could be enlarged by 10 elements as follows:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000010,), float64)\n", " nbytes := 76.29 MB; cbytes := 3.32 MB; ratio: 23.01\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 32768\n", "[ 0. 1. 2. ..., 7. 8. 9.]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.append(np.arange(10.))\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check how fast appending can be:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "100 loops, best of 3: 17.2 ms per loop\n", "10 loops, best of 3: 26.8 ms per loop\n", "\n", "* In this case appending to a carray was 1.56x times faster than numpy\n" ] } ], "source": [ "a = np.arange(1e7)\n", "b = bcolz.arange(1e7)\n", "\n", "t_bcolz = %timeit -o b.append(a)\n", "t_numpy = %timeit -o np.concatenate((a, a))\n", "ratio = t_numpy.best / t_bcolz.best\n", "bcolz_vs_numpy[\"append: array\"] = ratio\n", "\n", "print('\\n* In this case appending to a carray was {0}x times faster than numpy'.format(round(ratio, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And this is specially the case when appending small bits to large arrays:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 6.49 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "100000 loops, best of 3: 2.94 µs per loop\n", "100 loops, best of 3: 8.72 ms per loop\n", "\n", "* Appending to a large carray can be around 2966.07x times faster than numpy\n" ] } ], "source": [ "a = np.arange(1e7)\n", "b = bcolz.carray(a)\n", "c = np.arange(1e1)\n", "\n", "t_bcolz = %timeit -o b.append(c)\n", "t_numpy = %timeit -o np.concatenate((a, c))\n", "ratio = t_numpy.best / t_bcolz.best\n", "bcolz_vs_numpy[\"append: small array to large array\"] = ratio\n", "\n", "print('\\n* Appending to a large carray can be around {0}x times faster than numpy'.format(round(ratio, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also enlarge your arrays by using the `resize()` method:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((20,), int64)\n", " nbytes := 160; cbytes := 16.00 KB; ratio: 0.01\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = bcolz.arange(10)\n", "b.resize(20)\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the append values are filled with zeros. This is because the\n", "default value for filling is 0. But you can choose a different value\n", "too:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((20,), int64)\n", " nbytes := 160; cbytes := 16.00 KB; ratio: 0.01\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = bcolz.arange(10, dflt=1)\n", "b.resize(20)\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another operation supported by carrays is trimming:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((5,), int64)\n", " nbytes := 40; cbytes := 16.00 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[0 1 2 3 4]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = bcolz.arange(10)\n", "b.resize(5)\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you wish you could even set the size to 0:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.resize(0)\n", "len(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do not be afraid to resize extensively, as it is one of the strongest points of carray\n", "objects.\n", "\n", "Let's see below how it compares to numpy in case we had big arrays dangling around." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 206.52 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "1000000 loops, best of 3: 547 ns per loop\n", "100 loops, best of 3: 8.76 ms per loop\n", "\n", "* Resizing a large carray is around 16005.88x times faster than numpy\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/faltet/.local/lib/python2.7/site-packages/numpy/core/fromnumeric.py:1157: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n", " a = a[:-extra]\n", "/home/faltet/.local/lib/python2.7/site-packages/numpy/core/fromnumeric.py:225: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n", " return reshape(newshape, order=order)\n" ] } ], "source": [ "a = np.arange(1e7)\n", "b = bcolz.carray(a)\n", "desired_size = int(1e4)\n", "\n", "t_bcolz = %timeit -o b.resize(desired_size)\n", "t_numpy = %timeit -o np.resize(a, desired_size)\n", "ratio = t_numpy.best / t_bcolz.best\n", "bcolz_vs_numpy[\"resize large array\"] = ratio\n", "\n", "print('\\n* Resizing a large carray is around {0}x times faster than numpy'.format(round(ratio, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Compression level and shuffle filter\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bcolz carray objects use Blosc as the internal compressor, Blosc can be\n", "directed to use different compression levels and you could activate at whim \n", "the internal shuffle filter. The shuffle filter is a way to improve\n", "compression when using items that have type sizes > 1 byte, although\n", "it might be counter-productive (very rarely) for some data distributions.\n", "\n", "By default carrays are compressed using Blosc with compression level 5\n", "with shuffle active. But depending on you needs, you can and it could \n", "be a good idea to use other compression levels too\n", "\n", "Let's see some examples:\n", "\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 4.95 MB; ratio: 15.40\n", " cparams := cparams(clevel=1, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 16384\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06\n", " 9.99999800e+06 9.99999900e+06]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = np.arange(1e7)\n", "bcolz.carray(a, bcolz.cparams(clevel=1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's set the maximum compression level:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 1.03 MB; ratio: 73.81\n", " cparams := cparams(clevel=9, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 524288\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06\n", " 9.99999800e+06 9.99999900e+06]" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bcolz.carray(a, bcolz.cparams(clevel=9))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And we can use different compressors too:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 759.87 KB; ratio: 102.81\n", " cparams := cparams(clevel=9, shuffle=1, cname='zlib', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 524288\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06\n", " 9.99999800e+06 9.99999900e+06]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bcolz.carray(a, bcolz.cparams(clevel=9, cname=\"zlib\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And using other filters as well. Let's use BITSHUFFLE here:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 667.42 KB; ratio: 117.06\n", " cparams := cparams(clevel=9, shuffle=2, cname='zlib', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 524288\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06\n", " 9.99999800e+06 9.99999900e+06]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bcolz.carray(a, bcolz.cparams(clevel=9, cname=\"zlib\", shuffle=bcolz.BITSHUFFLE))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As mentioned above, you could decide to disable the shuffle filter that\n", "comes with Blosc:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 39.01 MB; ratio: 1.96\n", " cparams := cparams(clevel=5, shuffle=0, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 32768\n", "[ 0.00000000e+00 1.00000000e+00 2.00000000e+00 ..., 9.99999700e+06\n", " 9.99999800e+06 9.99999900e+06]" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bcolz.carray(a, bcolz.cparams(shuffle=bcolz.NOSHUFFLE))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen, the compression ratio is much worse in this case.\n", "In general, it is recommend to let SHUFFLE filter active (unless you are\n", "fine-tuning the performance for a specific carray).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "See the Optimization tips chapter for more info on how you can change other internal parameters like the chunk size.\n", "\n", "Also, if you would like to set globally your own compression parameters defaults, please see the Defaults chapter.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Accessing carray data\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The way to access carray data is very similar to the NumPy indexing\n", "scheme, and in fact, supports all the indexing methods supported by\n", "NumPy. First, start by specifying an index or slice:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = np.arange(10)\n", "b = bcolz.carray(a)\n", "b[0]" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "9" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[-1]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([2, 3])" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[2:4]" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 2, 4, 6, 8])" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[::2]" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([3, 6])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[3:9:3]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that NumPy objects are returned as the result of an indexing\n", "operation. This was designed on purpose because normally\n", "NumPy objects are more featured and flexible (specially if they are small). \n", "\n", "In fact, a handy way to get a NumPy array out of a carray object is \n", "asking for the complete range:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fancy indexing is supported too. For example, indexing with boolean arrays gives:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4])" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list_barr = [True]*5+[False]*5\n", "barr = np.array(list_barr)\n", "b[barr]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This also works with carray objects acting as the boolean index:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([0, 1, 2, 3, 4])" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[bcolz.carray(barr)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Be aware that if you provide al list of booleans it will be interpreted as a list of indices you want to extract and therefore you won't obtain what you are looking for:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 1, 1, 1, 0, 0, 0, 0, 0])" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[list_barr]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we just saw, you could also give a list of indices you are interested in:" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([2, 3, 0, 2])" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[[2,3,0,2]]" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([2, 3, 0, 2])" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[bcolz.carray([2,3,0,2])]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Querying carrays\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The carrays can be queried in different ways. The easiest and most powerful one is by using its set rich of iterators:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loop, best of 3: 649 ms per loop\n", "1 loop, best of 3: 1.83 s per loop\n", "\n", "* In this case summing up the desired items of our carray is 2.83x times faster than numpy\n" ] } ], "source": [ "a = np.arange(1e7)\n", "b = bcolz.carray(a)\n", "t_bcolz = %timeit -o sum(v for v in b if v < 10)\n", "t_numpy = %timeit -o sum(v for v in a if v < 10)\n", "ratio = t_numpy.best / t_bcolz.best\n", "bcolz_vs_numpy[\"query sum large array\"] = ratio\n", "\n", "print('\\n* In this case summing up the desired items of our carray is '\n", " '{0}x times faster than numpy'.format(round(ratio, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The iterator also has support for looking into slices of the array. The time taken in this case will be much shorter because the slice where we lookup is much shorter. Look at this:" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000 loops, best of 3: 301 µs per loop\n" ] } ], "source": [ "%timeit sum(v for v in b.iter(start=2, stop=20, step=3) if v < 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, you can quickly retrieve the indices of a boolean carray that \n", "are true::\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "barr = bcolz.eval(\"b < 10\") # see 'Operating with carrays' section below\n", "[i for i in barr.wheretrue()]" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000 loops, best of 3: 1.2 ms per loop\n" ] } ], "source": [ "%timeit [i for i in barr.wheretrue()]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And therefore, as we saw previously get the desired values using a boolean, which will return all the values from our carray where the boolean array is true:\n" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[i for i in b.where(barr)]" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1000 loops, best of 3: 2.38 ms per loop\n" ] } ], "source": [ "%timeit [i for i in b.where(barr)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how `wheretrue` and `where` iterators are really fast. They are\n", "also very powerful. For example, they support `limit` and `skip`\n", "parameters for limiting the number of elements returned and skipping\n", "the leading elements respectively:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[0, 1, 2, 3, 4]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[i for i in barr.wheretrue(limit=5)]" ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[3, 4, 5, 6, 7, 8, 9]" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[i for i in barr.wheretrue(skip=3)]" ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[3, 4, 5, 6, 7]" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[i for i in barr.wheretrue(limit=5, skip=3)]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The advantage of the carray iterators is that you can use them in\n", "generator contexts and hence, you don't need to waste memory for\n", "creating temporaries, which can be important and be considered when\n", "dealing with large arrays.\n", "\n", "Again, the iterator toolset in bcolz is very fast, so try to\n", "express your problems in a way that you can leverage it extensively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Modifying carrays\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it is a somewhat slow operation, carrays can be modified too.\n", "You can do it by specifying scalar or slice indices:" ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10,), int64)\n", " nbytes := 80; cbytes := 16.00 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[0 1 2 3 4 5 6 7 8 9]" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = np.arange(10)\n", "b = bcolz.arange(10)\n", "b" ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10,), int64)\n", " nbytes := 80; cbytes := 16.00 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[ 0 10 10 10 10 10 10 7 8 9]" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[1:7] = 10\n", "b" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10,), int64)\n", " nbytes := 80; cbytes := 16.00 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[ 0 -10 10 10 -10 10 10 -10 8 9]" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[1::3] = -10\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Modification by using fancy indexing is supported too:" ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10,), int64)\n", " nbytes := 80; cbytes := 16.00 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[ -5 -5 -5 -5 -5 10 10 -10 8 9]" ] }, "execution_count": 53, "metadata": {}, "output_type": "execute_result" } ], "source": [ "barr = np.array([True]*5+[False]*5)\n", "b[barr] = -5\n", "b" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10,), int64)\n", " nbytes := 80; cbytes := 16.00 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 2048; chunksize: 16384; blocksize: 0\n", "[ -5 -10 -10 -5 -10 10 10 -10 8 9]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[[1,2,4,1]] = -10\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, you must be aware that modifying a carray is expensive:" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 51.52 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 78.7 ns per loop\n", "1000 loops, best of 3: 211 µs per loop\n", "\n", "* Modifying a carray is around 2688.01x times slower than modifying a numpy one\n" ] } ], "source": [ "a = np.arange(1e7)\n", "b = bcolz.carray(a)\n", "\n", "t_numpy = %timeit -o a[2] = 3\n", "t_bcolz = %timeit -o b[2] = 3\n", "ratio = t_numpy.best / t_bcolz.best\n", "bcolz_vs_numpy[\"modify an array\"] = ratio\n", "\n", "print('\\n* Modifying a carray is around {0}x times slower than modifying a numpy one'.format(round(1/ratio, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "although modifying values inside the latest chunk is much cheaper:" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slowest run took 50.81 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "10000000 loops, best of 3: 79.8 ns per loop\n", "The slowest run took 5.25 times longer than the fastest. This could mean that an intermediate result is being cached.\n", "100000 loops, best of 3: 7.81 µs per loop\n", "\n", "* Modifying data in the last chunk of a caray is around 97.95x times slower than modifying a numpy one\n" ] } ], "source": [ "t_numpy = %timeit -o a[-1] = 3\n", "t_bcolz = %timeit -o b[-1] = 3\n", "ratio = t_numpy.best / t_bcolz.best\n", "bcolz_vs_numpy[\"modify array's last chunk\"] = ratio\n", "\n", "print('\\n* Modifying data in the last chunk of a caray is around {0}x times slower than modifying a numpy one'.format(round(1/ratio, 2)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So as you see, you should avoid modifications as much as possible (if you can) when using\n", "carrays." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Multidimensional carrays\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can create multidimensional carrays too. Look at this example:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((2, 3), float64)\n", " nbytes := 48; cbytes := 15.98 KB; ratio: 0.00\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 682; chunksize: 16368; blocksize: 0\n", "[[ 0. 0. 0.]\n", " [ 0. 0. 0.]]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = bcolz.zeros((2,3))\n", "a" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, you can access any element in any dimension:" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0., 0., 0.])" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[1]" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0., 0.])" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[1,::2]" ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0., 0.])" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a[:,1]" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "collapsed": false }, "outputs": [], "source": [ "a[0,1] = 0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, multidimensional carrays support the same multidimensional\n", "indexes than its NumPy counterparts.\n", "\n", "Also, you can use the `reshape()` method to set your desired shape to\n", "an existing carray:" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((3, 4), int64)\n", " nbytes := 96; cbytes := 16.00 KB; ratio: 0.01\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 512; chunksize: 16384; blocksize: 0\n", "[[ 0 1 2 3]\n", " [ 4 5 6 7]\n", " [ 8 9 10 11]]" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b = bcolz.arange(12).reshape((3,4))\n", "b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iterators loop over the leading dimension:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[array([0, 1, 2, 3]), array([4, 5, 6, 7]), array([ 8, 9, 10, 11])]" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[r for r in b]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you can select columns there by using another indirection level:" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[2, 6, 10]" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[r[2] for r in b]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Above, the third column has been selected. Although for this case the\n", "indexing is easier:" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 2, 6, 10])" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b[:,2]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Operating with carrays\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Right now, you cannot operate with carrays directly (although that\n", "might be implemented in the future):" ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "collapsed": false }, "outputs": [], "source": [ "x = bcolz.arange(1e7)" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Running the operation below will raise an error\n", "# x + x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Instead, you should use the `eval` function:" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 3.32 MB; ratio: 23.01\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 32768\n", "[ 0.00000000e+00 2.00000000e+00 4.00000000e+00 ..., 1.99999940e+07\n", " 1.99999960e+07 1.99999980e+07]" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = bcolz.eval(\"x + x\")\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also compute arbitrarily complex expressions in one shot:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 41.16 MB; ratio: 1.85\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 32768\n", "[ 0.00000000e+00 2.60000000e+00 1.24000000e+01 ..., 4.99999760e+20\n", " 4.99999910e+20 5.00000060e+20]" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = bcolz.eval(\".5*x**3 + 2.1*x**2\")\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note how the output of `eval()` is also a carray object. You can pass\n", "other parameters of the carray constructor too. Let's force maximum\n", "compression for the output:" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 24.50 MB; ratio: 3.11\n", " cparams := cparams(clevel=9, shuffle=2, cname='zlib', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 524288\n", "[ 0.00000000e+00 2.60000000e+00 1.24000000e+01 ..., 4.99999760e+20\n", " 4.99999910e+20 5.00000060e+20]" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = bcolz.eval(\".5*x**3 + 2.1*x**2\", cparams=bcolz.cparams(9, shuffle=bcolz.BITSHUFFLE, cname=\"zlib\"))\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also, we can get a native numpy array out of the computation:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([ 0.00000000e+00, 2.60000000e+00, 1.24000000e+01, ...,\n", " 4.99999760e+20, 4.99999910e+20, 5.00000060e+20])" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y = bcolz.eval(\".5*x**3 + 2.1*x**2\", out_flavor=\"numpy\")\n", "y" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, `eval` will use the \"numexpr\" virtual machine if it is installed. If not, \"dask\" is used if installed. And if neither of these can be found, then the \"python\" interpreter is used (via NumPy). You can use the `vm` parameter to select the desired virtual machine (\"numexpr\", \"dask\" or \"python\"):" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10 loops, best of 3: 71.4 ms per loop\n" ] } ], "source": [ "%timeit bcolz.eval(\".5*x**3 + 2.1*x**2\", vm=\"numexpr\")" ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loop, best of 3: 464 ms per loop\n" ] } ], "source": [ "%timeit bcolz.eval(\".5*x**3 + 2.1*x**2\", vm=\"dask\")" ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 loop, best of 3: 865 ms per loop\n" ] } ], "source": [ "%timeit bcolz.eval(\".5*x**3 + 2.1*x**2\", vm=\"python\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, `eval` lets you store the result directly on-disk in an efficient way (i.e. without temporaries):" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "carray((10000000,), float64)\n", " nbytes := 76.29 MB; cbytes := 41.02 MB; ratio: 1.86\n", " cparams := cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)\n", " chunklen := 65536; chunksize: 524288; blocksize: 32768\n", " rootdir := 'mydir/eval_result'\n", " mode := 'a'\n", "[ 0.00000000e+00 1.00000000e+00 8.00000000e+00 ..., 9.99999100e+20\n", " 9.99999400e+20 9.99999700e+20]" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bcolz.eval(\"x**3\", out_flavor=\"carray\", rootdir=\"mydir/eval_result\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For setting globally or permanently your own defaults for the `vm` and\n", "`out_flavors`, see defaults chapter.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## `carray` metadata\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "carray implements several attributes, like `dtype`, `shape` and `ndim`\n", "that makes it to 'quack' like a NumPy array:" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a = np.arange(1e7)\n", "b = bcolz.carray(a)" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "dtype('float64')" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.dtype" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(10000000,)" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, it implements the `cbytes` attribute that tells how many\n", "bytes in memory (or on-disk) the carray object is using:" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "3476685" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.cbytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This figure is approximate and generally smaller than the original\n", "(uncompressed) datasize, which can be accessed by retrieving the `nbytes` attribute:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "80000000" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.nbytes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which is the same as his equivalent NumPy array:" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "80000000" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.size*a.dtype.itemsize" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you would like to know the compression level used and other optional filters used by a particular object, you can read this information from the `cparams` read-only attribute:" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "cparams(clevel=5, shuffle=1, cname='lz4', quantize=0)" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.cparams" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default value of a carray is another attribute you would likely want to check before resizing a carray:" ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(0.0)" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.dflt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can access the `chunklen` (the length for each chunk) for this\n", "carray:" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "65536" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "b.chunklen" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a complete list of public attributes of carray, see section on\n", "carray attributes\n", "." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## `carray` user attrs\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Besides the regular attributes like `shape`, `dtype` or `chunklen`,\n", "there is another set of attributes that can be added (and removed) by\n", "the user in another name space. This space is accessible via the\n", "special `attrs` attribute, in the following example we will trigger flushing\n", "data to disk manually:" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "*no attrs*" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = bcolz.carray([1,2], rootdir='mydir/my_carray', mode='w')\n", "a.attrs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you see, by default there are no attributes attached to `attrs`.\n", "Also, notice that the carray that we have created is persistent and\n", "stored on the 'mydata' directory. Let's add one attribute here:" ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "myattr : 234" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.attrs['myattr'] = 234\n", "a.attrs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have now just attached the 'myattr' attribute with the value 234. Let's add a couple of attributes more:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "unit : 'Celsius'\n", "myattr : 234\n", "temp : 23" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a.attrs['temp'] = 23 \n", "a.attrs['unit'] = 'Celsius'\n", "a.attrs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good, we have three of them now. You can attach as many as you want,\n", "and the only current limitation is that they have to be serializable\n", "via JSON.\n", "\n", "As the 'a' carray is persistent, it can re-opened in other Python session:" ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a.flush()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could get our data back as follows:" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "collapsed": true }, "outputs": [], "source": [ "a2 = bcolz.open(rootdir=\"mydir/my_carray\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's remove a couple of user attrs:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "temp : 23" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "del a2.attrs['myattr']\n", "del a2.attrs['unit']\n", "a2.attrs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, it is really easy to make use of this feature and complement\n", "your data with (potentially persistent) metadata of your choice. Of\n", "course, the `ctable` object also wears this capability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "## Memory profiling\n", "Go to index" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could say that `carrays` normally consume less memory \n", "than their counterparts `numpy arrays`, but as we said \n", "before, this would highly depend on the dataset you are \n", "trying to store: for small arrays the `carrays`'s \n", "overhead becomes noticeable and they might be even bigger \n", "than `numpy arrays`, but keep in mind that `carrays` \n", "were designed with large amount of data in mind.\n", "\n", "Please see the following notebook for more details about this topic.\n", "- [carray memory profiling](tutorial_carray_memory_profile.ipynb)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }