fastchunking (HEAD) - Codebase list

Tree @HEAD (Download .tar.gz)

===========================
fastchunking Python library
===========================

.. image:: https://travis-ci.org/netleibi/fastchunking.svg?branch=master
    :target: https://travis-ci.org/netleibi/fastchunking

.. image:: https://badge.fury.io/py/fastchunking.svg
    :target: https://badge.fury.io/py/fastchunking

.. image:: https://readthedocs.org/projects/fastchunking/badge/?version=latest
    :target: http://fastchunking.readthedocs.io/en/latest/?badge=latest
    :alt: Documentation Status

What it is
----------

`fastchunking` is a Python library that contains efficient and easy-to-use
implementations of string chunking algorithms.

It has been developed as part of the work [LS16]_ at CISPA, Saarland University.

Installation
------------

::

    $ pip install fastchunking

If you are using Python >=3.5, you have to install pybindgen which is required
by fastchunking beforehand as the latest version currently available via pip is
outdated:

::

    $ pip uninstall pybindgen
    $ pip install git+https://github.com/gjcarneiro/pybindgen.git

.. note:: For performance reasons, parts of this library are implemented in C++.
	Installation from a source distribution, thus, requires availability of a
	correctly configured C++ compiler.

Usage and Overview
------------------

`fastchunking` provides efficient implementations for different string chunking
algorithms, e.g., static chunking (SC) and content-defined chunking (CDC).

Static Chunking (SC)
^^^^^^^^^^^^^^^^^^^^

Static chunking splits a message into fixed-size chunks.

Let us consider a random example message that shall be chunked:
    >>> import os
    >>> message = os.urandom(1024*1024)

Static chunking is trivial when chunking a single message:
    >>> import fastchunking
    >>> sc = fastchunking.SC()
    >>> chunker = sc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message)
    [4096, 8192, 12288, ...]

A large message can also be chunked in fragments, though:
    >>> chunker = sc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message[:10240])
    [4096, 8192]
    >>> chunker.next_chunk_boundaries(message[10240:])
    [2048, 6144, 10240, ...]

Content-Defined Chunking (CDC)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

`fastchunking` supports content-defined chunking, i.e., chunking of messages
into fragments of variable lengths.

Currently, a chunking strategy based on Rabin-Karp rolling hashes is supported.

As a rolling hash computation on plain-Python strings is incredibly slow with
any interpreter, most of the computation is performed by a C++ extension which
is based on the `ngramhashing` library by Daniel Lemire, see:
https://github.com/lemire/rollinghashcpp

Let us consider a random message that should be chunked:
    >>> import os
    >>> message = os.urandom(1024*1024)

When using static chunking, we have to specify a rolling hash window size (here:
48 bytes) and an optional seed value that affects the pseudo-random distribution
of the generated chunk boundaries.

Despite that, usage is similar to static chunking:
    >>> import fastchunking
    >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)
    >>> chunker = cdc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message)
    [7475, 10451, 12253, 13880, 15329, 19808, ...]
    
Chunking in fragments is straightforward:
    >>> chunker = cdc.create_chunker(chunk_size=4096)
    >>> chunker.next_chunk_boundaries(message[:10240])
    [7475]
    >>> chunker.next_chunk_boundaries(message[10240:])
    [211, 2013, 3640, 5089, 9568, ...]

Multi-Level Chunking (ML-\*)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Multiple chunkers of the same type (but with different chunk sizes) can be
efficiently used in parallel, e.g., to perform multi-level chunking [LS16]_.

Again, let us consider a random message that should be chunked:
    >>> import os
    >>> message = os.urandom(1024*1024)

Usage of multi-level-chunking, e.g., ML-CDC, is easy:
    >>> import fastchunking
    >>> cdc = fastchunking.RabinKarpCDC(window_size=48, seed=0)
    >>> chunk_sizes = [1024, 2048, 4096]
    >>> chunker = cdc.create_multilevel_chunker(chunk_sizes)
    >>> chunker.next_chunk_boundaries_with_levels(message)
    [(1049, 2), (1511, 1), (1893, 2), (2880, 1), (2886, 0),
    (3701, 0), (4617, 0), (5809, 2), (5843, 0), ...]

The second value in each tuple indicates the highest chunk size that leads to
a boundary. Here, the first boundary is a boundary created by the chunker with
index 2, i.e., the chunker with 4096 bytes target chunk size.

.. note::
   Only the highest index is output if multiple chunkers yield the same
   boundary.
    
.. warning::
   Chunk sizes have to be passed in correct order, i.e., from lowest to highest
   value.

Performance
-----------

Computation costs for `static chunking` are barely measurable: As chunking does
not depend on the actual message but only its length, computation costs are
essentially limited to a single :code:`xrange` call.

`Content-defined chunking`, however, is expensive: The algorithm has to compute
hash values for rolling hash window contents at `every` byte position of the
message that is to be chunked. To minimize costs, fastchunking works as follows:
    
    1. The message (fragment) is passed in its entirety to the C++ extension.
    2. Chunking is performed within the C++ extension.
    3. The resulting list of chunk boundaries is communicated back to Python and
       converted into a Python list.

Based on a 100 MiB random content, the author measured the following throughput
on an Intel Core i7-4770K in a single, non-representative test run using
Python 3.5 (Windows x86-64):

    =========== ==========
    chunk size  throughput
    =========== ==========
    64 bytes    118 MiB/s
    128 bytes   153 MiB/s
    256 bytes   187 MiB/s
    512 bytes   206 MiB/s
    1024 bytes  221 MiB/s
    2048 bytes  226 MiB/s
    4096 bytes  231 MiB/s
    8192 bytes  234 MiB/s
    16384 bytes 233 MiB/s
    32768 bytes 234 MiB/s
    =========== ==========

Testing
-------

`fastchunking` uses tox for testing, so simply run:

::

	$ tox

References:
    .. [LS16] Dominik Leibenger and Christoph Sorge (2016). sec-cs: Getting the
       Most out of Untrusted Cloud Storage.
       `arXiv:1606.03368 <http://arxiv.org/abs/1606.03368>`_
Commit History @HEAD

»»