Codebase list libdivsufsort / HEAD
HEAD

Tree @HEAD (Download .tar.gz)

libdivsufsort - A lightweight suffix-sorting library.
-----------------------------------------------------

Introduction:
-------------

The libdivsufsort project provides a fast, lightweight, and robust
C API library to construct the suffix array and the Burrows-Wheeler
transformed string for any input string of a constant-size alphabet.

The suffix-sorting algorithm runs in O(n log n) worst-case time
using only 5n+O(1) bytes of memory space, where n is the length of
the input string.

The latest version of libdivsufsort is available at:
  http://libdivsufsort.googlecode.com/


License:
--------

libdivsufsort is released under the MIT/X11 license. See the file
COPYING for more details.


APIs:
-----

  * Data types
  typedef int32_t saint_t;
  typedef int32_t saidx_t;
  typedef uint8_t sauchar_t;

  * Constructs the suffix array of a given string.
  * @param T[0..n-1] The input string.
  * @param SA[0..n-1] The output array or suffixes.
  * @param n The length of the given string.
  * @return 0 if no error occurred, -1 or -2 otherwise.
  saint_t
  divsufsort(const sauchar_t *T, saidx_t *SA, saidx_t n);

  * Constructs the burrows-wheeler transformed string of a given string.
  * @param T[0..n-1] The input string.
  * @param U[0..n-1] The output string. (can be T)
  * @param A[0..n-1] The temporary array. (can be NULL)
  * @param n The length of the given string.
  * @return The primary index if no error occurred, -1 or -2 otherwise.
  saidx_t
  divbwt(const sauchar_t *T, sauchar_t *U, saidx_t *A, saidx_t n);

  * Constructs the burrows-wheeler transformed string of a given string and suffix array.
  * @param T[0..n-1] The input string.
  * @param U[0..n-1] The output string. (can be T)
  * @param SA[0..n-1] The suffix array. (can be NULL)
  * @param n The length of the given string.
  * @param idx The output primary index.
  * @return 0 if no error occurred, -1 or -2 otherwise.
  saint_t
  bw_transform(const sauchar_t *T, sauchar_t *U, saidx_t *SA,
               saidx_t n, saidx_t *idx);

  * Inverse BW-transforms a given BWTed string.
  * @param T[0..n-1] The input string.
  * @param U[0..n-1] The output string. (can be T)
  * @param A[0..n-1] The temporary array. (can be NULL)
  * @param n The length of the given string.
  * @param idx The primary index.
  * @return 0 if no error occurred, -1 or -2 otherwise.
  saint_t
  inverse_bw_transform(const sauchar_t *T, sauchar_t *U,
                       saidx_t *A, saidx_t n, saidx_t idx);

  * Checks the correctness of a given suffix array.
  * @param T[0..n-1] The input string.
  * @param SA[0..n-1] The input suffix array.
  * @param n The length of the given string.
  * @param verbose The verbose mode.
  * @return 0 if no error occurred.
  saint_t
  sufcheck(const sauchar_t *T, const saidx_t *SA,
           saidx_t n, saint_t verbose);

  * Search for the pattern P in the string T.
  * @param T[0..Tsize-1] The input string.
  * @param Tsize The length of the given string.
  * @param P[0..Psize-1] The input pattern string.
  * @param Psize The length of the given pattern string.
  * @param SA[0..SAsize-1] The input suffix array.
  * @param SAsize The length of the given suffix array.
  * @param idx The output index.
  * @return The count of matches if no error occurred, -1 otherwise.
  saidx_t
  sa_search(const sauchar_t *T, saidx_t Tsize,
            const sauchar_t *P, saidx_t Psize,
            const saidx_t *SA, saidx_t SAsize,
            saidx_t *idx);

  * Search for the character c in the string T.
  * @param T[0..Tsize-1] The input string.
  * @param Tsize The length of the given string.
  * @param SA[0..SAsize-1] The input suffix array.
  * @param SAsize The length of the given suffix array.
  * @param c The input character.
  * @param idx The output index.
  * @return The count of matches if no error occurred, -1 otherwise.
  saidx_t
  sa_simplesearch(const sauchar_t *T, saidx_t Tsize,
                  const saidx_t *SA, saidx_t SAsize,
                  saint_t c, saidx_t *idx);

  * Returns the version string of libdivsufsort.
  * @return the version string.
  const char *
  divsufsort_version(void);


Benchmark:
------------------

= Specifications =
Processor:        2.66 GHz Intel Core 2 Duo E6750
L1 Cache:         (32 Kb + 32 Kb) x 2
L2 Cache:         4 Mb
RAM:              2 Gb main memory
Operating system: Windows XP Home SP 3 (with Cygwin)
Compiler:         GCC version 4.3.1

= Programs =
Archon4r0    kvark's sorting algorithm            http://forum.compression.ru/viewtopic.php?t=352
BPR          Bucket-Pointer Refinement algorithm  http://bibiserv.techfak.uni-bielefeld.de/bpr/
DC           Difference-Cover algorithm (v = 32)  http://www.cs.helsinki.fi/juha.karkkainen/publications/cpm03.tar.gz
DS           Deep-Shallow sorting algorithm       http://www.mfn.unipmn.it/~manzini/lightweight/
divsufsort1  libdivsufsort version 1.2.3          http://libdivsufsort.googlecode.com/
divsufsort2  libdivsufsort version 2.0.0          http://libdivsufsort.googlecode.com/
KA           Ko-Aluru algorithm                   http://ko.pang.cn.googlepages.com/software2
KS           Kärkkäinen-Sanders algorithm         http://www.mpi-inf.mpg.de/~sanders/programs/suffix/
MSufSort3    MSufSort version 3.1.1 beta          http://www.michael-maniscalco.com/msufsort.htm
qsufsort     Larsson-Sadakane algorithm           http://www.larsson.dogma.net/research.html
sais         Induced Sorting algorithm            http://yuta.256.googlepages.com/sais

All programs were compiled with gcc/g++ using '-O3 -fomit-frame-pointer -DNDEBUG'
optimization options. The times are the average of five runs, in seconds, and were
measured using the standard Unix/Cygwin 'time' command. (user + system) The spaces
were measured using the 'memusage' command.

= Testfiles =
Manzini's Large Corpus  http://www.mfn.unipmn.it/~manzini/lightweight/corpus/
The Gauntlet            http://www.michael-maniscalco.com/testset/gauntlet/

= Running times =

== Manzini's Corpus ==
Files                 Size  Archon4r0      BPR       DC       DS  divsufsort1  divsufsort2       KA        KS  MSufSort3  qsufsort     sais
chr22.dna         34553758      6.030    6.196   22.694    7.514        5.404        5.362   16.980    50.006      7.132    10.642   10.796
etext99          105277340     22.160   32.582   79.872   34.264       18.758       18.064   73.236   202.684     24.106    56.612   38.748
gcc-3.0.tar       86630400     13.856   20.692   61.690   35.822       10.382       10.084   40.908   135.174     14.952    40.766   20.990
howto             39422105      5.806    8.326   25.432    8.288        5.472        5.320   20.694    64.834      5.672    16.366   11.388
jdk13c            69728899     18.106   22.252   61.234   32.182        9.260        9.010   34.172   101.096     11.314    39.792   16.396
linux-2.4.5.tar  116254720     18.174   26.226   82.830   25.912       14.672       14.290   58.586   194.412     19.890    54.054   29.614
rctail96         114711151     32.490   55.826  119.026   62.502       18.500       17.914   70.072   190.562     21.060    70.456   33.248
rfc              116421901     20.736   35.404   91.284   29.666       16.116       15.658   64.390   196.500     17.936    61.436   32.224
sprot34.dat      109617186     22.832   36.720   93.122   32.096       17.894       17.404   68.084   187.594     23.352    56.946   34.092
w3c2             104201579     27.264   29.384   89.352   54.682       13.866       13.486   52.660   162.582     17.090    77.804   25.498
totals           896819039    187.454  273.608  726.536  322.928      130.324      126.592  499.782  1485.444    162.504   484.874  252.994

== The Gauntlet ==
Files              Size  Archon4r0      BPR      DC       DS  divsufsort1  divsufsort2      KA      KS  MSufSort3  qsufsort    sais
abac             200000      0.044    0.064   0.104   27.914        0.042        0.036   0.058   0.048      0.050     0.062   0.044
abba           10500600      3.270    5.124  10.766   30.702        1.714        1.602   2.570   7.952      3.514    15.272   1.460
book1x20       15375420      4.392    3.530  13.872   97.468        2.312        2.154   7.442  15.756      3.542    22.376   3.912
fib_s14930352  14930352     12.728   10.830  18.524  179.040        3.638        3.588   3.544  10.232      6.700    18.224   2.542
fss10          12078908     11.390    8.974  15.130   85.328        2.828        2.824   3.344   8.646      4.618    14.754   2.076
fss9            2851443      1.002    1.210   1.644    5.256        0.410        0.416   0.618   1.290      0.554     2.836   0.336
houston         3840000      0.344    0.708   2.226  118.960        0.118        0.128   0.520   0.744      0.242     1.230   0.238
paper5x80        981924      0.110    0.154   0.454    0.806        0.092        0.090   0.210   0.256      0.144     0.448   0.110
test1           2097152      0.332    2.132   1.108    8.680        0.268        0.280   0.376   1.066      1.302     2.762   0.202
test2           2097152      0.710    0.616   1.110    8.682        0.180        0.176   0.374   1.076      3.354     2.768   0.206
test3           2097152      0.488  213.154   1.164    1.772        0.220        0.226   0.388   1.082      0.922     3.246   0.212
totals         67050103     34.810  246.496  66.102  564.608       11.822       11.520  19.444  48.148     24.942    83.978  11.338

= Space (in MiBytes) =

== Manzini's Corpus ==
Files                 Size  Archon4r0      BPR       DC       DS  divsufsort1  divsufsort2       KA        KS  MSufSort3  qsufsort     sais
chr22.dna         34553758     174.66   296.88   193.60   165.18       165.02       165.02   289.97    428.39     199.72    263.62   164.77
etext99          105277340     531.13   915.48   589.85   503.23       502.25       502.25   907.34   1305.20     604.45    803.20   502.00
gcc-3.0.tar       86630400     437.14   756.43   485.38   415.87       413.34       413.34   709.50   1074.01     497.79    660.94   413.09
howto             39422105     199.20   367.53   220.88   188.45       188.23       188.23   331.54    488.75     227.67    300.77   187.98
jdk13c            69728899     351.96   603.99   390.68   333.40       332.74       332.74   609.71    864.48     401.04    531.99   332.49
linux-2.4.5.tar  116254720     586.46  1061.83   651.36   555.76       554.60       554.60   977.81   1441.30     667.39    886.95   554.35
rctail96         114711151     578.68   987.64   642.71   548.32       547.24       547.24  1004.98   1422.16     658.43    875.18   546.99
rfc              116421901     587.30  1005.85   652.29   556.53       555.39       555.39   956.52   1443.37     668.26    888.23   555.14
sprot34.dat      109617186     553.01   941.95   614.17   524.03       522.95       522.95   930.06   1359.01     629.26    836.31   522.70
w3c2             104201579     525.71   958.37   583.82   498.09       497.12       497.12   912.00   1291.87     598.82    795.00   496.87
totals           896819039    4525.25  7895.95  5024.74  4288.86      4278.88      4278.88  7629.43  11118.54    5152.83   6842.19  4276.38
mean                     -       5.29     9.23     5.88     5.01         5.00         5.00     8.92     13.00       6.02      8.00     5.00

== The Gauntlet ==
Files              Size  Archon4r0     BPR      DC      DS  divsufsort1  divsufsort2      KA      KS  MSufSort3  qsufsort    sais
abac             200000       1.51    1.73    1.12    0.98         1.21         1.20    1.75    2.48       3.15      1.53    0.95
abba           10500600      53.43   90.19   58.83   50.21        50.32        50.32   86.20  130.18      62.09     80.11   50.07
book1x20       15375420      78.00  134.00   86.15   73.52        73.57        73.57  132.42  190.62      89.99    117.31   73.32
fib_s14930352  14930352      75.75  128.15   83.65   71.71        71.44        71.44  117.16  185.10      87.43    113.91   71.19
fss10          12078908      61.38  103.68   67.68   58.05        57.85        57.85  107.05  149.75      71.12     92.16   57.60
fss9            2851443      14.87   24.48   15.98   13.71        13.85        13.85   25.27   35.35      18.32     21.76   13.60
houston         3840000      19.85   36.96   21.52   18.46        18.56        18.56   28.79   47.58      23.98     29.30   18.31
paper5x80        981924       5.45   11.40    5.50    4.72         4.93         4.93    8.59   12.17       7.63      7.49    4.68
test1           2097152      11.07   82.00   11.75   10.10        10.25        10.25   18.34   25.99      14.01     16.00   10.00
test2           2097152      11.07   82.00   11.75   10.10        10.25        10.25   18.34   25.99      14.01     16.00   10.00
test3           2097152      11.07   82.00   11.75   10.05        10.25        10.25   18.34   26.00      14.63     16.00   10.12
totals         67050103     343.45  776.59  375.68  321.61       322.48       322.47  562.25  831.21     406.36    511.57  319.84
mean                  -       5.37   12.14    5.88    5.03         5.04         5.04    8.79   13.00       6.35      8.00    5.00


Algorithm:
----------

libdivsufsort uses the following algorithms for suffix sorting.
  - The improved version of Itho-Tanaka two-stage sorting algorithm. [2][6]
  - A substring sorting/encoding technique. [1][3]
  - Maniscalco's tandem repeat sorting algorithm. [5]
  - Larsson-Sadakane sorting algorithm. [4]


References:
-----------

  1. Stefan Burkhardt and Juha K"arkk"ainen. Fast lightweight suffix
     array construction and checking. Proceedings of the 14th Annual
     Symposium on Combinatorial Pattern Matching, LNCS 2676,
     Springer, pp. 55-69, 2003.

  2. Hideo Itoh and Hozumi Tanaka, An Efficient Method for in Memory
     Construction of Suffix Arrays, Proceedings of the IEEE String
     Processing and Information Retrieval Symposium, pp. 81-88, 1999.

  3. Pang Ko and Srinivas Aluru, Space-efficient linear time
     construction of suffix arrays, Proceedings of the 14th Annual
     Symposium on Combinatorial Pattern Matching, pp. 200-210, 2003.

  4. Jesper Larsson and Kunihiko Sadakane, Faster suffix sorting.
     Technical report LU-CS-TR:99-214, Department of Computer
     Science, Lund University, Sweden, 1999.

  5. Michael Maniscalco, MSufSort.
     http://www.michael-maniscalco.com/msufsort.htm

  6. Yuta Mori, Short description of improved two-stage suffix sorting
     algorithm, 2005.
     http://homepage3.nifty.com/wpage/software/itssort.txt