Codebase list staden-io-lib / HEAD
HEAD

Tree @HEAD (Download .tar.gz)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
Io_lib:  Version 1.14.15
========================

Io_lib is a library of file reading and writing code to provide a general
purpose SAM/BAM/CRAM, trace file (and Experiment File) reading
interface.  Programmatically {S,B,CR}AM can be manipulated using the
scram_*() API functions while DNA Chromatogram ("trace") files  can be
read using the read_reading() function.

It has been compiled and tested on a variety of unix systems, MacOS X
and MS Windows.

The directories below here contain the io_lib code. These support the
following file formats:

	SAM/BAM sequence files
	CRAM sequence files
	SCF trace files
	ABI trace files
	ALF trace files
	ZTR trace files
	SFF trace archives
	SRF trace archives
	Experiment files
	Plain text files

These link together to form a single "libstaden-read" library supporting
all the file formats via a single read_reading (or fread_reading or
mfread_reading) function call and analogous write_reading functions
too. See the file include/Read.h for the generic 'Read' structure.

See the CHANGES for a summary of older updates or git logs for the
full details.


Version 1.14.15 (6th December 2022)
---------------

This is primarily a bug fix release.


Version 1.14.14 (17th March 2021)
---------------

This is simply a bug fix release.  It also updates to the latest
htscodecs submodule, now at an official 1.0 release.

Version 1.14.13 (3rd July 2020)
---------------

This release has a mixture of on-going CRAM 4 work (not compatible
with previous CRAM 4) and some more general quality of life
improvements for all CRAM versions including speed-ups and better
multi-threading.

Note both CRAM 3.1 and 4.0 are still to be considered an unofficial
CRAM extensions.

Updates:

* Scramble can now filter-in or filter-out aux tags during
  transcoding.  This is done using -d and -D options.  For example:

      scramble -D OQ,BI,BD in.bam out.cram

  removes the GATK added OQ, BI and BD aux tags.
  Requested by @jhaezebrouck in issue #24.

* The Scramble -X <profile> options are now implemented using a
  CRAM_OPT_PROFILE option.  This simplifies the scramble code and
  makes it easier to call from a library.  This also fixes a number of
  bugs in the order of argument parsing.

* Improved CRAM writing speeds.

  The bam_copy function now only copies the number of used bytes
  rather than the number of allocated bytes, which can sometimes be
  substantially smaller.  As this was done in the main thread it may
  have a significant benefit when multi-threading.

* Added libdeflate support into CRAM too (in addition to the existing
  support in BAM).  This isn't a huge change to CRAM speeds except at
  high levels (-8 and -9) which are now slower, but also better
  compression ratio.  A modest 2-3% speed gain is visible are low and
  mid levels, and at -1/-2 to -4 the compression ratio is also
  improved.

* CRAM 3.1 compression level -1 is now 25% faster, but 4% larger.
  This is achieved by difference choice of compression codecs, most
  notably disabling the name tokeniser for level 1.  Use level 2 for
  something comparable to the old behaviour.

* Added an io_lib/version.h to make it easier to detect the version
  being compiled against using IOLIB_VERSION macros.
  Requested by German Tischler in issue #25.

* Refactored the cram encoding interface used by biobambam.
  Implemented by German Tischler in PR#27.

* CRAM 4 now uses E_CONST instead of a uni-value version of
  E_HUFFMAN.  Also added offset field to VARINT_SIGNED and
  VARINT_UNSIGNED which helps for data series that have values from -1
  to MAXINT.

* CRAM 4 container structure has changed so that all values are
  variable sized integers instead of fixed size.

* Further improvements with CRAM 4's use of signed values.
  - Ref_seq_id is container and slice headers are now signed.
  - RI (ref ID) data series and NS (mate ref ID) are also now signed
    as -1 is a valid value.
  - Embedded ref id is now 0 for unusued instead of -1.

* Reversed the use of CRAM 4 delta encoding for the B array.  It only
  helps at the moment for ONT signal data, so it needs more work to
  make it auto-detect when delta makes sense. (Enabling it globally
  for CRAM4 B aux tags was accidental.)

* Htscodecs submodule has gained support for big-endian platforms
  Other big-endian improvements to parts of CRAM4 too.

Bug fixes:

* Fixed CRAM MD tag generatin when using the "b" feature code
  (NB: unused by known CRAM encoders).
  Also see https://github.com/samtools/htslib/pull/1086 for more details.

* Fixed CRAM quality string when using "q" feature code (unused by
  encoders?) and in lossy-quality mode (maybe utilised in old
  Cramtools).
  Also see https://github.com/samtools/htslib/pull/1094 for more details.

* Fixed some minor memory leaks.

* "Scramble -X archive -1" enabled lzma, which should only have
  arrived at level 7 and above. (It compared integer 7 vs ASCII '1'.)

* Removed minor compilation warning in printf debugging.

* Fixed a 7 year old bug in scram_pileup which couldn't cope with
  soft-clips being followed by hard-clips.


Technology Demo: CRAM 3.1 and 4.0
=================================

The current official GA4GH CRAM version is 3.0.

For purposes of *EVALUATION ONLY* this release of io_lib includes CRAM
version 3.1, with new compression codecs (but is otherwise identical
file layout to 3.0), and 4.0 with a few additional format
modifications, such as 64-bit sizes, deduplication of read names,
orientation changes of quality strings and a revised variable sized
integer encoding.

They can be turned on using e.g. scramble -V3.1 or scramble -V4.0.
It is likely CRAM v4.0 will be official significantly later, but we
plan on v3.1 being a recognised GA4GH standard this year.

By default enabling either of these will also enable the new codecs.
Which codecs are used also depends on the profile specified (eg via
"-X small").  Some of the new codecs are considerably slower,
especially at decompression, but by default CRAM 3.1 aims to be
comparable speed to 3.0.  Note the profiles also change the
granularity of random access (1k, 10k, 25k, 100k for fast, normal,
small and archive respectively).

Here are some example file sizes and timings with different codecs and
levels on 10 million 150bp NovaSeq reads, single threaded.  Decode
timing is checked using "scram_flagstat -b".  Tests were performed
on an Intel i5-4570 processor at 3.2GHz.

|Scramble opts.      |Size(MB) |Enc(s)|Dec(s)|Codecs used                |
|--------------------|--------:|-----:|-----:|---------------------------|
|-O bam              |    531.9|  92.3|   7.5|bgzf(zlib)                 |
|-O bam -1           |    611.4|  26.4|   5.4|bgzf(libdeflate)           |
|-O bam (default)    |    539.5|  45.0|   4.9|bgzf(libdeflate)           |
|-O bam -9           |    499.5| 920.2|   4.9|bgzf(libdeflate)           |
||||||
|-V2.0 -X fast       |    317.7|  38.8|  11.8|(default, level 1)         |
|-V2.0 (default)     |    267.6|  47.0|  10.5|(default)                  |
|-V2.0 -X small      |    218.0| 124.6|  33.1|bzip2                      |
||||||
|-V3.0 -X fast       |    264.9|  31.3|  10.8|(default, level 1)         |
|-V3.0 (default)     |    223.7|  34.7|  10.3|(default)                  |
|-V3.0 -X small      |    212.3|  88.3|  18.2|bzip2                      |
|-V3.0 -X archive    |    209.4|  98.7|  18.2|bzip2                      |
||||||
|-V3.1 -X fast       |    262.4|  29.1|   9.3|rANS++                     |
|-V3.1 (default)     |    186.4|  33.7|   8.3|rANS++,tok3                |
|-V3.1 -X small      |    176.8|  74.0|  35.2|rANS++,tok3,fqz            |
|-V3.1 -X archive    |    171.9| 127.9|  34.9|rANS++,tok3,fqz,bzip2,arith|
||||||
|-V4.0 -X fast       |    251.2|  28.9|   9.6|rANS++                     |
|-V4.0 (default)     |    182.1|  32.9|   8.2|rANS++,tok3                |
|-V4.0 -X small      |    170.9|  70.9|  35.0|rANS++,tok3,fqz            |
|-V4.0 -X archive    |    166.9| 116.4|  34.2|rANS++,tok3,fqz,bzip2,arith|

We also tested on a small human aligned HiSeq run (ERR317482)
representing older Illumina data with pre-binning era quality values.
This dataset shows less impressive gains with 4.0 over 3.0 in the
default profile, but major gains in small profile once fqzcomp quality
encoding is enabled.

Note for this file, the file sizes are larger meaning less disk
caching is possible (the test machine wasn't a memory stressed
desktop).  Threading was also enabled, albeit with just 4 threads,
which further exacerbates I/O bottlenecks.  The previous test
demonstrated BAM being faster to read than CRAM, but with large files
in a more I/O stressed situation this test demonstrates the default
profile of CRAM is faster to read than BAM, due to the smaller I/O
footprint.

NB: the table below was produced with 1.14.12.

|Scramble opts.         |Size(MB) |Enc(s)|Dec(s)|Codecs used                     |
|--------------------   |--------:|-----:|-----:|--------------------------------|
|-t4 -O bam (default)   |    6526 | 115.4|  44.7|bgzf(libdeflate)                |
||||||
|-t4 -V2.0 -X fast      |    3674 |  87.4|  31.4|(default, level 1)              |
|-t4 -V2.0 (default)    |    3435 |  91.4|  30.7|(default)                       |
|-t4 -V2.0 -X small     |    3373 | 145.5|  47.8|bzip2                           |
|-t4 -V2.0 -X archive   |    3377 | 166.3|  49.7|bzip2                           |
|-t4 -V2.0 -X archive -9|    3125 |1900.6|  76.9|bzip2                           |
||||||
|-t4 -V3.0 -X fast      |    3620 |  88.3|  29.3|(default, level 1)              |
|-t4 -V3.0 (default)    |    3287 |  90.5|  29.5|(default)                       |
|-t4 -V3.0 -X small     |    3238 | 128.5|  40.3|bzip2                           |
|-t4 -V3.0 -X archive   |    3220 | 164.9|  50.0|bzip2                           |
|-t4 -V3.0 -X archive -9|    3115 |1866.6|  75.2|bzip2, lzma                     |
||||||
|-t4 -V3.1 -X fast      |    3611 |  87.9|  29.2|rANS++                          |
|-t4 -V3.1 (default)    |    3161 |  88.8|  29.7|rANS++,tok3                     |
|-t4 -V3.1 -X small     |    2249 | 192.2| 146.1|rANS++,tok3,fqz                 |
|-t4 -V3.1 -X archive   |    2157 | 235.2| 127.5|rANS++,tok3,fqz,bzip2,arith     |
|-t4 -V3.1 -X archive   |    2145 | 480.3| 128.9|rANS++,tok3,fqz,bzip2,arith,lzma|
||||||
|-t4 -V4.0 -X fast      |    3551 |  87.8|  29.5|rANS++                          |
|-t4 -V4.0 (default)    |    3148 |  88.9|  30.0|rANS++,tok3                     |
|-t4 -V4.0 -X small     |    2236 | 189.7| 142.6|rANS++,tok3,fqz                 |
|-t4 -V4.0 -X archive   |    2139 | 226.7| 127.5|rANS++,tok3,fqz,bzip2,arith     |
|-t4 -V4.0 -X archive -9|    2132 | 453.5| 128.2|rANS++,tok3,fqz,bzip2,arith,lzma|


Building
========

Prerequisites
-------------

You will need a C compiler, a Unix "make" program plus zlib, bzip2 and
lzma libraries and associated development packages (including C header
files).  The appropriate operating system package names and comands
differ per system.  On Debian Linux derived systems use the command
below (or build and install your own copies from source):

  sudo apt-get install make zlib1g-dev libbz2-dev liblzma-dev

On RedHat derived systems the package names differ:

  sudo yum install make zlib-devel bzip2-devel xz-devel


Zlib
----

This code makes heavy use of the Deflate algorithm, assuming a Zlib
interface.  The native Zlib bundled with most systems is now rather
old and better optimised versions exist for certain platforms
(e.g. using the SSE instructions on Intel and AMD CPUs).

Therefore the --with-zlib=/path/to/zlib configure option may be used
to point to a different Zlib.  I have tested it with the vanilla zlib,
Intel's zlib and CloudFlare's Zlib.  Of the three it appears the
CloudFlare one has the quickest implementation, but mileage may vary
depending on OS and CPU.  

CloudFlare: https://github.com/cloudflare/zlib
Intel:      https://github.com/jtkukunas/zlib
Zlib-ng:    https://github.com/Dead2/zlib-ng

The Zlib-ng one needs configuring with --zlib-compat and when you
build Io_lib you will need to define -DWITH_GZFILEOP too.  It also
doesn't work well when used in conjunction with LD_PRELOAD. Therefore
I wouldn't recommend it for now.

If you are using the CloudFlare implementation, you may also want to
disable the CRC implementation in this code if your CloudFlare zlib
was built with PCLMUL support as their implementation is faster.
Otherwise the CRC here is quicker than Zlib's own version.
Building io_lib with the internal CRC code disabled is done
with ./configure --disable-own-crc (or CFLAGS=-UIOLIB_CRC).


Libdeflate
----------

The BAM reading and writing also has optional support for the
libdeflate library (https://github.com/ebiggers/libdeflate).  This can
be used instead of an optimised zlib (see above), and generally is
slightly faster.  Build using:

    ./configure --with-libdeflate=/path


Git clone
---------

We recommend building from a release tarball, which has the configure
script already created for you.  However if you wish to build from the
latest code then use "git clone -r" to clone recursively to get the
htscodecs submodule (or follow up a normal "git clone" with
"git submodule update --init --recursive").  You will then need to
create the configure script using autoreconf in both io_lib and
htscodecs directories.  This is easiest achieved using the supply
bootstrap script.

  ./bootstrap

The autotools programs may not be on your system.  If it fails, then
install autoconf, automake and libtool packages; see above for example
OS-specific installation commands.


Linux
-----

We use the GNU autoconf build mechanism.

To build:

1. ./configure

"./configure --help" will give a list of the options for GNU autoconf. For
modifying the compiler options or flags you may wish to redefine the CC or
CFLAGS variable.

Eg (in sh or bash):
   CC=cc CFLAGS=-g ./configure

2. make (or gmake)

This will build the sources.

CFLAGS may also be changed a build time using (eg):
    make 'CFLAGS=-g ...'

3. make install

The default installation location is /usr/local/bin and /usr/local/lib. These
can be changed with the --prefix option to "configure".


Windows
-------

Under Microsoft Windows we recommend the use of MSYS and MINGW as a
build environment.

These contain enough tools to build using the configure script as per
Linux. The latest msys can be downloaded here:

   http://repo.msys2.org/distrib/msys2-x86_64-latest.exe

Once installed and setup ("pacman -Syu"; close window & relaunch msys;
"pacman -Syu" again), install mingw64 compilers via "pacman -S
--needed man base-devel git mingw-w64-x86_64-toolchain".

This should then be sufficient to configure and compile.  However note
that you may need to use "./configure --disable-shared" for the test
harness to work due to deficiences in the libtool wrapper script.

If you wish to use Microsoft Visual Studio you may need to add the
MSVC_includes subdirectory to your C include search path.  This
adds several missing header files (eg unistd.h and sys/time.h) needed
to build this software.  We do not have a MSVC project file available
and have not tested the build under this environment for a number of
years.

In this case you will also need to copy io_lib/os.h.in to io_lib/os.h
and either remove the @SET_ENDIAN@ and adjacent @ lines (as these are
normally filled out for you by autoconf) or add -DNO_AUTOCONF to your
compiler options.

The code should also build cleanly under a cross-compiler.  This has
not been tested recently, but a past successful invocation was:

    ./configure \
            --host=x86_64-w64-mingw32 \
            --prefix=$DIST \
            --with-io_lib=$DIST \
            --with-tcl=$DIST/lib \
            --with-tk=$DIST/lib \
            --with-tklib=$DIST/lib/tklib0.5 \
            --with-zlib=$DIST \
            LDFLAGS=-L$DIST/lib

with $DIST being pre-populated with already built and installed 3rd
party dependencies, some from MSYS mentioned above.


Libbsc
------

This is experimental, just to see what we can get with a high quality
compression engine in CRAM.  It's hard to build right now, especially
given it's a C++ library and our code is C.  The hacky solution now
is (linux) e.g.:

  ../configure \
    CPPFLAGS=-I$HOME/ftp/compression/libbsc \
    LDFLAGS="-L$HOME/ftp/compression/libbsc -fopenmp" \
    LIBS=-lstdc++

Enable it using scramble -J, but note this requires experimental CRAM
versions 3.1 or 4.0.

** Neither of these should be used for production data. **


MacOS X
-------

The configure script should work by default, but if you are attempting
to build FAT binaries to work on both i386 and ppc targets you'll need
to disable dependency tracking. Ie:

    CFLAGS="-arch i386 -arch ppc" LDFLAGS="-arch i386 -arch ppc" \
      ../configure --disable-dependency-tracking