Codebase list golang-github-ulikunitz-xz / dc928beb-17f5-4514-94d6-f845010e7d09/upstream doc / LZMA2.md
dc928beb-17f5-4514-94d6-f845010e7d09/upstream

Tree @dc928beb-17f5-4514-94d6-f845010e7d09/upstream (Download .tar.gz)

LZMA2.md @dc928beb-17f5-4514-94d6-f845010e7d09/upstreamraw · history · blame

% LZMA2 format

The LZMA2 format supports flushing, parallel encoding or decoding.
Chunks of data that cannot be compressed are copied as such.

## Dictionary Size

LZMA2 requires information about the size of the dictionary. This is
provided by a single byte. 

Bits | Mask | Description
----:|-----:|:------------------------------------------------
 0-5 | 0x3F | Dictionary Size
 6-7 | 0xC0 | Reserved for future use; Must be zero

The dictionary size is encoded with a one-bit mantissa and five-bit
exponent. The smallest dictionary size is 4 KiB and the biggest is 4 GiB
- 1 B.

|Raw Value | Mantissa | Exponent | Dictionary size|
|---------:|---------:|---------:|---------------:|
|        0 |        2 |       11 |          4 KiB |
|        1 |        3 |       11 |          6 KiB |
|        2 |        2 |       12 |          8 KiB |
|        3 |        3 |       12 |         12 KiB |
|      ... |      ... |      ... |            ... |
|       36 |        2 |       29 |       1024 MiB |
|       37 |        3 |       29 |       1536 MiB |
|       38 |        2 |       30 |       2048 MiB |
|       39 |        3 |       30 |       3072 MiB |
|       40 |        2 |       31 |  4096 MiB - 1B |

For test purposes we add the dictionary size byte as first byte of an
LZMA2 stream.

## Chunks

An LZMA2 stream is a sequence of chunks. Each chunk is preceded by a
control byte and other information.

Following the C implementation in the LZMA SDK the control byte can be
described as such:

Chunk header         | Description
:------------------- | :--------------------------------------------------
`00000000`           | End of LZMA2 stream
`00000001 U U`       | Uncompressed chunk, reset dictionary
`00000010 U U`       | Uncompressed chunk, no reset of dictionary
`100uuuuu U U C C`   | LZMA, no reset
`101uuuuu U U C C`   | LZMA, reset state
`110uuuuu U U C C S` | LZMA, reset state, new properties
`111uuuuu U U C C S` | LZMA, reset state, new properties, reset dictionary

The symbols used are described by following table.

Symbol | Description
:----- | :--------------------
u      | uncompressed size bit
U      | uncompressed size byte
C      | uncompressed size byte
S      | properties byte

A dictionary reset requires always new properties. If this is an
uncompressed chunk the properties need to be provided in the next
compressed chunk. New properties require a reset of the state.

A dictionary reset puts the current position to zero. Uncompressed data
is written into the dictionary.

The uncompressed size and compressed size are given in big-endian byte order.
The values need to be incremented for the actual size. So a chunk with 1
byte uncompressed data will store size 0 in the uncompressed bits and bytes.

The properties byte provides the parameters pb, lc, lp using following
formula:

    S = (pb * 5 + lp) * 9 + lc

This is same encoding used for LZMA. For LZMA2 following condition has
been introduced:

    lc + lp <= 4.

The parameters are defined as follows:

Name  | Range  | Description
:---- | :----- | :------------------------------
lc    | [0,8]  | number of literal context bits
lp    | [0,4]  | number of literal pos bits
pb    | [0,4]  | the number of pos bits