Codebase list golang-github-ulikunitz-xz / dc928beb-17f5-4514-94d6-f845010e7d09/main doc / xz-issues.md
dc928beb-17f5-4514-94d6-f845010e7d09/main

Tree @dc928beb-17f5-4514-94d6-f845010e7d09/main (Download .tar.gz)

xz-issues.md @dc928beb-17f5-4514-94d6-f845010e7d09/mainraw · history · blame

# Issues in the XZ file format

During the development of the xz package for Go a number of issues with
the xz file format were observed. They are documented here to help a
later development of an improved format.

# xz file format

## Consistency

## General

Packets should either be constant size or should have encoded the size
in the header. File header and footer are constant size and the block
header has the size encoded. The index doesn't fulfill the criteria even
when its size is included in the footer.

## Index

The index doesn't have the size in the header. So in a stream you are
forced to read the whole index to identify its length.

The index should have made optional. This would require to remove the
index size from the footer and include its own footer in the index.

## Padding

The padding should allow direct mapping of the CRC values into memory, but it
wastes bytes bearing no information. This is certainly not optimal for a
compression format. It is argued alignment makes it faster to read and
write the checksum values, but the time spent there is much less than on
encoding and decoding itself.

## Filters for each block

Filters should have been defined in front of blocks. This way they
would not need to be repeated.

# LZMA2 

## Consistent header byte.

LZMA2 consists of a series of chunks with a header byte. The header byte
has a different format depending on whether it is an uncompressed or
compressed chunk. This has the consequence a complete reset of state,
properties and dictionary is not possible with an uncompressed chunk.
The encoder has to keep a state variable tracking a dictionary reset in
an uncompressed chunk to ensure that the flags are added in the first
compressed chunk to follow. This complicates the implementation of the
encoder and decoder.

## Dictionary capacity is not encoded

LZMA2 doesn't encode the dictionary capacity, so LZMA2 doesn't work
standalone.