Codebase list pagetools / lintian-fixes/main TODO
lintian-fixes/main

Tree @lintian-fixes/main (Download .tar.gz)

TODO @lintian-fixes/mainraw · history · blame

TODO for Page Layout Detection Tools

At this point, this is merely an unordered list of ideas.

1) organize the deskewer code as a library, the applications as clients,
 add application interface using STDIN instead of filenames 
(reimplement factory to accept an opened stream rather than a file name)

2) a utility to detect "almost empty pages" i.e. only dirt and no text

3) make man pages - DONE

4) bitmap rotation routine (ask our anonymous contributor to donate it too)

5) splitting of landscape pages with text bounding box detection

7) page numbers, margin notes, etc. - should not confuse the text BB detection too much

8) detect the position of each line of text, make a small BB around each reliably detected text line 
(based on the shape of the Radon profile); this should also work for paragraphs typeset in petite and for margin paragraphs.

9) detect the text orientation based on the length of lines at the beginning and at the end of paragraph


About the bounding box determination.  First, determine where the lines of
text are. They will be parallel and have a certain profile in the Y
direction.  Then perform some kind of wavelet transform in the X direction
to determine where each line of text starts and where it ends.  The "text"
has a characteristic scale of variation between white and black. The
vertical size of each line will be clearly visible from the radon profile.
So we can infer the typical horizontal wavelet profile so that the text
lines can be distinguished from anything else (graphics and dirt). All this
can be performed by a "data-gathering" version of the Radon transformer.