Codebase list libhtml-gentoc-perl / HEAD
HEAD

Tree @HEAD (Download .tar.gz)

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
NAME
    HTML::GenToc - Generate a Table of Contents for HTML documents.

VERSION
    version 3.20

SYNOPSIS
      use HTML::GenToc;

      # create a new object
      my $toc = new HTML::GenToc();

      my $toc = new HTML::GenToc(title=>"Table of Contents",
                              toc_entry=>{
                                H1=>1,
                                H2=>2
                              },
                              toc_end=>{
                                H1=>'/H1',
                                H2=>'/H2'
                              }
        );

      # generate a ToC from a file
      $toc->generate_toc(input=>$html_file,
                         footer=>$footer_file,
                         header=>$header_file
        );

DESCRIPTION
    HTML::GenToc generates anchors and a table of contents for HTML
    documents. Depending on the arguments, it will insert the information it
    generates, or output to a string, a separate file or STDOUT.

    While it defaults to taking H1 and H2 elements as the significant
    elements to put into the table of contents, any tag can be defined as a
    significant element. Also, it doesn't matter if the input HTML code is
    complete, pure HTML, one can input pseudo-html or page-fragments, which
    makes it suitable for using on templates and HTML meta-languages such as
    WML.

    Also included in the distrubution is hypertoc, a script which uses the
    module so that one can process files on the command-line in a
    user-friendly manner.

DETAILS
    The ToC generated is a multi-level level list containing links to the
    significant elements. HTML::GenToc inserts the links into the ToC to
    significant elements at a level specified by the user.

    Example:

    If H1s are specified as level 1, than they appear in the first level
    list of the ToC. If H2s are specified as a level 2, than they appear in
    a second level list in the ToC.

    Information on the significant elements and what level they should occur
    are passed in to the methods used by this object, or one can use the
    defaults.

    There are two phases to the ToC generation. The first phase is to put
    suitable anchors into the HTML documents, and the second phase is to
    generate the ToC from HTML documents which have anchors in them for the
    ToC to link to.

    For more information on controlling the contents of the created ToC, see
    "Formatting the ToC".

    HTML::GenToc also supports the ability to incorporate the ToC into the
    HTML document itself via the inline option. See "Inlining the ToC" for
    more information.

    In order for HTML::GenToc to support linking to significant elements,
    HTML::GenToc inserts anchors into the significant elements. One can use
    HTML::GenToc as a filter, outputing the result to another file, or one
    can overwrite the original file, with the original backed up with a
    suffix (default: "org") appended to the filename. One can also output
    the result to a string.

METHODS
    Default arguments can be set when the object is created, and overridden
    by setting arguments when the generate_toc method is called. Arguments
    are given as a hash of arguments.

  Method -- new
        $toc = new HTML::GenToc();

        $toc = new HTML::GenToc(toc_entry=>\%my_toc_entry,
            toc_end=>\%my_toc_end,
            bak=>'bak',
            ...
            );

    Creates a new HTML::GenToc object.

    These arguments will be used as defaults in invocations of other
    methods.

    See generate_tod for possible arguments.

  generate_toc
        $toc->generate_toc(outfile=>"index2.html");

        my $result_str = $toc->generate_toc(to_string=>1);

    Generates a table of contents for the significant elements in the HTML
    documents, optionally generating anchors for them first.

    Options

    bak bak => *string*

        If the input file/files is/are being overwritten (overwrite is on),
        copy the original file to "*filename*.*string*". If the value is
        empty, no backup file will be created. (default:org)

    debug
        debug => 1

        Enable verbose debugging output. Used for debugging this module; in
        other words, don't bother. (default:off)

    entrysep
        entrysep => *string*

        Separator string for non-<li> item entries (default: ", ")

    filenames
        filenames => \@filenames

        The filenames to use when creating table-of-contents links. This
        overrides the filenames given in the input option, and is expected
        to have exactly the same number of elements. This can also be used
        when passing in string-content to the input option, to give a (fake)
        filename to use for the links relating to that content.

    footer
        footer => *file_or_string*

        Either the filename of the file containing footer text for ToC; or a
        string containing the footer text.

    header
        header => *file_or_string*

        Either the filename of the file containing header text for ToC; or a
        string containing the header text.

    ignore_only_one
        ignore_only_one => 1

        If there would be only one item in the ToC, don't make a ToC.

    ignore_sole_first
        ignore_sole_first => 1

        If the first item in the ToC is of the highest level, AND it is the
        only one of that level, ignore it. This is useful in web-pages where
        there is only one H1 header but one doesn't know beforehand whether
        there will be only one.

    inline
        inline => 1

        Put ToC in document at a given point. See "Inlining the ToC" for
        more information.

    input
        input => \@filenames

        input => $content

        This is expected to be either a reference to an array of filenames,
        or a string containing content to process.

        The three main uses would be:

        (a) you have more than one file to process, so pass in multiple
            filenames

        (b) you have one file to process, so pass in its filename as the
            only array item

        (c) you have HTML content to process, so pass in just the content as
            a string

        (default:undefined)

    notoc_match
        notoc_match => *string*

        If there are certain individual tags you don't wish to include in
        the table of contents, even though they match the "significant
        elements", then if this pattern matches contents inside the tag (not
        the body), then that tag will not be included, either in generating
        anchors nor in generating the ToC. (default: "class="notoc"")

    ol  ol => 1

        Use an ordered list for level 1 ToC entries.

    ol_num_levels
        ol_num_levels => 2

        The number of levels deep the OL listing will go if ol is true. If
        set to zero, will use an ordered list for all levels. (default:1)

    overwrite
        overwrite => 1

        Overwrite the input file with the output. (default:off)

    outfile
        outfile => *file*

        File to write the output to. This is where the modified HTML output
        goes to. Note that it doesn't make sense to use this option if you
        are processing more than one file. If you give '-' as the filename,
        then output will go to STDOUT. (default: STDOUT)

    quiet
        quiet => 1

        Suppress informative messages. (default: off)

    textonly
        textonly => 1

        Use only text content in significant elements.

    title
        title => *string*

        Title for ToC page (if not using header or inline or toc_only)
        (default: "Table of Contents")

    toc_after
        toc_after => \%toc_after_data

        %toc_after_data = { *tag1* => *suffix1*, *tag2* => *suffix2* };

        toc_after => { H2=>'</em>' }

        For defining layout of significant elements in the ToC.

        This expects a reference to a hash of tag=>suffix pairs.

        The *tag* is the HTML tag which marks the start of the element. The
        *suffix* is what is required to be appended to the Table of Contents
        entry generated for that tag.

        (default: undefined)

    toc_before
        toc_before => \%toc_before_data

        %toc_before_data = { *tag1* => *prefix1*, *tag2* => *prefix2* };

        toc_before=>{ H2=>'<em>' }

        For defining the layout of significant elements in the ToC. The
        *tag* is the HTML tag which marks the start of the element. The
        *prefix* is what is required to be prepended to the Table of
        Contents entry generated for that tag.

        (default: undefined)

    toc_end
        toc_end => \%toc_end_data

        %toc_end_data = { *tag1* => *endtag1*, *tag2* => *endtag2* };

        toc_end => { H1 => '/H1', H2 => '/H2' }

        For defining significant elements. The *tag* is the HTML tag which
        marks the start of the element. The *endtag* the HTML tag which
        marks the end of the element. When matching in the input file, case
        is ignored (but make sure that all your *tag* options referring to
        the same tag are exactly the same!).

    toc_entry
        toc_entry => \%toc_entry_data

        %toc_entry_data = { *tag1* => *level1*, *tag2* => *level2* };

        toc_entry => { H1 => 1, H2 => 2 }

        For defining significant elements. The *tag* is the HTML tag which
        marks the start of the element. The *level* is what level the tag is
        considered to be. The value of *level* must be numeric, and
        non-zero. If the value is negative, consective entries represented
        by the significant_element will be separated by the value set by
        entrysep option.

    toclabel
        toclabel => *string*

        HTML text that labels the ToC. Always used. (default: "<h1>Table of
        Contents</h1>")

    toc_tag
        toc_tag => *string*

        If a ToC is to be included inline, this is the pattern which is used
        to match the tag where the ToC should be put. This can be a
        start-tag, an end-tag or a comment, but the < should be left out;
        that is, if you want the ToC to be placed after the BODY tag, then
        give "BODY". If you want a special comment tag to make where the ToC
        should go, then include the comment marks, for example: "!--toc--"
        (default:BODY)

    toc_tag_replace
        toc_tag_replace => 1

        In conjunction with toc_tag, this is a flag to say whether the given
        tag should be replaced, or if the ToC should be put after the tag.
        This can be useful if your toc_tag is a comment and you don't need
        it after you have the ToC in place. (default:false)

    toc_only
        toc_only => 1

        Output only the Table of Contents, that is, the Table of Contents
        plus the toclabel. If there is a header or a footer, these will also
        be output.

        If toc_only is false then if there is no header, and inline is not
        true, then a suitable HTML page header will be output, and if there
        is no footer and inline is not true, then a HTML page footer will be
        output.

        (default:false)

    to_string
        to_string => 1

        Return the modified HTML output as a string. This *does* override
        other methods of output (unlike version 3.00). If *to_string* is
        false, the method will return 1 rather than a string.

    use_id
        use_id => 1

        Use id="*name*" for anchors rather than <a name="*name*"/> anchors.
        However if an anchor already exists for a Significant Element, this
        won't make an id for that particular element.

    useorg
        useorg => 1

        Use pre-existing backup files as the input source; that is, files of
        the form *infile*.*bak* (see input and bak).

INTERNAL METHODS
    These methods are documented for developer purposes and aren't intended
    to be used externally.

  make_anchor_name
        $toc->make_anchor_name(content=>$content,
            anchors=>\%anchors);

    Makes the anchor-name for one anchor. Bases the anchor on the content of
    the significant element. Ensures that anchors are unique.

  make_anchors
        my $new_html = $toc->make_anchors(input=>$html,
            notoc_match=>$notoc_match,
            use_id=>$use_id,
            toc_entry=>\%toc_entries,
            toc_end=>\%toc_ends,
            );

    Makes the anchors the given input string. Returns a string.

  make_toc_list
        my @toc_list = $toc->make_toc_list(input=>$html,
            labels=>\%labels,
            notoc_match=>$notoc_match,
            toc_entry=>\%toc_entry,
            toc_end=>\%toc_end,
            filename=>$filename);

    Makes a list of lists which represents the structure and content of (a
    portion of) the ToC from one file. Also updates a list of labels for the
    ToC entries.

  build_lol
    Build a list of lists of paths, given a list of hashes with info about
    paths.

  output_toc
        $self->output_toc(toc=>$toc_str,
            input=>\@input,
            filenames=>\@filenames);

    Put the output (whether to file, STDOUT or string). The "output" in this
    case could be the ToC, the modified (anchors added) HTML, or both.

  put_toc_inline
        my $newhtml = $toc->put_toc_inline(toc_str=>$toc_str,
            filename=>$filename, in_string=>$in_string);

    Puts the given toc_str into the given input string; returns a string.

  cp
        cp($src, $dst);

    Copies file $src to $dst. Used for making backups of files.

FILE FORMATS
  Formatting the ToC
    The toc_entry and other related options give you control on how the ToC
    entries may look, but there are other options to affect the final
    appearance of the ToC file created.

    With the header option, the contents of the given file (or string) will
    be prepended before the generated ToC. This allows you to have
    introductory text, or any other text, before the ToC.

    Note:
        If you use the header option, make sure the file specified contains
        the opening HTML tag, the HEAD element (containing the TITLE
        element), and the opening BODY tag. However, these tags/elements
        should not be in the header file if the inline option is used. See
        "Inlining the ToC" for information on what the header file should
        contain for inlining the ToC.

    With the toclabel option, the contents of the given string will be
    prepended before the generated ToC (but after any text taken from a
    header file).

    With the footer option, the contents of the file will be appended after
    the generated ToC.

    Note:
        If you use the footer, make sure it includes the closing BODY and
        HTML tags (unless, of course, you are using the inline option).

    If the header option is not specified, the appropriate starting HTML
    markup will be added, unless the toc_only option is specified. If the
    footer option is not specified, the appropriate closing HTML markup will
    be added, unless the toc_only option is specified.

    If you do not want/need to deal with header, and footer, files, then you
    are allowed to specify the title, title option, of the ToC file; and it
    allows you to specify a heading, or label, to put before ToC entries'
    list, the toclabel option. Both options have default values.

    If you do not want HTML page tags to be supplied, and just want the ToC
    itself, then specify the toc_only option. If there are no header or
    footer files, then this will simply output the contents of toclabel and
    the ToC itself.

  Inlining the ToC
    The ability to incorporate the ToC directly into an HTML document is
    supported via the inline option.

    Inlining will be done on the first file in the list of files processed,
    and will only be done if that file contains an opening tag matching the
    toc_tag value.

    If overwrite is true, then the first file in the list will be
    overwritten, with the generated ToC inserted at the appropriate spot.
    Otherwise a modified version of the first file is output to either
    STDOUT or to the output file defined by the outfile option.

    The options toc_tag and toc_tag_replace are used to determine where and
    how the ToC is inserted into the output.

    Example 1

        $toc->generate_toc(inline=>1,
                           toc_tag => 'BODY',
                           toc_tag_replace => 0,
                           ...
                           );

    This will put the generated ToC after the BODY tag of the first file. If
    the header option is specified, then the contents of the specified file
    are inserted after the BODY tag. If the toclabel option is not empty,
    then the text specified by the toclabel option is inserted. Then the ToC
    is inserted, and finally, if the footer option is specified, it inserts
    the footer. Then the rest of the input file follows as it was before.

    Example 2

        $toc->generate_toc(inline=>1,
                           toc_tag => '!--toc--',
                           toc_tag_replace => 1,
                           ...
                           );

    This will put the generated ToC after the first comment of the form
    <!--toc-->, and that comment will be replaced by the ToC (in the order
    header toclabel ToC footer) followed by the rest of the input file.

    Note:
        The header file should not contain the beginning HTML tag and HEAD
        element since the HTML file being processed should already contain
        these tags/elements.

NOTES
    *   HTML::GenToc is smart enough to detect anchors inside significant
        elements. If the anchor defines the NAME attribute, HTML::GenToc
        uses the value. Else, it adds its own NAME attribute to the anchor.
        If use_id is true, then it likewise checks for and uses IDs.

    *   The TITLE element is treated specially if specified in the toc_entry
        option. It is illegal to insert anchors (A) into TITLE elements.
        Therefore, HTML::GenToc will actually link to the filename itself
        instead of the TITLE element of the document.

    *   HTML::GenToc will ignore a significant element if it does not
        contain any non-whitespace characters. A warning message is
        generated if such a condition exists.

    *   If you have a sequence of significant elements that change in a
        slightly disordered fashion, such as H1 -> H3 -> H2 or even H2 ->
        H1, though HTML::GenToc deals with this to create a list which is
        still good HTML, if you are using an ordered list to that depth,
        then you will get strange numbering, as an extra list element will
        have been inserted to nest the elements at the correct level.

        For example (H2 -> H1 with ol_num_levels=1):

            1. 
                * My H2 Header
            2. My H1 Header

        For example (H1 -> H3 -> H2 with ol_num_levels=0 and H3 also being
        significant):

            1. My H1 Header
                1. 
                    1. My H3 Header
                2. My H2 Header
            2. My Second H1 Header

        In cases such as this it may be better not to use the ol option.

CAVEATS
    *   Version 3.10 (and above) generates more verbose (SEO-friendly)
        anchors than prior versions. Thus anchors generated with earlier
        versions will not match version 3.10 anchors.

    *   Version 3.00 (and above) of HTML::GenToc is not compatible with
        Version 2.x of HTML::GenToc. It is now designed to do everything in
        one pass, and has dropped certain options: the infile option is no
        longer used (it has been replaced with the input option); the
        toc_file option no longer exists; use the outfile option instead;
        the tocmap option is no longer supported. Also the old array-parsing
        of arguments is no longer supported. There is no longer a
        generate_anchors method; everything is done with generate_toc.

        It now generates lower-case tags rather than upper-case ones.

    *   HTML::GenToc is not very efficient (memory and speed), and can be
        slow for large documents.

    *   Now that generation of anchors and of the ToC are done in one pass,
        even more memory is used than was the case before. This is more
        notable when processing multiple files, since all files are read
        into memory before processing them.

    *   Invalid markup will be generated if a significant element is
        contained inside of an anchor. For example:

            <a name="foo"><h1>The FOO command</h1></a>

        will be converted to (if H1 is a significant element),

            <a name="foo"><h1><a name="The">The</a> FOO command</h1></a>

        which is illegal since anchors cannot be nested.

        It is better style to put anchor statements within the element to be
        anchored. For example, the following is preferred:

            <h1><a name="foo">The FOO command</a></h1>

        HTML::GenToc will detect the "foo" name and use it.

    *   name attributes without quotes are not recognized.

BUGS
    Tell me about them.

REQUIRES
    The installation of this module requires "Module::Build". The module
    depends on "HTML::SimpleParse", "HTML::Entities" and "HTML::LinkList"
    and uses "Data::Dumper" for debugging purposes. The hypertoc script
    depends on "Getopt::Long", "Getopt::ArgvFile" and "Pod::Usage". Testing
    of this distribution depends on "Test::More".

INSTALLATION
    To install this module, run the following commands:

        perl Build.PL
        ./Build
        ./Build test
        ./Build install

    Or, if you're on a platform (like DOS or Windows) that doesn't like the
    "./" notation, you can do this:

       perl Build.PL
       perl Build
       perl Build test
       perl Build install

    In order to install somewhere other than the default, such as in a
    directory under your home directory, like "/home/fred/perl" go

       perl Build.PL --install_base /home/fred/perl

    as the first step instead.

    This will install the files underneath /home/fred/perl.

    You will then need to make sure that you alter the PERL5LIB variable to
    find the modules, and the PATH variable to find the script.

    Therefore you will need to change: your path, to include
    /home/fred/perl/script (where the script will be)

            PATH=/home/fred/perl/script:${PATH}

    the PERL5LIB variable to add /home/fred/perl/lib

            PERL5LIB=/home/fred/perl/lib:${PERL5LIB}

SEE ALSO
    perl(1) htmltoc(1) hypertoc(1)

AUTHOR
    Kathryn Andersen (RUBYKAT) http://www.katspace.org/tools/hypertoc/

    Based on htmltoc by Earl Hood ehood AT medusa.acs.uci.edu

    Contributions by Dan Dascalescu, <http://dandascalescu.com>

COPYRIGHT
    Copyright (C) 1994-1997 Earl Hood, ehood AT medusa.acs.uci.edu Copyright
    (C) 2002-2008 Kathryn Andersen

    This program is free software; you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation; either version 2 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    675 Mass Ave, Cambridge, MA 02139, USA.