Codebase list liblatex-tom-perl / 4b962ac
Import Upstream version 0.8 gregor herrmann 6 years ago
6 changed file(s) with 524 addition(s) and 331 deletion(s). Raw diff Collapse all Expand all
00 Revision history for Perl extension LaTeX::TOM.
1
2 0.8 Mon Oct 8 10:23:01 CEST 2007
3
4 - Fixed failing tests pod.t & pod-coverage.t (adjusted plans).
5
6 0.7 Tue Aug 28 00:12:03 CEST 2007
7
8 - Added formatting tags to the documentation where appropriate
9 and enlisted all methods within the documentation index.
110
211 0.6 Wed Mar 14 01:05:09 CET 2007
312
00 ---
11 name: LaTeX-TOM
2 version: 0.6
2 version: 0.8
33 author:
44 - 'Aaron Krowne <akrowne@vt.edu.org>'
5 abstract: 'A module for parsing, analyzing, and manipulating LaTeX documents.'
5 abstract: A module for parsing, analyzing, and manipulating LaTeX documents.
66 license: perl
7 resources:
8 license: http://dev.perl.org/licenses/
97 build_requires:
108 Test::More: 0
9 generated_by: Module::Build version 0.2808
10 meta-spec:
11 url: http://module-build.sourceforge.net/META-spec-v1.2.html
12 version: 1.2
1113 provides:
1214 LaTeX::TOM:
1315 file: lib/LaTeX/TOM.pm
14 version: 0.6
16 version: 0.8
1517 LaTeX::TOM::Node:
1618 file: lib/LaTeX/TOM/Node.pm
1719 LaTeX::TOM::Parser:
1820 file: lib/LaTeX/TOM/Parser.pm
1921 LaTeX::TOM::Tree:
2022 file: lib/LaTeX/TOM/Tree.pm
21 generated_by: Module::Build version 0.2805
22 meta-spec:
23 url: http://module-build.sourceforge.net/META-spec-v1.2.html
24 version: 1.2
23 resources:
24 license: http://dev.perl.org/licenses/
+165
-159
README less more
22 documents.
33
44 SYNOPSIS
5 use LaTeX::TOM;
6
7 my $parser = LaTeX::TOM->new;
8
9 my $document = $parser->parseFile('mypaper.tex');
10
11 my $latex = $document->toLaTeX;
12
13 my $specialnodes = $document->getNodesByCondition(
14 '$node->getNodeType eq \'TEXT\' &&
15 $node->getNodeText =~ /magic string/');
16
17 my $sections = $document->getNodesByCondition(
18 '$node->getNodeType eq \'COMMAND\' &&
19 $node->getCommandName =~ /section$/');
20
21 my $indexme = $document->getIndexableText;
22
23 $document->print;
5 use LaTeX::TOM;
6
7 $parser = LaTeX::TOM->new;
8
9 $document = $parser->parseFile('mypaper.tex');
10
11 $latex = $document->toLaTeX;
12
13 $specialnodes = $document->getNodesByCondition(
14 '$node->getNodeType eq \'TEXT\' &&
15 $node->getNodeText =~ /magic string/'
16 );
17
18 $sections = $document->getNodesByCondition(
19 '$node->getNodeType eq \'COMMAND\' &&
20 $node->getCommandName =~ /section$/'
21 );
22
23 $indexme = $document->getIndexableText;
24
25 $document->print;
2426
2527 DESCRIPTION
2628 This module provides a parser which parses and interprets (though not
2729 fully) LaTeX documents and returns a tree-based representation of what
28 it finds. This tree is a LaTeX::TOM::Tree. The tree contains
29 LaTeX::TOM:Node nodes.
30 it finds. This tree is a "LaTeX::TOM::Tree". The tree contains
31 "LaTeX::TOM::Node" nodes.
3032
3133 This module should be especially useful to anyone who wants to do
3234 processing of LaTeX documents that requires extraction of plain-text
4648 parameter to be 0 or 2 to completely parse the document.
4749
4850 read inputs flag (= 0 || 1)
49 This flag determines whether a scan for \input and \input-like
51 This flag determines whether a scan for "\input" and "\input-like"
5052 commands is performed, and the resulting called files parsed and
5153 added to the parent parse tree. 0 means no, 1 means do it. Note that
5254 this will happen recursively if it is turned on. Also,
5456
5557 apply mappings flag (= 0 || 1)
5658 This flag determines whether (most) user-defined mappings are
57 applied. This means \defs, \newcommands, and \newenvironments. This
58 is critical for properly analyzing the content of the document, as
59 this must be phrased in terms of the semantics of the original TeX
60 and LaTeX commands, not ad hoc user macros. So, for instance, do not
61 expect plain-text extraction to work properly with this option off.
62
63 The parser returns a LaTeX::TOM::Tree ($document in the SYNOPSIS).
59 applied. This means "\defs", "\newcommands", and "\newenvironments".
60 This is critical for properly analyzing the content of the document,
61 as this must be phrased in terms of the semantics of the original
62 TeX and LaTeX commands, not ad hoc user macros. So, for instance, do
63 not expect plain-text extraction to work properly with this option
64 off.
65
66 The parser returns a "LaTeX::TOM::Tree" ($document in the SYNOPSIS).
6467
6568 LaTeX::TOM::Node
6669 Nodes may be of the following types:
6770
6871 TEXT
69 TEXT nodes can be thought of as representing the plain-text portions
70 of the LaTeX document. This includes math and anything else that is
71 not a recognized TeX or LaTeX command, or user-defined command. In
72 reality, TEXT nodes contain commands that this parser does not yet
73 recognize the semantics of.
72 "TEXT" nodes can be thought of as representing the plain-text
73 portions of the LaTeX document. This includes math and anything else
74 that is not a recognized TeX or LaTeX command, or user-defined
75 command. In reality, "TEXT" nodes contain commands that this parser
76 does not yet recognize the semantics of.
7477
7578 COMMAND
76 A COMMAND node represents a TeX command. It always has child nodes
79 A "COMMAND" node represents a TeX command. It always has child nodes
7780 in a tree, though the tree might be empty if the command operates on
7881 zero parameters. An example of a command is
7982
80 \textbf{blah}
81
82 This would parse into a COMMAND node for *textbf*, which would have
83 a subtree containing the TEXT node with text ``blah.''
83 \textbf{blah}
84
85 This would parse into a "COMMAND" node for "textbf", which would
86 have a subtree containing the "TEXT" node with text ``blah.''
8487
8588 ENVIRONMENT
86 Similarly, TeX environments parse into ENVIRONMENT nodes, which have
87 metadata about the environment, along with a subtree representing
88 what is contained in the environment. For example,
89
90 \begin{equation}
91 r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
92 \end{equation}
93
94 Would parse into an ENVIRONMENT node of the class ``equation'' with
95 a child tree containing the result of parsing ``r = \frac{-b \pm
96 \sqrt{b^2 - 4ac}}{2a}.''
89 Similarly, TeX environments parse into "ENVIRONMENT" nodes, which
90 have metadata about the environment, along with a subtree
91 representing what is contained in the environment. For example,
92
93 \begin{equation}
94 r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
95 \end{equation}
96
97 Would parse into an "ENVIRONMENT" node of the class ``equation''
98 with a child tree containing the result of parsing ``r = \frac{-b
99 \pm \sqrt{b^2 - 4ac}}{2a}.''
97100
98101 GROUP
99 A GROUP is like an anonymous COMMAND. Since you can put whatever you
100 want in curly-braces ({}) in TeX in order to make semantically
102 A "GROUP" is like an anonymous "COMMAND". Since you can put whatever
103 you want in curly-braces ("{}") in TeX in order to make semantically
101104 isolated regions, this separation is preserved by the parser. A
102 GROUP is just the subtree of the parsed contents of plain
105 "GROUP" is just the subtree of the parsed contents of plain
103106 curly-braces.
104107
105 It is important to note that currently only the first GROUP in a
106 series of GROUPs following a LaTeX command will actually be parsed
107 into a COMMAND node. The reason is that, for the initial purposes of
108 this module, it was not necessary to recognize additional GROUPs as
109 additional parameters to the COMMAND. However, this is something
110 that this module really should do eventually. Currently if you want
111 all the parameters to a multi-parametered command, you'll need to
112 pick out all the following GROUP nodes yourself.
108 It is important to note that currently only the first "GROUP" in a
109 series of "GROUP"s following a LaTeX command will actually be parsed
110 into a "COMMAND" node. The reason is that, for the initial purposes
111 of this module, it was not necessary to recognize additional
112 "GROUP"s as additional parameters to the "COMMAND". However, this is
113 something that this module really should do eventually. Currently if
114 you want all the parameters to a multi-parametered command, you'll
115 need to pick out all the following "GROUP" nodes yourself.
113116
114117 Eventually this will become something like a list which is stored in
115 the COMMAND node, much like XML::DOM's treatment of attributes.
118 the "COMMAND" node, much like XML::DOM's treatment of attributes.
116119 These are, in a sense, apart from the rest of the document tree.
117 Then GROUP nodes will become much more rare.
120 Then "GROUP" nodes will become much more rare.
118121
119122 COMMENT
120 A COMMENT node is very similar to a TEXT node, except it is
123 A "COMMENT" node is very similar to a "TEXT" node, except it is
121124 specifically for lines beginning with ``%'' (the TeX comment
122125 delimeter) or the right-hand portion of a line that has ``%'' at
123126 some internal point.
133136
134137 METHODS
135138 LaTeX::TOM
136 new Instantiate a new parser object.
139 new
140 Instantiate a new parser object.
137141
138142 In this section all of the methods for each of the components are listed
139143 and described.
142146 The methods for the parser (aside from the constructor, discussed above)
143147 are :
144148
145 parseFile (filename)
149 parseFile (filename)
146150 Read in the contents of *filename* and parse them, returning a
147 LaTeX::TOM:Tree.
148
149 parse (string)
150 Parse the string *string* and return a LaTeX::TOM::Tree.
151 "LaTeX::TOM::Tree".
152
153 parse (string)
154 Parse the string *string* and return a "LaTeX::TOM::Tree".
151155
152156 LaTeX::TOM::Tree
153157 This section contains methods for the Trees returned by the parser.
154158
155 copy
159 copy
156160 Duplicate a tree into new memory.
157161
158 print
162 print
159163 A debug print of the structure of the tree.
160164
161 plainText
165 plainText
162166 Returns an arrayref which is a list of strings representing the text
163 of all getNodePlainTextFlag = 1 TEXT nodes, in an inorder traversal.
164
165 indexableText
167 of all "getNodePlainTextFlag = 1" "TEXT" nodes, in an inorder
168 traversal.
169
170 indexableText
166171 A method like the above but which goes one step further; it cleans
167172 all of the returned text and concatenates it into a single string
168173 which one could consider having all of the standard information
169174 retrieval value for the document, making it useful for indexing.
170175
171 toLaTeX
176 toLaTeX
172177 Return a string representing the LaTeX encoded by the tree. This is
173178 especially useful to get a normal document again, after modifying
174179 nodes of the tree.
175180
176 getTopLevelNodes
177 Return an arrayref which is a list of LaTeX::TOM::Nodes at the top
181 getTopLevelNodes
182 Return an arrayref which is a list of "LaTeX::TOM::Nodes" at the top
178183 level of the Tree.
179184
180 getAllNodes
185 getAllNodes
181186 Return an arrayref with all nodes of the tree. This "flattens" the
182187 tree.
183188
184 getCommandNodesByName (name)
185 Return an arrayref with all COMMAND nodes in the tree which have a
189 getCommandNodesByName (name)
190 Return an arrayref with all "COMMAND" nodes in the tree which have a
186191 name matching *name*.
187192
188 getEnvironmentsByName (name)
189 Return an arrayref with all ENVIRONMENT nodes in the tree which have
190 a class matching *name*.
191
192 getNodesByCondition (expression)
193 getEnvironmentsByName (name)
194 Return an arrayref with all "ENVIRONMENT" nodes in the tree which
195 have a class matching *name*.
196
197 getNodesByCondition (expression)
193198 This is a catch-all search method which can be used to pull out
194199 nodes that match pretty much any perl expression, without manually
195200 having to traverse the tree. *expression* is a valid perl expression
200205 LaTeX::TOM::Node
201206 This section contains the methods for nodes of the parsed Trees.
202207
203 getNodeType
204 Returns the type, one of 'TEXT', 'COMMAND', 'ENVIRONMENT', 'GROUP',
205 or 'COMMENT', as described above.
206
207 getNodeText
208 Applicable for TEXT or COMMENT nodes; this returns the document text
209 they contain. This is undef for other node types.
210
211 setNodeText
212 Set the node text, also for TEXT and COMMENT nodes.
213
214 getNodeStartingPosition
208 getNodeType
209 Returns the type, one of "TEXT", "COMMAND", "ENVIRONMENT", "GROUP",
210 or "COMMENT", as described above.
211
212 getNodeText
213 Applicable for "TEXT" or "COMMENT" nodes; this returns the document
214 text they contain. This is undef for other node types.
215
216 setNodeText
217 Set the node text, also for "TEXT" and "COMMENT" nodes.
218
219 getNodeStartingPosition
215220 Get the starting character position in the document of this node.
216 For TEXT and COMMENT nodes, this will be where the text begins. For
217 ENVIRONMENT, COMMAND, or GROUP nodes, this will be the position of
218 the *last* character of the opening identifier.
219
220 getNodeEndingPosition
221 Same as above, but for last character. For GROUP, ENVIRONMENT, or
222 COMMAND nodes, this will be the *first* character of the closing
221 For "TEXT" and "COMMENT" nodes, this will be where the text begins.
222 For "ENVIRONMENT", "COMMAND", or "GROUP" nodes, this will be the
223 position of the *last* character of the opening identifier.
224
225 getNodeEndingPosition
226 Same as above, but for last character. For "GROUP", "ENVIRONMENT",
227 or "COMMAND" nodes, this will be the *first* character of the
228 closing identifier.
229
230 getNodeOuterStartingPosition
231 Same as getNodeStartingPosition, but for "GROUP", "ENVIRONMENT", or
232 "COMMAND" nodes, this returns the *first* character of the opening
223233 identifier.
224234
225 getNodeOuterStartingPosition
226 Same as getNodeStartingPosition, but for GROUP, ENVIRONMENT, or
227 COMMAND nodes, this returns the *first* character of the opening
235 getNodeOuterEndingPosition
236 Same as getNodeEndingPosition, but for "GROUP", "ENVIRONMENT", or
237 "COMMAND" nodes, this returns the *last* character of the closing
228238 identifier.
229239
230 getNodeOuterEndingPosition
231 Same as getNodeEndingPosition, but for GROUP, ENVIRONMENT, or
232 COMMAND nodes, this returns the *last* character of the closing
233 identifier.
234
235 getNodeMathFlag
240 getNodeMathFlag
236241 This applies to any node type. It is 1 if the node sets, or is
237 contained within, a math mode region. 0 otherwise. TEXT nodes which
238 have this flag as 1 can be assumed to be the actual mathematics
239 contained in the document.
240
241 getNodePlainTextFlag
242 This applies only to TEXT nodes. It is 1 if the node is non-math and
243 is visible (in other words, will end up being a part of the output
244 document). One would only want to index TEXT nodes with this
245 property, for information retrieval purposes.
246
247 getEnvironmentClass
248 This applies only to ENVIRONMENT nodes. Returns what class of
249 environment the node represents (the X in \begin{X} and \end{X}).
250
251 getCommandName
252 This applies only to COMMAND nodes. Returns the name of the command
253 (the X in \X{...}).
254
255 getChildTree
256 This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it
257 returns the LaTeX::TOM::Tree which is ``under'' the calling node.
258
259 getFirstChild
260 This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it
242 contained within, a math mode region. 0 otherwise. "TEXT" nodes
243 which have this flag as 1 can be assumed to be the actual
244 mathematics contained in the document.
245
246 getNodePlainTextFlag
247 This applies only to "TEXT" nodes. It is 1 if the node is non-math
248 and is visible (in other words, will end up being a part of the
249 output document). One would only want to index "TEXT" nodes with
250 this property, for information retrieval purposes.
251
252 getEnvironmentClass
253 This applies only to "ENVIRONMENT" nodes. Returns what class of
254 environment the node represents (the "X" in "\begin{X}" and
255 "\end{X}").
256
257 getCommandName
258 This applies only to "COMMAND" nodes. Returns the name of the
259 command (the "X" in "\X{...}").
260
261 getChildTree
262 This applies only to "COMMAND", "ENVIRONMENT", and "GROUP" nodes: it
263 returns the "LaTeX::TOM::Tree" which is ``under'' the calling node.
264
265 getFirstChild
266 This applies only to "COMMAND", "ENVIRONMENT", and "GROUP" nodes: it
261267 returns the first node from the first level of the child subtree.
262268
263 getLastChild
269 getLastChild
264270 Same as above, but for the last node of the first level.
265271
266 getPreviousSibling
272 getPreviousSibling
267273 Return the prior node on the same level of the tree.
268274
269 getNextSibling
275 getNextSibling
270276 Same as above, but for following node.
271277
272 getParent
278 getParent
273279 Get the parent node of this node in the tree.
274280
275 getNextGroupNode
281 getNextGroupNode
276282 This is an interesting function, and kind of a hack because of the
277283 way the parser makes the current tree. Basically it will give you
278 the next sibling that is a GROUP node, until it either hits the end
279 of the tree level, a TEXT node which doesn't match /^\s*$/, or a
280 COMMAND node.
281
282 This is useful for finding all GROUPed parameters after a COMMAND
283 node (see comments for 'GROUP' in the 'COMPONENTS' /
284 'LaTeX::TOM::Node' section). You can just have a while loop that
285 calls this method until it gets 'undef', and you'll know you've
284 the next sibling that is a "GROUP" node, until it either hits the
285 end of the tree level, a "TEXT" node which doesn't match "/^\s*$/",
286 or a "COMMAND" node.
287
288 This is useful for finding all "GROUP"ed parameters after a
289 "COMMAND" node (see comments for "GROUP" in the "COMPONENTS" /
290 "LaTeX::TOM::Node" section). You can just have a while loop that
291 calls this method until it gets "undef", and you'll know you've
286292 found all the parameters to a command.
287293
288 Note: this may be bad, but TEXT Nodes matching /^\s*\[[0-9]+\]$/
294 Note: this may be bad, but "TEXT" Nodes matching "/^\s*\[[0-9]+\]$/"
289295 (optional parameter groups) are treated as if they were 'blank'.
290296
291297 CAVEATS
292298 Due to the lack of tree-modification methods, currently this module is
293299 mostly useful for minor modifications to the parsed document, for
294 instance, altering the text of TEXT nodes but not deleting the nodes. Of
295 course, the user can still do this by breaking abstraction and directly
296 modifying the Tree.
300 instance, altering the text of "TEXT" nodes but not deleting the nodes.
301 Of course, the user can still do this by breaking abstraction and
302 directly modifying the Tree.
297303
298304 Also note that the parsing is not complete. This module was not written
299305 with the intention of being able to produce output documents the way
309315 of ~1000 research publications from the Computing Research Repository,
310316 so I deemed it ``good enough'' to use for purposes similar to mine.
311317
312 Please let me know of parser errors if you discover any.
313
314 AUTHOR
318 Please let the authors know of parser errors if you discover any.
319
320 AUTHORS
315321 Written by Aaron Krowne <akrowne@vt.edu>
316322
317323 Maintained by Steven Schubiger <schubiger@cpan.org>
11 #
22 # LaTeX::TOM (TeX Object Model)
33 #
4 # Version 0.6
4 # Version 0.8
55 #
66 # ----------------------------------------------------------------------------
77 #
3131
3232 use base qw(LaTeX::TOM::Parser);
3333
34 our $VERSION = '0.6';
34 our $VERSION = '0.8';
3535
3636 # BEGIN CONFIG SECTION ########################################################
3737
206206
207207 =head1 SYNOPSIS
208208
209 use LaTeX::TOM;
210
211 my $parser = LaTeX::TOM->new;
212
213 my $document = $parser->parseFile('mypaper.tex');
214
215 my $latex = $document->toLaTeX;
216
217 my $specialnodes = $document->getNodesByCondition(
218 '$node->getNodeType eq \'TEXT\' &&
219 $node->getNodeText =~ /magic string/');
220
221 my $sections = $document->getNodesByCondition(
222 '$node->getNodeType eq \'COMMAND\' &&
223 $node->getCommandName =~ /section$/');
224
225 my $indexme = $document->getIndexableText;
226
227 $document->print;
209 use LaTeX::TOM;
210
211 $parser = LaTeX::TOM->new;
212
213 $document = $parser->parseFile('mypaper.tex');
214
215 $latex = $document->toLaTeX;
216
217 $specialnodes = $document->getNodesByCondition(
218 '$node->getNodeType eq \'TEXT\' &&
219 $node->getNodeText =~ /magic string/'
220 );
221
222 $sections = $document->getNodesByCondition(
223 '$node->getNodeType eq \'COMMAND\' &&
224 $node->getCommandName =~ /section$/'
225 );
226
227 $indexme = $document->getIndexableText;
228
229 $document->print;
228230
229231 =head1 DESCRIPTION
230232
231233 This module provides a parser which parses and interprets (though not fully)
232234 LaTeX documents and returns a tree-based representation of what it finds.
233 This tree is a LaTeX::TOM::Tree. The tree contains LaTeX::TOM:Node nodes.
235 This tree is a C<LaTeX::TOM::Tree>. The tree contains C<LaTeX::TOM::Node> nodes.
234236
235237 This module should be especially useful to anyone who wants to do processing
236238 of LaTeX documents that requires extraction of plain-text information, or
247249
248250 =item parse error handling (= B<0> || 1 || 2)
249251
250 Determines what happens when a parse error is encountered. 0 results in a
251 warning. 1 results in a die. 2 results in silence. Note that particular
252 Determines what happens when a parse error is encountered. C<0> results in a
253 warning. C<1> results in a die. C<2> results in silence. Note that particular
252254 groupings in LaTeX (i.e. newcommands and the like) contain invalid TeX or
253 LaTeX, so you nearly always need this parameter to be 0 or 2 to completely
255 LaTeX, so you nearly always need this parameter to be C<0> or C<2> to completely
254256 parse the document.
255257
256258 =item read inputs flag (= 0 || B<1>)
257259
258 This flag determines whether a scan for \input and \input-like commands is
260 This flag determines whether a scan for C<\input> and C<\input-like> commands is
259261 performed, and the resulting called files parsed and added to the parent
260 parse tree. 0 means no, 1 means do it. Note that this will happen recursively
261 if it is turned on. Also, bibliographies (.bbl files) are detected and
262 parse tree. C<0> means no, C<1> means do it. Note that this will happen recursively
263 if it is turned on. Also, bibliographies (F<.bbl> files) are detected and
262264 included.
263265
264266 =item apply mappings flag (= 0 || B<1>)
265267
266268 This flag determines whether (most) user-defined mappings are applied. This
267 means \defs, \newcommands, and \newenvironments. This is critical for properly
268 analyzing the content of the document, as this must be phrased in terms of the
269 semantics of the original TeX and LaTeX commands, not ad hoc user macros. So,
270 for instance, do not expect plain-text extraction to work properly with this
269 means C<\defs>, C<\newcommands>, and C<\newenvironments>. This is critical for
270 properly analyzing the content of the document, as this must be phrased in terms
271 of the semantics of the original TeX and LaTeX commands, not ad hoc user macros.
272 So, for instance, do not expect plain-text extraction to work properly with this
271273 option off.
272274
273275 =back
274276
275 The parser returns a LaTeX::TOM::Tree ($document in the SYNOPSIS).
277 The parser returns a C<LaTeX::TOM::Tree> ($document in the SYNOPSIS).
276278
277279 =head2 LaTeX::TOM::Node
278280
282284
283285 =item TEXT
284286
285 TEXT nodes can be thought of as representing the plain-text portions of the
287 C<TEXT> nodes can be thought of as representing the plain-text portions of the
286288 LaTeX document. This includes math and anything else that is not a recognized
287 TeX or LaTeX command, or user-defined command. In reality, TEXT nodes contain
289 TeX or LaTeX command, or user-defined command. In reality, C<TEXT> nodes contain
288290 commands that this parser does not yet recognize the semantics of.
289291
290292 =item COMMAND
291293
292 A COMMAND node represents a TeX command. It always has child nodes in a tree,
294 A C<COMMAND> node represents a TeX command. It always has child nodes in a tree,
293295 though the tree might be empty if the command operates on zero parameters. An
294296 example of a command is
295297
296 \textbf{blah}
297
298 This would parse into a COMMAND node for I<textbf>, which would have a subtree
299 containing the TEXT node with text ``blah.''
298 \textbf{blah}
299
300 This would parse into a C<COMMAND> node for C<textbf>, which would have a subtree
301 containing the C<TEXT> node with text ``blah.''
300302
301303 =item ENVIRONMENT
302304
303 Similarly, TeX environments parse into ENVIRONMENT nodes, which have metadata
305 Similarly, TeX environments parse into C<ENVIRONMENT> nodes, which have metadata
304306 about the environment, along with a subtree representing what is contained in
305307 the environment. For example,
306308
307 \begin{equation}
308 r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
309 \end{equation}
310
311 Would parse into an ENVIRONMENT node of the class ``equation'' with a child
312 tree containing the result of parsing ``r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.''
309 \begin{equation}
310 r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
311 \end{equation}
312
313 Would parse into an C<ENVIRONMENT> node of the class ``equation'' with a child
314 tree containing the result of parsing C<``r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.''>
313315
314316 =item GROUP
315317
316 A GROUP is like an anonymous COMMAND. Since you can put whatever you want in
317 curly-braces ({}) in TeX in order to make semantically isolated regions, this
318 separation is preserved by the parser. A GROUP is just the subtree of the
318 A C<GROUP> is like an anonymous C<COMMAND>. Since you can put whatever you want in
319 curly-braces (C<{}>) in TeX in order to make semantically isolated regions, this
320 separation is preserved by the parser. A C<GROUP> is just the subtree of the
319321 parsed contents of plain curly-braces.
320322
321 It is important to note that currently only the first GROUP in a series of
322 GROUPs following a LaTeX command will actually be parsed into a COMMAND node.
323 It is important to note that currently only the first C<GROUP> in a series of
324 C<GROUP>s following a LaTeX command will actually be parsed into a C<COMMAND> node.
323325 The reason is that, for the initial purposes of this module, it was not
324 necessary to recognize additional GROUPs as additional parameters to the
325 COMMAND. However, this is something that this module really should do
326 necessary to recognize additional C<GROUP>s as additional parameters to the
327 C<COMMAND>. However, this is something that this module really should do
326328 eventually. Currently if you want all the parameters to a multi-parametered
327 command, you'll need to pick out all the following GROUP nodes yourself.
329 command, you'll need to pick out all the following C<GROUP> nodes yourself.
328330
329331 Eventually this will become something like a list which is stored in the
330 COMMAND node, much like XML::DOM's treatment of attributes. These are, in a
331 sense, apart from the rest of the document tree. Then GROUP nodes will become
332 C<COMMAND> node, much like L<XML::DOM>'s treatment of attributes. These are, in a
333 sense, apart from the rest of the document tree. Then C<GROUP> nodes will become
332334 much more rare.
333335
334336 =item COMMENT
335337
336 A COMMENT node is very similar to a TEXT node, except it is specifically for
337 lines beginning with ``%'' (the TeX comment delimeter) or the right-hand
338 portion of a line that has ``%'' at some internal point.
338 A C<COMMENT> node is very similar to a C<TEXT> node, except it is specifically for
339 lines beginning with C<``%''> (the TeX comment delimeter) or the right-hand
340 portion of a line that has C<``%''> at some internal point.
339341
340342 =back
341343
353355
354356 =head2 LaTeX::TOM
355357
356 =over 4
357
358 =item new
358 =head3 new
359
360 =over 4
361
362 =item C<>
359363
360364 Instantiate a new parser object.
361365
368372
369373 The methods for the parser (aside from the constructor, discussed above) are :
370374
371 =over 4
372
373 =item parseFile (filename)
374
375 Read in the contents of I<filename> and parse them, returning a LaTeX::TOM:Tree.
376
377 =item parse (string)
378
379 Parse the string I<string> and return a LaTeX::TOM::Tree.
375 =head3 parseFile (filename)
376
377 =over 4
378
379 =item C<>
380
381 Read in the contents of I<filename> and parse them, returning a C<LaTeX::TOM::Tree>.
382
383 =back
384
385 =head3 parse (string)
386
387 =over 4
388
389 =item C<>
390
391 Parse the string I<string> and return a C<LaTeX::TOM::Tree>.
380392
381393 =back
382394
384396
385397 This section contains methods for the Trees returned by the parser.
386398
387 =over 4
388
389 =item copy
399 =head3 copy
400
401 =over 4
402
403 =item C<>
390404
391405 Duplicate a tree into new memory.
392406
393 =item print
407 =back
408
409 =head3 print
410
411 =over 4
412
413 =item C<>
394414
395415 A debug print of the structure of the tree.
396416
397 =item plainText
417 =back
418
419 =head3 plainText
420
421 =over 4
422
423 =item C<>
398424
399425 Returns an arrayref which is a list of strings representing the text of all
400 getNodePlainTextFlag = 1 TEXT nodes, in an inorder traversal.
401
402 =item indexableText
426 C<getNodePlainTextFlag = 1> C<TEXT> nodes, in an inorder traversal.
427
428 =back
429
430 =head3 indexableText
431
432 =over 4
433
434 =item C<>
403435
404436 A method like the above but which goes one step further; it cleans all of the
405437 returned text and concatenates it into a single string which one could consider
406438 having all of the standard information retrieval value for the document,
407439 making it useful for indexing.
408440
409 =item toLaTeX
441 =back
442
443 =head3 toLaTeX
444
445 =over 4
446
447 =item C<>
410448
411449 Return a string representing the LaTeX encoded by the tree. This is especially
412450 useful to get a normal document again, after modifying nodes of the tree.
413451
414 =item getTopLevelNodes
415
416 Return an arrayref which is a list of LaTeX::TOM::Nodes at the top level of
452 =back
453
454 =head3 getTopLevelNodes
455
456 =over 4
457
458 =item C<>
459
460 Return an arrayref which is a list of C<LaTeX::TOM::Nodes> at the top level of
417461 the Tree.
418462
419 =item getAllNodes
463 =back
464
465 =head3 getAllNodes
466
467 =over 4
468
469 =item C<>
420470
421471 Return an arrayref with B<all> nodes of the tree. This "flattens" the tree.
422472
423 =item getCommandNodesByName (name)
424
425 Return an arrayref with all COMMAND nodes in the tree which have a name
473 =back
474
475 =head3 getCommandNodesByName (name)
476
477 =over 4
478
479 =item C<>
480
481 Return an arrayref with all C<COMMAND> nodes in the tree which have a name
426482 matching I<name>.
427483
428 =item getEnvironmentsByName (name)
429
430 Return an arrayref with all ENVIRONMENT nodes in the tree which have a class
484 =back
485
486 =head3 getEnvironmentsByName (name)
487
488 =over 4
489
490 =item C<>
491
492 Return an arrayref with all C<ENVIRONMENT> nodes in the tree which have a class
431493 matching I<name>.
432494
433 =item getNodesByCondition (expression)
495 =back
496
497 =head3 getNodesByCondition (expression)
498
499 =over 4
500
501 =item C<>
434502
435503 This is a catch-all search method which can be used to pull out nodes that
436504 match pretty much any perl expression, without manually having to traverse the
444512
445513 This section contains the methods for nodes of the parsed Trees.
446514
447 =over 4
448
449 =item getNodeType
450
451 Returns the type, one of 'TEXT', 'COMMAND', 'ENVIRONMENT', 'GROUP', or 'COMMENT',
515 =head3 getNodeType
516
517 =over 4
518
519 =item C<>
520
521 Returns the type, one of C<TEXT>, C<COMMAND>, C<ENVIRONMENT>, C<GROUP>, or C<COMMENT>,
452522 as described above.
453523
454 =item getNodeText
455
456 Applicable for TEXT or COMMENT nodes; this returns the document text they contain.
524 =back
525
526 =head3 getNodeText
527
528 =over 4
529
530 =item C<>
531
532 Applicable for C<TEXT> or C<COMMENT> nodes; this returns the document text they contain.
457533 This is undef for other node types.
458534
459 =item setNodeText
460
461 Set the node text, also for TEXT and COMMENT nodes.
462
463 =item getNodeStartingPosition
464
465 Get the starting character position in the document of this node. For TEXT
466 and COMMENT nodes, this will be where the text begins. For ENVIRONMENT,
467 COMMAND, or GROUP nodes, this will be the position of the I<last> character of
535 =back
536
537 =head3 setNodeText
538
539 =over 4
540
541 =item C<>
542
543 Set the node text, also for C<TEXT> and C<COMMENT> nodes.
544
545 =back
546
547 =head3 getNodeStartingPosition
548
549 =over 4
550
551 =item C<>
552
553 Get the starting character position in the document of this node. For C<TEXT>
554 and C<COMMENT> nodes, this will be where the text begins. For C<ENVIRONMENT>,
555 C<COMMAND>, or C<GROUP> nodes, this will be the position of the I<last> character of
468556 the opening identifier.
469557
470 =item getNodeEndingPosition
471
472 Same as above, but for last character. For GROUP, ENVIRONMENT, or COMMAND
558 =back
559
560 =head3 getNodeEndingPosition
561
562 =over 4
563
564 =item C<>
565
566 Same as above, but for last character. For C<GROUP>, C<ENVIRONMENT>, or C<COMMAND>
473567 nodes, this will be the I<first> character of the closing identifier.
474568
475 =item getNodeOuterStartingPosition
476
477 Same as getNodeStartingPosition, but for GROUP, ENVIRONMENT, or COMMAND nodes,
569 =back
570
571 =head3 getNodeOuterStartingPosition
572
573 =over 4
574
575 =item C<>
576
577 Same as getNodeStartingPosition, but for C<GROUP>, C<ENVIRONMENT>, or C<COMMAND> nodes,
478578 this returns the I<first> character of the opening identifier.
479579
480 =item getNodeOuterEndingPosition
481
482 Same as getNodeEndingPosition, but for GROUP, ENVIRONMENT, or COMMAND nodes,
580 =back
581
582 =head3 getNodeOuterEndingPosition
583
584 =over 4
585
586 =item C<>
587
588 Same as getNodeEndingPosition, but for C<GROUP>, C<ENVIRONMENT>, or C<COMMAND> nodes,
483589 this returns the I<last> character of the closing identifier.
484590
485 =item getNodeMathFlag
486
487 This applies to any node type. It is 1 if the node sets, or is contained
488 within, a math mode region. 0 otherwise. TEXT nodes which have this flag as 1
591 =back
592
593 =head3 getNodeMathFlag
594
595 =over 4
596
597 =item C<>
598
599 This applies to any node type. It is C<1> if the node sets, or is contained
600 within, a math mode region. C<0> otherwise. C<TEXT> nodes which have this flag as C<1>
489601 can be assumed to be the actual mathematics contained in the document.
490602
491 =item getNodePlainTextFlag
492
493 This applies only to TEXT nodes. It is 1 if the node is non-math B<and> is
603 =back
604
605 =head3 getNodePlainTextFlag
606
607 =over 4
608
609 =item C<>
610
611 This applies only to C<TEXT> nodes. It is C<1> if the node is non-math B<and> is
494612 visible (in other words, will end up being a part of the output document). One
495 would only want to index TEXT nodes with this property, for information
613 would only want to index C<TEXT> nodes with this property, for information
496614 retrieval purposes.
497615
498 =item getEnvironmentClass
499
500 This applies only to ENVIRONMENT nodes. Returns what class of environment the
501 node represents (the X in \begin{X} and \end{X}).
502
503 =item getCommandName
504
505 This applies only to COMMAND nodes. Returns the name of the command (the X in
506 \X{...}).
507
508 =item getChildTree
509
510 This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it returns the
511 LaTeX::TOM::Tree which is ``under'' the calling node.
512
513 =item getFirstChild
514
515 This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it returns the
616 =back
617
618 =head3 getEnvironmentClass
619
620 =over 4
621
622 =item C<>
623
624 This applies only to C<ENVIRONMENT> nodes. Returns what class of environment the
625 node represents (the C<X> in C<\begin{X}> and C<\end{X}>).
626
627 =back
628
629 =head3 getCommandName
630
631 =over 4
632
633 =item C<>
634
635 This applies only to C<COMMAND> nodes. Returns the name of the command (the C<X> in
636 C<\X{...}>).
637
638 =back
639
640 =head3 getChildTree
641
642 =over 4
643
644 =item C<>
645
646 This applies only to C<COMMAND>, C<ENVIRONMENT>, and C<GROUP> nodes: it returns the
647 C<LaTeX::TOM::Tree> which is ``under'' the calling node.
648
649 =back
650
651 =head3 getFirstChild
652
653 =over 4
654
655 =item C<>
656
657 This applies only to C<COMMAND>, C<ENVIRONMENT>, and C<GROUP> nodes: it returns the
516658 first node from the first level of the child subtree.
517659
518 =item getLastChild
660 =back
661
662 =head3 getLastChild
663
664 =over 4
665
666 =item C<>
519667
520668 Same as above, but for the last node of the first level.
521669
522 =item getPreviousSibling
670 =back
671
672 =head3 getPreviousSibling
673
674 =over 4
675
676 =item C<>
523677
524678 Return the prior node on the same level of the tree.
525679
526 =item getNextSibling
680 =back
681
682 =head3 getNextSibling
683
684 =over 4
685
686 =item C<>
527687
528688 Same as above, but for following node.
529689
530 =item getParent
690 =back
691
692 =head3 getParent
693
694 =over 4
695
696 =item C<>
531697
532698 Get the parent node of this node in the tree.
533699
534 =item getNextGroupNode
700 =back
701
702 =head3 getNextGroupNode
703
704 =over 4
705
706 =item C<>
535707
536708 This is an interesting function, and kind of a hack because of the way the
537709 parser makes the current tree. Basically it will give you the next sibling
538 that is a GROUP node, until it either hits the end of the tree level, a TEXT
539 node which doesn't match /^\s*$/, or a COMMAND node.
540
541 This is useful for finding all GROUPed parameters after a COMMAND node (see
542 comments for 'GROUP' in the 'COMPONENTS' / 'LaTeX::TOM::Node' section). You
543 can just have a while loop that calls this method until it gets 'undef', and
710 that is a C<GROUP> node, until it either hits the end of the tree level, a C<TEXT>
711 node which doesn't match C</^\s*$/>, or a C<COMMAND> node.
712
713 This is useful for finding all C<GROUP>ed parameters after a C<COMMAND> node (see
714 comments for C<GROUP> in the C<COMPONENTS> / C<LaTeX::TOM::Node> section). You
715 can just have a while loop that calls this method until it gets C<undef>, and
544716 you'll know you've found all the parameters to a command.
545717
546 Note: this may be bad, but TEXT Nodes matching /^\s*\[[0-9]+\]$/ (optional
718 Note: this may be bad, but C<TEXT> Nodes matching C</^\s*\[[0-9]+\]$/> (optional
547719 parameter groups) are treated as if they were 'blank'.
548720
549721 =back
552724
553725 Due to the lack of tree-modification methods, currently this module is
554726 mostly useful for minor modifications to the parsed document, for instance,
555 altering the text of TEXT nodes but not deleting the nodes. Of course, the
727 altering the text of C<TEXT> nodes but not deleting the nodes. Of course, the
556728 user can still do this by breaking abstraction and directly modifying the Tree.
557729
558730 Also note that the parsing is not complete. This module was not written with
561733 logical level with regards to the content; it doesn't care about the document
562734 formatting and outputting side of TeX/LaTeX.
563735
564 There is much work still to be done. See the TODO list in the TOM.pm source.
736 There is much work still to be done. See the F<TODO> list in the F<TOM.pm> source.
565737
566738 =head1 BUGS
567739
569741 ~1000 research publications from the Computing Research Repository, so I
570742 deemed it ``good enough'' to use for purposes similar to mine.
571743
572 Please let me know of parser errors if you discover any.
573
574 =head1 AUTHOR
744 Please let the authors know of parser errors if you discover any.
745
746 =head1 AUTHORS
575747
576748 Written by Aaron Krowne <akrowne@vt.edu>
577749
22 use strict;
33 use warnings;
44
5 use Test::More tests => 1;
5 use Test::More;
6
67 eval "use Test::Pod::Coverage 1.04";
78 plan skip_all => "Test::Pod::Coverage 1.04 required for testing POD coverage" if $@;
9
10 plan tests => 1;
811 pod_coverage_ok('LaTeX::TOM');
22 use strict;
33 use warnings;
44
5 use Test::More tests => 1;
5 use Test::More;
6
67 eval "use Test::Pod 1.14";
78 plan skip_all => "Test::Pod 1.14 required for testing POD" if $@;
9
10 plan tests => 1;
811 pod_file_ok('lib/LaTeX/TOM.pm');