Import Upstream version 0.8
gregor herrmann
6 years ago
0 | 0 | Revision history for Perl extension LaTeX::TOM. |
1 | ||
2 | 0.8 Mon Oct 8 10:23:01 CEST 2007 | |
3 | ||
4 | - Fixed failing tests pod.t & pod-coverage.t (adjusted plans). | |
5 | ||
6 | 0.7 Tue Aug 28 00:12:03 CEST 2007 | |
7 | ||
8 | - Added formatting tags to the documentation where appropriate | |
9 | and enlisted all methods within the documentation index. | |
1 | 10 | |
2 | 11 | 0.6 Wed Mar 14 01:05:09 CET 2007 |
3 | 12 |
0 | 0 | --- |
1 | 1 | name: LaTeX-TOM |
2 | version: 0.6 | |
2 | version: 0.8 | |
3 | 3 | author: |
4 | 4 | - 'Aaron Krowne <akrowne@vt.edu.org>' |
5 | abstract: 'A module for parsing, analyzing, and manipulating LaTeX documents.' | |
5 | abstract: A module for parsing, analyzing, and manipulating LaTeX documents. | |
6 | 6 | license: perl |
7 | resources: | |
8 | license: http://dev.perl.org/licenses/ | |
9 | 7 | build_requires: |
10 | 8 | Test::More: 0 |
9 | generated_by: Module::Build version 0.2808 | |
10 | meta-spec: | |
11 | url: http://module-build.sourceforge.net/META-spec-v1.2.html | |
12 | version: 1.2 | |
11 | 13 | provides: |
12 | 14 | LaTeX::TOM: |
13 | 15 | file: lib/LaTeX/TOM.pm |
14 | version: 0.6 | |
16 | version: 0.8 | |
15 | 17 | LaTeX::TOM::Node: |
16 | 18 | file: lib/LaTeX/TOM/Node.pm |
17 | 19 | LaTeX::TOM::Parser: |
18 | 20 | file: lib/LaTeX/TOM/Parser.pm |
19 | 21 | LaTeX::TOM::Tree: |
20 | 22 | file: lib/LaTeX/TOM/Tree.pm |
21 | generated_by: Module::Build version 0.2805 | |
22 | meta-spec: | |
23 | url: http://module-build.sourceforge.net/META-spec-v1.2.html | |
24 | version: 1.2 | |
23 | resources: | |
24 | license: http://dev.perl.org/licenses/ |
2 | 2 | documents. |
3 | 3 | |
4 | 4 | SYNOPSIS |
5 | use LaTeX::TOM; | |
6 | ||
7 | my $parser = LaTeX::TOM->new; | |
8 | ||
9 | my $document = $parser->parseFile('mypaper.tex'); | |
10 | ||
11 | my $latex = $document->toLaTeX; | |
12 | ||
13 | my $specialnodes = $document->getNodesByCondition( | |
14 | '$node->getNodeType eq \'TEXT\' && | |
15 | $node->getNodeText =~ /magic string/'); | |
16 | ||
17 | my $sections = $document->getNodesByCondition( | |
18 | '$node->getNodeType eq \'COMMAND\' && | |
19 | $node->getCommandName =~ /section$/'); | |
20 | ||
21 | my $indexme = $document->getIndexableText; | |
22 | ||
23 | $document->print; | |
5 | use LaTeX::TOM; | |
6 | ||
7 | $parser = LaTeX::TOM->new; | |
8 | ||
9 | $document = $parser->parseFile('mypaper.tex'); | |
10 | ||
11 | $latex = $document->toLaTeX; | |
12 | ||
13 | $specialnodes = $document->getNodesByCondition( | |
14 | '$node->getNodeType eq \'TEXT\' && | |
15 | $node->getNodeText =~ /magic string/' | |
16 | ); | |
17 | ||
18 | $sections = $document->getNodesByCondition( | |
19 | '$node->getNodeType eq \'COMMAND\' && | |
20 | $node->getCommandName =~ /section$/' | |
21 | ); | |
22 | ||
23 | $indexme = $document->getIndexableText; | |
24 | ||
25 | $document->print; | |
24 | 26 | |
25 | 27 | DESCRIPTION |
26 | 28 | This module provides a parser which parses and interprets (though not |
27 | 29 | fully) LaTeX documents and returns a tree-based representation of what |
28 | it finds. This tree is a LaTeX::TOM::Tree. The tree contains | |
29 | LaTeX::TOM:Node nodes. | |
30 | it finds. This tree is a "LaTeX::TOM::Tree". The tree contains | |
31 | "LaTeX::TOM::Node" nodes. | |
30 | 32 | |
31 | 33 | This module should be especially useful to anyone who wants to do |
32 | 34 | processing of LaTeX documents that requires extraction of plain-text |
46 | 48 | parameter to be 0 or 2 to completely parse the document. |
47 | 49 | |
48 | 50 | read inputs flag (= 0 || 1) |
49 | This flag determines whether a scan for \input and \input-like | |
51 | This flag determines whether a scan for "\input" and "\input-like" | |
50 | 52 | commands is performed, and the resulting called files parsed and |
51 | 53 | added to the parent parse tree. 0 means no, 1 means do it. Note that |
52 | 54 | this will happen recursively if it is turned on. Also, |
54 | 56 | |
55 | 57 | apply mappings flag (= 0 || 1) |
56 | 58 | This flag determines whether (most) user-defined mappings are |
57 | applied. This means \defs, \newcommands, and \newenvironments. This | |
58 | is critical for properly analyzing the content of the document, as | |
59 | this must be phrased in terms of the semantics of the original TeX | |
60 | and LaTeX commands, not ad hoc user macros. So, for instance, do not | |
61 | expect plain-text extraction to work properly with this option off. | |
62 | ||
63 | The parser returns a LaTeX::TOM::Tree ($document in the SYNOPSIS). | |
59 | applied. This means "\defs", "\newcommands", and "\newenvironments". | |
60 | This is critical for properly analyzing the content of the document, | |
61 | as this must be phrased in terms of the semantics of the original | |
62 | TeX and LaTeX commands, not ad hoc user macros. So, for instance, do | |
63 | not expect plain-text extraction to work properly with this option | |
64 | off. | |
65 | ||
66 | The parser returns a "LaTeX::TOM::Tree" ($document in the SYNOPSIS). | |
64 | 67 | |
65 | 68 | LaTeX::TOM::Node |
66 | 69 | Nodes may be of the following types: |
67 | 70 | |
68 | 71 | TEXT |
69 | TEXT nodes can be thought of as representing the plain-text portions | |
70 | of the LaTeX document. This includes math and anything else that is | |
71 | not a recognized TeX or LaTeX command, or user-defined command. In | |
72 | reality, TEXT nodes contain commands that this parser does not yet | |
73 | recognize the semantics of. | |
72 | "TEXT" nodes can be thought of as representing the plain-text | |
73 | portions of the LaTeX document. This includes math and anything else | |
74 | that is not a recognized TeX or LaTeX command, or user-defined | |
75 | command. In reality, "TEXT" nodes contain commands that this parser | |
76 | does not yet recognize the semantics of. | |
74 | 77 | |
75 | 78 | COMMAND |
76 | A COMMAND node represents a TeX command. It always has child nodes | |
79 | A "COMMAND" node represents a TeX command. It always has child nodes | |
77 | 80 | in a tree, though the tree might be empty if the command operates on |
78 | 81 | zero parameters. An example of a command is |
79 | 82 | |
80 | \textbf{blah} | |
81 | ||
82 | This would parse into a COMMAND node for *textbf*, which would have | |
83 | a subtree containing the TEXT node with text ``blah.'' | |
83 | \textbf{blah} | |
84 | ||
85 | This would parse into a "COMMAND" node for "textbf", which would | |
86 | have a subtree containing the "TEXT" node with text ``blah.'' | |
84 | 87 | |
85 | 88 | ENVIRONMENT |
86 | Similarly, TeX environments parse into ENVIRONMENT nodes, which have | |
87 | metadata about the environment, along with a subtree representing | |
88 | what is contained in the environment. For example, | |
89 | ||
90 | \begin{equation} | |
91 | r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} | |
92 | \end{equation} | |
93 | ||
94 | Would parse into an ENVIRONMENT node of the class ``equation'' with | |
95 | a child tree containing the result of parsing ``r = \frac{-b \pm | |
96 | \sqrt{b^2 - 4ac}}{2a}.'' | |
89 | Similarly, TeX environments parse into "ENVIRONMENT" nodes, which | |
90 | have metadata about the environment, along with a subtree | |
91 | representing what is contained in the environment. For example, | |
92 | ||
93 | \begin{equation} | |
94 | r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} | |
95 | \end{equation} | |
96 | ||
97 | Would parse into an "ENVIRONMENT" node of the class ``equation'' | |
98 | with a child tree containing the result of parsing ``r = \frac{-b | |
99 | \pm \sqrt{b^2 - 4ac}}{2a}.'' | |
97 | 100 | |
98 | 101 | GROUP |
99 | A GROUP is like an anonymous COMMAND. Since you can put whatever you | |
100 | want in curly-braces ({}) in TeX in order to make semantically | |
102 | A "GROUP" is like an anonymous "COMMAND". Since you can put whatever | |
103 | you want in curly-braces ("{}") in TeX in order to make semantically | |
101 | 104 | isolated regions, this separation is preserved by the parser. A |
102 | GROUP is just the subtree of the parsed contents of plain | |
105 | "GROUP" is just the subtree of the parsed contents of plain | |
103 | 106 | curly-braces. |
104 | 107 | |
105 | It is important to note that currently only the first GROUP in a | |
106 | series of GROUPs following a LaTeX command will actually be parsed | |
107 | into a COMMAND node. The reason is that, for the initial purposes of | |
108 | this module, it was not necessary to recognize additional GROUPs as | |
109 | additional parameters to the COMMAND. However, this is something | |
110 | that this module really should do eventually. Currently if you want | |
111 | all the parameters to a multi-parametered command, you'll need to | |
112 | pick out all the following GROUP nodes yourself. | |
108 | It is important to note that currently only the first "GROUP" in a | |
109 | series of "GROUP"s following a LaTeX command will actually be parsed | |
110 | into a "COMMAND" node. The reason is that, for the initial purposes | |
111 | of this module, it was not necessary to recognize additional | |
112 | "GROUP"s as additional parameters to the "COMMAND". However, this is | |
113 | something that this module really should do eventually. Currently if | |
114 | you want all the parameters to a multi-parametered command, you'll | |
115 | need to pick out all the following "GROUP" nodes yourself. | |
113 | 116 | |
114 | 117 | Eventually this will become something like a list which is stored in |
115 | the COMMAND node, much like XML::DOM's treatment of attributes. | |
118 | the "COMMAND" node, much like XML::DOM's treatment of attributes. | |
116 | 119 | These are, in a sense, apart from the rest of the document tree. |
117 | Then GROUP nodes will become much more rare. | |
120 | Then "GROUP" nodes will become much more rare. | |
118 | 121 | |
119 | 122 | COMMENT |
120 | A COMMENT node is very similar to a TEXT node, except it is | |
123 | A "COMMENT" node is very similar to a "TEXT" node, except it is | |
121 | 124 | specifically for lines beginning with ``%'' (the TeX comment |
122 | 125 | delimeter) or the right-hand portion of a line that has ``%'' at |
123 | 126 | some internal point. |
133 | 136 | |
134 | 137 | METHODS |
135 | 138 | LaTeX::TOM |
136 | new Instantiate a new parser object. | |
139 | new | |
140 | Instantiate a new parser object. | |
137 | 141 | |
138 | 142 | In this section all of the methods for each of the components are listed |
139 | 143 | and described. |
142 | 146 | The methods for the parser (aside from the constructor, discussed above) |
143 | 147 | are : |
144 | 148 | |
145 | parseFile (filename) | |
149 | parseFile (filename) | |
146 | 150 | Read in the contents of *filename* and parse them, returning a |
147 | LaTeX::TOM:Tree. | |
148 | ||
149 | parse (string) | |
150 | Parse the string *string* and return a LaTeX::TOM::Tree. | |
151 | "LaTeX::TOM::Tree". | |
152 | ||
153 | parse (string) | |
154 | Parse the string *string* and return a "LaTeX::TOM::Tree". | |
151 | 155 | |
152 | 156 | LaTeX::TOM::Tree |
153 | 157 | This section contains methods for the Trees returned by the parser. |
154 | 158 | |
155 | copy | |
159 | copy | |
156 | 160 | Duplicate a tree into new memory. |
157 | 161 | |
158 | ||
162 | ||
159 | 163 | A debug print of the structure of the tree. |
160 | 164 | |
161 | plainText | |
165 | plainText | |
162 | 166 | Returns an arrayref which is a list of strings representing the text |
163 | of all getNodePlainTextFlag = 1 TEXT nodes, in an inorder traversal. | |
164 | ||
165 | indexableText | |
167 | of all "getNodePlainTextFlag = 1" "TEXT" nodes, in an inorder | |
168 | traversal. | |
169 | ||
170 | indexableText | |
166 | 171 | A method like the above but which goes one step further; it cleans |
167 | 172 | all of the returned text and concatenates it into a single string |
168 | 173 | which one could consider having all of the standard information |
169 | 174 | retrieval value for the document, making it useful for indexing. |
170 | 175 | |
171 | toLaTeX | |
176 | toLaTeX | |
172 | 177 | Return a string representing the LaTeX encoded by the tree. This is |
173 | 178 | especially useful to get a normal document again, after modifying |
174 | 179 | nodes of the tree. |
175 | 180 | |
176 | getTopLevelNodes | |
177 | Return an arrayref which is a list of LaTeX::TOM::Nodes at the top | |
181 | getTopLevelNodes | |
182 | Return an arrayref which is a list of "LaTeX::TOM::Nodes" at the top | |
178 | 183 | level of the Tree. |
179 | 184 | |
180 | getAllNodes | |
185 | getAllNodes | |
181 | 186 | Return an arrayref with all nodes of the tree. This "flattens" the |
182 | 187 | tree. |
183 | 188 | |
184 | getCommandNodesByName (name) | |
185 | Return an arrayref with all COMMAND nodes in the tree which have a | |
189 | getCommandNodesByName (name) | |
190 | Return an arrayref with all "COMMAND" nodes in the tree which have a | |
186 | 191 | name matching *name*. |
187 | 192 | |
188 | getEnvironmentsByName (name) | |
189 | Return an arrayref with all ENVIRONMENT nodes in the tree which have | |
190 | a class matching *name*. | |
191 | ||
192 | getNodesByCondition (expression) | |
193 | getEnvironmentsByName (name) | |
194 | Return an arrayref with all "ENVIRONMENT" nodes in the tree which | |
195 | have a class matching *name*. | |
196 | ||
197 | getNodesByCondition (expression) | |
193 | 198 | This is a catch-all search method which can be used to pull out |
194 | 199 | nodes that match pretty much any perl expression, without manually |
195 | 200 | having to traverse the tree. *expression* is a valid perl expression |
200 | 205 | LaTeX::TOM::Node |
201 | 206 | This section contains the methods for nodes of the parsed Trees. |
202 | 207 | |
203 | getNodeType | |
204 | Returns the type, one of 'TEXT', 'COMMAND', 'ENVIRONMENT', 'GROUP', | |
205 | or 'COMMENT', as described above. | |
206 | ||
207 | getNodeText | |
208 | Applicable for TEXT or COMMENT nodes; this returns the document text | |
209 | they contain. This is undef for other node types. | |
210 | ||
211 | setNodeText | |
212 | Set the node text, also for TEXT and COMMENT nodes. | |
213 | ||
214 | getNodeStartingPosition | |
208 | getNodeType | |
209 | Returns the type, one of "TEXT", "COMMAND", "ENVIRONMENT", "GROUP", | |
210 | or "COMMENT", as described above. | |
211 | ||
212 | getNodeText | |
213 | Applicable for "TEXT" or "COMMENT" nodes; this returns the document | |
214 | text they contain. This is undef for other node types. | |
215 | ||
216 | setNodeText | |
217 | Set the node text, also for "TEXT" and "COMMENT" nodes. | |
218 | ||
219 | getNodeStartingPosition | |
215 | 220 | Get the starting character position in the document of this node. |
216 | For TEXT and COMMENT nodes, this will be where the text begins. For | |
217 | ENVIRONMENT, COMMAND, or GROUP nodes, this will be the position of | |
218 | the *last* character of the opening identifier. | |
219 | ||
220 | getNodeEndingPosition | |
221 | Same as above, but for last character. For GROUP, ENVIRONMENT, or | |
222 | COMMAND nodes, this will be the *first* character of the closing | |
221 | For "TEXT" and "COMMENT" nodes, this will be where the text begins. | |
222 | For "ENVIRONMENT", "COMMAND", or "GROUP" nodes, this will be the | |
223 | position of the *last* character of the opening identifier. | |
224 | ||
225 | getNodeEndingPosition | |
226 | Same as above, but for last character. For "GROUP", "ENVIRONMENT", | |
227 | or "COMMAND" nodes, this will be the *first* character of the | |
228 | closing identifier. | |
229 | ||
230 | getNodeOuterStartingPosition | |
231 | Same as getNodeStartingPosition, but for "GROUP", "ENVIRONMENT", or | |
232 | "COMMAND" nodes, this returns the *first* character of the opening | |
223 | 233 | identifier. |
224 | 234 | |
225 | getNodeOuterStartingPosition | |
226 | Same as getNodeStartingPosition, but for GROUP, ENVIRONMENT, or | |
227 | COMMAND nodes, this returns the *first* character of the opening | |
235 | getNodeOuterEndingPosition | |
236 | Same as getNodeEndingPosition, but for "GROUP", "ENVIRONMENT", or | |
237 | "COMMAND" nodes, this returns the *last* character of the closing | |
228 | 238 | identifier. |
229 | 239 | |
230 | getNodeOuterEndingPosition | |
231 | Same as getNodeEndingPosition, but for GROUP, ENVIRONMENT, or | |
232 | COMMAND nodes, this returns the *last* character of the closing | |
233 | identifier. | |
234 | ||
235 | getNodeMathFlag | |
240 | getNodeMathFlag | |
236 | 241 | This applies to any node type. It is 1 if the node sets, or is |
237 | contained within, a math mode region. 0 otherwise. TEXT nodes which | |
238 | have this flag as 1 can be assumed to be the actual mathematics | |
239 | contained in the document. | |
240 | ||
241 | getNodePlainTextFlag | |
242 | This applies only to TEXT nodes. It is 1 if the node is non-math and | |
243 | is visible (in other words, will end up being a part of the output | |
244 | document). One would only want to index TEXT nodes with this | |
245 | property, for information retrieval purposes. | |
246 | ||
247 | getEnvironmentClass | |
248 | This applies only to ENVIRONMENT nodes. Returns what class of | |
249 | environment the node represents (the X in \begin{X} and \end{X}). | |
250 | ||
251 | getCommandName | |
252 | This applies only to COMMAND nodes. Returns the name of the command | |
253 | (the X in \X{...}). | |
254 | ||
255 | getChildTree | |
256 | This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it | |
257 | returns the LaTeX::TOM::Tree which is ``under'' the calling node. | |
258 | ||
259 | getFirstChild | |
260 | This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it | |
242 | contained within, a math mode region. 0 otherwise. "TEXT" nodes | |
243 | which have this flag as 1 can be assumed to be the actual | |
244 | mathematics contained in the document. | |
245 | ||
246 | getNodePlainTextFlag | |
247 | This applies only to "TEXT" nodes. It is 1 if the node is non-math | |
248 | and is visible (in other words, will end up being a part of the | |
249 | output document). One would only want to index "TEXT" nodes with | |
250 | this property, for information retrieval purposes. | |
251 | ||
252 | getEnvironmentClass | |
253 | This applies only to "ENVIRONMENT" nodes. Returns what class of | |
254 | environment the node represents (the "X" in "\begin{X}" and | |
255 | "\end{X}"). | |
256 | ||
257 | getCommandName | |
258 | This applies only to "COMMAND" nodes. Returns the name of the | |
259 | command (the "X" in "\X{...}"). | |
260 | ||
261 | getChildTree | |
262 | This applies only to "COMMAND", "ENVIRONMENT", and "GROUP" nodes: it | |
263 | returns the "LaTeX::TOM::Tree" which is ``under'' the calling node. | |
264 | ||
265 | getFirstChild | |
266 | This applies only to "COMMAND", "ENVIRONMENT", and "GROUP" nodes: it | |
261 | 267 | returns the first node from the first level of the child subtree. |
262 | 268 | |
263 | getLastChild | |
269 | getLastChild | |
264 | 270 | Same as above, but for the last node of the first level. |
265 | 271 | |
266 | getPreviousSibling | |
272 | getPreviousSibling | |
267 | 273 | Return the prior node on the same level of the tree. |
268 | 274 | |
269 | getNextSibling | |
275 | getNextSibling | |
270 | 276 | Same as above, but for following node. |
271 | 277 | |
272 | getParent | |
278 | getParent | |
273 | 279 | Get the parent node of this node in the tree. |
274 | 280 | |
275 | getNextGroupNode | |
281 | getNextGroupNode | |
276 | 282 | This is an interesting function, and kind of a hack because of the |
277 | 283 | way the parser makes the current tree. Basically it will give you |
278 | the next sibling that is a GROUP node, until it either hits the end | |
279 | of the tree level, a TEXT node which doesn't match /^\s*$/, or a | |
280 | COMMAND node. | |
281 | ||
282 | This is useful for finding all GROUPed parameters after a COMMAND | |
283 | node (see comments for 'GROUP' in the 'COMPONENTS' / | |
284 | 'LaTeX::TOM::Node' section). You can just have a while loop that | |
285 | calls this method until it gets 'undef', and you'll know you've | |
284 | the next sibling that is a "GROUP" node, until it either hits the | |
285 | end of the tree level, a "TEXT" node which doesn't match "/^\s*$/", | |
286 | or a "COMMAND" node. | |
287 | ||
288 | This is useful for finding all "GROUP"ed parameters after a | |
289 | "COMMAND" node (see comments for "GROUP" in the "COMPONENTS" / | |
290 | "LaTeX::TOM::Node" section). You can just have a while loop that | |
291 | calls this method until it gets "undef", and you'll know you've | |
286 | 292 | found all the parameters to a command. |
287 | 293 | |
288 | Note: this may be bad, but TEXT Nodes matching /^\s*\[[0-9]+\]$/ | |
294 | Note: this may be bad, but "TEXT" Nodes matching "/^\s*\[[0-9]+\]$/" | |
289 | 295 | (optional parameter groups) are treated as if they were 'blank'. |
290 | 296 | |
291 | 297 | CAVEATS |
292 | 298 | Due to the lack of tree-modification methods, currently this module is |
293 | 299 | mostly useful for minor modifications to the parsed document, for |
294 | instance, altering the text of TEXT nodes but not deleting the nodes. Of | |
295 | course, the user can still do this by breaking abstraction and directly | |
296 | modifying the Tree. | |
300 | instance, altering the text of "TEXT" nodes but not deleting the nodes. | |
301 | Of course, the user can still do this by breaking abstraction and | |
302 | directly modifying the Tree. | |
297 | 303 | |
298 | 304 | Also note that the parsing is not complete. This module was not written |
299 | 305 | with the intention of being able to produce output documents the way |
309 | 315 | of ~1000 research publications from the Computing Research Repository, |
310 | 316 | so I deemed it ``good enough'' to use for purposes similar to mine. |
311 | 317 | |
312 | Please let me know of parser errors if you discover any. | |
313 | ||
314 | AUTHOR | |
318 | Please let the authors know of parser errors if you discover any. | |
319 | ||
320 | AUTHORS | |
315 | 321 | Written by Aaron Krowne <akrowne@vt.edu> |
316 | 322 | |
317 | 323 | Maintained by Steven Schubiger <schubiger@cpan.org> |
1 | 1 | # |
2 | 2 | # LaTeX::TOM (TeX Object Model) |
3 | 3 | # |
4 | # Version 0.6 | |
4 | # Version 0.8 | |
5 | 5 | # |
6 | 6 | # ---------------------------------------------------------------------------- |
7 | 7 | # |
31 | 31 | |
32 | 32 | use base qw(LaTeX::TOM::Parser); |
33 | 33 | |
34 | our $VERSION = '0.6'; | |
34 | our $VERSION = '0.8'; | |
35 | 35 | |
36 | 36 | # BEGIN CONFIG SECTION ######################################################## |
37 | 37 | |
206 | 206 | |
207 | 207 | =head1 SYNOPSIS |
208 | 208 | |
209 | use LaTeX::TOM; | |
210 | ||
211 | my $parser = LaTeX::TOM->new; | |
212 | ||
213 | my $document = $parser->parseFile('mypaper.tex'); | |
214 | ||
215 | my $latex = $document->toLaTeX; | |
216 | ||
217 | my $specialnodes = $document->getNodesByCondition( | |
218 | '$node->getNodeType eq \'TEXT\' && | |
219 | $node->getNodeText =~ /magic string/'); | |
220 | ||
221 | my $sections = $document->getNodesByCondition( | |
222 | '$node->getNodeType eq \'COMMAND\' && | |
223 | $node->getCommandName =~ /section$/'); | |
224 | ||
225 | my $indexme = $document->getIndexableText; | |
226 | ||
227 | $document->print; | |
209 | use LaTeX::TOM; | |
210 | ||
211 | $parser = LaTeX::TOM->new; | |
212 | ||
213 | $document = $parser->parseFile('mypaper.tex'); | |
214 | ||
215 | $latex = $document->toLaTeX; | |
216 | ||
217 | $specialnodes = $document->getNodesByCondition( | |
218 | '$node->getNodeType eq \'TEXT\' && | |
219 | $node->getNodeText =~ /magic string/' | |
220 | ); | |
221 | ||
222 | $sections = $document->getNodesByCondition( | |
223 | '$node->getNodeType eq \'COMMAND\' && | |
224 | $node->getCommandName =~ /section$/' | |
225 | ); | |
226 | ||
227 | $indexme = $document->getIndexableText; | |
228 | ||
229 | $document->print; | |
228 | 230 | |
229 | 231 | =head1 DESCRIPTION |
230 | 232 | |
231 | 233 | This module provides a parser which parses and interprets (though not fully) |
232 | 234 | LaTeX documents and returns a tree-based representation of what it finds. |
233 | This tree is a LaTeX::TOM::Tree. The tree contains LaTeX::TOM:Node nodes. | |
235 | This tree is a C<LaTeX::TOM::Tree>. The tree contains C<LaTeX::TOM::Node> nodes. | |
234 | 236 | |
235 | 237 | This module should be especially useful to anyone who wants to do processing |
236 | 238 | of LaTeX documents that requires extraction of plain-text information, or |
247 | 249 | |
248 | 250 | =item parse error handling (= B<0> || 1 || 2) |
249 | 251 | |
250 | Determines what happens when a parse error is encountered. 0 results in a | |
251 | warning. 1 results in a die. 2 results in silence. Note that particular | |
252 | Determines what happens when a parse error is encountered. C<0> results in a | |
253 | warning. C<1> results in a die. C<2> results in silence. Note that particular | |
252 | 254 | groupings in LaTeX (i.e. newcommands and the like) contain invalid TeX or |
253 | LaTeX, so you nearly always need this parameter to be 0 or 2 to completely | |
255 | LaTeX, so you nearly always need this parameter to be C<0> or C<2> to completely | |
254 | 256 | parse the document. |
255 | 257 | |
256 | 258 | =item read inputs flag (= 0 || B<1>) |
257 | 259 | |
258 | This flag determines whether a scan for \input and \input-like commands is | |
260 | This flag determines whether a scan for C<\input> and C<\input-like> commands is | |
259 | 261 | performed, and the resulting called files parsed and added to the parent |
260 | parse tree. 0 means no, 1 means do it. Note that this will happen recursively | |
261 | if it is turned on. Also, bibliographies (.bbl files) are detected and | |
262 | parse tree. C<0> means no, C<1> means do it. Note that this will happen recursively | |
263 | if it is turned on. Also, bibliographies (F<.bbl> files) are detected and | |
262 | 264 | included. |
263 | 265 | |
264 | 266 | =item apply mappings flag (= 0 || B<1>) |
265 | 267 | |
266 | 268 | This flag determines whether (most) user-defined mappings are applied. This |
267 | means \defs, \newcommands, and \newenvironments. This is critical for properly | |
268 | analyzing the content of the document, as this must be phrased in terms of the | |
269 | semantics of the original TeX and LaTeX commands, not ad hoc user macros. So, | |
270 | for instance, do not expect plain-text extraction to work properly with this | |
269 | means C<\defs>, C<\newcommands>, and C<\newenvironments>. This is critical for | |
270 | properly analyzing the content of the document, as this must be phrased in terms | |
271 | of the semantics of the original TeX and LaTeX commands, not ad hoc user macros. | |
272 | So, for instance, do not expect plain-text extraction to work properly with this | |
271 | 273 | option off. |
272 | 274 | |
273 | 275 | =back |
274 | 276 | |
275 | The parser returns a LaTeX::TOM::Tree ($document in the SYNOPSIS). | |
277 | The parser returns a C<LaTeX::TOM::Tree> ($document in the SYNOPSIS). | |
276 | 278 | |
277 | 279 | =head2 LaTeX::TOM::Node |
278 | 280 | |
282 | 284 | |
283 | 285 | =item TEXT |
284 | 286 | |
285 | TEXT nodes can be thought of as representing the plain-text portions of the | |
287 | C<TEXT> nodes can be thought of as representing the plain-text portions of the | |
286 | 288 | LaTeX document. This includes math and anything else that is not a recognized |
287 | TeX or LaTeX command, or user-defined command. In reality, TEXT nodes contain | |
289 | TeX or LaTeX command, or user-defined command. In reality, C<TEXT> nodes contain | |
288 | 290 | commands that this parser does not yet recognize the semantics of. |
289 | 291 | |
290 | 292 | =item COMMAND |
291 | 293 | |
292 | A COMMAND node represents a TeX command. It always has child nodes in a tree, | |
294 | A C<COMMAND> node represents a TeX command. It always has child nodes in a tree, | |
293 | 295 | though the tree might be empty if the command operates on zero parameters. An |
294 | 296 | example of a command is |
295 | 297 | |
296 | \textbf{blah} | |
297 | ||
298 | This would parse into a COMMAND node for I<textbf>, which would have a subtree | |
299 | containing the TEXT node with text ``blah.'' | |
298 | \textbf{blah} | |
299 | ||
300 | This would parse into a C<COMMAND> node for C<textbf>, which would have a subtree | |
301 | containing the C<TEXT> node with text ``blah.'' | |
300 | 302 | |
301 | 303 | =item ENVIRONMENT |
302 | 304 | |
303 | Similarly, TeX environments parse into ENVIRONMENT nodes, which have metadata | |
305 | Similarly, TeX environments parse into C<ENVIRONMENT> nodes, which have metadata | |
304 | 306 | about the environment, along with a subtree representing what is contained in |
305 | 307 | the environment. For example, |
306 | 308 | |
307 | \begin{equation} | |
308 | r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} | |
309 | \end{equation} | |
310 | ||
311 | Would parse into an ENVIRONMENT node of the class ``equation'' with a child | |
312 | tree containing the result of parsing ``r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.'' | |
309 | \begin{equation} | |
310 | r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a} | |
311 | \end{equation} | |
312 | ||
313 | Would parse into an C<ENVIRONMENT> node of the class ``equation'' with a child | |
314 | tree containing the result of parsing C<``r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.''> | |
313 | 315 | |
314 | 316 | =item GROUP |
315 | 317 | |
316 | A GROUP is like an anonymous COMMAND. Since you can put whatever you want in | |
317 | curly-braces ({}) in TeX in order to make semantically isolated regions, this | |
318 | separation is preserved by the parser. A GROUP is just the subtree of the | |
318 | A C<GROUP> is like an anonymous C<COMMAND>. Since you can put whatever you want in | |
319 | curly-braces (C<{}>) in TeX in order to make semantically isolated regions, this | |
320 | separation is preserved by the parser. A C<GROUP> is just the subtree of the | |
319 | 321 | parsed contents of plain curly-braces. |
320 | 322 | |
321 | It is important to note that currently only the first GROUP in a series of | |
322 | GROUPs following a LaTeX command will actually be parsed into a COMMAND node. | |
323 | It is important to note that currently only the first C<GROUP> in a series of | |
324 | C<GROUP>s following a LaTeX command will actually be parsed into a C<COMMAND> node. | |
323 | 325 | The reason is that, for the initial purposes of this module, it was not |
324 | necessary to recognize additional GROUPs as additional parameters to the | |
325 | COMMAND. However, this is something that this module really should do | |
326 | necessary to recognize additional C<GROUP>s as additional parameters to the | |
327 | C<COMMAND>. However, this is something that this module really should do | |
326 | 328 | eventually. Currently if you want all the parameters to a multi-parametered |
327 | command, you'll need to pick out all the following GROUP nodes yourself. | |
329 | command, you'll need to pick out all the following C<GROUP> nodes yourself. | |
328 | 330 | |
329 | 331 | Eventually this will become something like a list which is stored in the |
330 | COMMAND node, much like XML::DOM's treatment of attributes. These are, in a | |
331 | sense, apart from the rest of the document tree. Then GROUP nodes will become | |
332 | C<COMMAND> node, much like L<XML::DOM>'s treatment of attributes. These are, in a | |
333 | sense, apart from the rest of the document tree. Then C<GROUP> nodes will become | |
332 | 334 | much more rare. |
333 | 335 | |
334 | 336 | =item COMMENT |
335 | 337 | |
336 | A COMMENT node is very similar to a TEXT node, except it is specifically for | |
337 | lines beginning with ``%'' (the TeX comment delimeter) or the right-hand | |
338 | portion of a line that has ``%'' at some internal point. | |
338 | A C<COMMENT> node is very similar to a C<TEXT> node, except it is specifically for | |
339 | lines beginning with C<``%''> (the TeX comment delimeter) or the right-hand | |
340 | portion of a line that has C<``%''> at some internal point. | |
339 | 341 | |
340 | 342 | =back |
341 | 343 | |
353 | 355 | |
354 | 356 | =head2 LaTeX::TOM |
355 | 357 | |
356 | =over 4 | |
357 | ||
358 | =item new | |
358 | =head3 new | |
359 | ||
360 | =over 4 | |
361 | ||
362 | =item C<> | |
359 | 363 | |
360 | 364 | Instantiate a new parser object. |
361 | 365 | |
368 | 372 | |
369 | 373 | The methods for the parser (aside from the constructor, discussed above) are : |
370 | 374 | |
371 | =over 4 | |
372 | ||
373 | =item parseFile (filename) | |
374 | ||
375 | Read in the contents of I<filename> and parse them, returning a LaTeX::TOM:Tree. | |
376 | ||
377 | =item parse (string) | |
378 | ||
379 | Parse the string I<string> and return a LaTeX::TOM::Tree. | |
375 | =head3 parseFile (filename) | |
376 | ||
377 | =over 4 | |
378 | ||
379 | =item C<> | |
380 | ||
381 | Read in the contents of I<filename> and parse them, returning a C<LaTeX::TOM::Tree>. | |
382 | ||
383 | =back | |
384 | ||
385 | =head3 parse (string) | |
386 | ||
387 | =over 4 | |
388 | ||
389 | =item C<> | |
390 | ||
391 | Parse the string I<string> and return a C<LaTeX::TOM::Tree>. | |
380 | 392 | |
381 | 393 | =back |
382 | 394 | |
384 | 396 | |
385 | 397 | This section contains methods for the Trees returned by the parser. |
386 | 398 | |
387 | =over 4 | |
388 | ||
389 | =item copy | |
399 | =head3 copy | |
400 | ||
401 | =over 4 | |
402 | ||
403 | =item C<> | |
390 | 404 | |
391 | 405 | Duplicate a tree into new memory. |
392 | 406 | |
393 | =item print | |
407 | =back | |
408 | ||
409 | =head3 print | |
410 | ||
411 | =over 4 | |
412 | ||
413 | =item C<> | |
394 | 414 | |
395 | 415 | A debug print of the structure of the tree. |
396 | 416 | |
397 | =item plainText | |
417 | =back | |
418 | ||
419 | =head3 plainText | |
420 | ||
421 | =over 4 | |
422 | ||
423 | =item C<> | |
398 | 424 | |
399 | 425 | Returns an arrayref which is a list of strings representing the text of all |
400 | getNodePlainTextFlag = 1 TEXT nodes, in an inorder traversal. | |
401 | ||
402 | =item indexableText | |
426 | C<getNodePlainTextFlag = 1> C<TEXT> nodes, in an inorder traversal. | |
427 | ||
428 | =back | |
429 | ||
430 | =head3 indexableText | |
431 | ||
432 | =over 4 | |
433 | ||
434 | =item C<> | |
403 | 435 | |
404 | 436 | A method like the above but which goes one step further; it cleans all of the |
405 | 437 | returned text and concatenates it into a single string which one could consider |
406 | 438 | having all of the standard information retrieval value for the document, |
407 | 439 | making it useful for indexing. |
408 | 440 | |
409 | =item toLaTeX | |
441 | =back | |
442 | ||
443 | =head3 toLaTeX | |
444 | ||
445 | =over 4 | |
446 | ||
447 | =item C<> | |
410 | 448 | |
411 | 449 | Return a string representing the LaTeX encoded by the tree. This is especially |
412 | 450 | useful to get a normal document again, after modifying nodes of the tree. |
413 | 451 | |
414 | =item getTopLevelNodes | |
415 | ||
416 | Return an arrayref which is a list of LaTeX::TOM::Nodes at the top level of | |
452 | =back | |
453 | ||
454 | =head3 getTopLevelNodes | |
455 | ||
456 | =over 4 | |
457 | ||
458 | =item C<> | |
459 | ||
460 | Return an arrayref which is a list of C<LaTeX::TOM::Nodes> at the top level of | |
417 | 461 | the Tree. |
418 | 462 | |
419 | =item getAllNodes | |
463 | =back | |
464 | ||
465 | =head3 getAllNodes | |
466 | ||
467 | =over 4 | |
468 | ||
469 | =item C<> | |
420 | 470 | |
421 | 471 | Return an arrayref with B<all> nodes of the tree. This "flattens" the tree. |
422 | 472 | |
423 | =item getCommandNodesByName (name) | |
424 | ||
425 | Return an arrayref with all COMMAND nodes in the tree which have a name | |
473 | =back | |
474 | ||
475 | =head3 getCommandNodesByName (name) | |
476 | ||
477 | =over 4 | |
478 | ||
479 | =item C<> | |
480 | ||
481 | Return an arrayref with all C<COMMAND> nodes in the tree which have a name | |
426 | 482 | matching I<name>. |
427 | 483 | |
428 | =item getEnvironmentsByName (name) | |
429 | ||
430 | Return an arrayref with all ENVIRONMENT nodes in the tree which have a class | |
484 | =back | |
485 | ||
486 | =head3 getEnvironmentsByName (name) | |
487 | ||
488 | =over 4 | |
489 | ||
490 | =item C<> | |
491 | ||
492 | Return an arrayref with all C<ENVIRONMENT> nodes in the tree which have a class | |
431 | 493 | matching I<name>. |
432 | 494 | |
433 | =item getNodesByCondition (expression) | |
495 | =back | |
496 | ||
497 | =head3 getNodesByCondition (expression) | |
498 | ||
499 | =over 4 | |
500 | ||
501 | =item C<> | |
434 | 502 | |
435 | 503 | This is a catch-all search method which can be used to pull out nodes that |
436 | 504 | match pretty much any perl expression, without manually having to traverse the |
444 | 512 | |
445 | 513 | This section contains the methods for nodes of the parsed Trees. |
446 | 514 | |
447 | =over 4 | |
448 | ||
449 | =item getNodeType | |
450 | ||
451 | Returns the type, one of 'TEXT', 'COMMAND', 'ENVIRONMENT', 'GROUP', or 'COMMENT', | |
515 | =head3 getNodeType | |
516 | ||
517 | =over 4 | |
518 | ||
519 | =item C<> | |
520 | ||
521 | Returns the type, one of C<TEXT>, C<COMMAND>, C<ENVIRONMENT>, C<GROUP>, or C<COMMENT>, | |
452 | 522 | as described above. |
453 | 523 | |
454 | =item getNodeText | |
455 | ||
456 | Applicable for TEXT or COMMENT nodes; this returns the document text they contain. | |
524 | =back | |
525 | ||
526 | =head3 getNodeText | |
527 | ||
528 | =over 4 | |
529 | ||
530 | =item C<> | |
531 | ||
532 | Applicable for C<TEXT> or C<COMMENT> nodes; this returns the document text they contain. | |
457 | 533 | This is undef for other node types. |
458 | 534 | |
459 | =item setNodeText | |
460 | ||
461 | Set the node text, also for TEXT and COMMENT nodes. | |
462 | ||
463 | =item getNodeStartingPosition | |
464 | ||
465 | Get the starting character position in the document of this node. For TEXT | |
466 | and COMMENT nodes, this will be where the text begins. For ENVIRONMENT, | |
467 | COMMAND, or GROUP nodes, this will be the position of the I<last> character of | |
535 | =back | |
536 | ||
537 | =head3 setNodeText | |
538 | ||
539 | =over 4 | |
540 | ||
541 | =item C<> | |
542 | ||
543 | Set the node text, also for C<TEXT> and C<COMMENT> nodes. | |
544 | ||
545 | =back | |
546 | ||
547 | =head3 getNodeStartingPosition | |
548 | ||
549 | =over 4 | |
550 | ||
551 | =item C<> | |
552 | ||
553 | Get the starting character position in the document of this node. For C<TEXT> | |
554 | and C<COMMENT> nodes, this will be where the text begins. For C<ENVIRONMENT>, | |
555 | C<COMMAND>, or C<GROUP> nodes, this will be the position of the I<last> character of | |
468 | 556 | the opening identifier. |
469 | 557 | |
470 | =item getNodeEndingPosition | |
471 | ||
472 | Same as above, but for last character. For GROUP, ENVIRONMENT, or COMMAND | |
558 | =back | |
559 | ||
560 | =head3 getNodeEndingPosition | |
561 | ||
562 | =over 4 | |
563 | ||
564 | =item C<> | |
565 | ||
566 | Same as above, but for last character. For C<GROUP>, C<ENVIRONMENT>, or C<COMMAND> | |
473 | 567 | nodes, this will be the I<first> character of the closing identifier. |
474 | 568 | |
475 | =item getNodeOuterStartingPosition | |
476 | ||
477 | Same as getNodeStartingPosition, but for GROUP, ENVIRONMENT, or COMMAND nodes, | |
569 | =back | |
570 | ||
571 | =head3 getNodeOuterStartingPosition | |
572 | ||
573 | =over 4 | |
574 | ||
575 | =item C<> | |
576 | ||
577 | Same as getNodeStartingPosition, but for C<GROUP>, C<ENVIRONMENT>, or C<COMMAND> nodes, | |
478 | 578 | this returns the I<first> character of the opening identifier. |
479 | 579 | |
480 | =item getNodeOuterEndingPosition | |
481 | ||
482 | Same as getNodeEndingPosition, but for GROUP, ENVIRONMENT, or COMMAND nodes, | |
580 | =back | |
581 | ||
582 | =head3 getNodeOuterEndingPosition | |
583 | ||
584 | =over 4 | |
585 | ||
586 | =item C<> | |
587 | ||
588 | Same as getNodeEndingPosition, but for C<GROUP>, C<ENVIRONMENT>, or C<COMMAND> nodes, | |
483 | 589 | this returns the I<last> character of the closing identifier. |
484 | 590 | |
485 | =item getNodeMathFlag | |
486 | ||
487 | This applies to any node type. It is 1 if the node sets, or is contained | |
488 | within, a math mode region. 0 otherwise. TEXT nodes which have this flag as 1 | |
591 | =back | |
592 | ||
593 | =head3 getNodeMathFlag | |
594 | ||
595 | =over 4 | |
596 | ||
597 | =item C<> | |
598 | ||
599 | This applies to any node type. It is C<1> if the node sets, or is contained | |
600 | within, a math mode region. C<0> otherwise. C<TEXT> nodes which have this flag as C<1> | |
489 | 601 | can be assumed to be the actual mathematics contained in the document. |
490 | 602 | |
491 | =item getNodePlainTextFlag | |
492 | ||
493 | This applies only to TEXT nodes. It is 1 if the node is non-math B<and> is | |
603 | =back | |
604 | ||
605 | =head3 getNodePlainTextFlag | |
606 | ||
607 | =over 4 | |
608 | ||
609 | =item C<> | |
610 | ||
611 | This applies only to C<TEXT> nodes. It is C<1> if the node is non-math B<and> is | |
494 | 612 | visible (in other words, will end up being a part of the output document). One |
495 | would only want to index TEXT nodes with this property, for information | |
613 | would only want to index C<TEXT> nodes with this property, for information | |
496 | 614 | retrieval purposes. |
497 | 615 | |
498 | =item getEnvironmentClass | |
499 | ||
500 | This applies only to ENVIRONMENT nodes. Returns what class of environment the | |
501 | node represents (the X in \begin{X} and \end{X}). | |
502 | ||
503 | =item getCommandName | |
504 | ||
505 | This applies only to COMMAND nodes. Returns the name of the command (the X in | |
506 | \X{...}). | |
507 | ||
508 | =item getChildTree | |
509 | ||
510 | This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it returns the | |
511 | LaTeX::TOM::Tree which is ``under'' the calling node. | |
512 | ||
513 | =item getFirstChild | |
514 | ||
515 | This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it returns the | |
616 | =back | |
617 | ||
618 | =head3 getEnvironmentClass | |
619 | ||
620 | =over 4 | |
621 | ||
622 | =item C<> | |
623 | ||
624 | This applies only to C<ENVIRONMENT> nodes. Returns what class of environment the | |
625 | node represents (the C<X> in C<\begin{X}> and C<\end{X}>). | |
626 | ||
627 | =back | |
628 | ||
629 | =head3 getCommandName | |
630 | ||
631 | =over 4 | |
632 | ||
633 | =item C<> | |
634 | ||
635 | This applies only to C<COMMAND> nodes. Returns the name of the command (the C<X> in | |
636 | C<\X{...}>). | |
637 | ||
638 | =back | |
639 | ||
640 | =head3 getChildTree | |
641 | ||
642 | =over 4 | |
643 | ||
644 | =item C<> | |
645 | ||
646 | This applies only to C<COMMAND>, C<ENVIRONMENT>, and C<GROUP> nodes: it returns the | |
647 | C<LaTeX::TOM::Tree> which is ``under'' the calling node. | |
648 | ||
649 | =back | |
650 | ||
651 | =head3 getFirstChild | |
652 | ||
653 | =over 4 | |
654 | ||
655 | =item C<> | |
656 | ||
657 | This applies only to C<COMMAND>, C<ENVIRONMENT>, and C<GROUP> nodes: it returns the | |
516 | 658 | first node from the first level of the child subtree. |
517 | 659 | |
518 | =item getLastChild | |
660 | =back | |
661 | ||
662 | =head3 getLastChild | |
663 | ||
664 | =over 4 | |
665 | ||
666 | =item C<> | |
519 | 667 | |
520 | 668 | Same as above, but for the last node of the first level. |
521 | 669 | |
522 | =item getPreviousSibling | |
670 | =back | |
671 | ||
672 | =head3 getPreviousSibling | |
673 | ||
674 | =over 4 | |
675 | ||
676 | =item C<> | |
523 | 677 | |
524 | 678 | Return the prior node on the same level of the tree. |
525 | 679 | |
526 | =item getNextSibling | |
680 | =back | |
681 | ||
682 | =head3 getNextSibling | |
683 | ||
684 | =over 4 | |
685 | ||
686 | =item C<> | |
527 | 687 | |
528 | 688 | Same as above, but for following node. |
529 | 689 | |
530 | =item getParent | |
690 | =back | |
691 | ||
692 | =head3 getParent | |
693 | ||
694 | =over 4 | |
695 | ||
696 | =item C<> | |
531 | 697 | |
532 | 698 | Get the parent node of this node in the tree. |
533 | 699 | |
534 | =item getNextGroupNode | |
700 | =back | |
701 | ||
702 | =head3 getNextGroupNode | |
703 | ||
704 | =over 4 | |
705 | ||
706 | =item C<> | |
535 | 707 | |
536 | 708 | This is an interesting function, and kind of a hack because of the way the |
537 | 709 | parser makes the current tree. Basically it will give you the next sibling |
538 | that is a GROUP node, until it either hits the end of the tree level, a TEXT | |
539 | node which doesn't match /^\s*$/, or a COMMAND node. | |
540 | ||
541 | This is useful for finding all GROUPed parameters after a COMMAND node (see | |
542 | comments for 'GROUP' in the 'COMPONENTS' / 'LaTeX::TOM::Node' section). You | |
543 | can just have a while loop that calls this method until it gets 'undef', and | |
710 | that is a C<GROUP> node, until it either hits the end of the tree level, a C<TEXT> | |
711 | node which doesn't match C</^\s*$/>, or a C<COMMAND> node. | |
712 | ||
713 | This is useful for finding all C<GROUP>ed parameters after a C<COMMAND> node (see | |
714 | comments for C<GROUP> in the C<COMPONENTS> / C<LaTeX::TOM::Node> section). You | |
715 | can just have a while loop that calls this method until it gets C<undef>, and | |
544 | 716 | you'll know you've found all the parameters to a command. |
545 | 717 | |
546 | Note: this may be bad, but TEXT Nodes matching /^\s*\[[0-9]+\]$/ (optional | |
718 | Note: this may be bad, but C<TEXT> Nodes matching C</^\s*\[[0-9]+\]$/> (optional | |
547 | 719 | parameter groups) are treated as if they were 'blank'. |
548 | 720 | |
549 | 721 | =back |
552 | 724 | |
553 | 725 | Due to the lack of tree-modification methods, currently this module is |
554 | 726 | mostly useful for minor modifications to the parsed document, for instance, |
555 | altering the text of TEXT nodes but not deleting the nodes. Of course, the | |
727 | altering the text of C<TEXT> nodes but not deleting the nodes. Of course, the | |
556 | 728 | user can still do this by breaking abstraction and directly modifying the Tree. |
557 | 729 | |
558 | 730 | Also note that the parsing is not complete. This module was not written with |
561 | 733 | logical level with regards to the content; it doesn't care about the document |
562 | 734 | formatting and outputting side of TeX/LaTeX. |
563 | 735 | |
564 | There is much work still to be done. See the TODO list in the TOM.pm source. | |
736 | There is much work still to be done. See the F<TODO> list in the F<TOM.pm> source. | |
565 | 737 | |
566 | 738 | =head1 BUGS |
567 | 739 | |
569 | 741 | ~1000 research publications from the Computing Research Repository, so I |
570 | 742 | deemed it ``good enough'' to use for purposes similar to mine. |
571 | 743 | |
572 | Please let me know of parser errors if you discover any. | |
573 | ||
574 | =head1 AUTHOR | |
744 | Please let the authors know of parser errors if you discover any. | |
745 | ||
746 | =head1 AUTHORS | |
575 | 747 | |
576 | 748 | Written by Aaron Krowne <akrowne@vt.edu> |
577 | 749 |