Commit 4b962aceb61794b3ac46b6f80a9010b4b4c0217f - liblatex-tom-perl

Import Upstream version 0.8 gregor herrmann 6 years ago

6 changed file(s) with 524 addition(s) and 331 deletion(s). Raw diff Collapse all Expand all

-0

Changes less more

0	0	Revision history for Perl extension LaTeX::TOM.
	1
	2	0.8 Mon Oct 8 10:23:01 CEST 2007
	3
	4	- Fixed failing tests pod.t & pod-coverage.t (adjusted plans).
	5
	6	0.7 Tue Aug 28 00:12:03 CEST 2007
	7
	8	- Added formatting tags to the documentation where appropriate
	9	and enlisted all methods within the documentation index.
1	10
2	11	0.6 Wed Mar 14 01:05:09 CET 2007
3	12

-9

META.yml less more

0	0	---
1	1	name: LaTeX-TOM
2		version: 0.6
	2	version: 0.8
3	3	author:
4	4	- 'Aaron Krowne <akrowne@vt.edu.org>'
5		abstract: 'A module for parsing, analyzing, and manipulating LaTeX documents.'
	5	abstract: A module for parsing, analyzing, and manipulating LaTeX documents.
6	6	license: perl
7		resources:
8		license: http://dev.perl.org/licenses/
9	7	build_requires:
10	8	Test::More: 0
	9	generated_by: Module::Build version 0.2808
	10	meta-spec:
	11	url: http://module-build.sourceforge.net/META-spec-v1.2.html
	12	version: 1.2
11	13	provides:
12	14	LaTeX::TOM:
13	15	file: lib/LaTeX/TOM.pm
14		version: 0.6
	16	version: 0.8
15	17	LaTeX::TOM::Node:
16	18	file: lib/LaTeX/TOM/Node.pm
17	19	LaTeX::TOM::Parser:
18	20	file: lib/LaTeX/TOM/Parser.pm
19	21	LaTeX::TOM::Tree:
20	22	file: lib/LaTeX/TOM/Tree.pm
21		generated_by: Module::Build version 0.2805
22		meta-spec:
23		url: http://module-build.sourceforge.net/META-spec-v1.2.html
24		version: 1.2
	23	resources:
	24	license: http://dev.perl.org/licenses/

+165

-159

README less more

2	2	documents.
3	3
4	4	SYNOPSIS
5		use LaTeX::TOM;
6
7		my $parser = LaTeX::TOM->new;
8
9		my $document = $parser->parseFile('mypaper.tex');
10
11		my $latex = $document->toLaTeX;
12
13		my $specialnodes = $document->getNodesByCondition(
14		'$node->getNodeType eq \'TEXT\' &&
15		$node->getNodeText =~ /magic string/');
16
17		my $sections = $document->getNodesByCondition(
18		'$node->getNodeType eq \'COMMAND\' &&
19		$node->getCommandName =~ /section$/');
20
21		my $indexme = $document->getIndexableText;
22
23		$document->print;
	5	use LaTeX::TOM;
	6
	7	$parser = LaTeX::TOM->new;
	8
	9	$document = $parser->parseFile('mypaper.tex');
	10
	11	$latex = $document->toLaTeX;
	12
	13	$specialnodes = $document->getNodesByCondition(
	14	'$node->getNodeType eq \'TEXT\' &&
	15	$node->getNodeText =~ /magic string/'
	16	);
	17
	18	$sections = $document->getNodesByCondition(
	19	'$node->getNodeType eq \'COMMAND\' &&
	20	$node->getCommandName =~ /section$/'
	21	);
	22
	23	$indexme = $document->getIndexableText;
	24
	25	$document->print;
24	26
25	27	DESCRIPTION
26	28	This module provides a parser which parses and interprets (though not
27	29	fully) LaTeX documents and returns a tree-based representation of what
28		it finds. This tree is a LaTeX::TOM::Tree. The tree contains
29		LaTeX::TOM:Node nodes.
	30	it finds. This tree is a "LaTeX::TOM::Tree". The tree contains
	31	"LaTeX::TOM::Node" nodes.
30	32
31	33	This module should be especially useful to anyone who wants to do
32	34	processing of LaTeX documents that requires extraction of plain-text

46	48	parameter to be 0 or 2 to completely parse the document.
47	49
48	50	read inputs flag (= 0 \|\| 1)
49		This flag determines whether a scan for \input and \input-like
	51	This flag determines whether a scan for "\input" and "\input-like"
50	52	commands is performed, and the resulting called files parsed and
51	53	added to the parent parse tree. 0 means no, 1 means do it. Note that
52	54	this will happen recursively if it is turned on. Also,

54	56
55	57	apply mappings flag (= 0 \|\| 1)
56	58	This flag determines whether (most) user-defined mappings are
57		applied. This means \defs, \newcommands, and \newenvironments. This
58		is critical for properly analyzing the content of the document, as
59		this must be phrased in terms of the semantics of the original TeX
60		and LaTeX commands, not ad hoc user macros. So, for instance, do not
61		expect plain-text extraction to work properly with this option off.
62
63		The parser returns a LaTeX::TOM::Tree ($document in the SYNOPSIS).
	59	applied. This means "\defs", "\newcommands", and "\newenvironments".
	60	This is critical for properly analyzing the content of the document,
	61	as this must be phrased in terms of the semantics of the original
	62	TeX and LaTeX commands, not ad hoc user macros. So, for instance, do
	63	not expect plain-text extraction to work properly with this option
	64	off.
	65
	66	The parser returns a "LaTeX::TOM::Tree" ($document in the SYNOPSIS).
64	67
65	68	LaTeX::TOM::Node
66	69	Nodes may be of the following types:
67	70
68	71	TEXT
69		TEXT nodes can be thought of as representing the plain-text portions
70		of the LaTeX document. This includes math and anything else that is
71		not a recognized TeX or LaTeX command, or user-defined command. In
72		reality, TEXT nodes contain commands that this parser does not yet
73		recognize the semantics of.
	72	"TEXT" nodes can be thought of as representing the plain-text
	73	portions of the LaTeX document. This includes math and anything else
	74	that is not a recognized TeX or LaTeX command, or user-defined
	75	command. In reality, "TEXT" nodes contain commands that this parser
	76	does not yet recognize the semantics of.
74	77
75	78	COMMAND
76		A COMMAND node represents a TeX command. It always has child nodes
	79	A "COMMAND" node represents a TeX command. It always has child nodes
77	80	in a tree, though the tree might be empty if the command operates on
78	81	zero parameters. An example of a command is
79	82
80		\textbf{blah}
81
82		This would parse into a COMMAND node for textbf, which would have
83		a subtree containing the TEXT node with text ``blah.''
	83	\textbf{blah}
	84
	85	This would parse into a "COMMAND" node for "textbf", which would
	86	have a subtree containing the "TEXT" node with text ``blah.''
84	87
85	88	ENVIRONMENT
86		Similarly, TeX environments parse into ENVIRONMENT nodes, which have
87		metadata about the environment, along with a subtree representing
88		what is contained in the environment. For example,
89
90		\begin{equation}
91		r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
92		\end{equation}
93
94		Would parse into an ENVIRONMENT node of the class ``equation'' with
95		a child tree containing the result of parsing ``r = \frac{-b \pm
96		\sqrt{b^2 - 4ac}}{2a}.''
	89	Similarly, TeX environments parse into "ENVIRONMENT" nodes, which
	90	have metadata about the environment, along with a subtree
	91	representing what is contained in the environment. For example,
	92
	93	\begin{equation}
	94	r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
	95	\end{equation}
	96
	97	Would parse into an "ENVIRONMENT" node of the class ``equation''
	98	with a child tree containing the result of parsing ``r = \frac{-b
	99	\pm \sqrt{b^2 - 4ac}}{2a}.''
97	100
98	101	GROUP
99		A GROUP is like an anonymous COMMAND. Since you can put whatever you
100		want in curly-braces ({}) in TeX in order to make semantically
	102	A "GROUP" is like an anonymous "COMMAND". Since you can put whatever
	103	you want in curly-braces ("{}") in TeX in order to make semantically
101	104	isolated regions, this separation is preserved by the parser. A
102		GROUP is just the subtree of the parsed contents of plain
	105	"GROUP" is just the subtree of the parsed contents of plain
103	106	curly-braces.
104	107
105		It is important to note that currently only the first GROUP in a
106		series of GROUPs following a LaTeX command will actually be parsed
107		into a COMMAND node. The reason is that, for the initial purposes of
108		this module, it was not necessary to recognize additional GROUPs as
109		additional parameters to the COMMAND. However, this is something
110		that this module really should do eventually. Currently if you want
111		all the parameters to a multi-parametered command, you'll need to
112		pick out all the following GROUP nodes yourself.
	108	It is important to note that currently only the first "GROUP" in a
	109	series of "GROUP"s following a LaTeX command will actually be parsed
	110	into a "COMMAND" node. The reason is that, for the initial purposes
	111	of this module, it was not necessary to recognize additional
	112	"GROUP"s as additional parameters to the "COMMAND". However, this is
	113	something that this module really should do eventually. Currently if
	114	you want all the parameters to a multi-parametered command, you'll
	115	need to pick out all the following "GROUP" nodes yourself.
113	116
114	117	Eventually this will become something like a list which is stored in
115		the COMMAND node, much like XML::DOM's treatment of attributes.
	118	the "COMMAND" node, much like XML::DOM's treatment of attributes.
116	119	These are, in a sense, apart from the rest of the document tree.
117		Then GROUP nodes will become much more rare.
	120	Then "GROUP" nodes will become much more rare.
118	121
119	122	COMMENT
120		A COMMENT node is very similar to a TEXT node, except it is
	123	A "COMMENT" node is very similar to a "TEXT" node, except it is
121	124	specifically for lines beginning with ``%'' (the TeX comment
122	125	delimeter) or the right-hand portion of a line that has ``%'' at
123	126	some internal point.

133	136
134	137	METHODS
135	138	LaTeX::TOM
136		new Instantiate a new parser object.
	139	new
	140	Instantiate a new parser object.
137	141
138	142	In this section all of the methods for each of the components are listed
139	143	and described.

142	146	The methods for the parser (aside from the constructor, discussed above)
143	147	are :
144	148
145		parseFile (filename)
	149	parseFile (filename)
146	150	Read in the contents of filename and parse them, returning a
147		LaTeX::TOM:Tree.
148
149		parse (string)
150		Parse the string string and return a LaTeX::TOM::Tree.
	151	"LaTeX::TOM::Tree".
	152
	153	parse (string)
	154	Parse the string string and return a "LaTeX::TOM::Tree".
151	155
152	156	LaTeX::TOM::Tree
153	157	This section contains methods for the Trees returned by the parser.
154	158
155		copy
	159	copy
156	160	Duplicate a tree into new memory.
157	161
158		print
	162	print
159	163	A debug print of the structure of the tree.
160	164
161		plainText
	165	plainText
162	166	Returns an arrayref which is a list of strings representing the text
163		of all getNodePlainTextFlag = 1 TEXT nodes, in an inorder traversal.
164
165		indexableText
	167	of all "getNodePlainTextFlag = 1" "TEXT" nodes, in an inorder
	168	traversal.
	169
	170	indexableText
166	171	A method like the above but which goes one step further; it cleans
167	172	all of the returned text and concatenates it into a single string
168	173	which one could consider having all of the standard information
169	174	retrieval value for the document, making it useful for indexing.
170	175
171		toLaTeX
	176	toLaTeX
172	177	Return a string representing the LaTeX encoded by the tree. This is
173	178	especially useful to get a normal document again, after modifying
174	179	nodes of the tree.
175	180
176		getTopLevelNodes
177		Return an arrayref which is a list of LaTeX::TOM::Nodes at the top
	181	getTopLevelNodes
	182	Return an arrayref which is a list of "LaTeX::TOM::Nodes" at the top
178	183	level of the Tree.
179	184
180		getAllNodes
	185	getAllNodes
181	186	Return an arrayref with all nodes of the tree. This "flattens" the
182	187	tree.
183	188
184		getCommandNodesByName (name)
185		Return an arrayref with all COMMAND nodes in the tree which have a
	189	getCommandNodesByName (name)
	190	Return an arrayref with all "COMMAND" nodes in the tree which have a
186	191	name matching name.
187	192
188		getEnvironmentsByName (name)
189		Return an arrayref with all ENVIRONMENT nodes in the tree which have
190		a class matching name.
191
192		getNodesByCondition (expression)
	193	getEnvironmentsByName (name)
	194	Return an arrayref with all "ENVIRONMENT" nodes in the tree which
	195	have a class matching name.
	196
	197	getNodesByCondition (expression)
193	198	This is a catch-all search method which can be used to pull out
194	199	nodes that match pretty much any perl expression, without manually
195	200	having to traverse the tree. expression is a valid perl expression

200	205	LaTeX::TOM::Node
201	206	This section contains the methods for nodes of the parsed Trees.
202	207
203		getNodeType
204		Returns the type, one of 'TEXT', 'COMMAND', 'ENVIRONMENT', 'GROUP',
205		or 'COMMENT', as described above.
206
207		getNodeText
208		Applicable for TEXT or COMMENT nodes; this returns the document text
209		they contain. This is undef for other node types.
210
211		setNodeText
212		Set the node text, also for TEXT and COMMENT nodes.
213
214		getNodeStartingPosition
	208	getNodeType
	209	Returns the type, one of "TEXT", "COMMAND", "ENVIRONMENT", "GROUP",
	210	or "COMMENT", as described above.
	211
	212	getNodeText
	213	Applicable for "TEXT" or "COMMENT" nodes; this returns the document
	214	text they contain. This is undef for other node types.
	215
	216	setNodeText
	217	Set the node text, also for "TEXT" and "COMMENT" nodes.
	218
	219	getNodeStartingPosition
215	220	Get the starting character position in the document of this node.
216		For TEXT and COMMENT nodes, this will be where the text begins. For
217		ENVIRONMENT, COMMAND, or GROUP nodes, this will be the position of
218		the last character of the opening identifier.
219
220		getNodeEndingPosition
221		Same as above, but for last character. For GROUP, ENVIRONMENT, or
222		COMMAND nodes, this will be the first character of the closing
	221	For "TEXT" and "COMMENT" nodes, this will be where the text begins.
	222	For "ENVIRONMENT", "COMMAND", or "GROUP" nodes, this will be the
	223	position of the last character of the opening identifier.
	224
	225	getNodeEndingPosition
	226	Same as above, but for last character. For "GROUP", "ENVIRONMENT",
	227	or "COMMAND" nodes, this will be the first character of the
	228	closing identifier.
	229
	230	getNodeOuterStartingPosition
	231	Same as getNodeStartingPosition, but for "GROUP", "ENVIRONMENT", or
	232	"COMMAND" nodes, this returns the first character of the opening
223	233	identifier.
224	234
225		getNodeOuterStartingPosition
226		Same as getNodeStartingPosition, but for GROUP, ENVIRONMENT, or
227		COMMAND nodes, this returns the first character of the opening
	235	getNodeOuterEndingPosition
	236	Same as getNodeEndingPosition, but for "GROUP", "ENVIRONMENT", or
	237	"COMMAND" nodes, this returns the last character of the closing
228	238	identifier.
229	239
230		getNodeOuterEndingPosition
231		Same as getNodeEndingPosition, but for GROUP, ENVIRONMENT, or
232		COMMAND nodes, this returns the last character of the closing
233		identifier.
234
235		getNodeMathFlag
	240	getNodeMathFlag
236	241	This applies to any node type. It is 1 if the node sets, or is
237		contained within, a math mode region. 0 otherwise. TEXT nodes which
238		have this flag as 1 can be assumed to be the actual mathematics
239		contained in the document.
240
241		getNodePlainTextFlag
242		This applies only to TEXT nodes. It is 1 if the node is non-math and
243		is visible (in other words, will end up being a part of the output
244		document). One would only want to index TEXT nodes with this
245		property, for information retrieval purposes.
246
247		getEnvironmentClass
248		This applies only to ENVIRONMENT nodes. Returns what class of
249		environment the node represents (the X in \begin{X} and \end{X}).
250
251		getCommandName
252		This applies only to COMMAND nodes. Returns the name of the command
253		(the X in \X{...}).
254
255		getChildTree
256		This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it
257		returns the LaTeX::TOM::Tree which is ``under'' the calling node.
258
259		getFirstChild
260		This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it
	242	contained within, a math mode region. 0 otherwise. "TEXT" nodes
	243	which have this flag as 1 can be assumed to be the actual
	244	mathematics contained in the document.
	245
	246	getNodePlainTextFlag
	247	This applies only to "TEXT" nodes. It is 1 if the node is non-math
	248	and is visible (in other words, will end up being a part of the
	249	output document). One would only want to index "TEXT" nodes with
	250	this property, for information retrieval purposes.
	251
	252	getEnvironmentClass
	253	This applies only to "ENVIRONMENT" nodes. Returns what class of
	254	environment the node represents (the "X" in "\begin{X}" and
	255	"\end{X}").
	256
	257	getCommandName
	258	This applies only to "COMMAND" nodes. Returns the name of the
	259	command (the "X" in "\X{...}").
	260
	261	getChildTree
	262	This applies only to "COMMAND", "ENVIRONMENT", and "GROUP" nodes: it
	263	returns the "LaTeX::TOM::Tree" which is ``under'' the calling node.
	264
	265	getFirstChild
	266	This applies only to "COMMAND", "ENVIRONMENT", and "GROUP" nodes: it
261	267	returns the first node from the first level of the child subtree.
262	268
263		getLastChild
	269	getLastChild
264	270	Same as above, but for the last node of the first level.
265	271
266		getPreviousSibling
	272	getPreviousSibling
267	273	Return the prior node on the same level of the tree.
268	274
269		getNextSibling
	275	getNextSibling
270	276	Same as above, but for following node.
271	277
272		getParent
	278	getParent
273	279	Get the parent node of this node in the tree.
274	280
275		getNextGroupNode
	281	getNextGroupNode
276	282	This is an interesting function, and kind of a hack because of the
277	283	way the parser makes the current tree. Basically it will give you
278		the next sibling that is a GROUP node, until it either hits the end
279		of the tree level, a TEXT node which doesn't match /^\s*$/, or a
280		COMMAND node.
281
282		This is useful for finding all GROUPed parameters after a COMMAND
283		node (see comments for 'GROUP' in the 'COMPONENTS' /
284		'LaTeX::TOM::Node' section). You can just have a while loop that
285		calls this method until it gets 'undef', and you'll know you've
	284	the next sibling that is a "GROUP" node, until it either hits the
	285	end of the tree level, a "TEXT" node which doesn't match "/^\s*$/",
	286	or a "COMMAND" node.
	287
	288	This is useful for finding all "GROUP"ed parameters after a
	289	"COMMAND" node (see comments for "GROUP" in the "COMPONENTS" /
	290	"LaTeX::TOM::Node" section). You can just have a while loop that
	291	calls this method until it gets "undef", and you'll know you've
286	292	found all the parameters to a command.
287	293
288		Note: this may be bad, but TEXT Nodes matching /^\s*\[[0-9]+\]$/
	294	Note: this may be bad, but "TEXT" Nodes matching "/^\s*\[[0-9]+\]$/"
289	295	(optional parameter groups) are treated as if they were 'blank'.
290	296
291	297	CAVEATS
292	298	Due to the lack of tree-modification methods, currently this module is
293	299	mostly useful for minor modifications to the parsed document, for
294		instance, altering the text of TEXT nodes but not deleting the nodes. Of
295		course, the user can still do this by breaking abstraction and directly
296		modifying the Tree.
	300	instance, altering the text of "TEXT" nodes but not deleting the nodes.
	301	Of course, the user can still do this by breaking abstraction and
	302	directly modifying the Tree.
297	303
298	304	Also note that the parsing is not complete. This module was not written
299	305	with the intention of being able to produce output documents the way

309	315	of ~1000 research publications from the Computing Research Repository,
310	316	so I deemed it ``good enough'' to use for purposes similar to mine.
311	317
312		Please let me know of parser errors if you discover any.
313
314		AUTHOR
	318	Please let the authors know of parser errors if you discover any.
	319
	320	AUTHORS
315	321	Written by Aaron Krowne <akrowne@vt.edu>
316	322
317	323	Maintained by Steven Schubiger <schubiger@cpan.org>

+333

-161

lib/LaTeX/TOM.pm less more

1	1	#
2	2	# LaTeX::TOM (TeX Object Model)
3	3	#
4		# Version 0.6
	4	# Version 0.8
5	5	#
6	6	# ----------------------------------------------------------------------------
7	7	#

31	31
32	32	use base qw(LaTeX::TOM::Parser);
33	33
34		our $VERSION = '0.6';
	34	our $VERSION = '0.8';
35	35
36	36	# BEGIN CONFIG SECTION ########################################################
37	37

206	206
207	207	=head1 SYNOPSIS
208	208
209		use LaTeX::TOM;
210
211		my $parser = LaTeX::TOM->new;
212
213		my $document = $parser->parseFile('mypaper.tex');
214
215		my $latex = $document->toLaTeX;
216
217		my $specialnodes = $document->getNodesByCondition(
218		'$node->getNodeType eq \'TEXT\' &&
219		$node->getNodeText =~ /magic string/');
220
221		my $sections = $document->getNodesByCondition(
222		'$node->getNodeType eq \'COMMAND\' &&
223		$node->getCommandName =~ /section$/');
224
225		my $indexme = $document->getIndexableText;
226
227		$document->print;
	209	use LaTeX::TOM;
	210
	211	$parser = LaTeX::TOM->new;
	212
	213	$document = $parser->parseFile('mypaper.tex');
	214
	215	$latex = $document->toLaTeX;
	216
	217	$specialnodes = $document->getNodesByCondition(
	218	'$node->getNodeType eq \'TEXT\' &&
	219	$node->getNodeText =~ /magic string/'
	220	);
	221
	222	$sections = $document->getNodesByCondition(
	223	'$node->getNodeType eq \'COMMAND\' &&
	224	$node->getCommandName =~ /section$/'
	225	);
	226
	227	$indexme = $document->getIndexableText;
	228
	229	$document->print;
228	230
229	231	=head1 DESCRIPTION
230	232
231	233	This module provides a parser which parses and interprets (though not fully)
232	234	LaTeX documents and returns a tree-based representation of what it finds.
233		This tree is a LaTeX::TOM::Tree. The tree contains LaTeX::TOM:Node nodes.
	235	This tree is a C<LaTeX::TOM::Tree>. The tree contains C<LaTeX::TOM::Node> nodes.
234	236
235	237	This module should be especially useful to anyone who wants to do processing
236	238	of LaTeX documents that requires extraction of plain-text information, or

247	249
248	250	=item parse error handling (= B<0> \|\| 1 \|\| 2)
249	251
250		Determines what happens when a parse error is encountered. 0 results in a
251		warning. 1 results in a die. 2 results in silence. Note that particular
	252	Determines what happens when a parse error is encountered. C<0> results in a
	253	warning. C<1> results in a die. C<2> results in silence. Note that particular
252	254	groupings in LaTeX (i.e. newcommands and the like) contain invalid TeX or
253		LaTeX, so you nearly always need this parameter to be 0 or 2 to completely
	255	LaTeX, so you nearly always need this parameter to be C<0> or C<2> to completely
254	256	parse the document.
255	257
256	258	=item read inputs flag (= 0 \|\| B<1>)
257	259
258		This flag determines whether a scan for \input and \input-like commands is
	260	This flag determines whether a scan for C<\input> and C<\input-like> commands is
259	261	performed, and the resulting called files parsed and added to the parent
260		parse tree. 0 means no, 1 means do it. Note that this will happen recursively
261		if it is turned on. Also, bibliographies (.bbl files) are detected and
	262	parse tree. C<0> means no, C<1> means do it. Note that this will happen recursively
	263	if it is turned on. Also, bibliographies (F<.bbl> files) are detected and
262	264	included.
263	265
264	266	=item apply mappings flag (= 0 \|\| B<1>)
265	267
266	268	This flag determines whether (most) user-defined mappings are applied. This
267		means \defs, \newcommands, and \newenvironments. This is critical for properly
268		analyzing the content of the document, as this must be phrased in terms of the
269		semantics of the original TeX and LaTeX commands, not ad hoc user macros. So,
270		for instance, do not expect plain-text extraction to work properly with this
	269	means C<\defs>, C<\newcommands>, and C<\newenvironments>. This is critical for
	270	properly analyzing the content of the document, as this must be phrased in terms
	271	of the semantics of the original TeX and LaTeX commands, not ad hoc user macros.
	272	So, for instance, do not expect plain-text extraction to work properly with this
271	273	option off.
272	274
273	275	=back
274	276
275		The parser returns a LaTeX::TOM::Tree ($document in the SYNOPSIS).
	277	The parser returns a C<LaTeX::TOM::Tree> ($document in the SYNOPSIS).
276	278
277	279	=head2 LaTeX::TOM::Node
278	280

282	284
283	285	=item TEXT
284	286
285		TEXT nodes can be thought of as representing the plain-text portions of the
	287	C<TEXT> nodes can be thought of as representing the plain-text portions of the
286	288	LaTeX document. This includes math and anything else that is not a recognized
287		TeX or LaTeX command, or user-defined command. In reality, TEXT nodes contain
	289	TeX or LaTeX command, or user-defined command. In reality, C<TEXT> nodes contain
288	290	commands that this parser does not yet recognize the semantics of.
289	291
290	292	=item COMMAND
291	293
292		A COMMAND node represents a TeX command. It always has child nodes in a tree,
	294	A C<COMMAND> node represents a TeX command. It always has child nodes in a tree,
293	295	though the tree might be empty if the command operates on zero parameters. An
294	296	example of a command is
295	297
296		\textbf{blah}
297
298		This would parse into a COMMAND node for I<textbf>, which would have a subtree
299		containing the TEXT node with text ``blah.''
	298	\textbf{blah}
	299
	300	This would parse into a C<COMMAND> node for C<textbf>, which would have a subtree
	301	containing the C<TEXT> node with text ``blah.''
300	302
301	303	=item ENVIRONMENT
302	304
303		Similarly, TeX environments parse into ENVIRONMENT nodes, which have metadata
	305	Similarly, TeX environments parse into C<ENVIRONMENT> nodes, which have metadata
304	306	about the environment, along with a subtree representing what is contained in
305	307	the environment. For example,
306	308
307		\begin{equation}
308		r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
309		\end{equation}
310
311		Would parse into an ENVIRONMENT node of the class ``equation'' with a child
312		tree containing the result of parsing ``r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.''
	309	\begin{equation}
	310	r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}
	311	\end{equation}
	312
	313	Would parse into an C<ENVIRONMENT> node of the class ``equation'' with a child
	314	tree containing the result of parsing C<``r = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}.''>
313	315
314	316	=item GROUP
315	317
316		A GROUP is like an anonymous COMMAND. Since you can put whatever you want in
317		curly-braces ({}) in TeX in order to make semantically isolated regions, this
318		separation is preserved by the parser. A GROUP is just the subtree of the
	318	A C<GROUP> is like an anonymous C<COMMAND>. Since you can put whatever you want in
	319	curly-braces (C<{}>) in TeX in order to make semantically isolated regions, this
	320	separation is preserved by the parser. A C<GROUP> is just the subtree of the
319	321	parsed contents of plain curly-braces.
320	322
321		It is important to note that currently only the first GROUP in a series of
322		GROUPs following a LaTeX command will actually be parsed into a COMMAND node.
	323	It is important to note that currently only the first C<GROUP> in a series of
	324	C<GROUP>s following a LaTeX command will actually be parsed into a C<COMMAND> node.
323	325	The reason is that, for the initial purposes of this module, it was not
324		necessary to recognize additional GROUPs as additional parameters to the
325		COMMAND. However, this is something that this module really should do
	326	necessary to recognize additional C<GROUP>s as additional parameters to the
	327	C<COMMAND>. However, this is something that this module really should do
326	328	eventually. Currently if you want all the parameters to a multi-parametered
327		command, you'll need to pick out all the following GROUP nodes yourself.
	329	command, you'll need to pick out all the following C<GROUP> nodes yourself.
328	330
329	331	Eventually this will become something like a list which is stored in the
330		COMMAND node, much like XML::DOM's treatment of attributes. These are, in a
331		sense, apart from the rest of the document tree. Then GROUP nodes will become
	332	C<COMMAND> node, much like L<XML::DOM>'s treatment of attributes. These are, in a
	333	sense, apart from the rest of the document tree. Then C<GROUP> nodes will become
332	334	much more rare.
333	335
334	336	=item COMMENT
335	337
336		A COMMENT node is very similar to a TEXT node, except it is specifically for
337		lines beginning with ``%'' (the TeX comment delimeter) or the right-hand
338		portion of a line that has ``%'' at some internal point.
	338	A C<COMMENT> node is very similar to a C<TEXT> node, except it is specifically for
	339	lines beginning with C<``%''> (the TeX comment delimeter) or the right-hand
	340	portion of a line that has C<``%''> at some internal point.
339	341
340	342	=back
341	343

353	355
354	356	=head2 LaTeX::TOM
355	357
356		=over 4
357
358		=item new
	358	=head3 new
	359
	360	=over 4
	361
	362	=item C<>
359	363
360	364	Instantiate a new parser object.
361	365

368	372
369	373	The methods for the parser (aside from the constructor, discussed above) are :
370	374
371		=over 4
372
373		=item parseFile (filename)
374
375		Read in the contents of I<filename> and parse them, returning a LaTeX::TOM:Tree.
376
377		=item parse (string)
378
379		Parse the string I<string> and return a LaTeX::TOM::Tree.
	375	=head3 parseFile (filename)
	376
	377	=over 4
	378
	379	=item C<>
	380
	381	Read in the contents of I<filename> and parse them, returning a C<LaTeX::TOM::Tree>.
	382
	383	=back
	384
	385	=head3 parse (string)
	386
	387	=over 4
	388
	389	=item C<>
	390
	391	Parse the string I<string> and return a C<LaTeX::TOM::Tree>.
380	392
381	393	=back
382	394

384	396
385	397	This section contains methods for the Trees returned by the parser.
386	398
387		=over 4
388
389		=item copy
	399	=head3 copy
	400
	401	=over 4
	402
	403	=item C<>
390	404
391	405	Duplicate a tree into new memory.
392	406
393		=item print
	407	=back
	408
	409	=head3 print
	410
	411	=over 4
	412
	413	=item C<>
394	414
395	415	A debug print of the structure of the tree.
396	416
397		=item plainText
	417	=back
	418
	419	=head3 plainText
	420
	421	=over 4
	422
	423	=item C<>
398	424
399	425	Returns an arrayref which is a list of strings representing the text of all
400		getNodePlainTextFlag = 1 TEXT nodes, in an inorder traversal.
401
402		=item indexableText
	426	C<getNodePlainTextFlag = 1> C<TEXT> nodes, in an inorder traversal.
	427
	428	=back
	429
	430	=head3 indexableText
	431
	432	=over 4
	433
	434	=item C<>
403	435
404	436	A method like the above but which goes one step further; it cleans all of the
405	437	returned text and concatenates it into a single string which one could consider
406	438	having all of the standard information retrieval value for the document,
407	439	making it useful for indexing.
408	440
409		=item toLaTeX
	441	=back
	442
	443	=head3 toLaTeX
	444
	445	=over 4
	446
	447	=item C<>
410	448
411	449	Return a string representing the LaTeX encoded by the tree. This is especially
412	450	useful to get a normal document again, after modifying nodes of the tree.
413	451
414		=item getTopLevelNodes
415
416		Return an arrayref which is a list of LaTeX::TOM::Nodes at the top level of
	452	=back
	453
	454	=head3 getTopLevelNodes
	455
	456	=over 4
	457
	458	=item C<>
	459
	460	Return an arrayref which is a list of C<LaTeX::TOM::Nodes> at the top level of
417	461	the Tree.
418	462
419		=item getAllNodes
	463	=back
	464
	465	=head3 getAllNodes
	466
	467	=over 4
	468
	469	=item C<>
420	470
421	471	Return an arrayref with B<all> nodes of the tree. This "flattens" the tree.
422	472
423		=item getCommandNodesByName (name)
424
425		Return an arrayref with all COMMAND nodes in the tree which have a name
	473	=back
	474
	475	=head3 getCommandNodesByName (name)
	476
	477	=over 4
	478
	479	=item C<>
	480
	481	Return an arrayref with all C<COMMAND> nodes in the tree which have a name
426	482	matching I<name>.
427	483
428		=item getEnvironmentsByName (name)
429
430		Return an arrayref with all ENVIRONMENT nodes in the tree which have a class
	484	=back
	485
	486	=head3 getEnvironmentsByName (name)
	487
	488	=over 4
	489
	490	=item C<>
	491
	492	Return an arrayref with all C<ENVIRONMENT> nodes in the tree which have a class
431	493	matching I<name>.
432	494
433		=item getNodesByCondition (expression)
	495	=back
	496
	497	=head3 getNodesByCondition (expression)
	498
	499	=over 4
	500
	501	=item C<>
434	502
435	503	This is a catch-all search method which can be used to pull out nodes that
436	504	match pretty much any perl expression, without manually having to traverse the

444	512
445	513	This section contains the methods for nodes of the parsed Trees.
446	514
447		=over 4
448
449		=item getNodeType
450
451		Returns the type, one of 'TEXT', 'COMMAND', 'ENVIRONMENT', 'GROUP', or 'COMMENT',
	515	=head3 getNodeType
	516
	517	=over 4
	518
	519	=item C<>
	520
	521	Returns the type, one of C<TEXT>, C<COMMAND>, C<ENVIRONMENT>, C<GROUP>, or C<COMMENT>,
452	522	as described above.
453	523
454		=item getNodeText
455
456		Applicable for TEXT or COMMENT nodes; this returns the document text they contain.
	524	=back
	525
	526	=head3 getNodeText
	527
	528	=over 4
	529
	530	=item C<>
	531
	532	Applicable for C<TEXT> or C<COMMENT> nodes; this returns the document text they contain.
457	533	This is undef for other node types.
458	534
459		=item setNodeText
460
461		Set the node text, also for TEXT and COMMENT nodes.
462
463		=item getNodeStartingPosition
464
465		Get the starting character position in the document of this node. For TEXT
466		and COMMENT nodes, this will be where the text begins. For ENVIRONMENT,
467		COMMAND, or GROUP nodes, this will be the position of the I<last> character of
	535	=back
	536
	537	=head3 setNodeText
	538
	539	=over 4
	540
	541	=item C<>
	542
	543	Set the node text, also for C<TEXT> and C<COMMENT> nodes.
	544
	545	=back
	546
	547	=head3 getNodeStartingPosition
	548
	549	=over 4
	550
	551	=item C<>
	552
	553	Get the starting character position in the document of this node. For C<TEXT>
	554	and C<COMMENT> nodes, this will be where the text begins. For C<ENVIRONMENT>,
	555	C<COMMAND>, or C<GROUP> nodes, this will be the position of the I<last> character of
468	556	the opening identifier.
469	557
470		=item getNodeEndingPosition
471
472		Same as above, but for last character. For GROUP, ENVIRONMENT, or COMMAND
	558	=back
	559
	560	=head3 getNodeEndingPosition
	561
	562	=over 4
	563
	564	=item C<>
	565
	566	Same as above, but for last character. For C<GROUP>, C<ENVIRONMENT>, or C<COMMAND>
473	567	nodes, this will be the I<first> character of the closing identifier.
474	568
475		=item getNodeOuterStartingPosition
476
477		Same as getNodeStartingPosition, but for GROUP, ENVIRONMENT, or COMMAND nodes,
	569	=back
	570
	571	=head3 getNodeOuterStartingPosition
	572
	573	=over 4
	574
	575	=item C<>
	576
	577	Same as getNodeStartingPosition, but for C<GROUP>, C<ENVIRONMENT>, or C<COMMAND> nodes,
478	578	this returns the I<first> character of the opening identifier.
479	579
480		=item getNodeOuterEndingPosition
481
482		Same as getNodeEndingPosition, but for GROUP, ENVIRONMENT, or COMMAND nodes,
	580	=back
	581
	582	=head3 getNodeOuterEndingPosition
	583
	584	=over 4
	585
	586	=item C<>
	587
	588	Same as getNodeEndingPosition, but for C<GROUP>, C<ENVIRONMENT>, or C<COMMAND> nodes,
483	589	this returns the I<last> character of the closing identifier.
484	590
485		=item getNodeMathFlag
486
487		This applies to any node type. It is 1 if the node sets, or is contained
488		within, a math mode region. 0 otherwise. TEXT nodes which have this flag as 1
	591	=back
	592
	593	=head3 getNodeMathFlag
	594
	595	=over 4
	596
	597	=item C<>
	598
	599	This applies to any node type. It is C<1> if the node sets, or is contained
	600	within, a math mode region. C<0> otherwise. C<TEXT> nodes which have this flag as C<1>
489	601	can be assumed to be the actual mathematics contained in the document.
490	602
491		=item getNodePlainTextFlag
492
493		This applies only to TEXT nodes. It is 1 if the node is non-math B<and> is
	603	=back
	604
	605	=head3 getNodePlainTextFlag
	606
	607	=over 4
	608
	609	=item C<>
	610
	611	This applies only to C<TEXT> nodes. It is C<1> if the node is non-math B<and> is
494	612	visible (in other words, will end up being a part of the output document). One
495		would only want to index TEXT nodes with this property, for information
	613	would only want to index C<TEXT> nodes with this property, for information
496	614	retrieval purposes.
497	615
498		=item getEnvironmentClass
499
500		This applies only to ENVIRONMENT nodes. Returns what class of environment the
501		node represents (the X in \begin{X} and \end{X}).
502
503		=item getCommandName
504
505		This applies only to COMMAND nodes. Returns the name of the command (the X in
506		\X{...}).
507
508		=item getChildTree
509
510		This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it returns the
511		LaTeX::TOM::Tree which is ``under'' the calling node.
512
513		=item getFirstChild
514
515		This applies only to COMMAND, ENVIRONMENT, and GROUP nodes: it returns the
	616	=back
	617
	618	=head3 getEnvironmentClass
	619
	620	=over 4
	621
	622	=item C<>
	623
	624	This applies only to C<ENVIRONMENT> nodes. Returns what class of environment the
	625	node represents (the C<X> in C<\begin{X}> and C<\end{X}>).
	626
	627	=back
	628
	629	=head3 getCommandName
	630
	631	=over 4
	632
	633	=item C<>
	634
	635	This applies only to C<COMMAND> nodes. Returns the name of the command (the C<X> in
	636	C<\X{...}>).
	637
	638	=back
	639
	640	=head3 getChildTree
	641
	642	=over 4
	643
	644	=item C<>
	645
	646	This applies only to C<COMMAND>, C<ENVIRONMENT>, and C<GROUP> nodes: it returns the
	647	C<LaTeX::TOM::Tree> which is ``under'' the calling node.
	648
	649	=back
	650
	651	=head3 getFirstChild
	652
	653	=over 4
	654
	655	=item C<>
	656
	657	This applies only to C<COMMAND>, C<ENVIRONMENT>, and C<GROUP> nodes: it returns the
516	658	first node from the first level of the child subtree.
517	659
518		=item getLastChild
	660	=back
	661
	662	=head3 getLastChild
	663
	664	=over 4
	665
	666	=item C<>
519	667
520	668	Same as above, but for the last node of the first level.
521	669
522		=item getPreviousSibling
	670	=back
	671
	672	=head3 getPreviousSibling
	673
	674	=over 4
	675
	676	=item C<>
523	677
524	678	Return the prior node on the same level of the tree.
525	679
526		=item getNextSibling
	680	=back
	681
	682	=head3 getNextSibling
	683
	684	=over 4
	685
	686	=item C<>
527	687
528	688	Same as above, but for following node.
529	689
530		=item getParent
	690	=back
	691
	692	=head3 getParent
	693
	694	=over 4
	695
	696	=item C<>
531	697
532	698	Get the parent node of this node in the tree.
533	699
534		=item getNextGroupNode
	700	=back
	701
	702	=head3 getNextGroupNode
	703
	704	=over 4
	705
	706	=item C<>
535	707
536	708	This is an interesting function, and kind of a hack because of the way the
537	709	parser makes the current tree. Basically it will give you the next sibling
538		that is a GROUP node, until it either hits the end of the tree level, a TEXT
539		node which doesn't match /^\s*$/, or a COMMAND node.
540
541		This is useful for finding all GROUPed parameters after a COMMAND node (see
542		comments for 'GROUP' in the 'COMPONENTS' / 'LaTeX::TOM::Node' section). You
543		can just have a while loop that calls this method until it gets 'undef', and
	710	that is a C<GROUP> node, until it either hits the end of the tree level, a C<TEXT>
	711	node which doesn't match C</^\s*$/>, or a C<COMMAND> node.
	712
	713	This is useful for finding all C<GROUP>ed parameters after a C<COMMAND> node (see
	714	comments for C<GROUP> in the C<COMPONENTS> / C<LaTeX::TOM::Node> section). You
	715	can just have a while loop that calls this method until it gets C<undef>, and
544	716	you'll know you've found all the parameters to a command.
545	717
546		Note: this may be bad, but TEXT Nodes matching /^\s*\[[0-9]+\]$/ (optional
	718	Note: this may be bad, but C<TEXT> Nodes matching C</^\s*\[[0-9]+\]$/> (optional
547	719	parameter groups) are treated as if they were 'blank'.
548	720
549	721	=back

552	724
553	725	Due to the lack of tree-modification methods, currently this module is
554	726	mostly useful for minor modifications to the parsed document, for instance,
555		altering the text of TEXT nodes but not deleting the nodes. Of course, the
	727	altering the text of C<TEXT> nodes but not deleting the nodes. Of course, the
556	728	user can still do this by breaking abstraction and directly modifying the Tree.
557	729
558	730	Also note that the parsing is not complete. This module was not written with

561	733	logical level with regards to the content; it doesn't care about the document
562	734	formatting and outputting side of TeX/LaTeX.
563	735
564		There is much work still to be done. See the TODO list in the TOM.pm source.
	736	There is much work still to be done. See the F<TODO> list in the F<TOM.pm> source.
565	737
566	738	=head1 BUGS
567	739

569	741	~1000 research publications from the Computing Research Repository, so I
570	742	deemed it ``good enough'' to use for purposes similar to mine.
571	743
572		Please let me know of parser errors if you discover any.
573
574		=head1 AUTHOR
	744	Please let the authors know of parser errors if you discover any.
	745
	746	=head1 AUTHORS
575	747
576	748	Written by Aaron Krowne <akrowne@vt.edu>
577	749

-1

t/pod-coverage.t less more

2	2	use strict;
3	3	use warnings;
4	4
5		use Test::More tests => 1;
	5	use Test::More;
	6
6	7	eval "use Test::Pod::Coverage 1.04";
7	8	plan skip_all => "Test::Pod::Coverage 1.04 required for testing POD coverage" if $@;
	9
	10	plan tests => 1;
8	11	pod_coverage_ok('LaTeX::TOM');

-1

t/pod.t less more

2	2	use strict;
3	3	use warnings;
4	4
5		use Test::More tests => 1;
	5	use Test::More;
	6
6	7	eval "use Test::Pod 1.14";
7	8	plan skip_all => "Test::Pod 1.14 required for testing POD" if $@;
	9
	10	plan tests => 1;
8	11	pod_file_ok('lib/LaTeX/TOM.pm');