Codebase list tika / efa85c9
New upstream version 1.8 Emmanuel Bourg 7 years ago
423 changed file(s) with 30451 addition(s) and 7179 deletion(s). Raw diff Collapse all Expand all
0 .svn
01 target
12 .idea
23 .classpath
56 *.iml
67 *.ipr
78 *.iws
9 nbactions.xml
10 nb-configuration.xml
0 Release 1.6 - 07/27/2014
0 Release 1.8 - 4/13/2015
1
2 * Fix null pointer when processing ODT footer styles (TIKA-1600).
3
4 * Upgrade to com.drewnoakes' metadata-extractor to 2.0 and
5 add parser for webp metadata (TIKA-1594).
6
7 * Duration extracted from MP3s with no ID3 tags (TIKA-1589).
8
9 * Upgraded to PDFBox 1.8.9 (TIKA-1575).
10
11 * Tika now supports the IsaTab data standard for bioinformatics
12 both in terms of MIME identification and in terms of parsing
13 (TIKA-1580).
14
15 * Tika server can now enable CORS requests with the command line
16 "--cors" or "-C" option (TIKA-1586).
17
18 * Update jhighlight dependency to avoid using LGPL license. Thank
19 @kkrugler for his great contribution (TIKA-1581).
20
21 * Updated HDF and NetCDF parsers to output file version in
22 metadata (TIKA-1578 and TIKA-1579).
23
24 * Upgraded to POI 3.12-beta1 (TIKA-1531).
25
26 * Added tika-batch module for directory to directory batch
27 processing. This is a new, experimental capability, and the API will
28 likely change in future releases (TIKA-1330).
29
30 * Translator.translate() Exceptions are now restricted to
31 TikaException and IOException (TIKA-1416).
32
33 * Tika now supports MIME detection for Microsoft Extended
34 Makefiles (EMF) (TIKA-1554).
35
36 * Tika has improved delineation in XML and HTML MIME detection
37 (TIKA-1365).
38
39 * Upgraded the Drew Noakes metadata-extractor to version 2.7.2
40 (TIKA-1576).
41
42 * Added basic style support for ODF documents, contributed by
43 Axel Dörfler (TIKA-1063).
44
45 * Move Tika server resources and writers to separate
46 org.apache.tika.server.resource and writer packages (TIKA-1564).
47
48 * Upgrade UCAR dependencies to 4.5.5 (TIKA-1571).
49
50 * Fix Paths in Tika server welcome page (TIKA-1567).
51
52 * Fixed infinite recursion while parsing some PDFs (TIKA-1038).
53
54 * XHTMLContentHandler now properly passes along body attributes,
55 contributed by Markus Jelsma (TIKA-995).
56
57 * TikaCLI option --compare-file-magic to report mime types known to
58 the file(1) tool but not known / fully known to Tika.
59
60 * MediaTypeRegistry support for returning known child types.
61
62 * Support for excluding (blacklisting) certain Parsers from being
63 used by DefaultParser via the Tika Config file, using the new
64 parser-exclude tag (TIKA-1558).
65
66 * Detect Global Change Master Directory (GCMD) Directory
67 Interchange Format (DIF) files (TIKA-1561).
68
69 * Tika's JAX-RS server can now return stacktraces for
70 parse exceptions (TIKA-1323).
71
72 * Added MockParser for testing handling of exceptions, errors
73 and hangs in code that uses parsers (TIKA-1553).
74
75 * The ForkParser service removed from Activator. Rollback of (TIKA-1354).
76
77 * Increased the speed of language identification by
78 a factor of two -- contributed by Toke Eskildsen (TIKA-1549).
79
80 * Added parser for Sqlite3 db files. BEWARE: the org.xerial
81 dependency includes native libs. Some users may need to
82 exclude this dependency or configure it specially for
83 their environment (TIKA-1511).
84
85 * Use POST instead of PUT for tika-server form methods
86 (TIKA-1547).
87
88 * A basic wrapper around the UNIX file command was
89 added to extract Strings. In addition a parse to
90 handle Strings parsing from octet-streams using Latin1
91 charsets as added (TIKA-1541, TIKA-1483).
92
93 * Add test files and detection mechanism for Gridded
94 Binary (GRIB) files (TIKA-1539).
95
96 * The RAR parser was updated to handle Chinese characters
97 using the functionality provided by allowing encoding to
98 be used within ZipArchiveInputStream (TIKA-936).
99
100 * Fix out of memory error in surefire plugin (TIKA-1537).
101
102 * Build a parser to extract data from GRIB formats (TIKA-1423).
103
104 * Upgrade to Commons Compress 1.9 (TIKA-1534).
105
106 * Include media duration in metadata parsed by MP4Parser (TIKA-1530).
107
108 * Support password protected 7zip files (using a PasswordProvider,
109 in keeping with the other password supporting formats) (TIKA-1521).
110
111 * Password protected Zip files should not trigger an exception (TIKA-1028).
112
113 Release 1.7 - 1/9/2015
114
115 * Fixed resource leak in OutlookPSTParser that caused TikaException
116 when invoked via AutoDetectParser on Windows (TIKA-1506).
117
118 * HTML tags are properly stripped from content by FeedParser
119 (TIKA-1500).
120
121 * Tika Server support for selecting a single metadata key;
122 wrapped MetadataEP into MetadataResource (TIKA-1499).
123
124 * Tika Server support for JSON and XMP views of metadata (TIKA-1497).
125
126 * Tika Parent uses dependency management to keep duplicate
127 dependencies in different modules the same version (TIKA-1384).
128
129 * Upgraded slf4j to version 1.7.7 (TIKA-1496).
130
131 * Tika Server support for RecursiveParserWrapper's JSON output
132 (endpoint=rmeta) equivalent to (TIKA-1451's) -J option
133 in tika-app (TIKA-1498).
134
135 * Tika Server support for providing the password for files on a
136 per-request basis through the Password http header (TIKA-1494).
137
138 * Simple support for the BPG (Better Portable Graphics) image format
139 (TIKA-1491, TIKA-1495).
140
141 * Prevent exceptions from being thrown for some malformed
142 mp3 files (TIKA-1218).
143
144 * Reformat pom.xml files to use two spaces per indent (TIKA-1475).
145
146 * Fix warning of slf4j logger on Tika Server startup (TIKA-1472).
147
148 * Tika CLI and GUI now have option to view JSON rendering of output
149 of RecursiveParserWrapper (TIKA-1451).
150
151 * Tika now integrates the Geospatial Data Abstraction Library
152 (GDAL) for parsing hundreds of geospatial formats (TIKA-605,
153 TIKA-1503).
154
155 * ExternalParsers can now use Regexs to specify dynamic keys
156 (TIKA-1441).
157
158 * Thread safety issues in ImageMetadataExtractor were resolved
159 (TIKA-1369).
160
161 * The ForkParser service is now registered in Activator
162 (TIKA-1354).
163
164 * The Rome Library was upgraded to version 1.5 (TIKA-1435).
165
166 * Add markup for files embedded in PDFs (TIKA-1427).
167
168 * Extract files embedded in annotations in PDFS (TIKA-1433).
169
170 * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442).
171
172 * Add RecursiveParserWrapper (aka Jukka's and Nick's)
173 RecursiveMetadataParser (TIKA-1329)
174
175 * Add example for how to dump TikaConfig to XML (TIKA-1418).
176
177 * Allow users to specify a tika config file for tika-app (TIKA-1426).
178
179 * PackageParser includes the last-modified date from the archive
180 in the metadata, when handling embedded entries (TIKA-1246)
181
182 * Created a new Tesseract OCR Parser to extract text from images.
183 Requires installation of Tesseract before use (TIKA-93).
184
185 * Basic parser for older Excel formats, such as Excel 4, 5 and 95,
186 which can get simple text, and metadata for Excel 5+95 (TIKA-1490)
187
188
189 Release 1.6 - 08/31/2014
190
191 * Parse output should indicate which Parser was actually used
192 (TIKA-674).
193
194 * Use the forbidden-apis Maven plugin to check for unsafe Java
195 operations (TIKA-1387).
196
197 * Created an ExternalTranslator class to interface with command
198 line Translators (TIKA-1385).
199
200 * Created a MosesTranslator as a subclass of ExternalTranslator
201 that calls the Moses Decoder machine translation program (TIKA-1385).
202
203 * Created the tika-example module. It will have examples of how to
204 use the main Tika interfaces (TIKA-1390).
205
206 * Upgraded to Commons Compress 1.8.1 (TIKA-1275).
207
208 * Upgraded to POI 3.11-beta1 (TIKA-1380).
209
210 * Tika now extracts SDTCell content from tables in .docx files (TIKA-1317).
1211
2212 * Tika now supports detection of the Persian/Farsi language.
3213 (TIKA-1337)
305305 v5oOrlFoVc31np7pDFg1nZ8o8Rb//arjsRzPe+oZIA==
306306 =wanG
307307 -----END PGP PUBLIC KEY BLOCK-----
308 pub 2048R/D4F10117 2015-01-01
309 uid Tyler Palsulich <tpalsulich@apache.org>
310 sig 3 D4F10117 2015-01-01 Tyler Palsulich <tpalsulich@apache.org>
311 sub 2048R/6137D1E6 2015-01-01
312 sig D4F10117 2015-01-01 Tyler Palsulich <tpalsulich@apache.org>
313
314 -----BEGIN PGP PUBLIC KEY BLOCK-----
315 Version: GnuPG v1
316
317 mQENBFSlspUBCADJfADZ0ep3o/wo5sUSHDcFvmcuTRsHZDgsoHrdk83oqsQtHBZK
318 EQ4KeTbPTONgyNSU13kQDT6BYX3CA4AB9rqSBCI/Gghi56+I4d8mjZODY5bpnILC
319 vU9FyLsJNdbV8J48+oDF/5LToo5VB8QYslZ8ZZ7DJZvNmh4EovlnP9bVVS4Txk7d
320 mywSr1MTy5u6lb71oczK95pxO2dRwvJzLcQNTAgh3nrqk1JCLMxJoGGaKKLiGZgF
321 psn5nusGzOoRHeUa33V3/ms3ZYM6mS/9MmyU5P1zOUZ2Exc9C6Tps0bYbB/oztgM
322 4bx9NFwpeuILi4OJ/wEIJNp809CXXoYFuWlNABEBAAG0J1R5bGVyIFBhbHN1bGlj
323 aCA8dHBhbHN1bGljaEBhcGFjaGUub3JnPokBOAQTAQIAIgUCVKWylQIbAwYLCQgH
324 AwIGFQgCCQoLBBYCAwECHgECF4AACgkQiBC7GdTxARd0nQf/S2yLJ8U7P/Hix5zR
325 3idwrAmfDtYhUJXuEedKCw9RFnq9Q45hs1zIHVsOtnYaPvyQqSF8rY/E5LR6KJ1W
326 I1reFc5wKJLfmCWPAJ0Og8U4N1DOwwxESesugUT16iAXQL58xbSAzGJ1/v4L8eTj
327 P7maZcEdW7FLLTqJFuSfJsu8VowU8pD+v2DGHehARhDyJhhQxrX1Zb1t8vffspXw
328 bND1CbdB87VZJOj1apRL47nG6Qev7On+XKEXR9tHz/MWdJ/0kyNju6OLcjPJ2QFb
329 Q/Dwj6VYblvKq5eIYuhSNzbaI2AayZGpC9/PpFSPPWPhqa+eukUoPd3rGEG2PGBh
330 1shjYLkBDQRUpbKVAQgAsHL1+04Um1nOQJyeBhZ6tIa5VBPvhwk+Gccy3rWFZ66W
331 4byZ16Hc4tM9mU2CcPpdLYITPJaAEi+T7frXuiJwmVeAe1o9LElVAOGwbDlybv6s
332 wJvQqnrbwRBQLmblXeSqffAE4bpz4iU4haD2LpyjKNs5D/YS9QfhjuTKh9gGu+uP
333 DhXmD1hGn0UvDy9GuX6PgWijeOIUlvuZaiN8cZjsG87MLXcLLxbvCZIfrmyheF22
334 zSYMEvNB3r8dLTnCIt7SqbdGGyyV0kBMQWic2Epk7WzQWNsshCVPhZNkJ4oQN4Yo
335 AMdGyLHTJ8HvH6L8trDFQEdJrt1lIcLn43lv1AzF9QARAQABiQEfBBgBAgAJBQJU
336 pbKVAhsMAAoJEIgQuxnU8QEX4+oIALw2qD3KyAKKwHGK8X93woHY19tDH4zCKsQa
337 r2qXy7aoAsNhERkg24OUkJu0T/c/HzAQPs0RbEZUxqhzsezmJKwey+9TmNsmTcM6
338 52nVMa5fl7+38A54dqLOtK965ZggSroM6Qyk9lrfsJRQ/4BbNfagsXPP7Fvs1DDe
339 JcWAy7md7XR9MiVgSQuw040wqSzcSA5M6RCFZ9gN+G0kP1CNZ5vDz+JktV4nJZzh
340 /i/wH25qTePHz6Clp6mye68cqtCTKX2RF5cTlFCWIqyFYFCfrKCi3LF0bhpWqq7S
341 JF8xV9E4P/Msl8hqmOOocZ4LDJdw/nt1UWlUmattMLBVWdSeuu0=
342 =pYQ7
343 -----END PGP PUBLIC KEY BLOCK-----
321321 14. This Specifications License Agreement reflects the entire agreement of the parties regarding the subject matter hereof and supersedes all prior agreements or representations regarding such matters, whether written or oral. To the extent any portion or provision of this Specifications License Agreement is found to be illegal or unenforceable, then the remaining provisions of this Specifications License Agreement will remain in full force and effect and the illegal or unenforceable provision will be construed to give it such effect as it may properly have that is consistent with the intentions of the parties.
322322 15. This Specifications License Agreement may only be modified in writing signed by an authorized representative of the IPTC.
323323 16. This Specifications License Agreement is governed by the law of United Kingdom, as such law is applied to contracts made and fully performed in the United Kingdom. Any disputes arising from or relating to this Specifications License Agreement will be resolved in the courts of the United Kingdom. You consent to the jurisdiction of such courts over you and covenant not to assert before such courts any objection to proceeding in such forums.
324
325
326 JUnRAR (https://github.com/edmund-wagner/junrar/)
327
328 JUnRAR is based on the UnRAR tool, and covered by the same license
329 It was formerly available from http://java-unrar.svn.sourceforge.net/
330
331 ****** ***** ****** UnRAR - free utility for RAR archives
332 ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
333 ****** ******* ****** License for use and distribution of
334 ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
335 ** ** ** ** ** ** FREE portable version
336 ~~~~~~~~~~~~~~~~~~~~~
337
338 The source code of UnRAR utility is freeware. This means:
339
340 1. All copyrights to RAR and the utility UnRAR are exclusively
341 owned by the author - Alexander Roshal.
342
343 2. The UnRAR sources may be used in any software to handle RAR
344 archives without limitations free of charge, but cannot be used
345 to re-create the RAR compression algorithm, which is proprietary.
346 Distribution of modified UnRAR sources in separate form or as a
347 part of other software is permitted, provided that it is clearly
348 stated in the documentation and source comments that the code may
349 not be used to develop a RAR (WinRAR) compatible archiver.
350
351 3. The UnRAR utility may be freely distributed. It is allowed
352 to distribute UnRAR inside of other software packages.
353
354 4. THE RAR ARCHIVER AND THE UnRAR UTILITY ARE DISTRIBUTED "AS IS".
355 NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU USE AT
356 YOUR OWN RISK. THE AUTHOR WILL NOT BE LIABLE FOR DATA LOSS,
357 DAMAGES, LOSS OF PROFITS OR ANY OTHER KIND OF LOSS WHILE USING
358 OR MISUSING THIS SOFTWARE.
359
360 5. Installing and using the UnRAR utility signifies acceptance of
361 these terms and conditions of the license.
362
363 6. If you don't agree with terms of the license you must remove
364 UnRAR files from your storage devices and cease to use the
365 utility.
366
367 Thank you for your interest in RAR and UnRAR. Alexander L. Roshal
368
369 Sqlite (bundled in org.xerial's sqlite-jdbc)
370 This product bundles Sqlite, which is in the Public Domain. For details
371 see: https://www.sqlite.org/copyright.html
00 Apache Tika
1 Copyright 2011 The Apache Software Foundation
1 Copyright 2015 The Apache Software Foundation
22
33 This product includes software developed at
44 The Apache Software Foundation (http://www.apache.org/).
66 Copyright 1993-2010 University Corporation for Atmospheric Research/Unidata
77 This software contains code derived from UCAR/Unidata's NetCDF library.
88
9 Tika-server compoment uses CDDL-licensed dependencies: jersey (http://jersey.java.net/) and
9 Tika-server component uses CDDL-licensed dependencies: jersey (http://jersey.java.net/) and
1010 Grizzly (http://grizzly.java.net/)
11
12 Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight (https://github.com/codelibs/jhighlight)
1113
1214 OpenCSV: Copyright 2005 Bytecode Pty Ltd. Licensed under the Apache License, Version 2.0
1315
99 Getting Started
1010 ---------------
1111
12 Tika is based on Java 5 and uses the [Maven 2](http://maven.apache.org) build system. To build Tika, use the following command in this directory:
12 Tika is based on Java 6 and uses the [Maven 3](http://maven.apache.org) build system. To build Tika, use the following command in this directory:
1313
1414 mvn clean install
1515
1616 The build consists of a number of components, including a standalone runnable jar that you can use to try out Tika features. You can run it like this:
1717
1818 java -jar tika-app/target/tika-app-*.jar --help
19
20 Contributing via Github
21 =======================
22 To contribute a patch, follow these instructions (note that installing
23 [Hub](http://hub.github.com) is not strictly required, but is recommended).
24
25 ```
26 0. Download and install hub.github.com
27 1. File JIRA issue for your fix at https://issues.apache.org/jira/browse/TIKA
28 - you will get issue id TIKA-xxx where xxx is the issue ID.
29 2. git clone http://github.com/apache/tika.git
30 3. cd tika
31 4. git checkout -b TIKA-xxx
32 5. edit files
33 6. git status (make sure it shows what files you expected to edit)
34 7. git add <files>
35 8. git commit -m “fix for TIKA-xxx contributed by <your username>”
36 9. git fork
37 10. git push -u <your git username> TIKA-xxx
38 11. git pull-request
39 ```
1940
2041 License (see also LICENSE.txt)
2142 ------------------------------
2424 <parent>
2525 <groupId>org.apache.tika</groupId>
2626 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
27 <version>1.8</version>
2828 <relativePath>tika-parent/pom.xml</relativePath>
2929 </parent>
3030
3535
3636 <scm>
3737 <connection>
38 scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6
38 scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2
3939 </connection>
4040 <developerConnection>
41 scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6
41 scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2
4242 </developerConnection>
43 <url>http://svn.apache.org/viewvc/tika/tags/1.6</url>
43 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2</url>
4444 </scm>
4545
4646 <modules>
4949 <module>tika-parsers</module>
5050 <module>tika-xmp</module>
5151 <module>tika-serialization</module>
52 <module>tika-batch</module>
5253 <module>tika-app</module>
5354 <module>tika-bundle</module>
5455 <module>tika-server</module>
5556 <module>tika-translate</module>
57 <module>tika-example</module>
5658 </modules>
57
58 <build>
59 <plugins>
60 <plugin>
61 <artifactId>maven-deploy-plugin</artifactId>
62 <configuration>
63 <skip>true</skip> <!-- No need to deploy the reactor -->
64 </configuration>
65 </plugin>
66 <plugin>
67 <artifactId>maven-site-plugin</artifactId>
68 <configuration>
69 <templateDirectory>src/site</templateDirectory>
70 <template>site.vm</template>
71 </configuration>
72 </plugin>
73 <plugin>
74 <groupId>org.apache.rat</groupId>
75 <artifactId>apache-rat-plugin</artifactId>
76 <configuration>
77 <excludes>
78 <exclude>.*/**</exclude>
79 <exclude>CHANGES.txt</exclude>
80 <exclude>tika-dotnet/AssemblyInfo.cs</exclude>
81 <exclude>tika-dotnet/Tika.csproj</exclude>
82 <exclude>tika-dotnet/Tika.sln</exclude>
83 <exclude>tika-dotnet/Tika.sln.cache</exclude>
84 <exclude>tika-dotnet/obj/**</exclude>
85 <exclude>tika-dotnet/target/**</exclude>
86 </excludes>
87 </configuration>
88 </plugin>
89 </plugins>
90 </build>
9159
9260 <profiles>
9361 <profile>
135103 <fileset dir="${basedir}">
136104 <include name="CHANGES.txt" />
137105 <include name="target/*-src.zip*" />
138 <include name="tika-app/target/*-${project.version}.jar*" />
106 <include name="tika-app/target/tika-app-${project.version}.jar*" />
107 <include name="tika-server/target/tika-server-${project.version}.jar*" />
139108 </fileset>
140109 </copy>
141110 <checksum algorithm="MD5" fileext=".md5">
154123 <echo file="${basedir}/target/vote.txt">
155124 From: ${username}@apache.org
156125 To: dev@tika.apache.org
157 Subject: [VOTE] Release Apache Tika ${project.version}
126 user@tika.apache.org
127 Subject: [VOTE] Release Apache Tika ${project.version} Candidate #N
158128
159129 A candidate for the Tika ${project.version} release is available at:
160
161 http://people.apache.org/~${username}/tika/${project.version}/
130 https://dist.apache.org/repos/dist/dev/tika/
162131
163132 The release candidate is a zip archive of the sources in:
164
165 http://svn.apache.org/repos/asf/tika/tags/${project.version}/
166
167 The SHA1 checksum of the archive is ${checksum}.
133 http://svn.apache.org/repos/asf/tika/tags/${project.version}-rcN/
134
135 The SHA1 checksum of the archive is
136 ${checksum}.
137
138 In addition, a staged maven repository is available here:
139 https://repository.apache.org/content/repositories/orgapachetika-.../org/apache/tika
168140
169141 Please vote on releasing this package as Apache Tika ${project.version}.
170142 The vote is open for the next 72 hours and passes if a majority of at
171143 least three +1 Tika PMC votes are cast.
172144
173 [ ] +1 Release this package as Apache Tika ${project.version}
174 [ ] -1 Do not release this package because...${line.separator}
145 [ ] +1 Release this package as Apache Tika ${project.version}
146 [ ] -1 Do not release this package because...${line.separator}
175147 </echo>
176148 <echo />
177149 <echo>
178 The release candidate has been prepared in:
179
180 ${basedir}/target/${project.version}
181
182 Please deploy it to people.apache.org like this:
183
184 scp -r ${basedir}/target/${project.version} people.apache.org:public_html/tika/
185
186 A release vote template has been generated for you:
187
188 file://${basedir}/target/vote.txt
150 The release candidate has been prepared in:
151
152 ${basedir}/target/${project.version}
153
154 Please deploy it to people.apache.org like this:
155
156 scp -r ${basedir}/target/${project.version} people.apache.org:public_html/tika/
157
158 A release vote template has been generated for you:
159
160 file://${basedir}/target/vote.txt
189161 </echo>
190162 <echo />
191163 </tasks>
194166 </executions>
195167 <dependencies>
196168 <dependency>
197 <groupId>org.apache.ant</groupId>
198 <artifactId>ant-nodeps</artifactId>
199 <version>1.8.1</version>
200 </dependency>
169 <groupId>org.apache.ant</groupId>
170 <artifactId>ant-nodeps</artifactId>
171 <version>1.8.1</version>
172 </dependency>
201173 </dependencies>
202174 </plugin>
203175 </plugins>
214186 </profile>
215187 </profiles>
216188
217 <description>The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. </description>
189 <description>The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents
190 using existing parser libraries.
191 </description>
218192 <organization>
219 <name>The Apache Software Foundation</name>
220 <url>http://www.apache.org</url>
193 <name>The Apache Software Foundation</name>
194 <url>http://www.apache.org</url>
221195 </organization>
222196 <issueManagement>
223 <system>JIRA</system>
224 <url>https://issues.apache.org/jira/browse/TIKA</url>
197 <system>JIRA</system>
198 <url>https://issues.apache.org/jira/browse/TIKA</url>
225199 </issueManagement>
226200 <ciManagement>
227 <system>Jenkins</system>
228 <url>https://builds.apache.org/job/Tika-trunk/</url>
201 <system>Jenkins</system>
202 <url>https://builds.apache.org/job/Tika-trunk/</url>
229203 </ciManagement>
230204 </project>
2424 <parent>
2525 <groupId>org.apache.tika</groupId>
2626 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
27 <version>1.8</version>
2828 <relativePath>../tika-parent/pom.xml</relativePath>
2929 </parent>
3030
4141 <groupId>${project.groupId}</groupId>
4242 <artifactId>tika-parsers</artifactId>
4343 <version>${project.version}</version>
44 <exclusions>
45 <exclusion>
46 <groupId>commons-logging</groupId>
47 <artifactId>commons-logging</artifactId>
48 </exclusion>
49 <exclusion>
50 <groupId>commons-logging</groupId>
51 <artifactId>commons-logging-api</artifactId>
52 </exclusion>
53 </exclusions>
4454 </dependency>
4555 <dependency>
4656 <groupId>${project.groupId}</groupId>
5262 <artifactId>tika-xmp</artifactId>
5363 <version>${project.version}</version>
5464 </dependency>
65 <dependency>
66 <groupId>${project.groupId}</groupId>
67 <artifactId>tika-batch</artifactId>
68 <version>${project.version}</version>
69 </dependency>
70
5571 <dependency>
5672 <groupId>org.slf4j</groupId>
5773 <artifactId>slf4j-log4j12</artifactId>
58 <version>1.5.6</version>
59 </dependency>
60 <dependency>
61 <groupId>com.google.code.gson</groupId>
62 <artifactId>gson</artifactId>
63 <version>1.7.1</version>
64 </dependency>
74 </dependency>
75 <dependency>
76 <groupId>org.slf4j</groupId>
77 <artifactId>jul-to-slf4j</artifactId>
78 </dependency>
79 <dependency>
80 <groupId>org.slf4j</groupId>
81 <artifactId>jcl-over-slf4j</artifactId>
82 </dependency>
83 <dependency>
84 <groupId>log4j</groupId>
85 <artifactId>log4j</artifactId>
86 <version>1.2.17</version>
87 </dependency>
88
6589 <dependency>
6690 <groupId>junit</groupId>
6791 <artifactId>junit</artifactId>
68 <scope>test</scope>
69 <version>4.11</version>
7092 </dependency>
7193 <dependency>
7294 <artifactId>commons-io</artifactId>
110132 <exclude>CHANGES</exclude>
111133 <exclude>README</exclude>
112134 <exclude>builddef.lst</exclude>
135 <!-- clutter not needed in jar -->
136 <exclude>resources/grib1/nasa/README*.pdf</exclude>
137 <exclude>resources/grib1/**/readme*.txt</exclude>
138 <exclude>resources/grib2/**/readme*.txt</exclude>
113139 <!-- TIKA-763: Workaround to avoid including LGPL classes -->
114140 <exclude>ucar/nc2/iosp/fysat/Fysat*.class</exclude>
115141 <exclude>ucar/nc2/dataset/transform/VOceanSG1*class</exclude>
116142 <exclude>ucar/unidata/geoloc/vertical/OceanSG*.class</exclude>
143
117144 </excludes>
118145 </filter>
119146 </filters>
225252 </profiles>
226253
227254 <organization>
228 <name>The Apache Software Foundation</name>
229 <url>http://www.apache.org</url>
255 <name>The Apache Software Foundation</name>
256 <url>http://www.apache.org</url>
230257 </organization>
231258 <scm>
232 <url>http://svn.apache.org/viewvc/tika/tags/1.6/tika-app</url>
233 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/tika-app</connection>
234 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/tika-app</developerConnection>
259 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-app</url>
260 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-app</connection>
261 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-app</developerConnection>
235262 </scm>
236263 <issueManagement>
237 <system>JIRA</system>
238 <url>https://issues.apache.org/jira/browse/TIKA</url>
264 <system>JIRA</system>
265 <url>https://issues.apache.org/jira/browse/TIKA</url>
239266 </issueManagement>
240267 <ciManagement>
241 <system>Jenkins</system>
242 <url>https://builds.apache.org/job/Tika-trunk/</url>
268 <system>Jenkins</system>
269 <url>https://builds.apache.org/job/Tika-trunk/</url>
243270 </ciManagement>
244271 </project>
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.cli;
18
19 import java.io.File;
20 import java.io.IOException;
21 import java.util.ArrayList;
22 import java.util.LinkedHashMap;
23 import java.util.List;
24 import java.util.Map;
25 import java.util.regex.Matcher;
26 import java.util.regex.Pattern;
27
28 /**
29 * This takes a TikaCLI commandline and builds the full commandline for
30 * org.apache.tika.batch.fs.FSBatchProcessCLI.
31 * <p>
32 * The "default" batch config file that this relies on
33 * if no batch config file is specified on the commandline
34 * is: tika-batch/src/main/resources/.../default-tika-batch-config.xml
35 */
36 class BatchCommandLineBuilder {
37
38 static Pattern JVM_OPTS_PATTERN = Pattern.compile("^(--?)J(.+)");
39
40 protected static String[] build(String[] args) throws IOException {
41 Map<String, String> processArgs = new LinkedHashMap<String, String>();
42 Map<String, String> jvmOpts = new LinkedHashMap<String,String>();
43 //take the args, and divide them into process args and options for
44 //the child jvm process (i.e. log files, etc)
45 mapifyArgs(args, processArgs, jvmOpts);
46
47 //now modify processArgs in place
48 translateCommandLine(args, processArgs);
49
50 //maybe the user specified a different classpath?!
51 if (! jvmOpts.containsKey("-cp") && ! jvmOpts.containsKey("--classpath")) {
52 String cp = System.getProperty("java.class.path");
53 //need to test for " " on *nix, can't just add double quotes
54 //across platforms.
55 if (cp.contains(" ")){
56 cp = "\""+cp+"\"";
57 }
58 jvmOpts.put("-cp", cp);
59 }
60
61 boolean hasLog4j = false;
62 for (String k : jvmOpts.keySet()) {
63 if (k.startsWith("-Dlog4j.configuration=")) {
64 hasLog4j = true;
65 break;
66 }
67 }
68 //use the log4j config file inside the app /resources/log4j_batch_process.properties
69 if (! hasLog4j) {
70 jvmOpts.put("-Dlog4j.configuration=\"log4j_batch_process.properties\"", "");
71 }
72 //now build the full command line
73 List<String> fullCommand = new ArrayList<String>();
74 fullCommand.add("java");
75 for (Map.Entry<String, String> e : jvmOpts.entrySet()) {
76 fullCommand.add(e.getKey());
77 if (e.getValue().length() > 0) {
78 fullCommand.add(e.getValue());
79 }
80 }
81 fullCommand.add("org.apache.tika.batch.fs.FSBatchProcessCLI");
82 //now add the process commands
83 for (Map.Entry<String, String> e : processArgs.entrySet()) {
84 fullCommand.add(e.getKey());
85 if (e.getValue().length() > 0) {
86 fullCommand.add(e.getValue());
87 }
88 }
89 return fullCommand.toArray(new String[fullCommand.size()]);
90 }
91
92
93 /**
94 * Take the input args and separate them into args that belong on the commandline
95 * and those that belong as jvm args for the child process.
96 * @param args -- literal args from TikaCLI commandline
97 * @param commandLine args that should be part of the batch commandline
98 * @param jvmArgs args that belong as jvm arguments for the child process
99 */
100 private static void mapifyArgs(final String[] args,
101 final Map<String, String> commandLine,
102 final Map<String, String> jvmArgs) {
103
104 if (args.length == 0) {
105 return;
106 }
107
108 Matcher matcher = JVM_OPTS_PATTERN.matcher("");
109 for (int i = 0; i < args.length; i++) {
110 if (matcher.reset(args[i]).find()) {
111 String jvmArg = matcher.group(1)+matcher.group(2);
112 String v = "";
113 if (i < args.length-1 && ! args[i+1].startsWith("-")){
114 v = args[i+1];
115 i++;
116 }
117 jvmArgs.put(jvmArg, v);
118 } else if (args[i].startsWith("-")) {
119 String k = args[i];
120 String v = "";
121 if (i < args.length-1 && ! args[i+1].startsWith("-")){
122 v = args[i+1];
123 i++;
124 }
125 commandLine.put(k, v);
126 }
127 }
128 }
129
130 private static void translateCommandLine(String[] args, Map<String, String> map) throws IOException {
131 //if there are only two args and they are both directories, treat the first
132 //as input and the second as output.
133 if (args.length == 2 && !args[0].startsWith("-") && ! args[1].startsWith("-")) {
134 File candInput = new File(args[0]);
135 File candOutput = new File(args[1]);
136 if (candOutput.isFile()) {
137 throw new IllegalArgumentException("Can't specify an existing file as the "+
138 "second argument for the output directory of a batch process");
139 }
140
141 if (candInput.isDirectory()){
142 map.put("-inputDir", args[0]);
143 map.put("-outputDir", args[1]);
144 }
145 }
146 //look for tikaConfig
147 for (String arg : args) {
148 if (arg.startsWith("--config=")) {
149 String configPath = arg.substring("--config=".length());
150 map.put("-c", configPath);
151 break;
152 }
153 }
154 //now translate output types
155 if (map.containsKey("-h") || map.containsKey("--html")) {
156 map.remove("-h");
157 map.remove("--html");
158 map.put("-basicHandlerType", "html");
159 map.put("-outputSuffix", "html");
160 } else if (map.containsKey("-x") || map.containsKey("--xml")) {
161 map.remove("-x");
162 map.remove("--xml");
163 map.put("-basicHandlerType", "xml");
164 map.put("-outputSuffix", "xml");
165 } else if (map.containsKey("-t") || map.containsKey("--text")) {
166 map.remove("-t");
167 map.remove("--text");
168 map.put("-basicHandlerType", "text");
169 map.put("-outputSuffix", "txt");
170 } else if (map.containsKey("-m") || map.containsKey("--metadata")) {
171 map.remove("-m");
172 map.remove("--metadata");
173 map.put("-basicHandlerType", "ignore");
174 map.put("-outputSuffix", "json");
175 } else if (map.containsKey("-T") || map.containsKey("--text-main")) {
176 map.remove("-T");
177 map.remove("--text-main");
178 map.put("-basicHandlerType", "body");
179 map.put("-outputSuffix", "txt");
180 }
181
182 if (map.containsKey("-J") || map.containsKey("--jsonRecursive")) {
183 map.remove("-J");
184 map.remove("--jsonRecursive");
185 map.put("-recursiveParserWrapper", "true");
186 //overwrite outputSuffix
187 map.put("-outputSuffix", "json");
188 }
189
190 if (map.containsKey("--inputDir") || map.containsKey("-i")) {
191 String v1 = map.remove("--inputDir");
192 String v2 = map.remove("-i");
193 String v = (v1 == null) ? v2 : v1;
194 map.put("-inputDir", v);
195 }
196
197 if (map.containsKey("--outputDir") || map.containsKey("-o")) {
198 String v1 = map.remove("--outputDir");
199 String v2 = map.remove("-o");
200 String v = (v1 == null) ? v2 : v1;
201 map.put("-outputDir", v);
202 }
203
204 }
205 }
1515 */
1616 package org.apache.tika.cli;
1717
18 import javax.xml.transform.OutputKeys;
19 import javax.xml.transform.TransformerConfigurationException;
20 import javax.xml.transform.sax.SAXTransformerFactory;
21 import javax.xml.transform.sax.TransformerHandler;
22 import javax.xml.transform.stream.StreamResult;
23 import java.io.BufferedReader;
1824 import java.io.File;
25 import java.io.FileInputStream;
1926 import java.io.FileOutputStream;
2027 import java.io.IOException;
2128 import java.io.InputStream;
29 import java.io.InputStreamReader;
2230 import java.io.OutputStream;
2331 import java.io.OutputStreamWriter;
2432 import java.io.PrintStream;
3038 import java.net.Socket;
3139 import java.net.URI;
3240 import java.net.URL;
41 import java.nio.charset.Charset;
3342 import java.util.Arrays;
3443 import java.util.Comparator;
44 import java.util.Enumeration;
3545 import java.util.HashMap;
3646 import java.util.HashSet;
3747 import java.util.List;
48 import java.util.Locale;
49 import java.util.Map;
3850 import java.util.Map.Entry;
39 import java.util.Map;
4051 import java.util.Set;
41 import javax.xml.transform.OutputKeys;
42 import javax.xml.transform.TransformerConfigurationException;
43 import javax.xml.transform.sax.SAXTransformerFactory;
44 import javax.xml.transform.sax.TransformerHandler;
45 import javax.xml.transform.stream.StreamResult;
52 import java.util.TreeSet;
4653
4754 import org.apache.commons.logging.Log;
4855 import org.apache.commons.logging.LogFactory;
49 import org.apache.log4j.BasicConfigurator;
5056 import org.apache.log4j.Level;
57 import org.apache.log4j.LogManager;
5158 import org.apache.log4j.Logger;
52 import org.apache.log4j.SimpleLayout;
53 import org.apache.log4j.WriterAppender;
59 import org.apache.log4j.PropertyConfigurator;
5460 import org.apache.poi.poifs.filesystem.DirectoryEntry;
5561 import org.apache.poi.poifs.filesystem.DocumentEntry;
5662 import org.apache.poi.poifs.filesystem.DocumentInputStream;
5763 import org.apache.poi.poifs.filesystem.POIFSFileSystem;
5864 import org.apache.tika.Tika;
65 import org.apache.tika.batch.BatchProcessDriverCLI;
5966 import org.apache.tika.config.TikaConfig;
6067 import org.apache.tika.detect.CompositeDetector;
6168 import org.apache.tika.detect.DefaultDetector;
6572 import org.apache.tika.fork.ForkParser;
6673 import org.apache.tika.gui.TikaGUI;
6774 import org.apache.tika.io.CloseShieldInputStream;
75 import org.apache.tika.io.FilenameUtils;
6876 import org.apache.tika.io.IOUtils;
6977 import org.apache.tika.io.TikaInputStream;
70 import org.apache.tika.io.json.JsonMetadataSerializer;
7178 import org.apache.tika.language.LanguageProfilerBuilder;
7279 import org.apache.tika.language.ProfilingHandler;
7380 import org.apache.tika.metadata.Metadata;
7481 import org.apache.tika.metadata.serialization.JsonMetadata;
82 import org.apache.tika.metadata.serialization.JsonMetadataList;
7583 import org.apache.tika.mime.MediaType;
7684 import org.apache.tika.mime.MediaTypeRegistry;
85 import org.apache.tika.mime.MimeType;
7786 import org.apache.tika.mime.MimeTypeException;
87 import org.apache.tika.mime.MimeTypes;
7888 import org.apache.tika.parser.AutoDetectParser;
7989 import org.apache.tika.parser.CompositeParser;
8090 import org.apache.tika.parser.NetworkParser;
8292 import org.apache.tika.parser.Parser;
8393 import org.apache.tika.parser.ParserDecorator;
8494 import org.apache.tika.parser.PasswordProvider;
95 import org.apache.tika.parser.RecursiveParserWrapper;
8596 import org.apache.tika.parser.html.BoilerpipeContentHandler;
97 import org.apache.tika.sax.BasicContentHandlerFactory;
8698 import org.apache.tika.sax.BodyContentHandler;
99 import org.apache.tika.sax.ContentHandlerFactory;
87100 import org.apache.tika.sax.ExpandedTitleContentHandler;
88101 import org.apache.tika.xmp.XMPMetadata;
89102 import org.xml.sax.ContentHandler;
90103 import org.xml.sax.SAXException;
91104 import org.xml.sax.helpers.DefaultHandler;
92 import org.apache.tika.io.FilenameUtils;
93105
94106 /**
95107 * Simple command line interface for Apache Tika.
100112 private static final Log logger = LogFactory.getLog(TikaCLI.class);
101113
102114 public static void main(String[] args) throws Exception {
103 BasicConfigurator.configure(
104 new WriterAppender(new SimpleLayout(), System.err));
105 Logger.getRootLogger().setLevel(Level.INFO);
106115
107116 TikaCLI cli = new TikaCLI();
117 if (! isConfigured()) {
118 PropertyConfigurator.configure(cli.getClass().getResourceAsStream("/log4j.properties"));
119 }
120
121 if (cli.testForHelp(args)) {
122 cli.usage();
123 return;
124 } else if (cli.testForBatch(args)) {
125 String[] batchArgs = BatchCommandLineBuilder.build(args);
126 BatchProcessDriverCLI batchDriver = new BatchProcessDriverCLI(batchArgs);
127 batchDriver.execute();
128 return;
129 }
130
108131 if (args.length > 0) {
109132 for (int i = 0; i < args.length; i++) {
110133 cli.process(args[i]);
127150 }
128151 }
129152
153 private static boolean isConfigured() {
154 //Borrowed from: http://wiki.apache.org/logging-log4j/UsefulCode
155 Enumeration appenders = LogManager.getRootLogger().getAllAppenders();
156 if (appenders.hasMoreElements()) {
157 return true;
158 }
159 else {
160 Enumeration loggers = LogManager.getCurrentLoggers() ;
161 while (loggers.hasMoreElements()) {
162 Logger c = (Logger) loggers.nextElement();
163 if (c.getAllAppenders().hasMoreElements())
164 return true;
165 }
166 }
167 return false;
168 }
130169 private class OutputType {
131170
132171 public void process(
276315
277316 private Parser parser;
278317
318 private String configFilePath;
319
279320 private OutputType type = XML;
321
322 private boolean recursiveJSON = false;
280323
281324 private LanguageProfilerBuilder ngp = null;
282325
323366 Logger.getRootLogger().setLevel(Level.DEBUG);
324367 } else if (arg.equals("-g") || arg.equals("--gui")) {
325368 pipeMode = false;
326 TikaGUI.main(new String[0]);
369 if (configFilePath != null){
370 TikaGUI.main(new String[]{configFilePath});
371 } else {
372 TikaGUI.main(new String[0]);
373 }
327374 } else if (arg.equals("--list-parser") || arg.equals("--list-parsers")) {
328375 pipeMode = false;
329376 displayParsers(false, false);
342389 } else if(arg.equals("--list-supported-types")){
343390 pipeMode = false;
344391 displaySupportedTypes();
392 } else if (arg.startsWith("--compare-file-magic=")) {
393 pipeMode = false;
394 compareFileMagic(arg.substring(arg.indexOf('=')+1));
345395 } else if (arg.equals("--container-aware")
346396 || arg.equals("--container-aware-detector")) {
347397 // ignore, as container-aware detectors are now always used
348398 } else if (arg.equals("-f") || arg.equals("--fork")) {
349399 fork = true;
400 } else if (arg.startsWith("--config=")) {
401 configure(arg.substring("--config=".length()));
350402 } else if (arg.startsWith("-e")) {
351403 encoding = arg.substring("-e".length());
352404 } else if (arg.startsWith("--encoding=")) {
357409 password = arg.substring("--password=".length());
358410 } else if (arg.equals("-j") || arg.equals("--json")) {
359411 type = JSON;
360 } else if (arg.equals("-y") || arg.equals("--xmp")) {
412 } else if (arg.equals("-J") || arg.equals("--jsonRecursive")) {
413 recursiveJSON = true;
414 } else if (arg.equals("-y") || arg.equals("--xmp")) {
361415 type = XMP;
362416 } else if (arg.equals("-x") || arg.equals("--xml")) {
363417 type = XML;
413467 } else {
414468 url = new URL(arg);
415469 }
416 Metadata metadata = new Metadata();
417 InputStream input = TikaInputStream.get(url, metadata);
418 try {
419 type.process(input, System.out, metadata);
420 } finally {
421 input.close();
422 System.out.flush();
423 }
424 }
425 }
426 }
427
470 if (recursiveJSON) {
471 handleRecursiveJson(url, System.out);
472 } else {
473 Metadata metadata = new Metadata();
474 InputStream input = TikaInputStream.get(url, metadata);
475 try {
476 type.process(input, System.out, metadata);
477 } finally {
478 input.close();
479 System.out.flush();
480 }
481 }
482 }
483 }
484 }
485
486 private void handleRecursiveJson(URL url, OutputStream output) throws IOException, SAXException, TikaException {
487 Metadata metadata = new Metadata();
488 InputStream input = TikaInputStream.get(url, metadata);
489 RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, getContentHandlerFactory(type));
490 try {
491 wrapper.parse(input, null, metadata, context);
492 } finally {
493 input.close();
494 }
495 JsonMetadataList.setPrettyPrinting(prettyPrint);
496 Writer writer = getOutputWriter(output, encoding);
497 try {
498 JsonMetadataList.toJson(wrapper.getMetadata(), writer);
499 } finally {
500 writer.flush();
501 }
502 }
503
504 private ContentHandlerFactory getContentHandlerFactory(OutputType type) {
505 BasicContentHandlerFactory.HANDLER_TYPE handlerType = BasicContentHandlerFactory.HANDLER_TYPE.IGNORE;
506 if (type.equals(HTML)) {
507 handlerType = BasicContentHandlerFactory.HANDLER_TYPE.HTML;
508 } else if (type.equals(XML)) {
509 handlerType = BasicContentHandlerFactory.HANDLER_TYPE.XML;
510 } else if (type.equals(TEXT)) {
511 handlerType = BasicContentHandlerFactory.HANDLER_TYPE.TEXT;
512 } else if (type.equals(TEXT_MAIN)) {
513 handlerType = BasicContentHandlerFactory.HANDLER_TYPE.BODY;
514 } else if (type.equals(METADATA)) {
515 handlerType = BasicContentHandlerFactory.HANDLER_TYPE.IGNORE;
516 }
517 return new BasicContentHandlerFactory(handlerType, -1);
518 }
428519 private void usage() {
429520 PrintStream out = System.out;
430521 out.println("usage: java -jar tika-app.jar [option...] [file|port...]");
438529 out.println(" -s or --server Start the Apache Tika server");
439530 out.println(" -f or --fork Use Fork Mode for out-of-process extraction");
440531 out.println();
532 out.println(" --config=<tika-config.xml>");
533 out.println(" TikaConfig file. Must be specified before -g, -s or -f!");
534 out.println("");
441535 out.println(" -x or --xml Output XHTML content (default)");
442536 out.println(" -h or --html Output HTML content");
443537 out.println(" -t or --text Output plain text content");
445539 out.println(" -m or --metadata Output only metadata");
446540 out.println(" -j or --json Output metadata in JSON");
447541 out.println(" -y or --xmp Output metadata in XMP");
542 out.println(" -J or --jsonRecursive Output metadata and content from all");
543 out.println(" embedded files (choose content type");
544 out.println(" with -x, -h, -t or -m; default is -x)");
448545 out.println(" -l or --language Output only language");
449546 out.println(" -d or --detect Detect document type");
450547 out.println(" -eX or --encoding=X Use output encoding X");
451548 out.println(" -pX or --password=X Use document password X");
452549 out.println(" -z or --extract Extract all attachements into current directory");
453550 out.println(" --extract-dir=<dir> Specify target directory for -z");
454 out.println(" -r or --pretty-print For XML and XHTML outputs, adds newlines and");
551 out.println(" -r or --pretty-print For JSON, XML and XHTML outputs, adds newlines and");
455552 out.println(" whitespace, for better readability");
456553 out.println();
457554 out.println(" --create-profile=X");
469566 out.println(" --list-supported-types");
470567 out.println(" List all known media types and related information");
471568 out.println();
569 out.println();
570 out.println(" --compare-file-magic=<dir>");
571 out.println(" Compares Tika's known media types to the File(1) tool's magic directory");
472572 out.println("Description:");
473573 out.println(" Apache Tika will parse the file(s) specified on the");
474574 out.println(" command line and output the extracted text content");
495595 out.println(" Apache Tika server. The server will listen to the");
496596 out.println(" ports you specify as one or more arguments.");
497597 out.println();
598 out.println("- Batch mode");
599 out.println();
600 out.println(" Simplest method.");
601 out.println(" Specify two directories as args with no other args:");
602 out.println(" java -jar tika-app.jar <inputDirectory> <outputDirectory>");
603 out.println();
604 out.println("Batch Options:");
605 out.println(" -i or --inputDir Input directory");
606 out.println(" -o or --outputDir Output directory");
607 out.println(" -numConsumers Number of processing threads");
608 out.println(" -bc Batch config file");
609 out.println(" -maxRestarts Maximum number of times the ");
610 out.println(" watchdog process will restart the child process.");
611 out.println(" -timeoutThresholdMillis Number of milliseconds allowed to a parse");
612 out.println(" before the process is killed and restarted");
613 out.println(" -fileList List of files to process, with");
614 out.println(" paths relative to the input directory");
615 out.println(" -includeFilePat Regular expression to determine which");
616 out.println(" files to process, e.g. \"(?i)\\.pdf\"");
617 out.println(" -excludeFilePat Regular expression to determine which");
618 out.println(" files to avoid processing, e.g. \"(?i)\\.pdf\"");
619 out.println(" -maxFileSizeBytes Skip files longer than this value");
620 out.println();
621 out.println(" Control the type of output with -x, -h, -t and/or -J.");
622 out.println();
623 out.println(" To modify child process jvm args, prepend \"J\" as in:");
624 out.println(" -JXmx4g or -JDlog4j.configuration=file:log4j.xml.");
498625 }
499626
500627 private void version() {
501628 System.out.println(new Tika().toString());
629 }
630
631 private boolean testForHelp(String[] args) {
632 for (String s : args) {
633 if (s.equals("-?") || s.equals("--help")) {
634 return true;
635 }
636 }
637 return false;
638 }
639
640 private boolean testForBatch(String[] args) {
641 if (args.length == 2 && ! args[0].startsWith("-")
642 && ! args[1].startsWith("-")) {
643 File inputCand = new File(args[0]);
644 File outputCand = new File(args[1]);
645 if (inputCand.isDirectory() && !outputCand.isFile()) {
646 return true;
647 }
648 }
649
650 for (String s : args) {
651 if (s.equals("-inputDir") || s.equals("--inputDir") || s.equals("-i")) {
652 return true;
653 }
654 }
655 return false;
656 }
657
658
659
660 private void configure(String configFilePath) throws Exception {
661 this.configFilePath = configFilePath;
662 TikaConfig config = new TikaConfig(new File(configFilePath));
663 parser = new AutoDetectParser(config);
664 detector = config.getDetector();
665 context.set(Parser.class, parser);
502666 }
503667
504668 private void displayMetModels(){
636800 }
637801 System.out.println(" parser: " + p.getClass().getName());
638802 }
803 }
804 }
805
806 /**
807 * Compares our mime types registry with the File(1) tool's
808 * directory of (uncompiled) Magic entries.
809 * (Well, those with mimetypes anyway)
810 * @param magicDir Path to the magic directory
811 */
812 private void compareFileMagic(String magicDir) throws Exception {
813 Set<String> tikaLacking = new TreeSet<String>();
814 Set<String> tikaNoMagic = new TreeSet<String>();
815
816 // Sanity check
817 File dir = new File(magicDir);
818 if ((new File(dir, "elf")).exists() &&
819 (new File(dir, "mime")).exists() &&
820 (new File(dir, "vorbis")).exists()) {
821 // Looks plausible
822 } else {
823 throw new IllegalArgumentException(
824 magicDir + " doesn't seem to hold uncompressed file magic entries");
825 }
826
827 // Find all the mimetypes in the directory
828 Set<String> fileMimes = new HashSet<String>();
829 for (File mf : dir.listFiles()) {
830 if (mf.isFile()) {
831 BufferedReader r = new BufferedReader(new InputStreamReader(
832 new FileInputStream(mf), IOUtils.UTF_8));
833 String line;
834 while ((line = r.readLine()) != null) {
835 if (line.startsWith("!:mime") ||
836 line.startsWith("#!:mime")) {
837 String mime = line.substring(7).trim();
838 fileMimes.add(mime);
839 }
840 }
841 r.close();
842 }
843 }
844
845 // See how those compare to the Tika ones
846 TikaConfig config = TikaConfig.getDefaultConfig();
847 MimeTypes mimeTypes = config.getMimeRepository();
848 MediaTypeRegistry registry = config.getMediaTypeRegistry();
849 for (String mime : fileMimes) {
850 try {
851 final MimeType type = mimeTypes.getRegisteredMimeType(mime);
852
853 if (type == null) {
854 // Tika doesn't know about this one
855 tikaLacking.add(mime);
856 } else {
857 // Tika knows about this one!
858
859 // Does Tika have magic for it?
860 boolean hasMagic = type.hasMagic();
861
862 // How about the children?
863 if (!hasMagic) {
864 for (MediaType child : registry.getChildTypes(type.getType())) {
865 MimeType childType = mimeTypes.getRegisteredMimeType(child.toString());
866 if (childType != null && childType.hasMagic()) {
867 hasMagic = true;
868 }
869 }
870 }
871
872 // How about the parents?
873 MimeType parentType = type;
874 while (parentType != null && !hasMagic) {
875 if (parentType.hasMagic()) {
876 // Has magic, fine
877 hasMagic = true;
878 } else {
879 // Check the parent next
880 MediaType parent = registry.getSupertype(type.getType());
881 if (parent == MediaType.APPLICATION_XML ||
882 parent == MediaType.TEXT_PLAIN ||
883 parent == MediaType.OCTET_STREAM) {
884 // Stop checking parents if we hit a top level type
885 parent = null;
886 }
887 if (parent != null) {
888 parentType = mimeTypes.getRegisteredMimeType(parent.toString());
889 } else {
890 parentType = null;
891 }
892 }
893 }
894 if (!hasMagic) {
895 tikaNoMagic.add(mime);
896 }
897 }
898 } catch (MimeTypeException e) {
899 // Broken entry in the file magic directory
900 // Silently skip
901 }
902 }
903
904 // Check how many tika knows about
905 int tikaTypes = 0;
906 int tikaAliases = 0;
907 for (MediaType type : registry.getTypes()) {
908 tikaTypes++;
909 tikaAliases += registry.getAliases(type).size();
910 }
911
912 // Report
913 System.out.println("Tika knows about " + tikaTypes + " unique mime types");
914 System.out.println("Tika knows about " + (tikaTypes+tikaAliases) + " mime types including aliases");
915 System.out.println("The File Magic directory knows about " + fileMimes.size() + " unique mime types");
916 System.out.println();
917 System.out.println("The following mime types are known to File but not Tika:");
918 for (String mime : tikaLacking) {
919 System.out.println(" " + mime);
920 }
921 System.out.println();
922 System.out.println("The following mime types from File have no Tika magic (but their children might):");
923 for (String mime : tikaNoMagic) {
924 System.out.println(" " + mime);
639925 }
640926 }
641927
655941 if (encoding != null) {
656942 return new OutputStreamWriter(output, encoding);
657943 } else if (System.getProperty("os.name")
658 .toLowerCase().startsWith("mac os x")) {
944 .toLowerCase(Locale.ROOT).startsWith("mac os x")) {
659945 // TIKA-324: Override the default encoding on Mac OS X
660 return new OutputStreamWriter(output, "UTF-8");
946 return new OutputStreamWriter(output, IOUtils.UTF_8);
661947 } else {
662 return new OutputStreamWriter(output);
948 return new OutputStreamWriter(output, Charset.defaultCharset());
663949 }
664950 }
665951
7581044 // being a CLI program messages should go to the stderr too
7591045 //
7601046 String msg = String.format(
1047 Locale.ROOT,
7611048 "Ignoring unexpected exception trying to save embedded file %s (%s)",
7621049 name,
7631050 e.getMessage()
8201107 @Override
8211108 public void run() {
8221109 try {
1110 InputStream input = null;
8231111 try {
8241112 InputStream rawInput = socket.getInputStream();
8251113 OutputStream output = socket.getOutputStream();
826 InputStream input = TikaInputStream.get(rawInput);
1114 input = TikaInputStream.get(rawInput);
8271115 type.process(input, output, new Metadata());
8281116 output.flush();
8291117 } finally {
1118 if (input != null) {
1119 input.close();
1120 }
8301121 socket.close();
8311122 }
8321123 } catch (Exception e) {
9231214 @Override
9241215 public void endDocument() throws SAXException {
9251216 try {
1217 JsonMetadata.setPrettyPrinting(prettyPrint);
9261218 JsonMetadata.toJson(metadata, writer);
9271219 writer.flush();
9281220 } catch (TikaException e) {
1414 * limitations under the License.
1515 */
1616 package org.apache.tika.gui;
17
18 import java.awt.CardLayout;
19 import java.awt.Color;
20 import java.awt.Dimension;
21 import java.awt.Toolkit;
22 import java.awt.event.ActionEvent;
23 import java.awt.event.ActionListener;
24 import java.awt.event.KeyEvent;
25 import java.awt.event.WindowEvent;
26 import java.io.File;
27 import java.io.FileOutputStream;
28 import java.io.IOException;
29 import java.io.InputStream;
30 import java.io.PrintWriter;
31 import java.io.StringWriter;
32 import java.io.Writer;
33 import java.net.MalformedURLException;
34 import java.net.URL;
35 import java.util.Arrays;
36 import java.util.HashMap;
37 import java.util.Map;
38 import java.util.Set;
3917
4018 import javax.swing.Box;
4119 import javax.swing.JDialog;
6038 import javax.xml.transform.sax.SAXTransformerFactory;
6139 import javax.xml.transform.sax.TransformerHandler;
6240 import javax.xml.transform.stream.StreamResult;
63
41 import java.awt.CardLayout;
42 import java.awt.Color;
43 import java.awt.Dimension;
44 import java.awt.Toolkit;
45 import java.awt.event.ActionEvent;
46 import java.awt.event.ActionListener;
47 import java.awt.event.KeyEvent;
48 import java.awt.event.WindowEvent;
49 import java.io.File;
50 import java.io.FileOutputStream;
51 import java.io.IOException;
52 import java.io.InputStream;
53 import java.io.PrintWriter;
54 import java.io.StringWriter;
55 import java.io.Writer;
56 import java.net.MalformedURLException;
57 import java.net.URL;
58 import java.util.Arrays;
59 import java.util.HashMap;
60 import java.util.Map;
61 import java.util.Set;
62
63 import org.apache.tika.config.TikaConfig;
6464 import org.apache.tika.exception.TikaException;
6565 import org.apache.tika.extractor.DocumentSelector;
6666 import org.apache.tika.io.IOUtils;
6767 import org.apache.tika.io.TikaInputStream;
6868 import org.apache.tika.metadata.Metadata;
69 import org.apache.tika.metadata.serialization.JsonMetadataList;
6970 import org.apache.tika.mime.MediaType;
7071 import org.apache.tika.parser.AbstractParser;
7172 import org.apache.tika.parser.AutoDetectParser;
7273 import org.apache.tika.parser.ParseContext;
7374 import org.apache.tika.parser.Parser;
75 import org.apache.tika.parser.RecursiveParserWrapper;
7476 import org.apache.tika.parser.html.BoilerpipeContentHandler;
77 import org.apache.tika.sax.BasicContentHandlerFactory;
7578 import org.apache.tika.sax.BodyContentHandler;
7679 import org.apache.tika.sax.ContentHandlerDecorator;
7780 import org.apache.tika.sax.TeeContentHandler;
102105 * @throws Exception if an error occurs
103106 */
104107 public static void main(String[] args) throws Exception {
108 TikaConfig config = TikaConfig.getDefaultConfig();
109 if (args.length > 0) {
110 File configFile = new File(args[0]);
111 config = new TikaConfig(configFile);
112 }
105113 UIManager.setLookAndFeel(UIManager.getSystemLookAndFeelClassName());
114 final TikaConfig finalConfig = config;
106115 SwingUtilities.invokeLater(new Runnable() {
107116 public void run() {
108 new TikaGUI(new AutoDetectParser()).setVisible(true);
117 new TikaGUI(new AutoDetectParser(finalConfig)).setVisible(true);
109118 }
110119 });
111120 }
112121
122 //maximum length to allow for mark for reparse to get JSON
123 private final int MAX_MARK = 20*1024*1024;//20MB
113124 /**
114125 * Parsing context.
115126 */
154165 * Raw XHTML source.
155166 */
156167 private final JEditorPane xml;
168
169 /**
170 * Raw JSON source.
171 */
172 private final JEditorPane json;
157173
158174 /**
159175 * Document metadata.
178194 text = addCard(cards, "text/plain", "text");
179195 textMain = addCard(cards, "text/plain", "main");
180196 xml = addCard(cards, "text/plain", "xhtml");
197 json = addCard(cards, "text/plain", "json");
181198 add(cards);
182199 layout.show(cards, "welcome");
183200
210227 addMenuItem(view, "Plain text", "text", KeyEvent.VK_P);
211228 addMenuItem(view, "Main content", "main", KeyEvent.VK_C);
212229 addMenuItem(view, "Structured text", "xhtml", KeyEvent.VK_S);
230 addMenuItem(view, "Recursive JSON", "json", KeyEvent.VK_J);
213231 bar.add(view);
214232
215233 bar.add(Box.createHorizontalGlue());
260278 layout.show(cards, command);
261279 } else if ("metadata".equals(command)) {
262280 layout.show(cards, command);
281 } else if ("json".equals(command)) {
282 layout.show(cards, command);
263283 } else if ("about".equals(command)) {
264284 textDialog(
265285 "About Apache Tika",
313333 getXmlContentHandler(xmlBuffer));
314334
315335 context.set(DocumentSelector.class, new ImageDocumentSelector());
316
336 if (input.markSupported()) {
337 input.mark(MAX_MARK);
338 }
317339 input = new ProgressMonitorInputStream(
318340 this, "Parsing stream", input);
319341 parser.parse(input, handler, md, context);
339361 setText(text, textBuffer.toString());
340362 setText(textMain, textMainBuffer.toString());
341363 setText(html, htmlBuffer.toString());
364 if (!input.markSupported()) {
365 setText(json, "InputStream does not support mark/reset for Recursive Parsing");
366 layout.show(cards, "metadata");
367 return;
368 }
369 boolean isReset = false;
370 try {
371 input.reset();
372 isReset = true;
373 } catch (IOException e) {
374 setText(json, "Error during stream reset.\n"+
375 "There's a limit of "+MAX_MARK + " bytes for this type of processing in the GUI.\n"+
376 "Try the app with command line argument of -J."
377 );
378 }
379 if (isReset) {
380 RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser,
381 new BasicContentHandlerFactory(
382 BasicContentHandlerFactory.HANDLER_TYPE.BODY, -1));
383 wrapper.parse(input, null, new Metadata(), new ParseContext());
384 StringWriter jsonBuffer = new StringWriter();
385 JsonMetadataList.setPrettyPrinting(true);
386 JsonMetadataList.toJson(wrapper.getMetadata(), jsonBuffer);
387 setText(json, jsonBuffer.toString());
388 }
342389 layout.show(cards, "metadata");
343390 }
344391
411458 InputStream stream = url.openStream();
412459 try {
413460 StringWriter writer = new StringWriter();
414 IOUtils.copy(stream, writer, "UTF-8");
461 IOUtils.copy(stream, writer, IOUtils.UTF_8.name());
415462
416463 JEditorPane editor =
417464 new JEditorPane("text/plain", writer.toString());
+0
-89
tika-app/src/main/java/org/apache/tika/io/json/JsonMetadataSerializer.java less more
0 package org.apache.tika.io.json;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.lang.reflect.Type;
20 import java.util.Arrays;
21
22 import org.apache.tika.metadata.Metadata;
23
24 import com.google.gson.JsonArray;
25 import com.google.gson.JsonElement;
26 import com.google.gson.JsonNull;
27 import com.google.gson.JsonObject;
28 import com.google.gson.JsonPrimitive;
29 import com.google.gson.JsonSerializationContext;
30 import com.google.gson.JsonSerializer;
31
32
33 public class JsonMetadataSerializer implements JsonSerializer<Metadata> {
34
35 /**
36 *
37 * @param metadata
38 * @param type
39 * @param context
40 * @return JsonObject with key/value(s) pairs or JsonNull if metadata is null.
41 */
42 @Override
43 public JsonElement serialize(Metadata metadata, Type type, JsonSerializationContext context) {
44 if (metadata == null){
45 return new JsonNull();
46 }
47 String[] names = getNames(metadata);
48 if (names == null) {
49 return new JsonNull();
50 }
51
52 JsonObject root = new JsonObject();
53
54 for (String n : names) {
55
56 String[] vals = metadata.getValues(n);
57 if (vals == null) {
58 //silently skip?
59 continue;
60 }
61
62 if (vals.length == 1) {
63 root.addProperty(n, vals[0]);
64 } else {
65 JsonArray jArr = new JsonArray();
66 for (int i = 0; i < vals.length; i++) {
67 jArr.add(new JsonPrimitive(vals[i]));
68 }
69 root.add(n, jArr);
70 }
71 }
72 return root;
73 }
74
75 /**
76 * Override to get a custom sort order
77 * or to filter names.
78 *
79 * @param metadata
80 * @return
81 */
82 protected String[] getNames(Metadata metadata) {
83 String[] names = metadata.names();
84 Arrays.sort(names);
85 return names;
86 }
87
88 }
0 # Licensed to the Apache Software Foundation (ASF) under one or more
1 # contributor license agreements. See the NOTICE file distributed with
2 # this work for additional information regarding copyright ownership.
3 # The ASF licenses this file to You under the Apache License, Version 2.0
4 # (the "License"); you may not use this file except in compliance with
5 # the License. You may obtain a copy of the License at
6 #
7 # http://www.apache.org/licenses/LICENSE-2.0
8 #
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14
15 #info,debug, error,fatal ...
16 log4j.rootLogger=info,stderr
17
18 #console
19 log4j.appender.stderr=org.apache.log4j.ConsoleAppender
20 log4j.appender.stderr.layout=org.apache.log4j.PatternLayout
21 log4j.appender.stderr.Target=System.err
22
23 log4j.appender.stderr.layout.ConversionPattern= %-5p %m%n
0 # Licensed to the Apache Software Foundation (ASF) under one or more
1 # contributor license agreements. See the NOTICE file distributed with
2 # this work for additional information regarding copyright ownership.
3 # The ASF licenses this file to You under the Apache License, Version 2.0
4 # (the "License"); you may not use this file except in compliance with
5 # the License. You may obtain a copy of the License at
6 #
7 # http://www.apache.org/licenses/LICENSE-2.0
8 #
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14
15 #info,debug, error,fatal ...
16 log4j.rootLogger=info,stdout
17
18 #console
19 log4j.appender.stdout=org.apache.log4j.ConsoleAppender
20 log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
21
22
23 log4j.appender.stdout.layout.ConversionPattern=%m%n
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.cli;
17
18 import static junit.framework.TestCase.assertTrue;
19 import static org.junit.Assert.assertEquals;
20
21 import java.io.File;
22 import java.io.FileOutputStream;
23 import java.io.IOException;
24 import java.io.OutputStream;
25 import java.util.LinkedHashMap;
26 import java.util.Map;
27
28 import org.apache.commons.io.FileUtils;
29 import org.apache.tika.io.IOUtils;
30 import org.junit.After;
31 import org.junit.Before;
32 import org.junit.Test;
33
34 public class TikaCLIBatchCommandLineTest {
35
36 File testInput = null;
37 File testFile = null;
38
39 @Before
40 public void init() {
41 testInput = new File("testInput");
42 if (!testInput.mkdirs()) {
43 throw new RuntimeException("Failed to open test input directory");
44 }
45 testFile = new File("testFile.txt");
46 OutputStream os = null;
47 try {
48 os = new FileOutputStream(testFile);
49 IOUtils.write("test output", os, "UTF-8");
50 } catch (IOException e) {
51 throw new RuntimeException("Couldn't open testFile");
52 } finally {
53 IOUtils.closeQuietly(os);
54 }
55 }
56
57 @After
58 public void tearDown() {
59 try {
60 FileUtils.deleteDirectory(testInput);
61 testFile.delete();
62 } catch (IOException e) {
63 throw new RuntimeException(e);
64 }
65 }
66
67 @Test
68 public void testJVMOpts() throws Exception {
69 String path = testInput.getAbsolutePath();
70 if (path.contains(" ")) {
71 path = "\"" + path + "\"";
72 }
73 String[] params = {"-JXmx1g", "-JDlog4j.configuration=batch_process_log4j.xml", "-inputDir",
74 path, "-outputDir", "testout-output"};
75
76
77 String[] commandLine = BatchCommandLineBuilder.build(params);
78 StringBuilder sb = new StringBuilder();
79
80 for (String s : commandLine) {
81 sb.append(s).append(" ");
82 }
83 String s = sb.toString();
84 int classInd = s.indexOf("org.apache.tika.batch.fs.FSBatchProcessCLI");
85 int xmx = s.indexOf("-Xmx1g");
86 int inputDir = s.indexOf("-inputDir");
87 int log = s.indexOf("-Dlog4j.configuration");
88 assertTrue(classInd > -1);
89 assertTrue(xmx > -1);
90 assertTrue(inputDir > -1);
91 assertTrue(log > -1);
92 assertTrue(xmx < classInd);
93 assertTrue(log < classInd);
94 assertTrue(inputDir > classInd);
95 }
96
97 @Test
98 public void testBasicMappingOfArgs() throws Exception {
99 String path = testInput.getAbsolutePath();
100 if (path.contains(" ")) {
101 path = "\"" + path + "\"";
102 }
103 String[] params = {"-JXmx1g", "-JDlog4j.configuration=batch_process_log4j.xml",
104 "-bc", "batch-config.xml",
105 "-J", "-h", "-inputDir", path};
106
107 String[] commandLine = BatchCommandLineBuilder.build(params);
108 Map<String, String> attrs = mapify(commandLine);
109 assertEquals("true", attrs.get("-recursiveParserWrapper"));
110 assertEquals("html", attrs.get("-basicHandlerType"));
111 assertEquals("json", attrs.get("-outputSuffix"));
112 assertEquals("batch-config.xml", attrs.get("-bc"));
113 assertEquals(path, attrs.get("-inputDir"));
114 }
115
116 @Test
117 public void testTwoDirsNoFlags() throws Exception {
118 String outputRoot = "outputRoot";
119 String path = testInput.getAbsolutePath();
120 if (path.contains(" ")) {
121 path = "\"" + path + "\"";
122 }
123 String[] params = {path, outputRoot};
124
125 String[] commandLine = BatchCommandLineBuilder.build(params);
126 Map<String, String> attrs = mapify(commandLine);
127 assertEquals(path, attrs.get("-inputDir"));
128 assertEquals(outputRoot, attrs.get("-outputDir"));
129 }
130
131 @Test
132 public void testTwoDirsVarious() throws Exception {
133 String outputRoot = "outputRoot";
134 String path = testInput.getAbsolutePath();
135 if (path.contains(" ")) {
136 path = "\"" + path + "\"";
137 }
138 String[] params = {"-i", path, "-o", outputRoot};
139
140 String[] commandLine = BatchCommandLineBuilder.build(params);
141 Map<String, String> attrs = mapify(commandLine);
142 assertEquals(path, attrs.get("-inputDir"));
143 assertEquals(outputRoot, attrs.get("-outputDir"));
144
145 params = new String[]{"--inputDir", path, "--outputDir", outputRoot};
146
147 commandLine = BatchCommandLineBuilder.build(params);
148 attrs = mapify(commandLine);
149 assertEquals(path, attrs.get("-inputDir"));
150 assertEquals(outputRoot, attrs.get("-outputDir"));
151
152 params = new String[]{"-inputDir", path, "-outputDir", outputRoot};
153
154 commandLine = BatchCommandLineBuilder.build(params);
155 attrs = mapify(commandLine);
156 assertEquals(path, attrs.get("-inputDir"));
157 assertEquals(outputRoot, attrs.get("-outputDir"));
158 }
159
160 @Test
161 public void testConfig() throws Exception {
162 String outputRoot = "outputRoot";
163 String configPath = "c:/somewhere/someConfig.xml";
164 String path = testInput.getAbsolutePath();
165
166 if (path.contains(" ")) {
167 path = "\"" + path + "\"";
168 }
169
170 String[] params = {"--inputDir", path, "--outputDir", outputRoot,
171 "--config="+configPath};
172 String[] commandLine = BatchCommandLineBuilder.build(params);
173 Map<String, String> attrs = mapify(commandLine);
174 assertEquals(path, attrs.get("-inputDir"));
175 assertEquals(outputRoot, attrs.get("-outputDir"));
176 assertEquals(configPath, attrs.get("-c"));
177
178 }
179
180 @Test
181 public void testOneDirOneFileException() throws Exception {
182 boolean ex = false;
183 try {
184 String outputRoot = "outputRoot";
185 String path = testInput.getAbsolutePath();
186 if (path.contains(" ")) {
187 path = "\"" + path + "\"";
188 }
189 String[] params = {path, testFile.getAbsolutePath()};
190
191 String[] commandLine = BatchCommandLineBuilder.build(params);
192
193 } catch (IllegalArgumentException e) {
194 ex = true;
195 }
196 assertTrue("exception on <dir> <file>", ex);
197 }
198
199 private Map<String, String> mapify(String[] args) {
200 Map<String, String> map = new LinkedHashMap<String, String>();
201 for (int i = 0; i < args.length; i++) {
202 if (args[i].startsWith("-")) {
203 String k = args[i];
204 String v = "";
205 if (i < args.length - 1 && !args[i + 1].startsWith("-")) {
206 v = args[i + 1];
207 i++;
208 }
209 map.put(k, v);
210 }
211 }
212 return map;
213 }
214
215 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.cli;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
22 import java.io.ByteArrayOutputStream;
23 import java.io.File;
24 import java.io.FileInputStream;
25 import java.io.InputStreamReader;
26 import java.io.OutputStream;
27 import java.io.PrintStream;
28 import java.io.Reader;
29 import java.util.List;
30
31 import org.apache.commons.io.FileUtils;
32 import org.apache.tika.io.IOUtils;
33 import org.apache.tika.metadata.Metadata;
34 import org.apache.tika.metadata.serialization.JsonMetadataList;
35 import org.apache.tika.parser.RecursiveParserWrapper;
36 import org.junit.After;
37 import org.junit.Before;
38 import org.junit.Test;
39
40 public class TikaCLIBatchIntegrationTest {
41
42 private File testDataFile = new File("src/test/resources/test-data");
43
44 private File tempDir;
45 private OutputStream out = null;
46 private OutputStream err = null;
47 private ByteArrayOutputStream outBuffer = null;
48
49 @Before
50 public void setup() throws Exception {
51 tempDir = File.createTempFile("tika-cli-test-batch-", "");
52 tempDir.delete();
53 tempDir.mkdir();
54 outBuffer = new ByteArrayOutputStream();
55 PrintStream outWriter = new PrintStream(outBuffer, true, IOUtils.UTF_8.name());
56 ByteArrayOutputStream errBuffer = new ByteArrayOutputStream();
57 PrintStream errWriter = new PrintStream(errBuffer, true, IOUtils.UTF_8.name());
58 out = System.out;
59 err = System.err;
60 System.setOut(outWriter);
61 System.setErr(errWriter);
62 }
63
64 @After
65 public void tearDown() throws Exception {
66 System.setOut(new PrintStream(out, true, IOUtils.UTF_8.name()));
67 System.setErr(new PrintStream(err, true, IOUtils.UTF_8.name()));
68 FileUtils.deleteDirectory(tempDir);
69 }
70
71 @Test
72 public void testSimplestBatchIntegration() throws Exception {
73 String[] params = {escape(testDataFile.getAbsolutePath()),
74 escape(tempDir.getAbsolutePath())};
75 TikaCLI.main(params);
76
77 assertTrue("bad_xml.xml.xml", new File(tempDir, "bad_xml.xml.xml").isFile());
78 assertTrue("coffee.xls.xml", new File(tempDir, "coffee.xls.xml").exists());
79 }
80
81 @Test
82 public void testBasicBatchIntegration() throws Exception {
83 String[] params = {"-i", escape(testDataFile.getAbsolutePath()),
84 "-o", escape(tempDir.getAbsolutePath()),
85 "-numConsumers", "2"
86 };
87 TikaCLI.main(params);
88
89 assertTrue("bad_xml.xml.xml", new File(tempDir, "bad_xml.xml.xml").isFile());
90 assertTrue("coffee.xls.xml", new File(tempDir, "coffee.xls.xml").exists());
91 }
92
93 @Test
94 public void testJsonRecursiveBatchIntegration() throws Exception {
95 Reader reader = null;
96 try {
97 String[] params = {"-i", escape(testDataFile.getAbsolutePath()),
98 "-o", escape(tempDir.getAbsolutePath()),
99 "-numConsumers", "10",
100 "-J", //recursive Json
101 "-t" //plain text in content
102 };
103 TikaCLI.main(params);
104 reader = new InputStreamReader(
105 new FileInputStream(new File(tempDir, "test_recursive_embedded.docx.json")), IOUtils.UTF_8);
106 List<Metadata> metadataList = JsonMetadataList.fromJson(reader);
107 assertEquals(12, metadataList.size());
108 assertTrue(metadataList.get(6).get(RecursiveParserWrapper.TIKA_CONTENT).contains("human events"));
109 } finally {
110 IOUtils.closeQuietly(reader);
111 }
112 }
113
114 @Test
115 public void testProcessLogFileConfig() throws Exception {
116 String[] params = {"-i", escape(testDataFile.getAbsolutePath()),
117 "-o", escape(tempDir.getAbsolutePath()),
118 "-numConsumers", "2",
119 "-JDlog4j.configuration=log4j_batch_process_test.properties"};
120 TikaCLI.main(params);
121
122 assertTrue("bad_xml.xml.xml", new File(tempDir, "bad_xml.xml.xml").isFile());
123 assertTrue("coffee.xls.xml", new File(tempDir, "coffee.xls.xml").exists());
124 String sysOutString = new String(outBuffer.toByteArray(), IOUtils.UTF_8);
125 assertTrue(sysOutString.contains("MY_CUSTOM_LOG_CONFIG"));
126 }
127
128 public static String escape(String path) {
129 if (path.indexOf(' ') > -1) {
130 return '"' + path + '"';
131 }
132 return path;
133 }
134
135 }
1515 */
1616 package org.apache.tika.cli;
1717
18 import static org.junit.Assert.assertFalse;
19 import static org.junit.Assert.assertTrue;
20
1821 import java.io.ByteArrayOutputStream;
1922 import java.io.File;
2023 import java.io.PrintStream;
2124 import java.net.URI;
2225
2326 import org.apache.commons.io.FileUtils;
24
27 import org.apache.tika.exception.TikaException;
28 import org.apache.tika.io.IOUtils;
2529 import org.junit.After;
26 import static org.junit.Assert.assertTrue;
2730 import org.junit.Before;
2831 import org.junit.Test;
2932
3639 private File profile = null;
3740 private ByteArrayOutputStream outContent = null;
3841 private PrintStream stdout = null;
39 private URI testDataURI = new File("src/test/resources/test-data/").toURI();
40 private String resourcePrefix = testDataURI.toString();
42 private File testDataFile = new File("src/test/resources/test-data");
43 private URI testDataURI = testDataFile.toURI();
44 private String resourcePrefix;
4145
4246 @Before
4347 public void setUp() throws Exception {
4448 profile = new File("welsh.ngp");
4549 outContent = new ByteArrayOutputStream();
50 resourcePrefix = testDataURI.toString();
4651 stdout = System.out;
47 System.setOut(new PrintStream(outContent));
52 System.setOut(new PrintStream(outContent, true, IOUtils.UTF_8.name()));
4853 }
4954
5055 /**
6873 public void testListParserDetail() throws Exception{
6974 String[] params = {"--list-parser-detail"};
7075 TikaCLI.main(params);
71 assertTrue(outContent.toString().contains("application/vnd.oasis.opendocument.text-web"));
76 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("application/vnd.oasis.opendocument.text-web"));
7277 }
7378
7479 /**
8186 String[] params = {"--list-parser"};
8287 TikaCLI.main(params);
8388 //Assert was commented temporarily for finding the problem
84 // Assert.assertTrue(outContent != null && outContent.toString().contains("org.apache.tika.parser.iwork.IWorkPackageParser"));
89 // Assert.assertTrue(outContent != null && outContent.toString("UTF-8").contains("org.apache.tika.parser.iwork.IWorkPackageParser"));
8590 }
8691
8792 /**
9398 public void testXMLOutput() throws Exception{
9499 String[] params = {"-x", resourcePrefix + "alice.cli.test"};
95100 TikaCLI.main(params);
96 assertTrue(outContent.toString().contains("?xml version=\"1.0\" encoding=\"UTF-8\"?"));
101 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("?xml version=\"1.0\" encoding=\"UTF-8\"?"));
97102 }
98103
99104 /**
105110 public void testHTMLOutput() throws Exception{
106111 String[] params = {"-h", resourcePrefix + "alice.cli.test"};
107112 TikaCLI.main(params);
108 assertTrue(outContent.toString().contains("html xmlns=\"http://www.w3.org/1999/xhtml"));
113 assertTrue(outContent.toString("UTF-8").contains("html xmlns=\"http://www.w3.org/1999/xhtml"));
109114 assertTrue("Expanded <title></title> element should be present",
110 outContent.toString().contains("<title></title>"));
115 outContent.toString(IOUtils.UTF_8.name()).contains("<title></title>"));
111116 }
112117
113118 /**
119124 public void testTextOutput() throws Exception{
120125 String[] params = {"-t", resourcePrefix + "alice.cli.test"};
121126 TikaCLI.main(params);
122 assertTrue(outContent.toString().contains("finished off the cake"));
127 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("finished off the cake"));
123128 }
124129
125130 /**
130135 public void testMetadataOutput() throws Exception{
131136 String[] params = {"-m", resourcePrefix + "alice.cli.test"};
132137 TikaCLI.main(params);
133 assertTrue(outContent.toString().contains("text/plain"));
138 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("text/plain"));
134139 }
135140
136141 /**
142147 public void testJsonMetadataOutput() throws Exception {
143148 String[] params = {"--json", resourcePrefix + "testJsonMultipleInts.html"};
144149 TikaCLI.main(params);
145 String json = outContent.toString();
150 String json = outContent.toString(IOUtils.UTF_8.name());
146151 //TIKA-1310
147152 assertTrue(json.contains("\"fb:admins\":\"1,2,3,4\","));
148153
155160 }
156161
157162 /**
163 * Test for -json with prettyprint option
164 *
165 * @throws Exception
166 */
167 @Test
168 public void testJsonMetadataPrettyPrintOutput() throws Exception {
169 String[] params = {"--json", "-r", resourcePrefix + "testJsonMultipleInts.html"};
170 TikaCLI.main(params);
171 String json = outContent.toString(IOUtils.UTF_8.name());
172
173 assertTrue(json.contains(" \"X-Parsed-By\": [\n" +
174 " \"org.apache.tika.parser.DefaultParser\",\n" +
175 " \"org.apache.tika.parser.html.HtmlParser\"\n" +
176 " ],\n"));
177 //test legacy alphabetic sort of keys
178 int enc = json.indexOf("\"Content-Encoding\"");
179 int fb = json.indexOf("fb:admins");
180 int title = json.indexOf("\"title\"");
181 assertTrue(enc > -1 && fb > -1 && enc < fb);
182 assertTrue (fb > -1 && title > -1 && fb < title);
183 }
184
185 /**
158186 * Tests -l option of the cli
159187 *
160188 * @throws Exception
163191 public void testLanguageOutput() throws Exception{
164192 String[] params = {"-l", resourcePrefix + "alice.cli.test"};
165193 TikaCLI.main(params);
166 assertTrue(outContent.toString().contains("en"));
194 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("en"));
167195 }
168196
169197 /**
175203 public void testDetectOutput() throws Exception{
176204 String[] params = {"-d", resourcePrefix + "alice.cli.test"};
177205 TikaCLI.main(params);
178 assertTrue(outContent.toString().contains("text/plain"));
206 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("text/plain"));
179207 }
180208
181209 /**
187215 public void testListMetModels() throws Exception{
188216 String[] params = {"--list-met-models", resourcePrefix + "alice.cli.test"};
189217 TikaCLI.main(params);
190 assertTrue(outContent.toString().contains("text/plain"));
218 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("text/plain"));
191219 }
192220
193221 /**
199227 public void testListSupportedTypes() throws Exception{
200228 String[] params = {"--list-supported-types", resourcePrefix + "alice.cli.test"};
201229 TikaCLI.main(params);
202 assertTrue(outContent.toString().contains("supertype: application/octet-stream"));
230 assertTrue(outContent.toString(IOUtils.UTF_8.name()).contains("supertype: application/octet-stream"));
203231 }
204232
205233 /**
224252
225253 TikaCLI.main(params);
226254
255 StringBuffer allFiles = new StringBuffer();
256 for (String f : tempFile.list()) {
257 if (allFiles.length() > 0) allFiles.append(" : ");
258 allFiles.append(f);
259 }
260
227261 // ChemDraw file
228 File expected1 = new File(tempFile, "MBD002B040A.cdx");
262 File expectedCDX = new File(tempFile, "MBD002B040A.cdx");
263 // Image of the ChemDraw molecule
264 File expectedIMG = new File(tempFile, "file4.png");
229265 // OLE10Native
230 File expected2 = new File(tempFile, "MBD002B0FA6_file5");
266 File expectedOLE10 = new File(tempFile, "MBD002B0FA6_file5.bin");
267 // Something that really isnt a text file... Not sure what it is???
268 File expected262FE3 = new File(tempFile, "MBD00262FE3.txt");
231269 // Image of one of the embedded resources
232 File expected3 = new File(tempFile, "file0.emf");
270 File expectedEMF = new File(tempFile, "file0.emf");
233271
234 assertTrue(expected1.exists());
235 assertTrue(expected2.exists());
236 assertTrue(expected3.exists());
237
238 assertTrue(expected1.length()>0);
239 assertTrue(expected2.length()>0);
240 assertTrue(expected3.length()>0);
272 assertExtracted(expectedCDX, allFiles.toString());
273 assertExtracted(expectedIMG, allFiles.toString());
274 assertExtracted(expectedOLE10, allFiles.toString());
275 assertExtracted(expected262FE3, allFiles.toString());
276 assertExtracted(expectedEMF, allFiles.toString());
241277 } finally {
242278 FileUtils.deleteDirectory(tempFile);
243279 }
244
280 }
281 protected static void assertExtracted(File f, String allFiles) {
282
283 assertTrue(
284 "File " + f.getName() + " not found in " + allFiles,
285 f.exists()
286 );
287
288 assertFalse(
289 "File " + f.getName() + " is a directory!", f.isDirectory()
290 );
291
292 assertTrue(
293 "File " + f.getName() + " wasn't extracted with contents",
294 f.length() > 0
295 );
245296 }
246297
247298 // TIKA-920
249300 public void testMultiValuedMetadata() throws Exception {
250301 String[] params = {"-m", resourcePrefix + "testMultipleSheets.numbers"};
251302 TikaCLI.main(params);
252 String content = outContent.toString();
303 String content = outContent.toString(IOUtils.UTF_8.name());
253304 assertTrue(content.contains("sheetNames: Checking"));
254305 assertTrue(content.contains("sheetNames: Secon sheet"));
255306 assertTrue(content.contains("sheetNames: Logical Sheet 3"));
263314 new File("subdir/foo.txt").delete();
264315 new File("subdir").delete();
265316 TikaCLI.main(params);
266 String content = outContent.toString();
317 String content = outContent.toString(IOUtils.UTF_8.name());
267318 assertTrue(content.contains("Extracting 'subdir/foo.txt'"));
268319 // clean up. TODO: These should be in target.
269320 new File("target/subdir/foo.txt").delete();
270321 new File("target/subdir").delete();
271322 }
323
324 @Test
325 public void testDefaultConfigException() throws Exception {
326 //default xml parser will throw TikaException
327 //this and TestConfig() are broken into separate tests so that
328 //setUp and tearDown() are called each time
329 String[] params = {resourcePrefix + "bad_xml.xml"};
330 boolean tikaEx = false;
331 try {
332 TikaCLI.main(params);
333 } catch (TikaException e) {
334 tikaEx = true;
335 }
336 assertTrue(tikaEx);
337 }
338
339 @Test
340 public void testConfig() throws Exception {
341 String[] params = new String[]{"--config="+testDataFile.toString()+"/tika-config1.xml", resourcePrefix+"bad_xml.xml"};
342 TikaCLI.main(params);
343 String content = outContent.toString(IOUtils.UTF_8.name());
344 assertTrue(content.contains("apple"));
345 assertTrue(content.contains("org.apache.tika.parser.html.HtmlParser"));
346 }
347
348 @Test
349 public void testJsonRecursiveMetadataParserMetadataOnly() throws Exception {
350 String[] params = new String[]{"-m", "-J", "-r", resourcePrefix+"test_recursive_embedded.docx"};
351 TikaCLI.main(params);
352 String content = outContent.toString(IOUtils.UTF_8.name());
353 assertTrue(content.contains("[\n" +
354 " {\n" +
355 " \"Application-Name\": \"Microsoft Office Word\",\n" +
356 " \"Application-Version\": \"15.0000\",\n" +
357 " \"Character Count\": \"28\",\n" +
358 " \"Character-Count-With-Spaces\": \"31\","));
359 assertTrue(content.contains("\"X-TIKA:embedded_resource_path\": \"test_recursive_embedded.docx/embed1.zip\""));
360 assertFalse(content.contains("X-TIKA:content"));
361
362 }
363
364 @Test
365 public void testJsonRecursiveMetadataParserDefault() throws Exception {
366 String[] params = new String[]{"-J", "-r", resourcePrefix+"test_recursive_embedded.docx"};
367 TikaCLI.main(params);
368 String content = outContent.toString(IOUtils.UTF_8.name());
369 assertTrue(content.contains("\"X-TIKA:content\": \"\\u003chtml xmlns\\u003d\\\"http://www.w3.org/1999/xhtml"));
370 }
371
372 @Test
373 public void testJsonRecursiveMetadataParserText() throws Exception {
374 String[] params = new String[]{"-J", "-r", "-t", resourcePrefix+"test_recursive_embedded.docx"};
375 TikaCLI.main(params);
376 String content = outContent.toString(IOUtils.UTF_8.name());
377 assertTrue(content.contains("\\n\\nembed_4\\n"));
378 assertTrue(content.contains("\\n\\nembed_0"));
379 }
272380 }
0 <?xml version="1.0" encoding="UTF-8"?>
1 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
2
3 <modelVersion>4.0.0</modelVersion>
4 <properties>
5 <cli.version>1.2</cli.version> <!--sync version with tika-server or move to parent? -->
6 <compress.version>1.9</compress.version> <!-- sync with tika-parsers or move to parent? -->
7
8 </properties>
9
10 <parent>
11 <groupId>org.apache.tika</groupId>
12 <artifactId>tika-parent</artifactId>
13 <version>1.8</version>
14 <relativePath>../tika-parent/pom.xml</relativePath>
15 </parent>
16
17 <artifactId>tika-batch</artifactId>
18 <packaging>bundle</packaging>
19 <name>Apache Tika batch</name>
20 <url>http://tika.apache.org/</url>
21
22 <dependencies>
23 <dependency>
24 <groupId>${project.groupId}</groupId>
25 <artifactId>tika-core</artifactId>
26 <version>${project.version}</version>
27 </dependency>
28 <dependency>
29 <groupId>${project.groupId}</groupId>
30 <artifactId>tika-serialization</artifactId>
31 <version>${project.version}</version>
32 </dependency>
33 <dependency>
34 <groupId>org.apache.commons</groupId>
35 <artifactId>commons-compress</artifactId>
36 <version>${compress.version}</version>
37 </dependency>
38 <dependency>
39 <groupId>org.slf4j</groupId>
40 <artifactId>slf4j-log4j12</artifactId>
41 </dependency>
42 <dependency>
43 <groupId>commons-cli</groupId>
44 <artifactId>commons-cli</artifactId>
45 <version>${cli.version}</version>
46 </dependency>
47 <dependency>
48 <groupId>org.apache.tika</groupId>
49 <artifactId>tika-core</artifactId>
50 <version>${project.version}</version>
51 <type>test-jar</type>
52 <scope>test</scope>
53 </dependency>
54 <dependency>
55 <groupId>org.apache.tika</groupId>
56 <artifactId>tika-parsers</artifactId>
57 <version>${project.version}</version>
58 <type>test-jar</type>
59 <scope>test</scope>
60 </dependency>
61 <dependency>
62 <groupId>junit</groupId>
63 <artifactId>junit</artifactId>
64 <scope>test</scope>
65 </dependency>
66 <dependency>
67 <groupId>commons-io</groupId>
68 <artifactId>commons-io</artifactId>
69 <scope>test</scope>
70 <version>2.1</version>
71 </dependency>
72
73
74 </dependencies>
75 <build>
76 <plugins>
77 <plugin>
78 <artifactId>maven-remote-resources-plugin</artifactId>
79 <version>1.5</version>
80 <executions>
81 <execution>
82 <goals>
83 <goal>bundle</goal>
84 </goals>
85 </execution>
86 </executions>
87 <configuration>
88 <includes>
89 <include>**/*.xml</include>
90 </includes>
91 </configuration>
92 </plugin>
93
94 <plugin>
95 <groupId>org.apache.felix</groupId>
96 <artifactId>maven-bundle-plugin</artifactId>
97 <extensions>true</extensions>
98 <configuration>
99 <instructions>
100 <Bundle-DocURL>${project.url}</Bundle-DocURL>
101 <Bundle-Activator>
102 org.apache.tika.config.TikaActivator
103 </Bundle-Activator>
104 <Bundle-ActivationPolicy>lazy</Bundle-ActivationPolicy>
105 </instructions>
106 </configuration>
107 </plugin>
108 <plugin>
109 <groupId>org.apache.rat</groupId>
110 <artifactId>apache-rat-plugin</artifactId>
111 <configuration>
112 <excludes>
113 <exclude>src/test/resources/org/apache/tika/**</exclude>
114 </excludes>
115 </configuration>
116 </plugin>
117 <plugin>
118 <groupId>org.apache.maven.plugins</groupId>
119 <artifactId>maven-jar-plugin</artifactId>
120 <executions>
121 <execution>
122 <goals>
123 <goal>test-jar</goal>
124 </goals>
125 </execution>
126 </executions>
127 </plugin>
128 <plugin>
129 <artifactId>maven-failsafe-plugin</artifactId>
130 <version>2.10</version>
131 <configuration>
132 <additionalClasspathElements>
133 <additionalClasspathElement>
134 ${project.build.directory}/${project.build.finalName}.jar
135 </additionalClasspathElement>
136 </additionalClasspathElements>
137 </configuration>
138 <executions>
139 <execution>
140 <goals>
141 <goal>integration-test</goal>
142 <goal>verify</goal>
143 </goals>
144 </execution>
145 </executions>
146 </plugin>
147 </plugins>
148 </build>
149
150
151
152 <organization>
153 <name>The Apache Software Foundation</name>
154 <url>http://www.apache.org</url>
155 </organization>
156 <scm>
157 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-batch</url>
158 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-batch</connection>
159 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-batch</developerConnection>
160 </scm>
161 <issueManagement>
162 <system>JIRA</system>
163 <url>https://issues.apache.org/jira/browse/TIKA</url>
164 </issueManagement>
165 <ciManagement>
166 <system>Jenkins</system>
167 <url>https://builds.apache.org/job/Tika-trunk/</url>
168 </ciManagement>
169
170
171 </project>
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.config.TikaConfig;
20 import org.apache.tika.parser.AutoDetectParser;
21 import org.apache.tika.parser.Parser;
22
23 /**
24 * Simple class for AutoDetectParser
25 */
26 public class AutoDetectParserFactory implements ParserFactory {
27
28 @Override
29 public Parser getParser(TikaConfig config) {
30 return new AutoDetectParser(config);
31 }
32
33 }
0 package org.apache.tika.batch;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 /**
19 * FileResourceConsumers should throw this if something
20 * catastrophic has happened and the BatchProcess should shutdown
21 * and not be restarted.
22 *
23 */
24 public class BatchNoRestartError extends Error {
25
26 public BatchNoRestartError(Throwable t) {
27 super(t);
28 }
29 public BatchNoRestartError(String message) {
30 super(message);
31 }
32 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.IOException;
20 import java.io.PrintStream;
21 import java.util.Date;
22 import java.util.List;
23 import java.util.concurrent.ArrayBlockingQueue;
24 import java.util.concurrent.Callable;
25 import java.util.concurrent.CompletionService;
26 import java.util.concurrent.ExecutionException;
27 import java.util.concurrent.ExecutorCompletionService;
28 import java.util.concurrent.ExecutorService;
29 import java.util.concurrent.Executors;
30 import java.util.concurrent.Future;
31 import java.util.concurrent.TimeUnit;
32
33 import org.apache.tika.io.IOUtils;
34 import org.slf4j.Logger;
35 import org.slf4j.LoggerFactory;
36
37
38 /**
39 * This is the main processor class for a single process.
40 * This class can only be run once.
41 * <p/>
42 * It requires a {@link FileResourceCrawler} and {@link FileResourceConsumer}s, and it can also
43 * support a {@link StatusReporter} and an {@link Interrupter}.
44 * <p/>
45 * This is designed to shutdown if a parser has timed out or if there is
46 * an OutOfMemoryError. Consider using {@link BatchProcessDriverCLI}
47 * as a daemon/watchdog that monitors and can restart this batch process;
48 * <p>
49 * Note that this classs redirects stderr to stdout so that it can
50 * communicate without interference with the parent process on stderr.
51 */
52 public class BatchProcess implements Callable<ParallelFileProcessingResult> {
53
54 public enum BATCH_CONSTANTS {
55 BATCH_PROCESS_EXCEEDED_MAX_ALIVE_TIME,
56 BATCH_PROCESS_FATAL_MUST_RESTART
57 }
58
59 private enum CAUSE_FOR_TERMINATION {
60 COMPLETED_NORMALLY,
61 MAIN_LOOP_EXCEPTION_NO_RESTART,
62 CONSUMERS_MANAGER_DIDNT_INIT_IN_TIME_NO_RESTART,
63 MAIN_LOOP_EXCEPTION,
64 CRAWLER_TIMED_OUT,
65 TIMED_OUT_CONSUMER,
66 USER_INTERRUPTION,
67 BATCH_PROCESS_ALIVE_TOO_LONG,
68 }
69
70 private static final Logger logger;
71 static {
72 logger = LoggerFactory.getLogger(BatchProcess.class);
73 }
74
75 private PrintStream outputStreamWriter;
76
77 // If a file hasn't been processed in this amount of time,
78 // report it to the console. When the directory crawler has stopped, the thread will
79 // be terminated and the file name will be logged
80 private long timeoutThresholdMillis = 5 * 60 * 1000; // 5 minutes
81
82 private long timeoutCheckPulseMillis = 2 * 60 * 1000; //2 minutes
83 //if there was an early termination via the Interrupter
84 //or because of an uncaught runtime throwable, pause
85 //this long before shutting down to allow parsers to finish
86 private long pauseOnEarlyTerminationMillis = 30*1000; //30 seconds
87
88 private final long consumersManagerMaxMillis;
89
90 //maximum time that this process should stay alive
91 //to avoid potential memory leaks, not a bad idea to shutdown
92 //every hour or so.
93 private int maxAliveTimeSeconds = -1;
94
95 private final FileResourceCrawler fileResourceCrawler;
96
97 private final ConsumersManager consumersManager;
98
99 private final StatusReporter reporter;
100
101 private final Interrupter interrupter;
102
103 private final ArrayBlockingQueue<FileStarted> timedOuts;
104
105 private boolean alreadyExecuted = false;
106
107 public BatchProcess(FileResourceCrawler fileResourceCrawler,
108 ConsumersManager consumersManager,
109 StatusReporter reporter,
110 Interrupter interrupter) {
111 this.fileResourceCrawler = fileResourceCrawler;
112 this.consumersManager = consumersManager;
113 this.reporter = reporter;
114 this.interrupter = interrupter;
115 timedOuts = new ArrayBlockingQueue<FileStarted>(consumersManager.getConsumers().size());
116 this.consumersManagerMaxMillis = consumersManager.getConsumersManagerMaxMillis();
117 }
118
119 /**
120 * Runs main execution loop.
121 * <p>
122 * Redirects stdout to stderr to keep clean communications
123 * over stdout with parent process
124 * @return result of the processing
125 * @throws InterruptedException
126 */
127 public ParallelFileProcessingResult call()
128 throws InterruptedException {
129 if (alreadyExecuted) {
130 throw new IllegalStateException("Can only execute BatchRunner once.");
131 }
132 //redirect streams; all organic warnings should go to System.err;
133 //System.err should be redirected to System.out
134 PrintStream sysErr = System.err;
135 try {
136 outputStreamWriter = new PrintStream(sysErr, true, IOUtils.UTF_8.toString());
137 } catch (IOException e) {
138 throw new RuntimeException("Can't redirect streams");
139 }
140 System.setErr(System.out);
141
142 ParallelFileProcessingResult result = null;
143 try {
144 int numConsumers = consumersManager.getConsumers().size();
145 // fileResourceCrawler, statusReporter, the Interrupter, timeoutChecker
146 int numNonConsumers = 4;
147
148 ExecutorService ex = Executors.newFixedThreadPool(numConsumers
149 + numNonConsumers);
150 CompletionService<IFileProcessorFutureResult> completionService =
151 new ExecutorCompletionService<IFileProcessorFutureResult>(
152 ex);
153 TimeoutChecker timeoutChecker = new TimeoutChecker();
154
155 try {
156 startConsumersManager();
157 } catch (BatchNoRestartError e) {
158 return new
159 ParallelFileProcessingResult(0, 0, 0, 0,
160 0, BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE,
161 CAUSE_FOR_TERMINATION.CONSUMERS_MANAGER_DIDNT_INIT_IN_TIME_NO_RESTART.toString());
162
163 }
164
165 State state = mainLoop(completionService, timeoutChecker);
166 result = shutdown(ex, completionService, timeoutChecker, state);
167 } finally {
168 shutdownConsumersManager();
169 }
170 return result;
171 }
172
173
174 private State mainLoop(CompletionService<IFileProcessorFutureResult> completionService,
175 TimeoutChecker timeoutChecker) {
176 alreadyExecuted = true;
177 State state = new State();
178 logger.info("BatchProcess starting up");
179
180
181 state.start = new Date().getTime();
182 completionService.submit(interrupter);
183 completionService.submit(fileResourceCrawler);
184 completionService.submit(reporter);
185 completionService.submit(timeoutChecker);
186
187
188 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
189 completionService.submit(consumer);
190 }
191
192 state.numConsumers = consumersManager.getConsumers().size();
193 CAUSE_FOR_TERMINATION causeForTermination = null;
194 //main processing loop
195 while (true) {
196 try {
197 Future<IFileProcessorFutureResult> futureResult =
198 completionService.poll(1, TimeUnit.SECONDS);
199
200 if (futureResult != null) {
201 state.removed++;
202 IFileProcessorFutureResult result = futureResult.get();
203 if (result instanceof FileConsumerFutureResult) {
204 state.consumersRemoved++;
205 } else if (result instanceof FileResourceCrawlerFutureResult) {
206 state.crawlersRemoved++;
207 if (fileResourceCrawler.wasTimedOut()) {
208 causeForTermination = CAUSE_FOR_TERMINATION.CRAWLER_TIMED_OUT;
209 break;
210 }
211 } else if (result instanceof InterrupterFutureResult) {
212 causeForTermination = CAUSE_FOR_TERMINATION.USER_INTERRUPTION;
213 break;
214 } else if (result instanceof TimeoutFutureResult) {
215 causeForTermination = CAUSE_FOR_TERMINATION.TIMED_OUT_CONSUMER;
216 break;
217 } //only thing left should be StatusReporterResult
218 }
219
220 if (state.consumersRemoved >= state.numConsumers) {
221 causeForTermination = CAUSE_FOR_TERMINATION.COMPLETED_NORMALLY;
222 break;
223 }
224 if (aliveTooLong(state.start)) {
225 causeForTermination = CAUSE_FOR_TERMINATION.BATCH_PROCESS_ALIVE_TOO_LONG;
226 break;
227 }
228 } catch (Throwable e) {
229 if (isNonRestart(e)) {
230 causeForTermination = CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART;
231 } else {
232 causeForTermination = CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION;
233 }
234 logger.error("Main loop execution exception: " + e.getMessage());
235 break;
236 }
237 }
238 state.causeForTermination = causeForTermination;
239 return state;
240 }
241
242 private ParallelFileProcessingResult shutdown(ExecutorService ex,
243 CompletionService<IFileProcessorFutureResult> completionService,
244 TimeoutChecker timeoutChecker, State state) {
245
246 reporter.setIsShuttingDown(true);
247 int added = fileResourceCrawler.getAdded();
248 int considered = fileResourceCrawler.getConsidered();
249
250 //TODO: figure out safe way to shutdown resource crawler
251 //if it isn't. Does it need to add poison at this point?
252 //fileResourceCrawler.pleaseShutdown();
253
254 //Step 1: prevent uncalled threads from being started
255 ex.shutdown();
256
257 //Step 2: ask consumers to shutdown politely.
258 //Under normal circumstances, they should all have completed by now.
259 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
260 consumer.pleaseShutdown();
261 }
262 //The resourceCrawler should shutdown now. No need for poison.
263 fileResourceCrawler.shutDownNoPoison();
264 //if there are any active/asked to shutdown consumers, await termination
265 //this can happen if a user interrupts the process
266 //of if the crawler stops early, or ...
267 politelyAwaitTermination(state.causeForTermination);
268
269 //Step 3: Gloves come off. We've tried to ask kindly before.
270 //Now it is time shut down. This will corrupt
271 //nio channels via thread interrupts! Hopefully, everything
272 //has shut down by now.
273 logger.trace("About to shutdownNow()");
274 List<Runnable> neverCalled = ex.shutdownNow();
275 logger.trace("TERMINATED " + ex.isTerminated() + " : "
276 + state.consumersRemoved + " : " + state.crawlersRemoved);
277
278 int end = state.numConsumers + state.numNonConsumers - state.removed - neverCalled.size();
279
280 for (int t = 0; t < end; t++) {
281 Future<IFileProcessorFutureResult> future = null;
282 try {
283 future = completionService.poll(10, TimeUnit.MILLISECONDS);
284 } catch (InterruptedException e) {
285 logger.warn("thread interrupt while polling in final shutdown loop");
286 break;
287 }
288 logger.trace("In while future==null loop in final shutdown loop");
289 if (future == null) {
290 break;
291 }
292 try {
293 IFileProcessorFutureResult result = future.get();
294 if (result instanceof FileConsumerFutureResult) {
295 FileConsumerFutureResult consumerResult = (FileConsumerFutureResult) result;
296 FileStarted fileStarted = consumerResult.getFileStarted();
297 if (fileStarted != null
298 && fileStarted.getElapsedMillis() > timeoutThresholdMillis) {
299 logger.warn(fileStarted.getResourceId()
300 + "\t caused a file processor to hang or crash. You may need to remove "
301 + "this file from your input set and rerun.");
302 }
303 } else if (result instanceof FileResourceCrawlerFutureResult) {
304 FileResourceCrawlerFutureResult crawlerResult = (FileResourceCrawlerFutureResult) result;
305 considered += crawlerResult.getConsidered();
306 added += crawlerResult.getAdded();
307 } //else ...we don't care about anything else stopping at this point
308 } catch (ExecutionException e) {
309 logger.error("Execution exception trying to shutdown after shutdownNow:" + e.getMessage());
310 } catch (InterruptedException e) {
311 logger.error("Interrupted exception trying to shutdown after shutdownNow:" + e.getMessage());
312 }
313 }
314 //do we need to restart?
315 String restartMsg = null;
316 if (state.causeForTermination == CAUSE_FOR_TERMINATION.USER_INTERRUPTION
317 || state.causeForTermination == CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART) {
318 //do not restart!!!
319 } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION) {
320 restartMsg = "Uncaught consumer throwable";
321 } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.TIMED_OUT_CONSUMER) {
322 if (areResourcesPotentiallyRemaining()) {
323 restartMsg = "Consumer timed out with resources remaining";
324 }
325 } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.BATCH_PROCESS_ALIVE_TOO_LONG) {
326 restartMsg = BATCH_CONSTANTS.BATCH_PROCESS_EXCEEDED_MAX_ALIVE_TIME.toString();
327 } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.CRAWLER_TIMED_OUT) {
328 restartMsg = "Crawler timed out.";
329 } else if (fileResourceCrawler.wasTimedOut()) {
330 restartMsg = "Crawler was timed out.";
331 } else if (fileResourceCrawler.isActive()) {
332 restartMsg = "Crawler is still active.";
333 } else if (! fileResourceCrawler.isQueueEmpty()) {
334 restartMsg = "Resources still exist for processing";
335 }
336
337 int exitStatus = getExitStatus(state.causeForTermination, restartMsg);
338
339 //need to re-check, report, mark timed out consumers
340 timeoutChecker.checkForTimedOutConsumers();
341
342 for (FileStarted fs : timedOuts) {
343 logger.warn("A parser was still working on >" + fs.getResourceId() +
344 "< for " + fs.getElapsedMillis() + " milliseconds after it started." +
345 " This exceeds the maxTimeoutMillis parameter");
346 }
347 double elapsed = ((double) new Date().getTime() - (double) state.start) / 1000.0;
348 int processed = 0;
349 int numExceptions = 0;
350 for (FileResourceConsumer c : consumersManager.getConsumers()) {
351 processed += c.getNumResourcesConsumed();
352 numExceptions += c.getNumHandledExceptions();
353 }
354 return new
355 ParallelFileProcessingResult(considered, added, processed, numExceptions,
356 elapsed, exitStatus, state.causeForTermination.toString());
357 }
358
359 private class State {
360 long start = -1;
361 int numConsumers = 0;
362 int numNonConsumers = 0;
363 int removed = 0;
364 int consumersRemoved = 0;
365 int crawlersRemoved = 0;
366 CAUSE_FOR_TERMINATION causeForTermination = null;
367 }
368
369 private void startConsumersManager() {
370 if (consumersManagerMaxMillis < 0) {
371 consumersManager.init();
372 return;
373 }
374 Thread timed = new Thread() {
375 public void run() {
376 logger.trace("about to start consumers manager");
377 consumersManager.init();
378 logger.trace("finished starting consumers manager");
379 }
380 };
381 //don't allow this thread to keep process alive
382 timed.setDaemon(true);
383 timed.start();
384 try {
385 timed.join(consumersManagerMaxMillis);
386 } catch (InterruptedException e) {
387 logger.warn("interruption exception during consumers manager shutdown");
388 }
389 if (timed.isAlive()) {
390 logger.error("ConsumersManager did not start within " + consumersManagerMaxMillis + "ms");
391 throw new BatchNoRestartError("ConsumersManager did not start within "+consumersManagerMaxMillis+"ms");
392 }
393 }
394
395 private void shutdownConsumersManager() {
396 if (consumersManagerMaxMillis < 0) {
397 consumersManager.shutdown();
398 return;
399 }
400 Thread timed = new Thread() {
401 public void run() {
402 logger.trace("starting to shutdown consumers manager");
403 consumersManager.shutdown();
404 logger.trace("finished shutting down consumers manager");
405 }
406 };
407 timed.setDaemon(true);
408 timed.start();
409 try {
410 timed.join(consumersManagerMaxMillis);
411 } catch (InterruptedException e) {
412 logger.warn("interruption exception during consumers manager shutdown");
413 }
414 if (timed.isAlive()) {
415 logger.error("ConsumersManager was still alive during shutdown!");
416 throw new BatchNoRestartError("ConsumersManager did not shutdown within: "+
417 consumersManagerMaxMillis+"ms");
418 }
419 }
420
421 /**
422 * This is used instead of awaitTermination(), because that interrupts
423 * the thread and then waits for its termination. This politely waits.
424 *
425 * @param causeForTermination reason for termination.
426 */
427 private void politelyAwaitTermination(CAUSE_FOR_TERMINATION causeForTermination) {
428 if (causeForTermination == CAUSE_FOR_TERMINATION.COMPLETED_NORMALLY) {
429 return;
430 }
431 long start = new Date().getTime();
432 while (countActiveConsumers() > 0) {
433 try {
434 Thread.sleep(500);
435 } catch (InterruptedException e) {
436 logger.warn("Thread interrupted while trying to politelyAwaitTermination");
437 return;
438 }
439 long elapsed = new Date().getTime()-start;
440 if (pauseOnEarlyTerminationMillis > -1 &&
441 elapsed > pauseOnEarlyTerminationMillis) {
442 logger.warn("I waited after an early termination for "+
443 elapsed + ", but there was at least one active consumer");
444 return;
445 }
446 }
447 }
448
449 private boolean isNonRestart(Throwable e) {
450 if (e instanceof BatchNoRestartError) {
451 return true;
452 }
453 Throwable cause = e.getCause();
454 return cause != null && isNonRestart(cause);
455 }
456
457 private int getExitStatus(CAUSE_FOR_TERMINATION causeForTermination, String restartMsg) {
458 if (causeForTermination == CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART) {
459 logger.info(CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART.name());
460 return BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE;
461 }
462
463 if (restartMsg != null) {
464 if (restartMsg.equals(BATCH_CONSTANTS.BATCH_PROCESS_EXCEEDED_MAX_ALIVE_TIME.toString())) {
465 logger.warn(restartMsg);
466 } else {
467 logger.error(restartMsg);
468 }
469
470 //send over stdout wrapped in outputStreamWriter
471 outputStreamWriter.println(
472 BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString() +
473 " >> " + restartMsg);
474 outputStreamWriter.flush();
475 return BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE;
476 }
477 return 0;
478 }
479
480 //could new FileResources be consumed from the Queue?
481 //Because of race conditions, this can return a true
482 //when the real answer is false.
483 //This should never return false, though, if the answer is true!
484 private boolean areResourcesPotentiallyRemaining() {
485 if (fileResourceCrawler.isActive()) {
486 return true;
487 }
488 return !fileResourceCrawler.isQueueEmpty();
489 }
490
491 private boolean aliveTooLong(long started) {
492 if (maxAliveTimeSeconds < 0) {
493 return false;
494 }
495 double elapsedSeconds = (double) (new Date().getTime() - started) / (double) 1000;
496 return elapsedSeconds > (double) maxAliveTimeSeconds;
497 }
498
499 //snapshot of non-retired consumers; actual number may be smaller by the time
500 //this returns a value!
501 private int countActiveConsumers() {
502 int active = 0;
503 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
504 if (consumer.isStillActive()) {
505 active++;
506 }
507 }
508 return active;
509 }
510
511 /**
512 * If there is an early termination via an interrupt or too many timed out consumers
513 * or because a consumer or other Runnable threw a Throwable, pause this long
514 * before killing the consumers and other threads.
515 *
516 * Typically makes sense for this to be the same or slightly larger than
517 * timeoutThresholdMillis
518 *
519 * @param pauseOnEarlyTerminationMillis how long to pause if there is an early termination
520 */
521 public void setPauseOnEarlyTerminationMillis(long pauseOnEarlyTerminationMillis) {
522 this.pauseOnEarlyTerminationMillis = pauseOnEarlyTerminationMillis;
523 }
524
525 /**
526 * The amount of time allowed before a consumer should be timed out.
527 *
528 * @param timeoutThresholdMillis threshold in milliseconds before declaring a consumer timed out
529 */
530 public void setTimeoutThresholdMillis(long timeoutThresholdMillis) {
531 this.timeoutThresholdMillis = timeoutThresholdMillis;
532 }
533
534 public void setTimeoutCheckPulseMillis(long timeoutCheckPulseMillis) {
535 this.timeoutCheckPulseMillis = timeoutCheckPulseMillis;
536 }
537
538 /**
539 * The maximum amount of time that this process can be alive. To avoid
540 * memory leaks, it is sometimes beneficial to shutdown (and restart) the
541 * process periodically.
542 * <p/>
543 * If the value is < 0, the process will run until completion, interruption or exception.
544 *
545 * @param maxAliveTimeSeconds maximum amount of time in seconds to remain alive
546 */
547 public void setMaxAliveTimeSeconds(int maxAliveTimeSeconds) {
548 this.maxAliveTimeSeconds = maxAliveTimeSeconds;
549 }
550
551 private class TimeoutChecker implements Callable<IFileProcessorFutureResult> {
552
553 @Override
554 public TimeoutFutureResult call() throws Exception {
555 while (timedOuts.size() == 0) {
556 try {
557 Thread.sleep(timeoutCheckPulseMillis);
558 } catch (InterruptedException e) {
559 logger.debug("Thread interrupted exception in TimeoutChecker");
560 break;
561 //just stop.
562 }
563 checkForTimedOutConsumers();
564 if (countActiveConsumers() == 0) {
565 logger.info("No activeConsumers in TimeoutChecker");
566 break;
567 }
568 }
569 logger.debug("TimeoutChecker quitting: " + timedOuts.size());
570 return new TimeoutFutureResult(timedOuts.size());
571 }
572
573 private void checkForTimedOutConsumers() {
574 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
575 FileStarted fs = consumer.checkForTimedOutMillis(timeoutThresholdMillis);
576 if (fs != null) {
577 timedOuts.add(fs);
578 }
579 }
580 }
581 }
582
583 private class TimeoutFutureResult implements IFileProcessorFutureResult {
584 //used to be used when more than one timeout was allowed
585 //TODO: get rid of this?
586 private final int timedOutCount;
587
588 private TimeoutFutureResult(final int timedOutCount) {
589 this.timedOutCount = timedOutCount;
590 }
591
592 protected int getTimedOutCount() {
593 return timedOutCount;
594 }
595 }
596 }
0 package org.apache.tika.batch;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.BufferedInputStream;
19 import java.io.BufferedReader;
20 import java.io.File;
21 import java.io.IOException;
22 import java.io.InputStream;
23 import java.io.InputStreamReader;
24 import java.io.OutputStream;
25 import java.io.OutputStreamWriter;
26 import java.io.Writer;
27 import java.util.ArrayList;
28 import java.util.List;
29 import java.util.Locale;
30
31 import org.apache.tika.io.IOUtils;
32 import org.slf4j.Logger;
33 import org.slf4j.LoggerFactory;
34
35 public class BatchProcessDriverCLI {
36
37 /**
38 * This relies on an special exit values of 254 (do not restart),
39 * 0 ended correctly, 253 ended with exception (do restart)
40 */
41 public static final int PROCESS_RESTART_EXIT_CODE = 253;
42 //make sure this is above 255 to avoid stopping on system errors
43 //that is, if there is a system error (e.g. 143), you
44 //should restart the process.
45 public static final int PROCESS_NO_RESTART_EXIT_CODE = 254;
46 public static final int PROCESS_COMPLETED_SUCCESSFULLY = 0;
47 private static Logger logger = LoggerFactory.getLogger(BatchProcessDriverCLI.class);
48
49 private int maxProcessRestarts = -1;
50 private long pulseMillis = 1000;
51
52 //how many times to wait pulseMillis milliseconds if a restart
53 //message has been received through stdout, but the
54 //child process has not yet exited
55 private int waitNumLoopsAfterRestartmessage = 60;
56
57
58 private volatile boolean userInterrupted = false;
59 private boolean receivedRestartMsg = false;
60 private Process process = null;
61
62 private StreamGobbler errorWatcher = null;
63 private StreamGobbler outGobbler = null;
64 private InterruptWriter interruptWriter = null;
65 private final InterruptWatcher interruptWatcher =
66 new InterruptWatcher(System.in);
67
68 private Thread errorWatcherThread = null;
69 private Thread outGobblerThread = null;
70 private Thread interruptWriterThread = null;
71 private final Thread interruptWatcherThread = new Thread(interruptWatcher);
72
73 private final String[] commandLine;
74 private int numRestarts = 0;
75 private boolean redirectChildProcessToStdOut = true;
76
77 public BatchProcessDriverCLI(String[] commandLine){
78 this.commandLine = tryToReadMaxRestarts(commandLine);
79 }
80
81 private String[] tryToReadMaxRestarts(String[] commandLine) {
82 List<String> args = new ArrayList<String>();
83 for (int i = 0; i < commandLine.length; i++) {
84 String arg = commandLine[i];
85 if (arg.equals("-maxRestarts")) {
86 if (i == commandLine.length-1) {
87 throw new IllegalArgumentException("Must specify an integer after \"-maxRestarts\"");
88 }
89 String restartNumString = commandLine[i+1];
90 try {
91 maxProcessRestarts = Integer.parseInt(restartNumString);
92 } catch (NumberFormatException e) {
93 throw new IllegalArgumentException("Must specify an integer after \"-maxRestarts\" arg.");
94 }
95 i++;
96 } else {
97 args.add(arg);
98 }
99 }
100 return args.toArray(new String[args.size()]);
101 }
102
103 public void execute() throws Exception {
104
105 interruptWatcherThread.setDaemon(true);
106 interruptWatcherThread.start();
107 logger.info("about to start driver");
108 start();
109 int loopsAfterRestartMessageReceived = 0;
110 while (!userInterrupted) {
111 Integer exit = null;
112 try {
113 logger.trace("about to check exit value");
114 exit = process.exitValue();
115 logger.info("The child process has finished with an exit value of: "+exit);
116 stop();
117 } catch (IllegalThreadStateException e) {
118 //hasn't exited
119 logger.trace("process has not exited; IllegalThreadStateException");
120 }
121
122 logger.trace("Before sleep:" +
123 " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg);
124
125 //Even if the process has exited,
126 //wait just a little bit to make sure that
127 //mustRestart hasn't been set to true
128 try {
129 Thread.sleep(pulseMillis);
130 } catch (InterruptedException e) {
131 logger.trace("interrupted exception during sleep");
132 }
133 logger.trace("After sleep:" +
134 " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg);
135 //if we've gotten the message via stdout to restart
136 //but the process hasn't exited yet, give it another
137 //chance
138 if (receivedRestartMsg && exit == null) {
139 loopsAfterRestartMessageReceived++;
140 logger.warn("Must restart, still not exited; loops after restart: " +
141 loopsAfterRestartMessageReceived);
142 continue;
143 }
144 if (loopsAfterRestartMessageReceived > waitNumLoopsAfterRestartmessage) {
145 logger.trace("About to try to restart because:" +
146 " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg);
147 logger.warn("Restarting after exceeded wait loops waiting for exit: "+
148 loopsAfterRestartMessageReceived);
149 boolean restarted = restart(exit, receivedRestartMsg);
150 if (!restarted) {
151 break;
152 }
153 } else if (exit != null && exit != BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE
154 && exit != BatchProcessDriverCLI.PROCESS_COMPLETED_SUCCESSFULLY) {
155 logger.trace("About to try to restart because:" +
156 " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg);
157
158 if (exit == BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE) {
159 logger.info("Restarting on expected restart code");
160 } else {
161 logger.warn("Restarting on unexpected restart code: "+exit);
162 }
163 boolean restarted = restart(exit, receivedRestartMsg);
164 if (!restarted) {
165 break;
166 }
167 } else if (exit != null && (exit == PROCESS_COMPLETED_SUCCESSFULLY
168 || exit == BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE)) {
169 logger.trace("Will not restart: "+exit);
170 break;
171 }
172 }
173 logger.trace("about to call shutdown driver now");
174 shutdownDriverNow();
175 logger.info("Process driver has completed");
176 }
177
178 private void shutdownDriverNow() {
179 if (process != null) {
180 for (int i = 0; i < 60; i++) {
181
182 logger.trace("trying to shut down: "+i);
183 try {
184 int exit = process.exitValue();
185 logger.trace("trying to stop:"+exit);
186 stop();
187 interruptWatcherThread.interrupt();
188 return;
189 } catch (IllegalThreadStateException e) {
190 //hasn't exited
191 }
192 try {
193 Thread.sleep(1000);
194 } catch (InterruptedException e) {
195 //swallow
196 }
197 }
198 logger.error("Process didn't stop after 60 seconds after shutdown. " +
199 "I am forcefully killing it.");
200 }
201 interruptWatcherThread.interrupt();
202 }
203
204 public int getNumRestarts() {
205 return numRestarts;
206 }
207
208 public boolean getUserInterrupted() {
209 return userInterrupted;
210 }
211
212 /**
213 * Tries to restart (stop and then start) the child process
214 * @return whether or not this was successful, will be false if numRestarts >= maxProcessRestarts
215 * @throws Exception
216 */
217 private boolean restart(Integer exitValue, boolean receivedRestartMsg) throws Exception {
218 if (maxProcessRestarts > -1 && numRestarts >= maxProcessRestarts) {
219 logger.warn("Hit the maximum number of process restarts. Driver is shutting down now.");
220 stop();
221 return false;
222 }
223 logger.warn("Must restart process (exitValue="+exitValue+" numRestarts="+numRestarts+
224 " receivedRestartMessage="+receivedRestartMsg+")");
225 stop();
226 start();
227 numRestarts++;
228 return true;
229 }
230
231 private void stop() {
232 if (process != null) {
233 logger.trace("destroying a non-null process");
234 process.destroy();
235 }
236
237 receivedRestartMsg = false;
238 //interrupt the writer thread first
239 interruptWriterThread.interrupt();
240
241 errorWatcher.stopGobblingAndDie();
242 outGobbler.stopGobblingAndDie();
243 errorWatcherThread.interrupt();
244 outGobblerThread.interrupt();
245 }
246
247 private void start() throws Exception {
248 ProcessBuilder builder = new ProcessBuilder(commandLine);
249 builder.directory(new File("."));
250 process = builder.start();
251
252 errorWatcher = new StreamWatcher(process.getErrorStream());
253 errorWatcherThread = new Thread(errorWatcher);
254 errorWatcherThread.start();
255
256 outGobbler = new StreamGobbler(process.getInputStream());
257 outGobblerThread = new Thread(outGobbler);
258 outGobblerThread.start();
259
260 interruptWriter = new InterruptWriter(process.getOutputStream());
261 interruptWriterThread = new Thread(interruptWriter);
262 interruptWriterThread.start();
263
264 }
265
266 /**
267 * Typically only used for testing. This determines whether or not
268 * to redirect child process's stdOut to driver's stdout
269 * @param redirectChildProcessToStdOut should the driver redirect the child's stdout
270 */
271 public void setRedirectChildProcessToStdOut(boolean redirectChildProcessToStdOut) {
272 this.redirectChildProcessToStdOut = redirectChildProcessToStdOut;
273 }
274
275 /**
276 * Class to watch stdin from the driver for anything that is typed.
277 * This will currently cause an interrupt if anything followed by
278 * a return key is entered. We may want to add an "Are you sure?" dialogue.
279 */
280 private class InterruptWatcher implements Runnable {
281 private BufferedReader reader;
282
283 private InterruptWatcher(InputStream is) {
284 reader = new BufferedReader(new InputStreamReader(is, IOUtils.UTF_8));
285 }
286
287 @Override
288 public void run() {
289 try {
290 //this will block.
291 //as soon as it reads anything,
292 //set userInterrupted to true and stop
293 reader.readLine();
294 userInterrupted = true;
295 } catch (IOException e) {
296 //swallow
297 }
298 }
299 }
300
301 /**
302 * Class that writes to the child process
303 * to force an interrupt in the child process.
304 */
305 private class InterruptWriter implements Runnable {
306 private final Writer writer;
307
308 private InterruptWriter(OutputStream os) {
309 this.writer = new OutputStreamWriter(os, IOUtils.UTF_8);
310 }
311
312 @Override
313 public void run() {
314 try {
315 while (true) {
316 Thread.sleep(500);
317 if (userInterrupted) {
318 writer.write(String.format(Locale.ENGLISH, "Ave atque vale!%n"));
319 writer.flush();
320 }
321 }
322 } catch (IOException e) {
323 //swallow
324 } catch (InterruptedException e) {
325 //job is done, ok
326 }
327 }
328 }
329
330 private class StreamGobbler implements Runnable {
331 //plagiarized from org.apache.oodt's StreamGobbler
332 protected final BufferedReader reader;
333 protected boolean running = true;
334
335 private StreamGobbler(InputStream is) {
336 this.reader = new BufferedReader(new InputStreamReader(new BufferedInputStream(is),
337 IOUtils.UTF_8));
338 }
339
340 @Override
341 public void run() {
342 String line = null;
343 try {
344 logger.trace("gobbler starting to read");
345 while ((line = reader.readLine()) != null && this.running) {
346 if (redirectChildProcessToStdOut) {
347 System.out.println("BatchProcess:"+line);
348 }
349 }
350 } catch (IOException e) {
351 logger.trace("gobbler io exception");
352 //swallow ioe
353 }
354 logger.trace("gobbler done");
355 }
356
357 private void stopGobblingAndDie() {
358 logger.trace("stop gobbling");
359 running = false;
360 IOUtils.closeQuietly(reader);
361 }
362 }
363
364 private class StreamWatcher extends StreamGobbler implements Runnable {
365 //plagiarized from org.apache.oodt's StreamGobbler
366
367 private StreamWatcher(InputStream is){
368 super(is);
369 }
370
371 @Override
372 public void run() {
373 String line = null;
374 try {
375 logger.trace("watcher starting to read");
376 while ((line = reader.readLine()) != null && this.running) {
377 if (line.startsWith(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString())) {
378 receivedRestartMsg = true;
379 }
380 logger.info("BatchProcess: "+line);
381 }
382 } catch (IOException e) {
383 logger.trace("watcher io exception");
384 //swallow ioe
385 }
386 logger.trace("watcher done");
387 }
388 }
389
390
391 public static void main(String[] args) throws Exception {
392
393 BatchProcessDriverCLI runner = new BatchProcessDriverCLI(args);
394 runner.execute();
395 System.out.println("FSBatchProcessDriver has gracefully completed");
396 System.exit(0);
397 }
398 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Collections;
20 import java.util.List;
21
22 /**
23 * Simple interface around a collection of consumers that allows
24 * for initializing and shutting shared resources (e.g. db connection, index, writer, etc.)
25 */
26 public abstract class ConsumersManager {
27
28 //maximum time to allow the ConsumersManager for either init()
29 //or shutdown()
30 private long consumersManagerMaxMillis = 60000;
31 private final List<FileResourceConsumer> consumers;
32
33 public ConsumersManager(List<FileResourceConsumer> consumers) {
34 this.consumers = Collections.unmodifiableList(consumers);
35 }
36 /**
37 * Get the consumers
38 * @return consumers
39 */
40 public List<FileResourceConsumer> getConsumers() {
41 return consumers;
42 }
43
44 /**
45 * This is called by BatchProcess before submitting the threads
46 */
47 public void init(){
48
49 }
50
51 /**
52 * This is called by BatchProcess immediately before closing.
53 * Beware! Some of the consumers may have hung or may not
54 * have completed.
55 */
56 public void shutdown(){
57
58 }
59
60 /**
61 * {@link org.apache.tika.batch.BatchProcess} will throw an exception
62 * if the ConsumersManager doesn't complete init() or shutdown()
63 * within this amount of time.
64 * @return the maximum time allowed for init() or shutdown()
65 */
66 public long getConsumersManagerMaxMillis() {
67 return consumersManagerMaxMillis;
68 }
69
70 /**
71 * {@see #getConsumersManagerMaxMillis()}
72 *
73 * @param consumersManagerMaxMillis maximum number of milliseconds
74 * to allow for init() or shutdown()
75 */
76 public void setConsumersManagerMaxMillis(long consumersManagerMaxMillis) {
77 this.consumersManagerMaxMillis = consumersManagerMaxMillis;
78 }
79 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 class FileConsumerFutureResult implements IFileProcessorFutureResult {
20
21 private final FileStarted fileStarted;
22 private final int filesProcessed;
23
24 public FileConsumerFutureResult(FileStarted fs, int filesProcessed) {
25 this.fileStarted = fs;
26 this.filesProcessed = filesProcessed;
27 }
28
29 public FileStarted getFileStarted() {
30 return fileStarted;
31 }
32
33 public int getFilesProcessed() {
34 return filesProcessed;
35 }
36 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.metadata.Metadata;
20 import org.apache.tika.metadata.Property;
21
22 import java.io.IOException;
23 import java.io.InputStream;
24
25
26 /**
27 * This is a basic interface to handle a logical "file".
28 * This should enable code-agnostic handling of files from different
29 * sources: file system, database, etc.
30 *
31 */
32 public interface FileResource {
33
34 //The literal lowercased extension of a file. This may or may not
35 //have any relationship to the actual type of the file.
36 public static final Property FILE_EXTENSION = Property.internalText("tika:file_ext");
37
38 /**
39 * This is only used in logging to identify which file
40 * may have caused problems. While it is probably best
41 * to use unique ids for the sake of debugging, it is not
42 * necessary that the ids be unique. This id
43 * is never used as a hashkey by the batch processors, for example.
44 *
45 * @return an id for a FileResource
46 */
47 public String getResourceId();
48
49 /**
50 * This gets the metadata available before the parsing of the file.
51 * This will typically be "external" metadata: file name,
52 * file size, file location, data stream, etc. That is, things
53 * that are known about the file from outside information, not
54 * file-internal metadata.
55 *
56 * @return Metadata
57 */
58 public Metadata getMetadata();
59
60 /**
61 *
62 * @return an InputStream for the FileResource
63 * @throws java.io.IOException
64 */
65 public InputStream openInputStream() throws IOException;
66
67 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import javax.xml.stream.XMLOutputFactory;
20 import javax.xml.stream.XMLStreamException;
21 import javax.xml.stream.XMLStreamWriter;
22 import java.io.Closeable;
23 import java.io.Flushable;
24 import java.io.IOException;
25 import java.io.InputStream;
26 import java.io.PrintWriter;
27 import java.io.StringWriter;
28 import java.util.Date;
29 import java.util.concurrent.ArrayBlockingQueue;
30 import java.util.concurrent.Callable;
31 import java.util.concurrent.TimeUnit;
32 import java.util.concurrent.atomic.AtomicInteger;
33
34 import org.apache.tika.metadata.Metadata;
35 import org.apache.tika.parser.ParseContext;
36 import org.apache.tika.parser.Parser;
37 import org.slf4j.Logger;
38 import org.slf4j.LoggerFactory;
39 import org.xml.sax.ContentHandler;
40
41
42 /**
43 * This is a base class for file consumers. The
44 * goal of this class is to abstract out the multithreading
45 * and recordkeeping components.
46 * <p/>
47 */
48 public abstract class FileResourceConsumer implements Callable<IFileProcessorFutureResult> {
49
50 private enum STATE {
51 NOT_YET_STARTED,
52 ACTIVELY_CONSUMING,
53 SWALLOWED_POISON,
54 THREAD_INTERRUPTED,
55 EXCEEDED_MAX_CONSEC_WAIT_MILLIS,
56 ASKED_TO_SHUTDOWN,
57 TIMED_OUT,
58 CONSUMER_EXCEPTION,
59 CONSUMER_ERROR,
60 COMPLETED
61 }
62
63 public static String TIMED_OUT = "timed_out";
64 public static String OOM = "oom";
65 public static String IO_IS = "io_on_inputstream";
66 public static String IO_OS = "io_on_outputstream";
67 public static String PARSE_ERR = "parse_err";
68 public static String PARSE_EX = "parse_ex";
69
70 public static String ELAPSED_MILLIS = "elapsedMS";
71
72 private static AtomicInteger numConsumers = new AtomicInteger(-1);
73 protected static Logger logger = LoggerFactory.getLogger(FileResourceConsumer.class);
74
75 private long maxConsecWaitInMillis = 10*60*1000;// 10 minutes
76
77 private final ArrayBlockingQueue<FileResource> fileQueue;
78
79 private final XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newFactory();
80 private final int consumerId;
81
82 //used to lock checks on state to prevent
83 private final Object lock = new Object();
84
85 //this records the file that is currently
86 //being processed. It is null if no file is currently being processed.
87 //no need for volatile because of lock for checkForStales
88 private FileStarted currentFile = null;
89
90 //total number of files consumed; volatile so that reporter
91 //sees the latest
92 private volatile int numResourcesConsumed = 0;
93
94 //total number of exceptions that were handled by subclasses;
95 //volatile so that reporter sees the latest
96 private volatile int numHandledExceptions = 0;
97
98 //after this has been set to ACTIVELY_CONSUMING,
99 //this should only be set by setEndedState.
100 private volatile STATE currentState = STATE.NOT_YET_STARTED;
101
102 public FileResourceConsumer(ArrayBlockingQueue<FileResource> fileQueue) {
103 this.fileQueue = fileQueue;
104 consumerId = numConsumers.incrementAndGet();
105 }
106
107 public IFileProcessorFutureResult call() {
108 currentState = STATE.ACTIVELY_CONSUMING;
109
110 try {
111 FileResource fileResource = getNextFileResource();
112 while (fileResource != null) {
113 logger.debug("file consumer is about to process: " + fileResource.getResourceId());
114 boolean consumed = _processFileResource(fileResource);
115 logger.debug("file consumer has finished processing: " + fileResource.getResourceId());
116
117 if (consumed) {
118 numResourcesConsumed++;
119 }
120 fileResource = getNextFileResource();
121 }
122 } catch (InterruptedException e) {
123 setEndedState(STATE.THREAD_INTERRUPTED);
124 }
125
126 setEndedState(STATE.COMPLETED);
127 return new FileConsumerFutureResult(currentFile, numResourcesConsumed);
128 }
129
130
131 /**
132 * Main piece of code that needs to be implemented. Clients
133 * are responsible for closing streams and handling the exceptions
134 * that they'd like to handle.
135 * <p/>
136 * Unchecked throwables can be thrown past this, of course. When an unchecked
137 * throwable is thrown, this logs the error, and then rethrows the exception.
138 * Clients/subclasses should make sure to catch and handle everything they can.
139 * <p/>
140 * The design goal is that the whole process should close up and shutdown soon after
141 * an unchecked exception or error is thrown.
142 * <p/>
143 * Make sure to call {@link #incrementHandledExceptions()} appropriately in
144 * your implementation of this method.
145 * <p/>
146 *
147 * @param fileResource resource to process
148 * @return whether or not a file was successfully processed
149 */
150 public abstract boolean processFileResource(FileResource fileResource);
151
152
153 /**
154 * Make sure to call this appropriately!
155 */
156 protected void incrementHandledExceptions() {
157 numHandledExceptions++;
158 }
159
160
161 /**
162 * Returns whether or not the consumer is still could process
163 * a file or is still processing a file (ACTIVELY_CONSUMING or ASKED_TO_SHUTDOWN)
164 * @return whether this consumer is still active
165 */
166 public boolean isStillActive() {
167 if (Thread.currentThread().isInterrupted()) {
168 return false;
169 } else if( currentState == STATE.NOT_YET_STARTED ||
170 currentState == STATE.ACTIVELY_CONSUMING ||
171 currentState == STATE.ASKED_TO_SHUTDOWN) {
172 return true;
173 }
174 return false;
175 }
176
177 private boolean _processFileResource(FileResource fileResource) {
178 currentFile = new FileStarted(fileResource.getResourceId());
179 boolean consumed = false;
180 try {
181 consumed = processFileResource(fileResource);
182 } catch (RuntimeException e) {
183 setEndedState(STATE.CONSUMER_EXCEPTION);
184 throw e;
185 } catch (Error e) {
186 setEndedState(STATE.CONSUMER_ERROR);
187 throw e;
188 }
189 //if anything is thrown from processFileResource, then the fileStarted
190 //will remain what it was right before the exception was thrown.
191 currentFile = null;
192 return consumed;
193 }
194
195 /**
196 * This politely asks the consumer to shutdown.
197 * Before processing another file, the consumer will check to see
198 * if it has been asked to terminate.
199 * <p>
200 * This offers another method for politely requesting
201 * that a FileResourceConsumer stop processing
202 * besides passing it {@link org.apache.tika.batch.PoisonFileResource}.
203 *
204 */
205 public void pleaseShutdown() {
206 setEndedState(STATE.ASKED_TO_SHUTDOWN);
207 }
208
209 /**
210 * Returns the name and start time of a file that is currently being processed.
211 * If no file is currently being processed, this will return null.
212 *
213 * @return FileStarted or null
214 */
215 public FileStarted getCurrentFile() {
216 return currentFile;
217 }
218
219 public int getNumResourcesConsumed() {
220 return numResourcesConsumed;
221 }
222
223 public int getNumHandledExceptions() {
224 return numHandledExceptions;
225 }
226
227 /**
228 * Checks to see if the currentFile being processed (if there is one)
229 * should be timed out (still being worked on after staleThresholdMillis).
230 * <p>
231 * If the consumer should be timed out, this will return the currentFile and
232 * set the state to TIMED_OUT.
233 * <p>
234 * If the consumer was already timed out earlier or
235 * is not processing a file or has been working on a file
236 * for less than #staleThresholdMillis, then this will return null.
237 * <p>
238 * @param staleThresholdMillis threshold to determine whether the consumer has gone stale.
239 * @return null or the file started that triggered the stale condition
240 */
241 public FileStarted checkForTimedOutMillis(long staleThresholdMillis) {
242 //if there isn't a current file, don't bother obtaining lock
243 if (currentFile == null) {
244 return null;
245 }
246 //if threshold is < 0, don't even look.
247 if (staleThresholdMillis < 0) {
248 return null;
249 }
250 synchronized(lock) {
251 //check again once the lock has been obtained
252 if (currentState != STATE.ACTIVELY_CONSUMING
253 && currentState != STATE.ASKED_TO_SHUTDOWN) {
254 return null;
255 }
256 FileStarted tmp = currentFile;
257 if (tmp == null) {
258 return null;
259 }
260 if (tmp.getElapsedMillis() > staleThresholdMillis) {
261 setEndedState(STATE.TIMED_OUT);
262 logger.error("{}", getXMLifiedLogMsg(
263 TIMED_OUT,
264 tmp.getResourceId(),
265 ELAPSED_MILLIS, Long.toString(tmp.getElapsedMillis())));
266 return tmp;
267 }
268 }
269 return null;
270 }
271
272 protected String getXMLifiedLogMsg(String type, String resourceId, String... attrs) {
273 return getXMLifiedLogMsg(type, resourceId, null, attrs);
274 }
275
276 /**
277 * Use this for structured output that captures resourceId and other attributes.
278 *
279 * @param type entity name for exception
280 * @param resourceId resourceId string
281 * @param t throwable can be null
282 * @param attrs (array of key0, value0, key1, value1, etc.)
283 */
284 protected String getXMLifiedLogMsg(String type, String resourceId, Throwable t, String... attrs) {
285
286 StringWriter writer = new StringWriter();
287 try {
288 XMLStreamWriter xml = xmlOutputFactory.createXMLStreamWriter(writer);
289 xml.writeStartDocument();
290 xml.writeStartElement(type);
291 xml.writeAttribute("resourceId", resourceId);
292 if (attrs != null) {
293 //this assumes args has name value pairs alternating, name0 at 0, val0 at 1, name1 at 2, val2 at 3, etc.
294 for (int i = 0; i < attrs.length - 1; i++) {
295 xml.writeAttribute(attrs[i], attrs[i + 1]);
296 }
297 }
298 if (t != null) {
299 StringWriter stackWriter = new StringWriter();
300 PrintWriter printWriter = new PrintWriter(stackWriter);
301 t.printStackTrace(printWriter);
302 printWriter.flush();
303 stackWriter.flush();
304 xml.writeCharacters(stackWriter.toString());
305 }
306 xml.writeEndElement();
307 xml.writeEndDocument();
308 xml.flush();
309 xml.close();
310 } catch (XMLStreamException e) {
311 logger.error("error writing xml stream for: " + resourceId, t);
312 }
313 return writer.toString();
314 }
315
316 private FileResource getNextFileResource() throws InterruptedException {
317 FileResource fileResource = null;
318 long start = new Date().getTime();
319 while (fileResource == null) {
320 //check to see if thread is interrupted before polling
321 if (Thread.currentThread().isInterrupted()) {
322 setEndedState(STATE.THREAD_INTERRUPTED);
323 logger.debug("Consumer thread was interrupted.");
324 break;
325 }
326
327 synchronized(lock) {
328 //need to lock here to prevent race condition with other threads setting state
329 if (currentState != STATE.ACTIVELY_CONSUMING) {
330 logger.debug("Consumer already closed because of: "+ currentState.toString());
331 break;
332 }
333 }
334 fileResource = fileQueue.poll(1L, TimeUnit.SECONDS);
335 if (fileResource != null) {
336 if (fileResource instanceof PoisonFileResource) {
337 setEndedState(STATE.SWALLOWED_POISON);
338 fileResource = null;
339 }
340 break;
341 }
342 logger.debug(consumerId + " is waiting for file and the queue size is: " + fileQueue.size());
343
344 long elapsed = new Date().getTime() - start;
345 if (maxConsecWaitInMillis > 0 && elapsed > maxConsecWaitInMillis) {
346 setEndedState(STATE.EXCEEDED_MAX_CONSEC_WAIT_MILLIS);
347 break;
348 }
349 }
350 return fileResource;
351 }
352
353 protected void close(Closeable closeable) {
354 if (closeable != null) {
355 try {
356 closeable.close();
357 } catch (IOException e){
358 logger.error(e.getMessage());
359 }
360 }
361 closeable = null;
362 }
363
364 protected void flushAndClose(Closeable closeable) {
365 if (closeable == null) {
366 return;
367 }
368 if (closeable instanceof Flushable){
369 try {
370 ((Flushable)closeable).flush();
371 } catch (IOException e) {
372 logger.error(e.getMessage());
373 }
374 }
375 close(closeable);
376 }
377
378 //do not overwrite a finished state except if
379 //not yet started, actively consuming or shutting down. This should
380 //represent the initial cause; all subsequent calls
381 //to set will be ignored!!!
382 private void setEndedState(STATE cause) {
383 synchronized(lock) {
384 if (currentState == STATE.NOT_YET_STARTED ||
385 currentState == STATE.ACTIVELY_CONSUMING ||
386 currentState == STATE.ASKED_TO_SHUTDOWN) {
387 currentState = cause;
388 }
389 }
390 }
391
392 /**
393 * Utility method to handle logging equivalently among all
394 * implementing classes. Use, override or avoid as desired.
395 *
396 * @param resourceId resourceId
397 * @param parser parser to use
398 * @param is inputStream (will be closed by this method!)
399 * @param handler handler for the content
400 * @param m metadata
401 * @param parseContext parse context
402 * @throws Throwable (logs and then throws whatever was thrown (if anything)
403 */
404 protected void parse(final String resourceId, final Parser parser, InputStream is,
405 final ContentHandler handler,
406 final Metadata m, final ParseContext parseContext) throws Throwable {
407
408 try {
409 parser.parse(is, handler, m, parseContext);
410 } catch (Throwable t) {
411 if (t instanceof OutOfMemoryError) {
412 logger.error(getXMLifiedLogMsg(OOM,
413 resourceId, t));
414 } else if (t instanceof Error) {
415 logger.error(getXMLifiedLogMsg(PARSE_ERR,
416 resourceId, t));
417 } else {
418 logger.warn(getXMLifiedLogMsg(PARSE_EX,
419 resourceId, t));
420 incrementHandledExceptions();
421 }
422 throw t;
423 } finally {
424 close(is);
425 }
426 }
427
428 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Date;
20 import java.util.concurrent.ArrayBlockingQueue;
21 import java.util.concurrent.Callable;
22 import java.util.concurrent.TimeUnit;
23
24 import org.apache.tika.extractor.DocumentSelector;
25 import org.apache.tika.metadata.Metadata;
26 import org.slf4j.Logger;
27 import org.slf4j.LoggerFactory;
28
29 public abstract class FileResourceCrawler implements Callable<IFileProcessorFutureResult> {
30
31 protected final static int SKIPPED = 0;
32 protected final static int ADDED = 1;
33 protected final static int STOP_NOW = 2;
34
35 private volatile boolean hasCompletedCrawling = false;
36 private volatile boolean shutDownNoPoison = false;
37 private volatile boolean isActive = true;
38 private volatile boolean timedOut = false;
39
40 //how long to pause if can't add to queue
41 private static final long PAUSE_INCREMENT_MILLIS = 1000;
42
43 protected static Logger logger = LoggerFactory.getLogger(FileResourceCrawler.class.toString());
44
45 private int maxFilesToAdd = -1;
46 private int maxFilesToConsider = -1;
47
48 private final ArrayBlockingQueue<FileResource> queue;
49 private final int numConsumers;
50
51
52 private long maxConsecWaitInMillis = 300000;//300,000ms = 5 minutes
53 private DocumentSelector documentSelector = null;
54
55 //number of files added to queue
56 private int added = 0;
57 //number of files considered including those that were rejected by documentSelector
58 private int considered = 0;
59
60 /**
61 * @param queue shared queue
62 * @param numConsumers number of consumers (needs to know how many poisons to add when done)
63 */
64 public FileResourceCrawler(ArrayBlockingQueue<FileResource> queue, int numConsumers) {
65 this.queue = queue;
66 this.numConsumers = numConsumers;
67 }
68
69 /**
70 * Implement this to control the addition of FileResources. Call {@link #tryToAdd}
71 * to add FileResources to the queue.
72 *
73 * @throws InterruptedException
74 */
75 public abstract void start() throws InterruptedException;
76
77 public FileResourceCrawlerFutureResult call() {
78 try {
79 start();
80 } catch (InterruptedException e) {
81 //this can be triggered by shutdownNow in BatchProcess
82 logger.info("InterruptedException in FileCrawler: " + e.getMessage());
83 } catch (Exception e) {
84 logger.error("Exception in FileResourceCrawler: " + e.getMessage());
85 } finally {
86 isActive = false;
87 }
88
89 try {
90 shutdown();
91 } catch (InterruptedException e) {
92 //swallow
93 }
94
95 return new FileResourceCrawlerFutureResult(considered, added);
96 }
97
98 /**
99 *
100 * @param fileResource resource to add
101 * @return int status of the attempt (SKIPPED, ADDED, STOP_NOW) to add the resource to the queue.
102 * @throws InterruptedException
103 */
104 protected int tryToAdd(FileResource fileResource) throws InterruptedException {
105
106 if (maxFilesToAdd > -1 && added >= maxFilesToAdd) {
107 return STOP_NOW;
108 }
109
110 if (maxFilesToConsider > -1 && considered > maxFilesToConsider) {
111 return STOP_NOW;
112 }
113
114 boolean isAdded = false;
115 if (select(fileResource.getMetadata())) {
116 long totalConsecutiveWait = 0;
117 while (queue.offer(fileResource, 1L, TimeUnit.SECONDS) == false) {
118
119 logger.info("FileResourceCrawler is pausing. Queue is full: " + queue.size());
120 Thread.sleep(PAUSE_INCREMENT_MILLIS);
121 totalConsecutiveWait += PAUSE_INCREMENT_MILLIS;
122 if (maxConsecWaitInMillis > -1 && totalConsecutiveWait > maxConsecWaitInMillis) {
123 timedOut = true;
124 logger.error("Crawler had to wait longer than max consecutive wait time.");
125 throw new InterruptedException("FileResourceCrawler had to wait longer than max consecutive wait time.");
126 }
127 if (Thread.currentThread().isInterrupted()) {
128 logger.info("FileResourceCrawler shutting down because of interrupted thread.");
129 throw new InterruptedException("FileResourceCrawler interrupted.");
130 }
131 }
132 isAdded = true;
133 added++;
134 } else {
135 logger.debug("crawler did not select: "+fileResource.getResourceId());
136 }
137 considered++;
138 return (isAdded)?ADDED:SKIPPED;
139 }
140
141 //Warning! Depending on the value of maxConsecWaitInMillis
142 //this could try forever in vain to add poison to the queue.
143 private void shutdown() throws InterruptedException{
144 logger.debug("FileResourceCrawler entering shutdown");
145 if (hasCompletedCrawling || shutDownNoPoison) {
146 return;
147 }
148 int i = 0;
149 long start = new Date().getTime();
150 while (queue.offer(new PoisonFileResource(), 1L, TimeUnit.SECONDS)) {
151 if (shutDownNoPoison) {
152 logger.debug("quitting the poison loop because shutDownNoPoison is now true");
153 return;
154 }
155 if (Thread.currentThread().isInterrupted()) {
156 logger.debug("thread interrupted while trying to add poison");
157 return;
158 }
159 long elapsed = new Date().getTime() - start;
160 if (maxConsecWaitInMillis > -1 && elapsed > maxConsecWaitInMillis) {
161 logger.error("Crawler timed out while trying to add poison");
162 return;
163 }
164 logger.debug("added "+i+" number of PoisonFileResource(s)");
165 if (i++ >= numConsumers) {
166 break;
167 }
168
169 }
170 hasCompletedCrawling = true;
171 }
172
173 /**
174 * If the crawler stops for any reason, it is no longer active.
175 *
176 * @return whether crawler is active or not
177 */
178 public boolean isActive() {
179 return isActive;
180 }
181
182 public void setMaxConsecWaitInMillis(long maxConsecWaitInMillis) {
183 this.maxConsecWaitInMillis = maxConsecWaitInMillis;
184 }
185 public void setDocumentSelector(DocumentSelector documentSelector) {
186 this.documentSelector = documentSelector;
187 }
188
189 public int getConsidered() {
190 return considered;
191 }
192
193 protected boolean select(Metadata m) {
194 return documentSelector.select(m);
195 }
196
197 /**
198 * Maximum number of files to add. If {@link #maxFilesToAdd} < 0 (default),
199 * then this crawler will add all documents.
200 *
201 * @param maxFilesToAdd maximum number of files to add to the queue
202 */
203 public void setMaxFilesToAdd(int maxFilesToAdd) {
204 this.maxFilesToAdd = maxFilesToAdd;
205 }
206
207
208 /**
209 * Maximum number of files to consider. A file is considered
210 * whether or not the DocumentSelector selects a document.
211 * <p/>
212 * If {@link #maxFilesToConsider} < 0 (default), then this crawler
213 * will add all documents.
214 *
215 * @param maxFilesToConsider maximum number of files to consider adding to the queue
216 */
217 public void setMaxFilesToConsider(int maxFilesToConsider) {
218 this.maxFilesToConsider = maxFilesToConsider;
219 }
220
221 /**
222 * Use sparingly. This synchronizes on the queue!
223 * @return whether this queue contains any non-poison file resources
224 */
225 public boolean isQueueEmpty() {
226 int size= 0;
227 synchronized(queue) {
228 for (FileResource aQueue : queue) {
229 if (!(aQueue instanceof PoisonFileResource)) {
230 size++;
231 }
232 }
233 }
234 return size == 0;
235 }
236
237 /**
238 * Returns whether the crawler timed out while trying to add a resource
239 * to the queue.
240 * <p/>
241 * If the crawler timed out while trying to add poison, this is not
242 * set to true.
243 *
244 * @return whether this was timed out or not
245 */
246 public boolean wasTimedOut() {
247 return timedOut;
248 }
249
250 /**
251 *
252 * @return number of files that this crawler added to the queue
253 */
254 public int getAdded() {
255 return added;
256 }
257
258 /**
259 * Set to true to shut down the FileResourceCrawler without
260 * adding poison. Do this only if you've already called another mechanism
261 * to request that consumers shut down. This prevents a potential deadlock issue
262 * where the crawler is trying to add to the queue, but it is full.
263 *
264 * @return
265 */
266 public void shutDownNoPoison() {
267 this.shutDownNoPoison = true;
268 }
269 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 class FileResourceCrawlerFutureResult implements IFileProcessorFutureResult {
20
21 private final int considered;
22 private final int added;
23
24 protected FileResourceCrawlerFutureResult(int considered, int added) {
25 this.considered = considered;
26 this.added = added;
27 }
28
29 protected int getConsidered() {
30 return considered;
31 }
32
33 protected int getAdded() {
34 return added;
35 }
36 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Date;
20
21 /**
22 * Simple class to record the time when a FileResource's processing started.
23 */
24 class FileStarted {
25
26 private final String resourceId;
27 private final long started;
28
29 /**
30 * Initializes a new FileStarted class with {@link #resourceId}
31 * and sets {@link #started} as new Date().getTime().
32 *
33 * @param resourceId string for unique resource id
34 */
35 public FileStarted(String resourceId) {
36 this(resourceId, new Date().getTime());
37 }
38
39 public FileStarted(String resourceId, long started) {
40 this.resourceId = resourceId;
41 this.started = started;
42 }
43
44
45 /**
46 * @return id of resource
47 */
48 public String getResourceId() {
49 return resourceId;
50 }
51
52 /**
53 * @return time at which processing on this file started
54 */
55 public long getStarted() {
56 return started;
57 }
58
59 /**
60 * @return elapsed milliseconds this the start of processing of this
61 * file resource
62 */
63 public long getElapsedMillis() {
64 long now = new Date().getTime();
65 return now - started;
66 }
67
68 @Override
69 public int hashCode() {
70 final int prime = 31;
71 int result = 1;
72 result = prime * result
73 + ((resourceId == null) ? 0 : resourceId.hashCode());
74 result = prime * result + (int) (started ^ (started >>> 32));
75 return result;
76 }
77
78 @Override
79 public boolean equals(Object obj) {
80 if (this == obj) {
81 return true;
82 }
83 if (obj == null) {
84 return false;
85 }
86 if (!(obj instanceof FileStarted)) {
87 return false;
88 }
89 FileStarted other = (FileStarted) obj;
90 if (resourceId == null) {
91 if (other.resourceId != null) {
92 return false;
93 }
94 } else if (!resourceId.equals(other.resourceId)) {
95 return false;
96 }
97 return started == other.started;
98 }
99
100 @Override
101 public String toString() {
102 StringBuilder builder = new StringBuilder();
103 builder.append("FileStarted [resourceId=");
104 builder.append(resourceId);
105 builder.append(", started=");
106 builder.append(started);
107 builder.append("]");
108 return builder.toString();
109 }
110
111
112 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 /**
20 * stub interface to allow for different result types from different processors
21 *
22 */
23 public interface IFileProcessorFutureResult {
24
25 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.BufferedReader;
20 import java.io.IOException;
21 import java.io.InputStreamReader;
22 import java.util.concurrent.Callable;
23
24 import org.apache.tika.io.IOUtils;
25 import org.slf4j.Logger;
26 import org.slf4j.LoggerFactory;
27
28
29 /**
30 * Class that waits for input on System.in. If the user enters a keystroke on
31 * System.in, this will send a signal to the FileResourceRunner to shutdown gracefully.
32 *
33 * <p>
34 * In the future, this may implement a common IInterrupter interface for more flexibility.
35 */
36 public class Interrupter implements Callable<IFileProcessorFutureResult> {
37
38 private Logger logger = LoggerFactory.getLogger(Interrupter.class);
39 public IFileProcessorFutureResult call(){
40 try{
41 BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, IOUtils.UTF_8));
42 while (true){
43 if (reader.ready()){
44 reader.readLine();
45 break;
46 } else {
47 Thread.sleep(1000);
48 }
49 }
50 } catch (InterruptedException e){
51 //canceller was interrupted
52 } catch (IOException e){
53 logger.error("IOException from STDIN in CommandlineInterrupter.");
54 }
55 return new InterrupterFutureResult();
56 }
57 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 public class InterrupterFutureResult implements IFileProcessorFutureResult {
20
21 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.metadata.Metadata;
20
21 import java.io.IOException;
22 import java.io.OutputStream;
23
24 public interface OutputStreamFactory {
25
26 public OutputStream getOutputStream(Metadata metadata) throws IOException;
27
28 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 public class ParallelFileProcessingResult {
20 private final int considered;
21 private final int added;
22 private final int consumed;
23 private final int numberHandledExceptions;
24 private final double secondsElapsed;
25 private final int exitStatus;
26 private final String causeForTermination;
27
28 public ParallelFileProcessingResult(int considered, int added,
29 int consumed, int numberHandledExceptions,
30 double secondsElapsed,
31 int exitStatus,
32 String causeForTermination) {
33 this.considered = considered;
34 this.added = added;
35 this.consumed = consumed;
36 this.numberHandledExceptions = numberHandledExceptions;
37 this.secondsElapsed = secondsElapsed;
38 this.exitStatus = exitStatus;
39 this.causeForTermination = causeForTermination;
40 }
41
42 /**
43 * Returns the number of file resources considered.
44 * If a filter causes the crawler to ignore a number of resources,
45 * this number could be higher than that returned by {@link #getConsumed()}.
46 *
47 * @return number of file resources considered
48 */
49 public int getConsidered() {
50 return considered;
51 }
52
53 /**
54 * @return number of resources added to the queue
55 */
56 public int getAdded() {
57 return added;
58 }
59
60 /**
61 * @return number of resources that were tried to be consumed. There
62 * may have been an exception.
63 */
64 public int getConsumed() {
65 return consumed;
66 }
67
68 /**
69 * @return whether the {@link BatchProcess} was interrupted
70 * by an {@link Interrupter}.
71 */
72 public String getCauseForTermination() {
73 return causeForTermination;
74 }
75
76 /**
77 *
78 * @return seconds elapsed since the start of the batch processing
79 */
80 public double secondsElapsed() {
81 return secondsElapsed;
82 }
83
84 public int getNumberHandledExceptions() {
85 return numberHandledExceptions;
86 }
87
88 /**
89 *
90 * @return intendedExitStatus
91 */
92 public int getExitStatus() {
93 return exitStatus;
94 }
95
96 @Override
97 public String toString() {
98 return "ParallelFileProcessingResult{" +
99 "considered=" + considered +
100 ", added=" + added +
101 ", consumed=" + consumed +
102 ", numberHandledExceptions=" + numberHandledExceptions +
103 ", secondsElapsed=" + secondsElapsed +
104 ", exitStatus=" + exitStatus +
105 ", causeForTermination='" + causeForTermination + '\'' +
106 '}';
107 }
108 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.config.TikaConfig;
20 import org.apache.tika.parser.Parser;
21
22 public interface ParserFactory {
23
24 public Parser getParser(TikaConfig config);
25
26 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.metadata.Metadata;
20
21 import java.io.InputStream;
22
23 /**
24 * Sentinel class for the crawler to add to the queue to let
25 * the consumers know that they should shutdown.
26 */
27 class PoisonFileResource implements FileResource {
28
29 /**
30 * always returns null
31 */
32 @Override
33 public Metadata getMetadata() {
34 return null;
35 }
36
37 /**
38 * always returns null
39 */
40 @Override
41 public InputStream openInputStream() {
42 return null;
43 }
44
45 /**
46 * always returns null
47 */
48 @Override
49 public String getResourceId() {
50 return null;
51 }
52
53 }
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.text.NumberFormat;
20 import java.util.Date;
21 import java.util.Locale;
22 import java.util.concurrent.Callable;
23
24 import org.apache.tika.util.DurationFormatUtils;
25 import org.slf4j.Logger;
26 import org.slf4j.LoggerFactory;
27
28 /**
29 * Basic class to use for reporting status from both the crawler and the consumers.
30 * This wakes up roughly every {@link #sleepMillis} and log.info's a status report.
31 */
32
33 public class StatusReporter implements Callable<IFileProcessorFutureResult> {
34
35 private final Logger logger = LoggerFactory.getLogger(StatusReporter.class);
36
37 //require references to these so that the
38 //StatusReporter can query them when it wakes up
39 private final ConsumersManager consumersManager;
40 private final FileResourceCrawler crawler;
41
42 //local time that the StatusReporter started
43 private final long start;
44 //how long to sleep between reporting intervals
45 private long sleepMillis = 1000;
46
47 //how long before considering a parse "stale" (potentially hung forever)
48 private long staleThresholdMillis = 100000;
49
50 private volatile boolean isShuttingDown = false;
51
52 /**
53 * Initialize with the crawler and consumers
54 *
55 * @param crawler crawler to ping at intervals
56 * @param consumersManager consumers to ping at intervals
57 */
58 public StatusReporter(FileResourceCrawler crawler, ConsumersManager consumersManager) {
59 this.consumersManager = consumersManager;
60 this.crawler = crawler;
61 start = new Date().getTime();
62 }
63
64 /**
65 * Override for different behavior.
66 * <p/>
67 * This reports the string at the info level to this class' logger.
68 *
69 * @param s string to report
70 */
71 protected void report(String s) {
72 logger.info(s);
73 }
74
75 /**
76 * Startup the reporter.
77 */
78 public IFileProcessorFutureResult call() {
79 NumberFormat numberFormat = NumberFormat.getNumberInstance(Locale.ROOT);
80 try {
81 while (true) {
82 Thread.sleep(sleepMillis);
83 int cnt = getRoughCountConsumed();
84 int exceptions = getRoughCountExceptions();
85 long elapsed = new Date().getTime() - start;
86 double elapsedSecs = (double) elapsed / (double) 1000;
87 int avg = (elapsedSecs > 5 || cnt > 100) ? (int) ((double) cnt / elapsedSecs) : -1;
88
89 String elapsedString = DurationFormatUtils.formatMillis(new Date().getTime() - start);
90 String docsPerSec = avg > -1 ? String.format(Locale.ROOT,
91 " (%s docs per sec)",
92 numberFormat.format(avg)) : "";
93 String msg =
94 String.format(
95 Locale.ROOT,
96 "Processed %s documents in %s%s.",
97 numberFormat.format(cnt), elapsedString, docsPerSec);
98 report(msg);
99 if (exceptions == 1){
100 msg = "There has been one handled exception.";
101 } else {
102 msg =
103 String.format(Locale.ROOT,
104 "There have been %s handled exceptions.",
105 numberFormat.format(exceptions));
106 }
107 report(msg);
108
109 reportStale();
110
111 int stillAlive = getStillAlive();
112 if (stillAlive == 1) {
113 msg = "There is one file processor still active.";
114 } else {
115 msg = "There are " + numberFormat.format(stillAlive) + " file processors still active.";
116 }
117 report(msg);
118
119 int crawled = crawler.getConsidered();
120 int added = crawler.getAdded();
121 if (crawled == 1) {
122 msg = "The directory crawler has considered 1 file,";
123 } else {
124 msg = "The directory crawler has considered " +
125 numberFormat.format(crawled) + " files, ";
126 }
127 if (added == 1) {
128 msg += "and it has added 1 file.";
129 } else {
130 msg += "and it has added " +
131 numberFormat.format(crawler.getAdded()) + " files.";
132 }
133 msg += "\n";
134 report(msg);
135
136 if (! crawler.isActive()) {
137 msg = "The directory crawler has completed its crawl.\n";
138 report(msg);
139 }
140 if (isShuttingDown) {
141 msg = "Process is shutting down now.";
142 report(msg);
143 }
144 }
145 } catch (InterruptedException e) {
146 //swallow
147 }
148 return new StatusReporterFutureResult();
149 }
150
151
152 /**
153 * Set the amount of time to sleep between reports.
154 * @param sleepMillis length to sleep btwn reports in milliseconds
155 */
156 public void setSleepMillis(long sleepMillis) {
157 this.sleepMillis = sleepMillis;
158 }
159
160 /**
161 * Set the amount of time in milliseconds to use as the threshold for determining
162 * a stale parse.
163 *
164 * @param staleThresholdMillis threshold for determining whether or not to report a stale
165 */
166 public void setStaleThresholdMillis(long staleThresholdMillis) {
167 this.staleThresholdMillis = staleThresholdMillis;
168 }
169
170
171 private void reportStale() {
172 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
173 FileStarted fs = consumer.getCurrentFile();
174 if (fs == null) {
175 continue;
176 }
177 long elapsed = fs.getElapsedMillis();
178 if (elapsed > staleThresholdMillis) {
179 String elapsedString = Double.toString((double) elapsed / (double) 1000);
180 report("A thread has been working on " + fs.getResourceId() +
181 " for " + elapsedString + " seconds.");
182 }
183 }
184 }
185
186 /*
187 * This returns a rough (unsynchronized) count of resources consumed.
188 */
189 private int getRoughCountConsumed() {
190 int ret = 0;
191 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
192 ret += consumer.getNumResourcesConsumed();
193 }
194 return ret;
195 }
196
197 private int getStillAlive() {
198 int ret = 0;
199 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
200 if ( consumer.isStillActive()) {
201 ret++;
202 }
203 }
204 return ret;
205 }
206
207 /**
208 * This returns a rough (unsynchronized) count of caught/handled exceptions.
209 * @return rough count of exceptions
210 */
211 public int getRoughCountExceptions() {
212 int ret = 0;
213 for (FileResourceConsumer consumer : consumersManager.getConsumers()) {
214 ret += consumer.getNumHandledExceptions();
215 }
216 return ret;
217 }
218
219 /**
220 * Set whether the main process is in the process of shutting down.
221 * @param isShuttingDown
222 */
223 public void setIsShuttingDown(boolean isShuttingDown){
224 this.isShuttingDown = isShuttingDown;
225 }
226 }
0 package org.apache.tika.batch;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 /**
19 * Empty class for what a StatusReporter returns when it finishes.
20 */
21 public class StatusReporterFutureResult implements IFileProcessorFutureResult {
22 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.batch.ConsumersManager;
20 import org.apache.tika.batch.FileResource;
21 import org.w3c.dom.Node;
22
23 import java.util.Map;
24 import java.util.concurrent.ArrayBlockingQueue;
25
26 public abstract class AbstractConsumersBuilder {
27
28 public static int getDefaultNumConsumers(){
29 int n = Runtime.getRuntime().availableProcessors()-1;
30 return (n < 1) ? 1 : n;
31 }
32
33 public abstract ConsumersManager build(Node node, Map<String, String> runtimeAttributes,
34 ArrayBlockingQueue<FileResource> queue);
35
36
37 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import javax.xml.parsers.DocumentBuilder;
20 import javax.xml.parsers.DocumentBuilderFactory;
21 import javax.xml.parsers.ParserConfigurationException;
22 import java.io.IOException;
23 import java.io.InputStream;
24 import java.util.Collections;
25 import java.util.HashMap;
26 import java.util.Map;
27 import java.util.concurrent.ArrayBlockingQueue;
28
29 import org.apache.tika.batch.BatchProcess;
30 import org.apache.tika.batch.ConsumersManager;
31 import org.apache.tika.batch.FileResource;
32 import org.apache.tika.batch.FileResourceCrawler;
33 import org.apache.tika.batch.Interrupter;
34 import org.apache.tika.batch.StatusReporter;
35 import org.apache.tika.util.ClassLoaderUtil;
36 import org.apache.tika.util.XMLDOMUtil;
37 import org.w3c.dom.Document;
38 import org.w3c.dom.Node;
39 import org.w3c.dom.NodeList;
40 import org.xml.sax.SAXException;
41
42
43 /**
44 * Builds a BatchProcessor from a combination of runtime arguments and the
45 * config file.
46 */
47 public class BatchProcessBuilder {
48
49 public final static int DEFAULT_MAX_QUEUE_SIZE = 1000;
50 public final static String MAX_QUEUE_SIZE_KEY = "maxQueueSize";
51 public final static String NUM_CONSUMERS_KEY = "numConsumers";
52
53 /**
54 * Builds a BatchProcess from runtime arguments and a
55 * input stream of a configuration file. With the exception of the QueueBuilder,
56 * the builders choose how to adjudicate between
57 * runtime arguments and the elements in the configuration file.
58 * <p/>
59 * This does not close the InputStream!
60 * @param is inputStream
61 * @param runtimeAttributes incoming runtime attributes
62 * @return batch process
63 * @throws java.io.IOException
64 */
65 public BatchProcess build(InputStream is, Map<String,String> runtimeAttributes) throws IOException {
66 Document doc = null;
67 DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
68 DocumentBuilder docBuilder = null;
69 try {
70 docBuilder = fact.newDocumentBuilder();
71 doc = docBuilder.parse(is);
72 } catch (ParserConfigurationException e) {
73 throw new IOException(e);
74 } catch (SAXException e) {
75 throw new IOException(e);
76 }
77 Node docElement = doc.getDocumentElement();
78 return build(docElement, runtimeAttributes);
79 }
80
81 /**
82 * Builds a FileResourceBatchProcessor from runtime arguments and a
83 * document node of a configuration file. With the exception of the QueueBuilder,
84 * the builders choose how to adjudicate between
85 * runtime arguments and the elements in the configuration file.
86 *
87 * @param docElement document element of the xml config file
88 * @param incomingRuntimeAttributes runtime arguments
89 * @return FileResourceBatchProcessor
90 */
91 public BatchProcess build(Node docElement, Map<String, String> incomingRuntimeAttributes) {
92
93 //key components
94 long timeoutThresholdMillis = XMLDOMUtil.getLong("timeoutThresholdMillis",
95 incomingRuntimeAttributes, docElement);
96 long timeoutCheckPulseMillis = XMLDOMUtil.getLong("timeoutCheckPulseMillis",
97 incomingRuntimeAttributes, docElement);
98 long pauseOnEarlyTerminationMillis = XMLDOMUtil.getLong("pauseOnEarlyTerminationMillis",
99 incomingRuntimeAttributes, docElement);
100 int maxAliveTimeSeconds = XMLDOMUtil.getInt("maxAliveTimeSeconds",
101 incomingRuntimeAttributes, docElement);
102
103 FileResourceCrawler crawler = null;
104 ConsumersManager consumersManager = null;
105 StatusReporter reporter = null;
106 Interrupter interrupter = null;
107
108 /*
109 * TODO: This is a bit smelly. NumConsumers needs to be used by the crawler
110 * and the consumers. This copies the incomingRuntimeAttributes and then
111 * supplies the numConsumers from the commandline (if it exists) or from the config file
112 * At least this creates an unmodifiable defensive copy of incomingRuntimeAttributes...
113 */
114 Map<String, String> runtimeAttributes = setNumConsumersInRuntimeAttributes(docElement, incomingRuntimeAttributes);
115
116 //build queue
117 ArrayBlockingQueue<FileResource> queue = buildQueue(docElement, runtimeAttributes);
118
119 NodeList children = docElement.getChildNodes();
120 Map<String, Node> keyNodes = new HashMap<String, Node>();
121 for (int i = 0; i < children.getLength(); i++) {
122 Node child = children.item(i);
123 if (child.getNodeType() != Node.ELEMENT_NODE) {
124 continue;
125 }
126 String nodeName = child.getNodeName();
127 keyNodes.put(nodeName, child);
128 }
129 //build consumers
130 consumersManager = buildConsumersManager(keyNodes.get("consumers"), runtimeAttributes, queue);
131
132 //build crawler
133 crawler = buildCrawler(queue, keyNodes.get("crawler"), runtimeAttributes);
134
135 reporter = buildReporter(crawler, consumersManager, keyNodes.get("reporter"), runtimeAttributes);
136
137 interrupter = buildInterrupter(keyNodes.get("interrupter"), runtimeAttributes);
138
139 BatchProcess proc = new BatchProcess(
140 crawler, consumersManager, reporter, interrupter);
141
142 if (timeoutThresholdMillis > -1) {
143 proc.setTimeoutThresholdMillis(timeoutThresholdMillis);
144 }
145
146 if (pauseOnEarlyTerminationMillis > -1) {
147 proc.setPauseOnEarlyTerminationMillis(pauseOnEarlyTerminationMillis);
148 }
149
150 if (timeoutCheckPulseMillis > -1) {
151 proc.setTimeoutCheckPulseMillis(timeoutCheckPulseMillis);
152 }
153 proc.setMaxAliveTimeSeconds(maxAliveTimeSeconds);
154 return proc;
155 }
156
157 private Interrupter buildInterrupter(Node node, Map<String, String> runtimeAttributes) {
158 Map<String, String> attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
159 String className = attrs.get("builderClass");
160 if (className == null) {
161 throw new RuntimeException("Need to specify class name in interrupter element");
162 }
163 InterrupterBuilder builder = ClassLoaderUtil.buildClass(InterrupterBuilder.class, className);
164
165 return builder.build(node, runtimeAttributes);
166
167 }
168
169 private StatusReporter buildReporter(FileResourceCrawler crawler, ConsumersManager consumersManager,
170 Node node, Map<String, String> runtimeAttributes) {
171
172 Map<String, String> attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
173 String className = attrs.get("builderClass");
174 if (className == null) {
175 throw new RuntimeException("Need to specify class name in reporter element");
176 }
177 StatusReporterBuilder builder = ClassLoaderUtil.buildClass(StatusReporterBuilder.class, className);
178
179 return builder.build(crawler, consumersManager, node, runtimeAttributes);
180
181 }
182
183 /**
184 * numConsumers is needed by both the crawler and the consumers. This utility method
185 * is to be used to extract the number of consumers from a map of String key value pairs.
186 * <p>
187 * If the value is "default", not a parseable integer or has a value < 1,
188 * then <code>AbstractConsumersBuilder</code>'s <code>getDefaultNumConsumers()</code>
189 * @param attrs attributes from which to select the NUM_CONSUMERS_KEY
190 * @return number of consumers
191 */
192 public static int getNumConsumers(Map<String, String> attrs) {
193 String nString = attrs.get(BatchProcessBuilder.NUM_CONSUMERS_KEY);
194 if (nString == null || nString.equals("default")) {
195 return AbstractConsumersBuilder.getDefaultNumConsumers();
196 }
197 int n = -1;
198 try {
199 n = Integer.parseInt(nString);
200 } catch (NumberFormatException e) {
201 //swallow
202 }
203 if (n < 1) {
204 n = AbstractConsumersBuilder.getDefaultNumConsumers();
205 }
206 return n;
207 }
208
209 private Map<String, String> setNumConsumersInRuntimeAttributes(Node docElement, Map<String, String> incomingRuntimeAttributes) {
210 Map<String, String> runtimeAttributes = new HashMap<String, String>();
211
212 for(Map.Entry<String, String> e : incomingRuntimeAttributes.entrySet()) {
213 runtimeAttributes.put(e.getKey(), e.getValue());
214 }
215
216 //if this is set at runtime use that value
217 if (runtimeAttributes.containsKey(NUM_CONSUMERS_KEY)){
218 return Collections.unmodifiableMap(runtimeAttributes);
219 }
220 Node ncNode = docElement.getAttributes().getNamedItem("numConsumers");
221 int numConsumers = -1;
222 String numConsumersString = ncNode.getNodeValue();
223 try {
224 numConsumers = Integer.parseInt(numConsumersString);
225 } catch (NumberFormatException e) {
226 //swallow and just use numConsumers
227 }
228 //TODO: should we have a max range check?
229 if (numConsumers < 1) {
230 numConsumers = AbstractConsumersBuilder.getDefaultNumConsumers();
231 }
232 runtimeAttributes.put(NUM_CONSUMERS_KEY, Integer.toString(numConsumers));
233 return Collections.unmodifiableMap(runtimeAttributes);
234 }
235
236 //tries to get maxQueueSize from main element
237 private ArrayBlockingQueue<FileResource> buildQueue(Node docElement,
238 Map<String, String> runtimeAttributes) {
239 int maxQueueSize = DEFAULT_MAX_QUEUE_SIZE;
240 String szString = runtimeAttributes.get(MAX_QUEUE_SIZE_KEY);
241
242 if (szString == null) {
243 Node szNode = docElement.getAttributes().getNamedItem(MAX_QUEUE_SIZE_KEY);
244 if (szNode != null) {
245 szString = szNode.getNodeValue();
246 }
247 }
248
249 if (szString != null) {
250 try {
251 maxQueueSize = Integer.parseInt(szString);
252 } catch (NumberFormatException e) {
253 //swallow
254 }
255 }
256
257 if (maxQueueSize < 0) {
258 maxQueueSize = DEFAULT_MAX_QUEUE_SIZE;
259 }
260
261 return new ArrayBlockingQueue<FileResource>(maxQueueSize);
262 }
263
264 private ConsumersManager buildConsumersManager(Node node,
265 Map<String, String> runtimeAttributes, ArrayBlockingQueue<FileResource> queue) {
266
267 Map<String, String> attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
268 String className = attrs.get("builderClass");
269 if (className == null) {
270 throw new RuntimeException("Need to specify class name in consumers element");
271 }
272 AbstractConsumersBuilder builder = ClassLoaderUtil.buildClass(AbstractConsumersBuilder.class, className);
273
274 return builder.build(node, runtimeAttributes, queue);
275 }
276
277
278 private FileResourceCrawler buildCrawler(ArrayBlockingQueue<FileResource> queue,
279 Node node, Map<String, String> runtimeAttributes) {
280 Map<String, String> attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
281 String className = attrs.get("builderClass");
282 if (className == null) {
283 throw new RuntimeException("Need to specify class name in crawler element");
284 }
285
286 ICrawlerBuilder builder = ClassLoaderUtil.buildClass(ICrawlerBuilder.class, className);
287 return builder.build(node, runtimeAttributes, queue);
288 }
289
290
291
292
293
294 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import javax.xml.parsers.DocumentBuilder;
20 import javax.xml.parsers.DocumentBuilderFactory;
21 import javax.xml.parsers.ParserConfigurationException;
22 import java.io.IOException;
23 import java.io.InputStream;
24 import java.util.Locale;
25
26 import org.apache.commons.cli.Option;
27 import org.apache.commons.cli.Options;
28 import org.w3c.dom.Document;
29 import org.w3c.dom.NamedNodeMap;
30 import org.w3c.dom.Node;
31 import org.w3c.dom.NodeList;
32 import org.xml.sax.SAXException;
33
34 /**
35 * Reads configurable options from a config file and returns org.apache.commons.cli.Options
36 * object to be used in commandline parser. This allows users and developers to set
37 * which options should be made available via the commandline.
38 */
39 public class CommandLineParserBuilder {
40
41 public Options build(InputStream is) throws IOException {
42 Document doc = null;
43 DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
44 DocumentBuilder docBuilder = null;
45 try {
46 docBuilder = fact.newDocumentBuilder();
47 doc = docBuilder.parse(is);
48 } catch (ParserConfigurationException e) {
49 throw new IOException(e);
50 } catch (SAXException e) {
51 throw new IOException(e);
52 }
53 Node docElement = doc.getDocumentElement();
54 NodeList children = docElement.getChildNodes();
55 Node commandlineNode = null;
56 for (int i = 0; i < children.getLength(); i++) {
57 Node child = children.item(i);
58 if (child.getNodeType() != Node.ELEMENT_NODE) {
59 continue;
60 }
61 String nodeName = child.getNodeName();
62 if (nodeName.equals("commandline")) {
63 commandlineNode = child;
64 break;
65 }
66 }
67 Options options = new Options();
68 if (commandlineNode == null) {
69 return options;
70 }
71 NodeList optionNodes = commandlineNode.getChildNodes();
72 for (int i = 0; i < optionNodes.getLength(); i++) {
73
74 Node optionNode = optionNodes.item(i);
75 if (optionNode.getNodeType() != Node.ELEMENT_NODE) {
76 continue;
77 }
78 Option opt = buildOption(optionNode);
79 if (opt != null) {
80 options.addOption(opt);
81 }
82 }
83 return options;
84 }
85
86 private Option buildOption(Node optionNode) {
87 NamedNodeMap map = optionNode.getAttributes();
88 String opt = getString(map, "opt", "");
89 String description = getString(map, "description", "");
90 String longOpt = getString(map, "longOpt", "");
91 boolean isRequired = getBoolean(map, "required", false);
92 boolean hasArg = getBoolean(map, "hasArg", false);
93 if(opt.trim().length() == 0 || description.trim().length() == 0) {
94 throw new IllegalArgumentException(
95 "Must specify at least option and description");
96 }
97 Option option = new Option(opt, description);
98 if (longOpt.trim().length() > 0) {
99 option.setLongOpt(longOpt);
100 }
101 if (isRequired) {
102 option.setRequired(true);
103 }
104 if (hasArg) {
105 option.setArgs(1);
106 }
107 return option;
108 }
109
110 private boolean getBoolean(NamedNodeMap map, String opt, boolean defaultValue) {
111 Node n = map.getNamedItem(opt);
112 if (n == null) {
113 return defaultValue;
114 }
115
116 if (n.getNodeValue() == null) {
117 return defaultValue;
118 }
119
120 if (n.getNodeValue().toLowerCase(Locale.ROOT).equals("true")) {
121 return true;
122 } else if (n.getNodeValue().toLowerCase(Locale.ROOT).equals("false")) {
123 return false;
124 }
125 return defaultValue;
126 }
127
128 private String getString(NamedNodeMap map, String opt, String defaultVal) {
129 Node n = map.getNamedItem(opt);
130 if (n == null) {
131 return defaultVal;
132 }
133 String value = n.getNodeValue();
134
135 if (value == null) {
136 return defaultVal;
137 }
138 return value;
139 }
140
141
142 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Locale;
20 import java.util.Map;
21
22 import org.apache.tika.sax.BasicContentHandlerFactory;
23 import org.apache.tika.sax.ContentHandlerFactory;
24 import org.apache.tika.util.XMLDOMUtil;
25 import org.w3c.dom.Node;
26
27 /**
28 * Builds BasicContentHandler with type defined by attribute "basicHandlerType"
29 * with possible values: xml, html, text, body, ignore.
30 * Default is text.
31 * <p>
32 * Sets the writeLimit to the value of "writeLimit.
33 * Default is -1;
34 */
35 public class DefaultContentHandlerFactoryBuilder implements IContentHandlerFactoryBuilder {
36
37 @Override
38 public ContentHandlerFactory build(Node node, Map<String, String> runtimeAttributes) {
39 Map<String, String> attributes = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
40 BasicContentHandlerFactory.HANDLER_TYPE type = null;
41 String handlerTypeString = attributes.get("basicHandlerType");
42 if (handlerTypeString == null) {
43 handlerTypeString = "text";
44 }
45 handlerTypeString = handlerTypeString.toLowerCase(Locale.ROOT);
46 if (handlerTypeString.equals("xml")) {
47 type = BasicContentHandlerFactory.HANDLER_TYPE.XML;
48 } else if (handlerTypeString.equals("text")) {
49 type = BasicContentHandlerFactory.HANDLER_TYPE.TEXT;
50 } else if (handlerTypeString.equals("txt")) {
51 type = BasicContentHandlerFactory.HANDLER_TYPE.TEXT;
52 } else if (handlerTypeString.equals("html")) {
53 type = BasicContentHandlerFactory.HANDLER_TYPE.HTML;
54 } else if (handlerTypeString.equals("body")) {
55 type = BasicContentHandlerFactory.HANDLER_TYPE.BODY;
56 } else if (handlerTypeString.equals("ignore")) {
57 type = BasicContentHandlerFactory.HANDLER_TYPE.IGNORE;
58 } else {
59 type = BasicContentHandlerFactory.HANDLER_TYPE.TEXT;
60 }
61 int writeLimit = -1;
62 String writeLimitString = attributes.get("writeLimit");
63 if (writeLimitString != null) {
64 try {
65 writeLimit = Integer.parseInt(attributes.get("writeLimit"));
66 } catch (NumberFormatException e) {
67 //swallow and default to -1
68 //TODO: should we throw a RuntimeException?
69 }
70 }
71 return new BasicContentHandlerFactory(type, writeLimit);
72 }
73
74
75 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Map;
20
21 import org.apache.tika.sax.ContentHandlerFactory;
22 import org.w3c.dom.Node;
23
24 public interface IContentHandlerFactoryBuilder extends ObjectFromDOMBuilder<ContentHandlerFactory> {
25
26 public ContentHandlerFactory build(Node node, Map<String, String> attributes);
27
28 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.batch.FileResource;
20 import org.apache.tika.batch.FileResourceCrawler;
21 import org.w3c.dom.Node;
22
23 import java.util.Map;
24 import java.util.concurrent.ArrayBlockingQueue;
25
26 public interface ICrawlerBuilder extends ObjectFromDOMAndQueueBuilder<FileResourceCrawler>{
27
28 public FileResourceCrawler build(Node node, Map<String, String> attributes,
29 ArrayBlockingQueue<FileResource> queue);
30
31 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Map;
20
21 import org.apache.tika.batch.Interrupter;
22 import org.w3c.dom.Node;
23
24 /**
25 * Builds an Interrupter
26 */
27 public class InterrupterBuilder {
28
29 public Interrupter build(Node n, Map<String, String> commandlineArguments) {
30 return new Interrupter();
31 }
32 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Map;
20 import java.util.concurrent.ArrayBlockingQueue;
21
22 import org.apache.tika.batch.FileResource;
23 import org.w3c.dom.Node;
24
25 /**
26 * Same as {@link org.apache.tika.batch.builders.ObjectFromDOMAndQueueBuilder},
27 * but this is for objects that require access to the shared queue.
28 * @param <T>
29 */
30 public interface ObjectFromDOMAndQueueBuilder<T> {
31
32 public T build(Node node, Map<String, String> runtimeAttributes,
33 ArrayBlockingQueue<FileResource> resourceQueue);
34
35 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.w3c.dom.Node;
20
21 import java.util.Map;
22
23 /**
24 * Interface for things that build objects from a DOM Node and a map of runtime attributes
25 * @param <T>
26 */
27 public interface ObjectFromDOMBuilder<T> {
28
29 public T build(Node node, Map<String, String> runtimeAttributes);
30 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Map;
20
21 import org.apache.tika.batch.StatusReporter;
22 import org.w3c.dom.Node;
23
24 /**
25 * Interface for reporter builders
26 */
27 public interface ReporterBuilder extends ObjectFromDOMBuilder<StatusReporter> {
28 public StatusReporter build(Node n, Map<String, String> runtimeAttributes);
29 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Map;
20
21 import org.apache.tika.batch.ConsumersManager;
22 import org.apache.tika.batch.FileResourceCrawler;
23 import org.apache.tika.batch.StatusReporter;
24 import org.apache.tika.util.PropsUtil;
25 import org.apache.tika.util.XMLDOMUtil;
26 import org.w3c.dom.Node;
27
28 public class SimpleLogReporterBuilder implements StatusReporterBuilder {
29
30 @Override
31 public StatusReporter build(FileResourceCrawler crawler, ConsumersManager consumersManager,
32 Node n, Map<String, String> commandlineArguments) {
33
34 Map<String, String> attributes = XMLDOMUtil.mapifyAttrs(n, commandlineArguments);
35 long sleepMillis = PropsUtil.getLong(attributes.get("reporterSleepMillis"), 1000L);
36 long staleThresholdMillis = PropsUtil.getLong(attributes.get("reporterStaleThresholdMillis"), 500000L);
37 StatusReporter reporter = new StatusReporter(crawler, consumersManager);
38 reporter.setSleepMillis(sleepMillis);
39 reporter.setStaleThresholdMillis(staleThresholdMillis);
40 return reporter;
41 }
42 }
0 package org.apache.tika.batch.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Map;
20
21 import org.apache.tika.batch.ConsumersManager;
22 import org.apache.tika.batch.FileResourceCrawler;
23 import org.apache.tika.batch.StatusReporter;
24 import org.w3c.dom.Node;
25
26 public interface StatusReporterBuilder {
27
28 public StatusReporter build(FileResourceCrawler crawler, ConsumersManager consumers,
29 Node n, Map<String, String> commandlineArguments);
30 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.io.OutputStream;
22 import java.util.concurrent.ArrayBlockingQueue;
23
24 import org.apache.tika.batch.BatchNoRestartError;
25 import org.apache.tika.batch.FileResource;
26 import org.apache.tika.batch.FileResourceConsumer;
27 import org.apache.tika.batch.OutputStreamFactory;
28 import org.apache.tika.metadata.Metadata;
29 import org.apache.tika.parser.ParseContext;
30 import org.apache.tika.parser.Parser;
31 import org.xml.sax.ContentHandler;
32
33 public abstract class AbstractFSConsumer extends FileResourceConsumer {
34
35 public AbstractFSConsumer(ArrayBlockingQueue<FileResource> fileQueue) {
36 super(fileQueue);
37 }
38
39 /**
40 * Use this for consistent logging of exceptions. Clients must
41 * check for whether the os is null, which is the signal
42 * that the output file already exists and should be skipped.
43 *
44 * @param fsOSFactory factory that creates the outputstream
45 * @param fileResource used by the OSFactory to create the stream
46 * @return the OutputStream or null if the output file already exists
47 */
48 protected OutputStream getOutputStream(OutputStreamFactory fsOSFactory,
49 FileResource fileResource) {
50 OutputStream os = null;
51 try {
52 os = fsOSFactory.getOutputStream(fileResource.getMetadata());
53 } catch (IOException e) {
54 //This can happen if the disk has run out of space,
55 //or if there was a failure with mkdirs in fsOSFactory
56 logger.error("{}", getXMLifiedLogMsg(IO_OS,
57 fileResource.getResourceId(), e));
58 throw new BatchNoRestartError("IOException trying to open output stream for " +
59 fileResource.getResourceId() + " :: " + e.getMessage());
60 }
61 return os;
62 }
63
64 /**
65 *
66 * @param fileResource
67 * @return inputStream, can be null if there is an exception opening IS
68 */
69 protected InputStream getInputStream(FileResource fileResource) {
70 InputStream is = null;
71 try {
72 is = fileResource.openInputStream();
73 } catch (IOException e) {
74 logger.warn("{}", getXMLifiedLogMsg(IO_IS,
75 fileResource.getResourceId(), e));
76 flushAndClose(is);
77 }
78 return is;
79 }
80
81 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.InputStream;
20 import java.io.OutputStream;
21 import java.io.UnsupportedEncodingException;
22 import java.util.concurrent.ArrayBlockingQueue;
23
24 import org.apache.tika.batch.FileResource;
25 import org.apache.tika.batch.OutputStreamFactory;
26 import org.apache.tika.batch.ParserFactory;
27 import org.apache.tika.config.TikaConfig;
28 import org.apache.tika.io.IOUtils;
29 import org.apache.tika.parser.ParseContext;
30 import org.apache.tika.parser.Parser;
31 import org.apache.tika.sax.ContentHandlerFactory;
32 import org.xml.sax.ContentHandler;
33
34 /**
35 * Basic FileResourceConsumer that reads files from an input
36 * directory and writes content to the output directory.
37 * <p>
38 * This catches all exceptions and errors and then logs them.
39 * This will re-throw errors.
40 *
41 */
42 public class BasicTikaFSConsumer extends AbstractFSConsumer {
43
44 private boolean parseRecursively = true;
45 private final ParserFactory parserFactory;
46 private final ContentHandlerFactory contentHandlerFactory;
47 private final OutputStreamFactory fsOSFactory;
48 private final TikaConfig config;
49 private String outputEncoding = IOUtils.UTF_8.toString();
50
51
52 public BasicTikaFSConsumer(ArrayBlockingQueue<FileResource> queue,
53 ParserFactory parserFactory,
54 ContentHandlerFactory contentHandlerFactory,
55 OutputStreamFactory fsOSFactory,
56 TikaConfig config) {
57 super(queue);
58 this.parserFactory = parserFactory;
59 this.contentHandlerFactory = contentHandlerFactory;
60 this.fsOSFactory = fsOSFactory;
61 this.config = config;
62 }
63
64 @Override
65 public boolean processFileResource(FileResource fileResource) {
66
67 Parser parser = parserFactory.getParser(config);
68 ParseContext context = new ParseContext();
69 if (parseRecursively) {
70 context.set(Parser.class, parser);
71 }
72
73 OutputStream os = getOutputStream(fsOSFactory, fileResource);
74 //os can be null if fsOSFactory is set to skip processing a file if the output
75 //file already exists
76 if (os == null) {
77 logger.debug("Skipping: " + fileResource.getMetadata().get(FSProperties.FS_REL_PATH));
78 return false;
79 }
80
81 InputStream is = getInputStream(fileResource);
82 if (is == null) {
83 IOUtils.closeQuietly(os);
84 return false;
85 }
86 ContentHandler handler;
87 try {
88 handler = contentHandlerFactory.getNewContentHandler(os, getOutputEncoding());
89 } catch (UnsupportedEncodingException e) {
90 incrementHandledExceptions();
91 logger.error(getXMLifiedLogMsg("output_encoding_ex",
92 fileResource.getResourceId(), e));
93 flushAndClose(os);
94 throw new RuntimeException(e.getMessage());
95 }
96
97 //now actually call parse!
98 Throwable thrown = null;
99 try {
100 parse(fileResource.getResourceId(), parser, is, handler,
101 fileResource.getMetadata(), context);
102 } catch (Error t) {
103 throw t;
104 } catch (Throwable t) {
105 thrown = t;
106 } finally {
107 flushAndClose(os);
108 }
109
110 if (thrown != null) {
111 return false;
112 }
113 return true;
114 }
115
116 public String getOutputEncoding() {
117 return outputEncoding;
118 }
119
120 public void setOutputEncoding(String outputEncoding) {
121 this.outputEncoding = outputEncoding;
122 }
123 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.File;
20 import java.io.IOException;
21 import java.util.HashMap;
22 import java.util.Map;
23 import java.util.concurrent.ExecutorService;
24 import java.util.concurrent.Executors;
25 import java.util.concurrent.Future;
26
27 import org.apache.commons.cli.CommandLine;
28 import org.apache.commons.cli.CommandLineParser;
29 import org.apache.commons.cli.GnuParser;
30 import org.apache.commons.cli.HelpFormatter;
31 import org.apache.commons.cli.Option;
32 import org.apache.commons.cli.Options;
33 import org.apache.tika.batch.BatchProcess;
34 import org.apache.tika.batch.BatchProcessDriverCLI;
35 import org.apache.tika.batch.ParallelFileProcessingResult;
36 import org.apache.tika.batch.builders.BatchProcessBuilder;
37 import org.apache.tika.batch.builders.CommandLineParserBuilder;
38 import org.apache.tika.io.IOUtils;
39 import org.apache.tika.io.TikaInputStream;
40 import org.slf4j.Logger;
41 import org.slf4j.LoggerFactory;
42 import org.slf4j.MarkerFactory;
43
44 public class FSBatchProcessCLI {
45
46 public static String FINISHED_STRING = "Main thread in TikaFSBatchCLI has finished processing.";
47
48 private static Logger logger = LoggerFactory.getLogger(FSBatchProcessCLI.class);
49 private final Options options;
50
51 public FSBatchProcessCLI(String[] args) throws IOException {
52 TikaInputStream configIs = null;
53 try {
54 configIs = getConfigInputStream(args, true);
55 CommandLineParserBuilder builder = new CommandLineParserBuilder();
56 options = builder.build(configIs);
57 } finally {
58 IOUtils.closeQuietly(configIs);
59 }
60 }
61
62 public void usage() {
63 HelpFormatter helpFormatter = new HelpFormatter();
64 helpFormatter.printHelp("tika filesystem batch", options);
65 }
66
67 private TikaInputStream getConfigInputStream(String[] args, boolean logDefault) throws IOException {
68 TikaInputStream is = null;
69 File batchConfigFile = getConfigFile(args);
70 if (batchConfigFile != null) {
71 //this will throw IOException if it can't find a specified config file
72 //better to throw an exception than silently back off to default.
73 is = TikaInputStream.get(batchConfigFile);
74 } else {
75 if (logDefault) {
76 logger.info("No config file set via -bc, relying on default-tika-batch-config.xml");
77 }
78 is = TikaInputStream.get(
79 FSBatchProcessCLI.class.getResourceAsStream("default-tika-batch-config.xml"));
80 }
81 return is;
82 }
83
84 private void execute(String[] args) throws Exception {
85
86 CommandLineParser cliParser = new GnuParser();
87 CommandLine line = cliParser.parse(options, args);
88
89 if (line.hasOption("help")) {
90 usage();
91 System.exit(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE);
92 }
93
94 Map<String, String> mapArgs = new HashMap<String, String>();
95 for (Option option : line.getOptions()) {
96 String v = option.getValue();
97 if (v == null || v.equals("")) {
98 v = "true";
99 }
100 mapArgs.put(option.getOpt(), v);
101 }
102
103 BatchProcessBuilder b = new BatchProcessBuilder();
104 TikaInputStream is = null;
105 BatchProcess process = null;
106 try {
107 is = getConfigInputStream(args, false);
108 process = b.build(is, mapArgs);
109 } finally {
110 IOUtils.closeQuietly(is);
111 }
112 final Thread mainThread = Thread.currentThread();
113
114
115 ExecutorService executor = Executors.newSingleThreadExecutor();
116 Future<ParallelFileProcessingResult> futureResult = executor.submit(process);
117
118 ParallelFileProcessingResult result = futureResult.get();
119 System.out.println(FINISHED_STRING);
120 System.out.println("\n");
121 System.out.println(result.toString());
122 System.exit(result.getExitStatus());
123 }
124
125 private File getConfigFile(String[] args) {
126 File configFile = null;
127 for (int i = 0; i < args.length; i++) {
128 if (args[i].equals("-bc") || args[i].equals("-batch-config")) {
129 if (i < args.length-1) {
130 configFile = new File(args[i+1]);
131 }
132 }
133 }
134 return configFile;
135 }
136
137 public static void main(String[] args) throws Exception {
138
139 try{
140 FSBatchProcessCLI cli = new FSBatchProcessCLI(args);
141 cli.execute(args);
142 } catch (Throwable t) {
143 t.printStackTrace();
144 logger.error(MarkerFactory.getMarker("FATAL"),
145 "Fatal exception from FSBatchProcessCLI: " + t.getMessage(), t);
146 System.exit(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE);
147 }
148 }
149
150 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.List;
20
21 import org.apache.tika.batch.ConsumersManager;
22 import org.apache.tika.batch.FileResourceConsumer;
23
24 public class FSConsumersManager extends ConsumersManager {
25
26
27 public FSConsumersManager(List<FileResourceConsumer> consumers) {
28 super(consumers);
29 }
30
31 @Override
32 public void init() {
33 //noop
34 }
35
36 @Override
37 public void shutdown() {
38 //noop
39 }
40
41 }
0 package org.apache.tika.batch.fs;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.File;
19 import java.util.ArrayList;
20 import java.util.Arrays;
21 import java.util.Collections;
22 import java.util.Comparator;
23 import java.util.List;
24 import java.util.concurrent.ArrayBlockingQueue;
25
26 import org.apache.tika.batch.FileResource;
27 import org.apache.tika.batch.FileResourceCrawler;
28
29 public class FSDirectoryCrawler extends FileResourceCrawler {
30
31 public enum CRAWL_ORDER
32 {
33 SORTED, //alphabetical order; necessary for cross-platform unit tests
34 RANDOM, //shuffle
35 OS_ORDER //operating system chooses
36 }
37
38 private final File root;
39 private final File startDirectory;
40 private final Comparator<File> fileComparator = new FileNameComparator();
41 private CRAWL_ORDER crawlOrder;
42
43 public FSDirectoryCrawler(ArrayBlockingQueue<FileResource> fileQueue,
44 int numConsumers, File root, CRAWL_ORDER crawlOrder) {
45 super(fileQueue, numConsumers);
46 this.root = root;
47 this.startDirectory = root;
48 this.crawlOrder = crawlOrder;
49 if (! startDirectory.isDirectory()) {
50 throw new RuntimeException("Crawler couldn't find this directory:" + startDirectory.getAbsolutePath());
51 }
52
53 }
54
55 public FSDirectoryCrawler(ArrayBlockingQueue<FileResource> fileQueue,
56 int numConsumers, File root, File startDirectory,
57 CRAWL_ORDER crawlOrder) {
58 super(fileQueue, numConsumers);
59 this.root = root;
60 this.startDirectory = startDirectory;
61 this.crawlOrder = crawlOrder;
62 assert(FSUtil.checkThisIsAncestorOfOrSameAsThat(root, startDirectory));
63 if (! startDirectory.isDirectory()) {
64 throw new RuntimeException("Crawler couldn't find this directory:" + startDirectory.getAbsolutePath());
65 }
66 }
67
68 public void start() throws InterruptedException {
69 addFiles(startDirectory);
70 }
71
72 private void addFiles(File directory) throws InterruptedException {
73
74 if (directory == null ||
75 !directory.isDirectory() || !directory.canRead()) {
76 String path = "null path";
77 if (directory != null) {
78 path = directory.getAbsolutePath();
79 }
80 logger.warn("FSFileAdder can't read this directory: " + path);
81 return;
82 }
83
84 List<File> directories = new ArrayList<File>();
85 File[] fileArr = directory.listFiles();
86 if (fileArr == null) {
87 logger.info("Empty directory: " + directory.getAbsolutePath());
88 return;
89 }
90
91 List<File> files = new ArrayList<File>(Arrays.asList(fileArr));
92
93 if (crawlOrder == CRAWL_ORDER.RANDOM) {
94 Collections.shuffle(files);
95 } else if (crawlOrder == CRAWL_ORDER.SORTED) {
96 Collections.sort(files, fileComparator);
97 }
98
99 int numFiles = 0;
100 for (File f : files) {
101 if (Thread.currentThread().isInterrupted()) {
102 throw new InterruptedException("file adder interrupted");
103 }
104
105 if (f.isFile()) {
106 numFiles++;
107 if (numFiles == 1) {
108 handleFirstFileInDirectory(f);
109 }
110 }
111 if (f.isDirectory()) {
112 directories.add(f);
113 continue;
114 }
115 int added = tryToAdd(new FSFileResource(root, f));
116 if (added == FileResourceCrawler.STOP_NOW) {
117 logger.debug("crawler has hit a limit: "+f.getAbsolutePath() + " : " + added);
118 return;
119 }
120 logger.debug("trying to add: "+f.getAbsolutePath() + " : " + added);
121 }
122
123 for (File f : directories) {
124 addFiles(f);
125 }
126 }
127
128 /**
129 * Override this if you have any special handling
130 * for the first actual file that the crawler comes across
131 * in a directory. For example, it might be handy to call
132 * mkdirs() on an output directory if your FileResourceConsumers
133 * are writing to a file.
134 *
135 * @param f file to handle
136 */
137 public void handleFirstFileInDirectory(File f) {
138 //no-op
139 }
140
141 //simple lexical order for the file name, we don't really care about localization.
142 //we do want this, though, because file.compareTo behaves differently
143 //on different OS's.
144 private class FileNameComparator implements Comparator<File> {
145
146 @Override
147 public int compare(File f1, File f2) {
148 if (f1 == null || f2 == null) {
149 return 0;
150 }
151 return f1.getName().compareTo(f2.getName());
152 }
153 }
154 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.regex.Matcher;
20 import java.util.regex.Pattern;
21
22 import org.apache.tika.extractor.DocumentSelector;
23 import org.apache.tika.metadata.Metadata;
24 import org.apache.tika.util.PropsUtil;
25
26 /**
27 * Selector that chooses files based on their file name
28 * and their size, as determined by Metadata.RESOURCE_NAME_KEY and Metadata.CONTENT_LENGTH.
29 * <p/>
30 * The {@link #excludeFileName} pattern is applied first (if it isn't null).
31 * Then the {@link #includeFileName} pattern is applied (if it isn't null),
32 * and finally, the size limit is applied if it is above 0.
33 */
34 public class FSDocumentSelector implements DocumentSelector {
35
36 //can be null!
37 private final Pattern includeFileName;
38
39 //can be null!
40 private final Pattern excludeFileName;
41 private final long maxFileSizeBytes;
42 private final long minFileSizeBytes;
43
44 public FSDocumentSelector(Pattern includeFileName, Pattern excludeFileName, long minFileSizeBytes,
45 long maxFileSizeBytes) {
46 this.includeFileName = includeFileName;
47 this.excludeFileName = excludeFileName;
48 this.minFileSizeBytes = minFileSizeBytes;
49 this.maxFileSizeBytes = maxFileSizeBytes;
50 }
51
52 @Override
53 public boolean select(Metadata metadata) {
54 String fName = metadata.get(Metadata.RESOURCE_NAME_KEY);
55 long sz = PropsUtil.getLong(metadata.get(Metadata.CONTENT_LENGTH), -1L);
56 if (maxFileSizeBytes > -1 && sz > 0) {
57 if (sz > maxFileSizeBytes) {
58 return false;
59 }
60 }
61
62 if (minFileSizeBytes > -1 && sz > 0) {
63 if (sz < minFileSizeBytes) {
64 return false;
65 }
66 }
67
68 if (excludeFileName != null && fName != null) {
69 Matcher m = excludeFileName.matcher(fName);
70 if (m.find()) {
71 return false;
72 }
73 }
74
75 if (includeFileName != null && fName != null) {
76 Matcher m = includeFileName.matcher(fName);
77 return m.find();
78 }
79 return true;
80 }
81
82 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.File;
20 import java.io.IOException;
21 import java.io.InputStream;
22 import java.util.Locale;
23
24 import org.apache.tika.batch.FileResource;
25 import org.apache.tika.io.TikaInputStream;
26 import org.apache.tika.metadata.Metadata;
27
28 /**
29 * FileSystem(FS)Resource wraps a file name.
30 * <p/>
31 * This class automatically sets the following keys in Metadata:
32 * <ul>
33 * <li>Metadata.RESOURCE_NAME_KEY (file name)</li>
34 * <li>Metadata.CONTENT_LENGTH</li>
35 * <li>FSProperties.FS_REL_PATH</li>
36 * <li>FileResource.FILE_EXTENSION</li>
37 * </ul>,
38 */
39 public class FSFileResource implements FileResource {
40
41 private final File fullPath;
42 private final String relativePath;
43 private final Metadata metadata;
44
45 public FSFileResource(File inputRoot, File fullPath) {
46 this.fullPath = fullPath;
47 this.metadata = new Metadata();
48 //child path must actually be a child
49 assert(FSUtil.checkThisIsAncestorOfThat(inputRoot, fullPath));
50 this.relativePath = fullPath.getAbsolutePath().substring(inputRoot.getAbsolutePath().length()+1);
51
52 //need to set these now so that the filter can determine
53 //whether or not to crawl this file
54 metadata.set(Metadata.RESOURCE_NAME_KEY, fullPath.getName());
55 metadata.set(Metadata.CONTENT_LENGTH, Long.toString(fullPath.length()));
56 metadata.set(FSProperties.FS_REL_PATH, relativePath);
57 metadata.set(FileResource.FILE_EXTENSION, getExtension(fullPath));
58 }
59
60 /**
61 * Simple extension extractor that takes whatever comes after the
62 * last period in the path. It returns a lowercased version of the "extension."
63 * <p>
64 * If there is no period, it returns an empty string.
65 *
66 * @param fullPath full path from which to try to find an extension
67 * @return the lowercased extension or an empty string
68 */
69 private String getExtension(File fullPath) {
70 String p = fullPath.getName();
71 int i = p.lastIndexOf(".");
72 if (i > -1) {
73 return p.substring(i + 1).toLowerCase(Locale.ROOT);
74 }
75 return "";
76 }
77
78 /**
79 *
80 * @return file's relativePath
81 */
82 @Override
83 public String getResourceId() {
84 return relativePath;
85 }
86
87 @Override
88 public Metadata getMetadata() {
89 return metadata;
90 }
91
92 @Override
93 public InputStream openInputStream() throws IOException {
94 //no need to include Metadata because we already set the
95 //same information in the initializer
96 return TikaInputStream.get(fullPath);
97 }
98 }
0 package org.apache.tika.batch.fs;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import org.apache.tika.batch.FileResource;
19 import org.apache.tika.batch.FileResourceCrawler;
20
21 import java.io.BufferedReader;
22 import java.io.File;
23 import java.io.FileInputStream;
24 import java.io.FileNotFoundException;
25 import java.io.IOException;
26 import java.io.InputStreamReader;
27 import java.io.UnsupportedEncodingException;
28 import java.util.concurrent.ArrayBlockingQueue;
29
30 /**
31 * Class that "crawls" a list of files.
32 */
33 public class FSListCrawler extends FileResourceCrawler {
34
35 private final BufferedReader reader;
36 private final File root;
37
38 public FSListCrawler(ArrayBlockingQueue<FileResource> fileQueue,
39 int numConsumers, File root, File list, String encoding)
40 throws FileNotFoundException, UnsupportedEncodingException {
41 super(fileQueue, numConsumers);
42 reader = new BufferedReader(new InputStreamReader(new FileInputStream(list), encoding));
43 this.root = root;
44
45 }
46
47 public void start() throws InterruptedException {
48 String line = nextLine();
49
50 while (line != null) {
51 if (Thread.currentThread().isInterrupted()) {
52 throw new InterruptedException("file adder interrupted");
53 }
54 File f = new File(root, line);
55 if (! f.exists()) {
56 logger.warn("File doesn't exist:"+f.getAbsolutePath());
57 line = nextLine();
58 continue;
59 }
60 if (f.isDirectory()) {
61 logger.warn("File is a directory:"+f.getAbsolutePath());
62 line = nextLine();
63 continue;
64 }
65 tryToAdd(new FSFileResource(root, f));
66 line = nextLine();
67 }
68 }
69
70 private String nextLine() {
71 String line = null;
72 try {
73 line = reader.readLine();
74 } catch (IOException e) {
75 throw new RuntimeException(e);
76 }
77 return line;
78 }
79 }
0 package org.apache.tika.batch.fs;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.File;
19 import java.io.FileOutputStream;
20 import java.io.IOException;
21 import java.io.OutputStream;
22 import java.util.zip.GZIPOutputStream;
23
24 import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
25 import org.apache.commons.compress.compressors.bzip2.BZip2CompressorOutputStream;
26 import org.apache.tika.batch.OutputStreamFactory;
27 import org.apache.tika.metadata.Metadata;
28
29 public class FSOutputStreamFactory implements OutputStreamFactory {
30
31 public enum COMPRESSION {
32 NONE,
33 BZIP2,
34 GZIP,
35 ZIP
36 }
37
38 private final FSUtil.HANDLE_EXISTING handleExisting;
39 private final File outputRoot;
40 private final String suffix;
41 private final COMPRESSION compression;
42
43 public FSOutputStreamFactory(File outputRoot, FSUtil.HANDLE_EXISTING handleExisting,
44 COMPRESSION compression, String suffix) {
45 this.handleExisting = handleExisting;
46 this.outputRoot = outputRoot.getAbsoluteFile();
47 this.suffix = suffix;
48 this.compression = compression;
49 }
50
51 /**
52 * This tries to create a file based on the {@link org.apache.tika.batch.fs.FSUtil.HANDLE_EXISTING}
53 * value that was passed in during initialization.
54 * <p>
55 * If {@link #handleExisting} is set to "SKIP" and the output file already exists,
56 * this will return null.
57 * <p>
58 * If an output file can be found, this will try to mkdirs for that output file.
59 * If mkdirs() fails, this will throw an IOException.
60 * <p>
61 * Finally, this will open an output stream for the appropriate output file.
62 * @param metadata must have a value set for FSMetadataProperties.FS_ABSOLUTE_PATH or
63 * else NullPointerException will be thrown!
64 * @return OutputStream
65 * @throws java.io.IOException, NullPointerException
66 */
67 @Override
68 public OutputStream getOutputStream(Metadata metadata) throws IOException {
69 String initialRelativePath = metadata.get(FSProperties.FS_REL_PATH);
70 File outputFile = FSUtil.getOutputFile(outputRoot, initialRelativePath, handleExisting, suffix);
71 if (outputFile == null) {
72 return null;
73 }
74 if (! outputFile.getParentFile().isDirectory()) {
75 boolean success = outputFile.getParentFile().mkdirs();
76 //with multithreading, it is possible that the parent file was created between
77 //the test and the attempt to .mkdirs(); mkdirs() returns false if the dirs already exist
78 if (! success && ! outputFile.getParentFile().isDirectory()) {
79 throw new IOException("Couldn't create parent directory for:"+outputFile.getAbsolutePath());
80 }
81 }
82
83 OutputStream os = new FileOutputStream(outputFile);
84 if (compression == COMPRESSION.BZIP2){
85 os = new BZip2CompressorOutputStream(os);
86 } else if (compression == COMPRESSION.GZIP) {
87 os = new GZIPOutputStream(os);
88 } else if (compression == COMPRESSION.ZIP) {
89 os = new ZipArchiveOutputStream(os);
90 }
91 return os;
92 }
93 }
0 package org.apache.tika.batch.fs;
1
2 import org.apache.tika.metadata.Property;
3
4 public class FSProperties {
5 private final static String TIKA_BATCH_FS_NAMESPACE = "tika_batch_fs";
6
7 /**
8 * File's relative path (including file name) from a given source root
9 */
10 public final static Property FS_REL_PATH = Property.internalText(TIKA_BATCH_FS_NAMESPACE+":relative_path");
11 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.batch.fs;
18
19 import java.io.File;
20 import java.io.IOException;
21 import java.util.UUID;
22 import java.util.regex.Matcher;
23 import java.util.regex.Pattern;
24
25 /**
26 * Utility class to handle some common issues when
27 * reading from and writing to a file system (FS).
28 */
29 public class FSUtil {
30
31 public static boolean checkThisIsAncestorOfThat(File ancestor, File child) {
32 int ancLen = ancestor.getAbsolutePath().length();
33 int childLen = child.getAbsolutePath().length();
34 if (childLen <= ancLen) {
35 return false;
36 }
37
38 String childBase = child.getAbsolutePath().substring(0, ancLen);
39 return childBase.equals(ancestor.getAbsolutePath());
40
41 }
42
43 public static boolean checkThisIsAncestorOfOrSameAsThat(File ancestor, File child) {
44 if (ancestor.equals(child)) {
45 return true;
46 }
47 return checkThisIsAncestorOfThat(ancestor, child);
48 }
49
50 public enum HANDLE_EXISTING {
51 OVERWRITE,
52 RENAME,
53 SKIP
54 }
55
56 private final static Pattern FILE_NAME_PATTERN =
57 Pattern.compile("\\A(.*?)(?:\\((\\d+)\\))?\\.([^\\.]+)\\Z");
58
59 /**
60 * Given an output root and an initial relative path,
61 * return the output file according to the HANDLE_EXISTING strategy
62 * <p/>
63 * In the most basic use case, given a root directory "input",
64 * a file's relative path "dir1/dir2/fileA.docx", and an output directory
65 * "output", the output file would be "output/dir1/dir2/fileA.docx."
66 * <p/>
67 * If HANDLE_EXISTING is set to OVERWRITE, this will not check to see if the output already exists,
68 * and the returned file could overwrite an existing file!!!
69 * <p/>
70 * If HANDLE_EXISTING is set to RENAME, this will try to increment a counter at the end of
71 * the file name (fileA(2).docx) until there is a file name that doesn't exist.
72 * <p/>
73 * This will return null if handleExisting == HANDLE_EXISTING.SKIP and
74 * the candidate file already exists.
75 * <p/>
76 * This will throw an IOException if HANDLE_EXISTING is set to
77 * RENAME, and a candidate cannot output file cannot be found
78 * after trying to increment the file count (e.g. fileA(2).docx) 10000 times
79 * and then after trying 20,000 UUIDs.
80 *
81 * @param outputRoot directory root for output
82 * @param initialRelativePath initial relative path (including file name, which may be renamed)
83 * @param handleExisting what to do if the output file exists
84 * @param suffix suffix to add to files, can be null
85 * @return output file or null if no output file should be created
86 * @throws java.io.IOException
87 */
88 public static File getOutputFile(File outputRoot, String initialRelativePath,
89 HANDLE_EXISTING handleExisting, String suffix) throws IOException {
90 String localSuffix = (suffix == null) ? "" : suffix;
91 File cand = new File(outputRoot, initialRelativePath+ "." +localSuffix);
92 if (cand.isFile()) {
93 if (handleExisting.equals(HANDLE_EXISTING.OVERWRITE)) {
94 return cand;
95 } else if (handleExisting.equals(HANDLE_EXISTING.SKIP)) {
96 return null;
97 }
98 }
99
100 //if we're here, the output file exists, and
101 //we must find a new name for it.
102
103 //groups for "testfile(1).txt":
104 //group(1) is "testfile"
105 //group(2) is 1
106 //group(3) is "txt"
107 //Note: group(2) can be null
108 int cnt = 0;
109 String fNameBase = null;
110 String fNameExt = "";
111 //this doesn't include the addition of the localSuffix
112 File candOnly = new File(outputRoot, initialRelativePath);
113 Matcher m = FILE_NAME_PATTERN.matcher(candOnly.getName());
114 if (m.find()) {
115 fNameBase = m.group(1);
116
117 if (m.group(2) != null) {
118 try {
119 cnt = Integer.parseInt(m.group(2));
120 } catch (NumberFormatException e) {
121 //swallow
122 }
123 }
124 if (m.group(3) != null) {
125 fNameExt = m.group(3);
126 }
127 }
128
129 File outputParent = cand.getParentFile();
130 while (fNameBase != null && cand.isFile() && ++cnt < 10000) {
131 String candFileName = fNameBase + "(" + cnt + ")." + fNameExt+ "" +localSuffix;
132 cand = new File(outputParent, candFileName);
133 }
134 //reset count to 0 and try 20000 times
135 cnt = 0;
136 while (cand.isFile() && cnt++ < 20000) {
137 UUID uid = UUID.randomUUID();
138 cand = new File(outputParent, uid.toString() + fNameExt+ "" +localSuffix);
139 }
140
141 if (cand.isFile()) {
142 throw new IOException("Couldn't find candidate output file after trying " +
143 "very, very hard");
144 }
145 return cand;
146 }
147
148 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.InputStream;
20 import java.io.OutputStream;
21 import java.io.OutputStreamWriter;
22 import java.io.Writer;
23 import java.util.LinkedList;
24 import java.util.List;
25 import java.util.concurrent.ArrayBlockingQueue;
26
27 import org.apache.tika.batch.FileResource;
28 import org.apache.tika.batch.OutputStreamFactory;
29 import org.apache.tika.batch.ParserFactory;
30 import org.apache.tika.config.TikaConfig;
31 import org.apache.tika.io.IOUtils;
32 import org.apache.tika.metadata.Metadata;
33 import org.apache.tika.metadata.TikaCoreProperties;
34 import org.apache.tika.metadata.serialization.JsonMetadataList;
35 import org.apache.tika.parser.ParseContext;
36 import org.apache.tika.parser.Parser;
37 import org.apache.tika.parser.RecursiveParserWrapper;
38 import org.apache.tika.sax.ContentHandlerFactory;
39 import org.apache.tika.util.TikaExceptionFilter;
40 import org.xml.sax.helpers.DefaultHandler;
41
42 /**
43 * Basic FileResourceConsumer that reads files from an input
44 * directory and writes content to the output directory.
45 * <p/>
46 * This tries to catch most of the common exceptions, log them and
47 * store them in the metadata list output.
48 */
49 public class RecursiveParserWrapperFSConsumer extends AbstractFSConsumer {
50
51
52 private final ParserFactory parserFactory;
53 private final ContentHandlerFactory contentHandlerFactory;
54 private final OutputStreamFactory fsOSFactory;
55 private final TikaConfig tikaConfig;
56 private String outputEncoding = "UTF-8";
57 //TODO: parameterize this
58 private TikaExceptionFilter exceptionFilter = new TikaExceptionFilter();
59
60
61 public RecursiveParserWrapperFSConsumer(ArrayBlockingQueue<FileResource> queue,
62 ParserFactory parserFactory,
63 ContentHandlerFactory contentHandlerFactory,
64 OutputStreamFactory fsOSFactory, TikaConfig tikaConfig) {
65 super(queue);
66 this.parserFactory = parserFactory;
67 this.contentHandlerFactory = contentHandlerFactory;
68 this.fsOSFactory = fsOSFactory;
69 this.tikaConfig = tikaConfig;
70 }
71
72 @Override
73 public boolean processFileResource(FileResource fileResource) {
74
75 Parser wrapped = parserFactory.getParser(tikaConfig);
76 RecursiveParserWrapper parser = new RecursiveParserWrapper(wrapped, contentHandlerFactory);
77 ParseContext context = new ParseContext();
78
79 // if (parseRecursively == true) {
80 context.set(Parser.class, parser);
81 // }
82
83 //try to open outputstream first
84 OutputStream os = getOutputStream(fsOSFactory, fileResource);
85
86 if (os == null) {
87 logger.debug("Skipping: " + fileResource.getMetadata().get(FSProperties.FS_REL_PATH));
88 return false;
89 }
90
91 //try to open the inputstream before the parse.
92 //if the parse hangs or throws a nasty exception, at least there will
93 //be a zero byte file there so that the batchrunner can skip that problematic
94 //file during the next run.
95 InputStream is = getInputStream(fileResource);
96 if (is == null) {
97 IOUtils.closeQuietly(os);
98 return false;
99 }
100
101 Throwable thrown = null;
102 List<Metadata> metadataList = null;
103 Metadata containerMetadata = fileResource.getMetadata();
104 try {
105 parse(fileResource.getResourceId(), parser, is, new DefaultHandler(),
106 containerMetadata, context);
107 metadataList = parser.getMetadata();
108 } catch (Throwable t) {
109 thrown = t;
110 metadataList = parser.getMetadata();
111 if (metadataList == null) {
112 metadataList = new LinkedList<Metadata>();
113 }
114 Metadata m = null;
115 if (metadataList.size() == 0) {
116 m = containerMetadata;
117 } else {
118 //take the top metadata item
119 m = metadataList.remove(0);
120 }
121 String stackTrace = exceptionFilter.getStackTrace(t);
122 m.add(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime", stackTrace);
123 metadataList.add(0, m);
124 } finally {
125 IOUtils.closeQuietly(is);
126 }
127
128 Writer writer = null;
129
130 try {
131 writer = new OutputStreamWriter(os, getOutputEncoding());
132 JsonMetadataList.toJson(metadataList, writer);
133 } catch (Exception e) {
134 //this is a stop the world kind of thing
135 logger.error("{}", getXMLifiedLogMsg(IO_OS+"json",
136 fileResource.getResourceId(), e));
137 throw new RuntimeException(e);
138 } finally {
139 flushAndClose(writer);
140 }
141
142 if (thrown != null) {
143 if (thrown instanceof Error) {
144 throw (Error) thrown;
145 } else {
146 return false;
147 }
148 }
149
150 return true;
151 }
152
153 public String getOutputEncoding() {
154 return outputEncoding;
155 }
156
157 public void setOutputEncoding(String outputEncoding) {
158 this.outputEncoding = outputEncoding;
159 }
160 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.batch.fs.builders;
18
19 import java.io.File;
20 import java.util.LinkedList;
21 import java.util.List;
22 import java.util.Map;
23 import java.util.concurrent.ArrayBlockingQueue;
24
25 import org.apache.tika.batch.ConsumersManager;
26 import org.apache.tika.batch.FileResource;
27 import org.apache.tika.batch.FileResourceConsumer;
28 import org.apache.tika.batch.OutputStreamFactory;
29 import org.apache.tika.batch.ParserFactory;
30 import org.apache.tika.batch.builders.AbstractConsumersBuilder;
31 import org.apache.tika.batch.builders.BatchProcessBuilder;
32 import org.apache.tika.batch.builders.IContentHandlerFactoryBuilder;
33 import org.apache.tika.batch.fs.BasicTikaFSConsumer;
34 import org.apache.tika.batch.fs.FSConsumersManager;
35 import org.apache.tika.batch.fs.FSOutputStreamFactory;
36 import org.apache.tika.batch.fs.FSUtil;
37 import org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer;
38 import org.apache.tika.config.TikaConfig;
39 import org.apache.tika.sax.ContentHandlerFactory;
40 import org.apache.tika.util.ClassLoaderUtil;
41 import org.apache.tika.util.PropsUtil;
42 import org.apache.tika.util.XMLDOMUtil;
43 import org.w3c.dom.Node;
44 import org.w3c.dom.NodeList;
45
46 public class BasicTikaFSConsumersBuilder extends AbstractConsumersBuilder {
47
48 @Override
49 public ConsumersManager build(Node node, Map<String, String> runtimeAttributes,
50 ArrayBlockingQueue<FileResource> queue) {
51
52 //figure out if we're building a recursiveParserWrapper
53 boolean recursiveParserWrapper = false;
54 String recursiveParserWrapperString = runtimeAttributes.get("recursiveParserWrapper");
55 if (recursiveParserWrapperString != null){
56 recursiveParserWrapper = PropsUtil.getBoolean(recursiveParserWrapperString, recursiveParserWrapper);
57 } else {
58 Node recursiveParserWrapperNode = node.getAttributes().getNamedItem("recursiveParserWrapper");
59 if (recursiveParserWrapperNode != null) {
60 recursiveParserWrapper = PropsUtil.getBoolean(recursiveParserWrapperNode.getNodeValue(), recursiveParserWrapper);
61 }
62 }
63
64 //how long to let the consumersManager run on init() and shutdown()
65 Long consumersManagerMaxMillis = null;
66 String consumersManagerMaxMillisString = runtimeAttributes.get("consumersManagerMaxMillis");
67 if (consumersManagerMaxMillisString != null){
68 consumersManagerMaxMillis = PropsUtil.getLong(consumersManagerMaxMillisString, null);
69 } else {
70 Node consumersManagerMaxMillisNode = node.getAttributes().getNamedItem("consumersManagerMaxMillis");
71 if (consumersManagerMaxMillis == null && consumersManagerMaxMillisNode != null) {
72 consumersManagerMaxMillis = PropsUtil.getLong(consumersManagerMaxMillisNode.getNodeValue(),
73 null);
74 }
75 }
76
77 TikaConfig config = null;
78 String tikaConfigPath = runtimeAttributes.get("c");
79
80 if( tikaConfigPath == null) {
81 Node tikaConfigNode = node.getAttributes().getNamedItem("tikaConfig");
82 if (tikaConfigNode != null) {
83 tikaConfigPath = PropsUtil.getString(tikaConfigNode.getNodeValue(), null);
84 }
85 }
86 if (tikaConfigPath != null) {
87 try {
88 config = new TikaConfig(new File(tikaConfigPath));
89 } catch (Exception e) {
90 throw new RuntimeException(e);
91 }
92 } else {
93 config = TikaConfig.getDefaultConfig();
94 }
95
96 List<FileResourceConsumer> consumers = new LinkedList<FileResourceConsumer>();
97 int numConsumers = BatchProcessBuilder.getNumConsumers(runtimeAttributes);
98
99 NodeList nodeList = node.getChildNodes();
100 Node contentHandlerFactoryNode = null;
101 Node parserFactoryNode = null;
102 Node outputStreamFactoryNode = null;
103
104 for (int i = 0; i < nodeList.getLength(); i++){
105 Node child = nodeList.item(i);
106 String cn = child.getNodeName();
107 if (cn.equals("parser")){
108 parserFactoryNode = child;
109 } else if (cn.equals("contenthandler")) {
110 contentHandlerFactoryNode = child;
111 } else if (cn.equals("outputstream")) {
112 outputStreamFactoryNode = child;
113 }
114 }
115
116 if (contentHandlerFactoryNode == null || parserFactoryNode == null
117 || outputStreamFactoryNode == null) {
118 throw new RuntimeException("You must specify a ContentHandlerFactory, "+
119 "a ParserFactory and an OutputStreamFactory");
120 }
121 ContentHandlerFactory contentHandlerFactory = getContentHandlerFactory(contentHandlerFactoryNode, runtimeAttributes);
122 ParserFactory parserFactory = getParserFactory(parserFactoryNode, runtimeAttributes);
123 OutputStreamFactory outputStreamFactory = getOutputStreamFactory(outputStreamFactoryNode, runtimeAttributes);
124
125 if (recursiveParserWrapper) {
126 for (int i = 0; i < numConsumers; i++) {
127 FileResourceConsumer c = new RecursiveParserWrapperFSConsumer(queue,
128 parserFactory, contentHandlerFactory, outputStreamFactory, config);
129 consumers.add(c);
130 }
131 } else {
132 for (int i = 0; i < numConsumers; i++) {
133 FileResourceConsumer c = new BasicTikaFSConsumer(queue,
134 parserFactory, contentHandlerFactory, outputStreamFactory, config);
135 consumers.add(c);
136 }
137 }
138 ConsumersManager manager = new FSConsumersManager(consumers);
139 if (consumersManagerMaxMillis != null) {
140 manager.setConsumersManagerMaxMillis(consumersManagerMaxMillis);
141 }
142 return manager;
143 }
144
145
146 private ContentHandlerFactory getContentHandlerFactory(Node node, Map<String, String> runtimeAttributes) {
147
148 Map<String, String> localAttrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
149 String className = localAttrs.get("builderClass");
150 if (className == null) {
151 throw new RuntimeException("Must specify builderClass for contentHandler");
152 }
153 IContentHandlerFactoryBuilder builder = ClassLoaderUtil.buildClass(IContentHandlerFactoryBuilder.class, className);
154 return builder.build(node, runtimeAttributes);
155 }
156
157 private ParserFactory getParserFactory(Node node, Map<String, String> runtimeAttributes) {
158 //TODO: add ability to set TikaConfig file path
159 Map<String, String> localAttrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
160 String className = localAttrs.get("class");
161 return ClassLoaderUtil.buildClass(ParserFactory.class, className);
162 }
163
164 private OutputStreamFactory getOutputStreamFactory(Node node, Map<String, String> runtimeAttributes) {
165 Map<String, String> attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
166
167 File outputDir = PropsUtil.getFile(attrs.get("outputDir"), null);
168 /* FSUtil.HANDLE_EXISTING handleExisting = null;
169 String handleExistingString = attrs.get("handleExisting");
170 if (handleExistingString == null) {
171 handleExistingException();
172 } else if (handleExistingString.equals("overwrite")){
173 handleExisting = FSUtil.HANDLE_EXISTING.OVERWRITE;
174 } else if (handleExistingString.equals("rename")) {
175 handleExisting = FSUtil.HANDLE_EXISTING.RENAME;
176 } else if (handleExistingString.equals("skip")) {
177 handleExisting = FSUtil.HANDLE_EXISTING.SKIP;
178 } else {
179 handleExistingException();
180 }
181 */
182 String compressionString = attrs.get("compression");
183 FSOutputStreamFactory.COMPRESSION compression = FSOutputStreamFactory.COMPRESSION.NONE;
184 if (compressionString == null) {
185 //do nothing
186 } else if (compressionString.contains("bz")) {
187 compression = FSOutputStreamFactory.COMPRESSION.BZIP2;
188 } else if (compressionString.contains("gz")) {
189 compression = FSOutputStreamFactory.COMPRESSION.GZIP;
190 } else if (compressionString.contains("zip")) {
191 compression = FSOutputStreamFactory.COMPRESSION.ZIP;
192 }
193 String suffix = attrs.get("outputSuffix");
194
195 //TODO: possibly open up the different handle existings in the future
196 //but for now, lock it down to require skip. Too dangerous otherwise
197 //if the driver restarts and this is set to overwrite...
198 return new FSOutputStreamFactory(outputDir, FSUtil.HANDLE_EXISTING.SKIP,
199 compression, suffix);
200 }
201
202 }
0 package org.apache.tika.batch.fs.builders;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19
20 import java.io.File;
21 import java.util.Locale;
22 import java.util.Map;
23 import java.util.concurrent.ArrayBlockingQueue;
24 import java.util.regex.Pattern;
25
26 import org.apache.tika.batch.FileResource;
27 import org.apache.tika.batch.FileResourceCrawler;
28 import org.apache.tika.batch.builders.BatchProcessBuilder;
29 import org.apache.tika.batch.builders.ICrawlerBuilder;
30 import org.apache.tika.batch.fs.FSDirectoryCrawler;
31 import org.apache.tika.batch.fs.FSDocumentSelector;
32 import org.apache.tika.extractor.DocumentSelector;
33 import org.apache.tika.util.PropsUtil;
34 import org.apache.tika.util.XMLDOMUtil;
35 import org.w3c.dom.Node;
36
37 /**
38 * Builds either an FSDirectoryCrawler or an FSListCrawler.
39 */
40 public class FSCrawlerBuilder implements ICrawlerBuilder {
41
42 private final static String MAX_CONSEC_WAIT_MILLIS = "maxConsecWaitMillis";
43 private final static String MAX_FILES_TO_ADD_ATTR = "maxFilesToAdd";
44 private final static String MAX_FILES_TO_CONSIDER_ATTR = "maxFilesToConsider";
45
46
47 private final static String CRAWL_ORDER = "crawlOrder";
48 private final static String INPUT_DIR_ATTR = "inputDir";
49 private final static String INPUT_START_DIR_ATTR = "startDir";
50 private final static String MAX_FILE_SIZE_BYTES_ATTR = "maxFileSizeBytes";
51 private final static String MIN_FILE_SIZE_BYTES_ATTR = "minFileSizeBytes";
52
53
54 private final static String INCLUDE_FILE_PAT_ATTR = "includeFilePat";
55 private final static String EXCLUDE_FILE_PAT_ATTR = "excludeFilePat";
56
57 @Override
58 public FileResourceCrawler build(Node node, Map<String, String> runtimeAttributes,
59 ArrayBlockingQueue<FileResource> queue) {
60
61 Map<String, String> attributes = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes);
62
63 int numConsumers = BatchProcessBuilder.getNumConsumers(runtimeAttributes);
64 File inputDir = PropsUtil.getFile(attributes.get(INPUT_DIR_ATTR), new File("input"));
65 FileResourceCrawler crawler = null;
66 if (attributes.containsKey("fileList")) {
67 String randomCrawlString = attributes.get(CRAWL_ORDER);
68
69 if (randomCrawlString != null) {
70 //TODO: change to logger warn or throw RuntimeException?
71 System.err.println("randomCrawl attribute is ignored by FSListCrawler");
72 }
73 File fileList = PropsUtil.getFile(attributes.get("fileList"), null);
74 String encoding = PropsUtil.getString(attributes.get("fileListEncoding"), "UTF-8");
75 try {
76 crawler = new org.apache.tika.batch.fs.FSListCrawler(queue, numConsumers, inputDir, fileList, encoding);
77 } catch (java.io.FileNotFoundException e) {
78 throw new RuntimeException("fileList file not found for FSListCrawler: " + fileList.getAbsolutePath());
79 } catch (java.io.UnsupportedEncodingException e) {
80 throw new RuntimeException("fileList encoding not supported: "+encoding);
81 }
82 } else {
83 FSDirectoryCrawler.CRAWL_ORDER crawlOrder = getCrawlOrder(attributes.get(CRAWL_ORDER));
84 File startDir = PropsUtil.getFile(attributes.get(INPUT_START_DIR_ATTR), null);
85 if (startDir == null) {
86 crawler = new FSDirectoryCrawler(queue, numConsumers, inputDir, crawlOrder);
87 } else {
88 crawler = new FSDirectoryCrawler(queue, numConsumers, inputDir, startDir, crawlOrder);
89 }
90 }
91
92 crawler.setMaxFilesToConsider(PropsUtil.getInt(attributes.get(MAX_FILES_TO_CONSIDER_ATTR), -1));
93 crawler.setMaxFilesToAdd(PropsUtil.getInt(attributes.get(MAX_FILES_TO_ADD_ATTR), -1));
94
95 DocumentSelector selector = buildSelector(attributes);
96 if (selector != null) {
97 crawler.setDocumentSelector(selector);
98 }
99
100 crawler.setMaxConsecWaitInMillis(PropsUtil.getLong(attributes.get(MAX_CONSEC_WAIT_MILLIS), 300000L));//5 minutes
101 return crawler;
102 }
103
104 private FSDirectoryCrawler.CRAWL_ORDER getCrawlOrder(String s) {
105 if (s == null || s.trim().length() == 0 || s.equals("os")) {
106 return FSDirectoryCrawler.CRAWL_ORDER.OS_ORDER;
107 } else if (s.toLowerCase(Locale.ROOT).contains("rand")) {
108 return FSDirectoryCrawler.CRAWL_ORDER.RANDOM;
109 } else if (s.toLowerCase(Locale.ROOT).contains("sort")) {
110 return FSDirectoryCrawler.CRAWL_ORDER.SORTED;
111 } else {
112 return FSDirectoryCrawler.CRAWL_ORDER.OS_ORDER;
113 }
114 }
115
116 private DocumentSelector buildSelector(Map<String, String> attributes) {
117 String includeString = attributes.get(INCLUDE_FILE_PAT_ATTR);
118 String excludeString = attributes.get(EXCLUDE_FILE_PAT_ATTR);
119 long maxFileSize = PropsUtil.getLong(attributes.get(MAX_FILE_SIZE_BYTES_ATTR), -1L);
120 long minFileSize = PropsUtil.getLong(attributes.get(MIN_FILE_SIZE_BYTES_ATTR), -1L);
121 Pattern includePat = (includeString != null && includeString.length() > 0) ? Pattern.compile(includeString) : null;
122 Pattern excludePat = (excludeString != null && excludeString.length() > 0) ? Pattern.compile(excludeString) : null;
123
124 return new FSDocumentSelector(includePat, excludePat, minFileSize, maxFileSize);
125 }
126
127
128 }
0 package org.apache.tika.batch.fs.strawman;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.File;
20 import java.io.FileOutputStream;
21 import java.io.IOException;
22 import java.io.InputStream;
23 import java.io.OutputStream;
24 import java.util.ArrayList;
25 import java.util.Arrays;
26 import java.util.Date;
27 import java.util.List;
28 import java.util.concurrent.Callable;
29 import java.util.concurrent.ExecutionException;
30 import java.util.concurrent.ExecutorCompletionService;
31 import java.util.concurrent.ExecutorService;
32 import java.util.concurrent.Executors;
33 import java.util.concurrent.Future;
34 import java.util.concurrent.atomic.AtomicInteger;
35
36 import org.apache.tika.io.IOUtils;
37 import org.slf4j.Logger;
38 import org.slf4j.LoggerFactory;
39 import org.slf4j.MarkerFactory;
40
41 /**
42 * Simple single-threaded class that calls tika-app against every file in a directory.
43 *
44 * This is exceedingly robust. One file per process.
45 *
46 * However, you can use this to compare performance against tika-batch fs code.
47 *
48 *
49 */
50 public class StrawManTikaAppDriver implements Callable<Integer> {
51
52 private static AtomicInteger threadCount = new AtomicInteger(0);
53 private final int totalThreads;
54 private final int threadNum;
55 private int rootLen = -1;
56 private File inputDir = null;
57 private File outputDir = null;
58 private String[] args = null;
59 private Logger logger = LoggerFactory.getLogger(StrawManTikaAppDriver.class);
60
61
62 public StrawManTikaAppDriver(File inputDir, File outputDir, int totalThreads, String[] args) {
63 rootLen = inputDir.getAbsolutePath().length()+1;
64 this.inputDir = inputDir;
65 this.outputDir = outputDir;
66 this.args = args;
67 threadNum = threadCount.getAndIncrement();
68 this.totalThreads = totalThreads;
69 }
70
71
72 private int processDirectory(File inputDir) {
73 int processed = 0;
74 if (inputDir == null || inputDir.listFiles() == null) {
75 return processed;
76 }
77 for (File f : inputDir.listFiles()) {
78 List<File> childDirs = new ArrayList<File>();
79 if (f.isDirectory()) {
80 childDirs.add(f);
81 } else {
82 processed += processFile(f);
83 }
84 for (File dir : childDirs) {
85 processed += processDirectory(dir);
86
87 }
88 }
89 return processed;
90 }
91
92 private int processFile(File f) {
93 if (totalThreads > 1) {
94 int hashCode = f.getAbsolutePath().hashCode();
95 if (Math.abs(hashCode % totalThreads) != threadNum) {
96 return 0;
97 }
98 }
99 File outputFile = new File(outputDir, f.getAbsolutePath().substring(rootLen)+".txt");
100 outputFile.getAbsoluteFile().getParentFile().mkdirs();
101 if (! outputFile.getParentFile().exists()) {
102 logger.error(MarkerFactory.getMarker("FATAL"),
103 "parent directory for "+ outputFile + " was not made!");
104 throw new RuntimeException("couldn't make parent file for " + outputFile);
105 }
106 List<String> commandLine = new ArrayList<String>();
107 for (String arg : args) {
108 commandLine.add(arg);
109 }
110 commandLine.add("-t");
111 commandLine.add("\""+f.getAbsolutePath()+"\"");
112 ProcessBuilder builder = new ProcessBuilder(commandLine.toArray(new String[commandLine.size()]));
113 logger.info("about to process: "+f.getAbsolutePath());
114 Process proc = null;
115 RedirectGobbler gobbler = null;
116 Thread gobblerThread = null;
117 try {
118 OutputStream os = new FileOutputStream(outputFile);
119 proc = builder.start();
120 gobbler = new RedirectGobbler(proc.getInputStream(), os);
121 gobblerThread = new Thread(gobbler);
122 gobblerThread.start();
123 } catch (IOException e) {
124 logger.error(e.getMessage());
125 return 0;
126 }
127
128 boolean finished = false;
129 long totalTime = 180000;//3 minutes
130 long pulse = 100;
131 for (int i = 0; i < totalTime; i += pulse) {
132 try {
133 Thread.currentThread().sleep(pulse);
134 } catch (InterruptedException e) {
135 //swallow
136 }
137 try {
138 int exit = proc.exitValue();
139 finished = true;
140 break;
141 } catch (IllegalThreadStateException e) {
142 //swallow
143 }
144 }
145 if (!finished) {
146 logger.warn("Had to kill process working on: " + f.getAbsolutePath());
147 proc.destroy();
148 }
149 gobbler.close();
150 gobblerThread.interrupt();
151 return 1;
152 }
153
154
155 @Override
156 public Integer call() throws Exception {
157 long start = new Date().getTime();
158
159 int processed = processDirectory(inputDir);
160 double elapsedSecs = ((double)new Date().getTime()-(double)start)/(double)1000;
161 logger.info("Finished processing " + processed + " files in " + elapsedSecs + " seconds.");
162 return processed;
163 }
164
165 private class RedirectGobbler implements Runnable {
166 private OutputStream redirectOs = null;
167 private InputStream redirectIs = null;
168
169 private RedirectGobbler(InputStream is, OutputStream os) {
170 this.redirectIs = is;
171 this.redirectOs = os;
172 }
173
174 private void close() {
175 if (redirectOs != null) {
176 try {
177 redirectOs.flush();
178 } catch (IOException e) {
179 logger.error("can't flush");
180 }
181 try {
182 redirectIs.close();
183 } catch (IOException e) {
184 logger.error("can't close input in redirect gobbler");
185 }
186 try {
187 redirectOs.close();
188 } catch (IOException e) {
189 logger.error("can't close output in redirect gobbler");
190 }
191 }
192 }
193
194 @Override
195 public void run() {
196 try {
197 IOUtils.copy(redirectIs, redirectOs);
198 } catch (IOException e) {
199 logger.error("IOException while gobbling");
200 }
201 }
202 }
203
204 public static String usage() {
205 StringBuilder sb = new StringBuilder();
206 sb.append("Example usage:\n");
207 sb.append("java -cp <CP> org.apache.batch.fs.strawman.StrawManTikaAppDriver ");
208 sb.append("<inputDir> <outputDir> <numThreads> ");
209 sb.append("java -jar tika-app-X.Xjar <...commandline arguments for tika-app>\n\n");
210 return sb.toString();
211 }
212
213 public static void main(String[] args) {
214 long start = new Date().getTime();
215 if (args.length < 6) {
216 System.err.println(StrawManTikaAppDriver.usage());
217 }
218 File inputDir = new File(args[0]);
219 File outputDir = new File(args[1]);
220 int totalThreads = Integer.parseInt(args[2]);
221
222 List<String> commandLine = new ArrayList<String>();
223 commandLine.addAll(Arrays.asList(args).subList(3, args.length));
224 totalThreads = (totalThreads < 1) ? 1 : totalThreads;
225 ExecutorService ex = Executors.newFixedThreadPool(totalThreads);
226 ExecutorCompletionService<Integer> completionService =
227 new ExecutorCompletionService<Integer>(ex);
228
229 for (int i = 0; i < totalThreads; i++) {
230 StrawManTikaAppDriver driver =
231 new StrawManTikaAppDriver(inputDir, outputDir, totalThreads, commandLine.toArray(new String[commandLine.size()]));
232 completionService.submit(driver);
233 }
234
235 int totalFilesProcessed = 0;
236 for (int i = 0; i < totalThreads; i++) {
237 try {
238 Future<Integer> future = completionService.take();
239 if (future != null) {
240 totalFilesProcessed += future.get();
241 }
242 } catch (InterruptedException e) {
243 e.printStackTrace();
244 } catch (ExecutionException e) {
245 e.printStackTrace();
246 }
247 }
248 double elapsedSeconds = (double)(new Date().getTime()-start)/(double)1000;
249 System.out.println("Processed "+totalFilesProcessed + " in " + elapsedSeconds + " seconds");
250 }
251 }
0 package org.apache.tika.util;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18 public class ClassLoaderUtil {
19
20 @SuppressWarnings("unchecked")
21 public static <T> T buildClass(Class<T> iface, String className) {
22
23 ClassLoader loader = ClassLoader.getSystemClassLoader();
24 Class<?> clazz;
25 try {
26 clazz = loader.loadClass(className);
27 if (iface.isAssignableFrom(clazz)) {
28 return (T) clazz.newInstance();
29 }
30 throw new IllegalArgumentException(iface.toString() + " is not assignable from " + className);
31 } catch (ClassNotFoundException e) {
32 throw new RuntimeException(e);
33 } catch (InstantiationException e) {
34 throw new RuntimeException(e);
35 } catch (IllegalAccessException e) {
36 throw new RuntimeException(e);
37 }
38
39 }
40 }
0 package org.apache.tika.util;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 /**
20 * Functionality and naming conventions (roughly) copied from org.apache.commons.lang3
21 * so that we didn't have to add another dependency.
22 */
23 public class DurationFormatUtils {
24
25 public static String formatMillis(long duration) {
26 duration = Math.abs(duration);
27 StringBuilder sb = new StringBuilder();
28 int secs = (int) (duration / 1000) % 60;
29 int mins = (int) ((duration / (1000 * 60)) % 60);
30 int hrs = (int) ((duration / (1000 * 60 * 60)) % 24);
31 int days = (int) ((duration / (1000 * 60 * 60 * 24)) % 7);
32
33 //sb.append(millis + " milliseconds");
34 addUnitString(sb, days, "day");
35 addUnitString(sb, hrs, "hour");
36 addUnitString(sb, mins, "minute");
37 addUnitString(sb, secs, "second");
38 if (duration < 1000) {
39 addUnitString(sb, duration, "millisecond");
40 }
41
42 return sb.toString();
43 }
44
45 private static void addUnitString(StringBuilder sb, long unit, String unitString) {
46 //only add unit if >= 1
47 if (unit == 1) {
48 addComma(sb);
49 sb.append("1 ");
50 sb.append(unitString);
51 } else if (unit > 1) {
52 addComma(sb);
53 sb.append(unit);
54 sb.append(" ");
55 sb.append(unitString);
56 sb.append("s");
57 }
58 }
59
60 private static void addComma(StringBuilder sb) {
61 if (sb.length() > 0) {
62 sb.append(", ");
63 }
64 }
65 }
0 package org.apache.tika.util;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.File;
20 import java.util.Locale;
21
22 /**
23 * Utility class to handle properties. If the value is null,
24 * or if there is a parser error, the defaultMissing value will be returned.
25 */
26 public class PropsUtil {
27
28 /**
29 * Parses v. If there is a problem, this returns defaultMissing.
30 *
31 * @param v string to parse
32 * @param defaultMissing value to return if value is null or unparseable
33 * @return parsed value
34 */
35 public static Boolean getBoolean(String v, Boolean defaultMissing) {
36 if (v == null || v.length() == 0) {
37 return defaultMissing;
38 }
39 if (v.toLowerCase(Locale.ROOT).equals("true")) {
40 return true;
41 }
42 if (v.toLowerCase(Locale.ROOT).equals("false")) {
43 return false;
44 }
45 return defaultMissing;
46 }
47
48 /**
49 * Parses v. If there is a problem, this returns defaultMissing.
50 *
51 * @param v string to parse
52 * @param defaultMissing value to return if value is null or unparseable
53 * @return parsed value
54 */
55 public static Integer getInt(String v, Integer defaultMissing) {
56 if (v == null || v.length() == 0) {
57 return defaultMissing;
58 }
59 try {
60 return Integer.parseInt(v);
61 } catch (NumberFormatException e) {
62 //NO OP
63 }
64 return defaultMissing;
65 }
66
67 /**
68 * Parses v. If there is a problem, this returns defaultMissing.
69 *
70 * @param v string to parse
71 * @param defaultMissing value to return if value is null or unparseable
72 * @return parsed value
73 */
74 public static Long getLong(String v, Long defaultMissing) {
75 if (v == null || v.length() == 0) {
76 return defaultMissing;
77 }
78 try {
79 return Long.parseLong(v);
80 } catch (NumberFormatException e) {
81 //swallow
82 }
83 return defaultMissing;
84 }
85
86
87 /**
88 * Parses v. If there is a problem, this returns defaultMissing.
89 *
90 * @param v string to parse
91 * @param defaultMissing value to return if value is null or unparseable
92 * @return parsed value
93 */
94 public static File getFile(String v, File defaultMissing) {
95 if (v == null || v.length() == 0) {
96 return defaultMissing;
97 }
98 //trim initial and final " if they exist
99 if (v.startsWith("\"")) {
100 v = v.substring(1);
101 }
102 if (v.endsWith("\"")) {
103 v = v.substring(0, v.length()-1);
104 }
105
106 return new File(v);
107 }
108
109 /**
110 * Parses v. If v is null, this returns defaultMissing.
111 *
112 * @param v string to parse
113 * @param defaultMissing value to return if value is null
114 * @return parsed value
115 */
116 public static String getString(String v, String defaultMissing) {
117 if (v == null) {
118 return defaultMissing;
119 }
120 return v;
121 }
122 }
0 package org.apache.tika.util;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.PrintWriter;
19 import java.io.StringWriter;
20
21 import org.apache.tika.exception.TikaException;
22
23 /**
24 * Unwrap TikaExceptions and other wrappers that we might not care about
25 * in downstream analysis. This is similar to
26 * what tika-server does when returning stack traces.
27 */
28 public class TikaExceptionFilter {
29
30 /**
31 * Unwrap TikaExceptions and other wrappers that users might not
32 * care about in downstream analysis.
33 *
34 * @param t throwable to filter
35 * @return filtered throwable
36 */
37 public Throwable filter(Throwable t) {
38 if (t instanceof TikaException) {
39 Throwable cause = t.getCause();
40 if (cause != null) {
41 return cause;
42 }
43 }
44 return t;
45 }
46
47 /**
48 * This calls {@link #filter} and then prints the filtered
49 * <code>Throwable</code>to a <code>String</code>.
50 *
51 * @param t throwable
52 * @return a filtered version of the StackTrace
53 */
54 public String getStackTrace(Throwable t) {
55 Throwable filtered = filter(t);
56 StringWriter stringWriter = new StringWriter();
57 PrintWriter w = new PrintWriter(stringWriter);
58 filtered.printStackTrace(w);
59 w.flush();
60 stringWriter.flush();
61 return stringWriter.toString();
62 }
63 }
0 package org.apache.tika.util;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.HashMap;
20 import java.util.Map;
21
22 import org.w3c.dom.NamedNodeMap;
23 import org.w3c.dom.Node;
24
25 public class XMLDOMUtil {
26
27 /**
28 * This grabs the attributes from a dom node and overwrites those values with those
29 * specified by the overwrite map.
30 *
31 * @param node node for building
32 * @param overwrite map of attributes to overwrite
33 * @return map of attributes
34 */
35 public static Map<String, String> mapifyAttrs(Node node, Map<String, String> overwrite) {
36 Map<String, String> map = new HashMap<String, String>();
37 NamedNodeMap nnMap = node.getAttributes();
38 for (int i = 0; i < nnMap.getLength(); i++) {
39 Node attr = nnMap.item(i);
40 map.put(attr.getNodeName(), attr.getNodeValue());
41 }
42 if (overwrite != null) {
43 for (Map.Entry<String, String> e : overwrite.entrySet()) {
44 map.put(e.getKey(), e.getValue());
45 }
46 }
47 return map;
48 }
49
50
51 /**
52 * Get an int value. Try the runtime attributes first and then back off to
53 * the document element. Throw a RuntimeException if the attribute is not
54 * found or if the value is not parseable as an int.
55 *
56 * @param attrName attribute name to find
57 * @param runtimeAttributes runtime attributes
58 * @param docElement correct element that should have specified attribute
59 * @return specified int value
60 */
61 public static int getInt(String attrName, Map<String, String> runtimeAttributes, Node docElement) {
62 String stringValue = getStringValue(attrName, runtimeAttributes, docElement);
63 if (stringValue != null) {
64 try {
65 return Integer.parseInt(stringValue);
66 } catch (NumberFormatException e) {
67 //swallow
68 }
69 }
70 throw new RuntimeException("Need to specify a parseable int value in -- "
71 +attrName+" -- in commandline or in config file!");
72 }
73
74
75 /**
76 * Get a long value. Try the runtime attributes first and then back off to
77 * the document element. Throw a RuntimeException if the attribute is not
78 * found or if the value is not parseable as a long.
79 *
80 * @param attrName attribute name to find
81 * @param runtimeAttributes runtime attributes
82 * @param docElement correct element that should have specified attribute
83 * @return specified long value
84 */
85 public static long getLong(String attrName, Map<String, String> runtimeAttributes, Node docElement) {
86 String stringValue = getStringValue(attrName, runtimeAttributes, docElement);
87 if (stringValue != null) {
88 try {
89 return Long.parseLong(stringValue);
90 } catch (NumberFormatException e) {
91 //swallow
92 }
93 }
94 throw new RuntimeException("Need to specify a \"long\" value in -- "
95 +attrName+" -- in commandline or in config file!");
96 }
97
98 private static String getStringValue(String attrName, Map<String, String> runtimeAttributes, Node docElement) {
99 String stringValue = runtimeAttributes.get(attrName);
100 if (stringValue == null) {
101 Node staleNode = docElement.getAttributes().getNamedItem(attrName);
102 if (staleNode != null) {
103 stringValue = staleNode.getNodeValue();
104 }
105 }
106 return stringValue;
107 }
108 }
0 <!--
1 Licensed to the Apache Software Foundation (ASF) under one or more
2 contributor license agreements. See the NOTICE file distributed with
3 this work for additional information regarding copyright ownership.
4 The ASF licenses this file to You under the Apache License, Version 2.0
5 (the "License"); you may not use this file except in compliance with
6 the License. You may obtain a copy of the License at
7
8 http://www.apache.org/licenses/LICENSE-2.0
9
10 Unless required by applicable law or agreed to in writing, software
11 distributed under the License is distributed on an "AS IS" BASIS,
12 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 See the License for the specific language governing permissions and
14 limitations under the License.
15 -->
16 <!DOCTYPE html>
17 <html>
18 <head lang="en">
19 <meta charset="UTF-8">
20 <title>Tika Batch Module</title>
21 </head>
22 <body>
23
24 <h1>The Batch Module for Apache Tika</h1>
25
26 <p>
27 The batch module is new to Tika in 1.8. The goal is to enable robust
28 batch processing, with extensibility and logging.
29 </p>
30 <p>
31 This module currently enables file system directory to directory processing.
32 To build out other interfaces, follow the example of BasicTikaFSConsumer and
33 extend FileResourceConsumer.
34 </p>
35 <p>
36 <b>NOTE: This package is new and experimental and is subject to suddenly change in the next release.</b>
37 </p>
38
39 </body>
40 </html>
0 <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
1
2 <!--
3 Licensed to the Apache Software Foundation (ASF) under one
4 or more contributor license agreements. See the NOTICE file
5 distributed with this work for additional information
6 regarding copyright ownership. The ASF licenses this file
7 to you under the Apache License, Version 2.0 (the
8 "License"); you may not use this file except in compliance
9 with the License. You may obtain a copy of the License at
10
11 http://www.apache.org/licenses/LICENSE-2.0
12
13 Unless required by applicable law or agreed to in writing,
14 software distributed under the License is distributed on an
15 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
16 KIND, either express or implied. See the License for the
17 specific language governing permissions and limitations
18 under the License.
19 -->
20 <!-- NOTE: tika-batch is still an experimental feature.
21 The configuration file will likely change and be backward incompatible
22 with new versions of Tika. Please stay tuned.
23 -->
24
25 <tika-batch-config
26 maxAliveTimeSeconds="-1"
27 pauseOnEarlyTerminationMillis="10000"
28 timeoutThresholdMillis="300000"
29 timeoutCheckPulseMillis="1000"
30 maxQueueSize="10000"
31 numConsumers="default"> <!-- numConsumers = number of file consumers, "default" = number of processors -1 -->
32
33 <!-- options to allow on the commandline -->
34 <commandline>
35 <option opt="c" longOpt="tika-config" hasArg="true"
36 description="TikaConfig file"/>
37 <option opt="bc" longOpt="batch-config" hasArg="true"
38 description="xml batch config file"/>
39 <!-- We needed sorted for testing. We added random for performance.
40 Where crawling a directory is slow, it might be beneficial to
41 go randomly so that the parsers are triggered earlier. The
42 default is operating system's choice ("os") which means whatever order
43 the os returns files in .listFiles(). -->
44 <option opt="crawlOrder" hasArg="true"
45 description="how does the crawler sort the directories and files:
46 (random|sorted|os)"/>
47 <option opt="numConsumers" hasArg="true"
48 description="number of fileConsumers threads"/>
49 <option opt="maxFileSizeBytes" hasArg="true"
50 description="maximum file size to process; do not process files larger than this"/>
51 <option opt="maxQueueSize" hasArg="true"
52 description="maximum queue size for FileResources"/>
53 <option opt="fileList" hasArg="true"
54 description="file that contains a list of files (relative to inputDir) to process"/>
55 <option opt="fileListEncoding" hasArg="true"
56 description="encoding for fileList"/>
57 <option opt="inputDir" hasArg="true"
58 description="root directory for the files to be processed"/>
59 <option opt="startDir" hasArg="true"
60 description="directory (under inputDir) at which to start crawling"/>
61 <option opt="outputDir" hasArg="true"
62 description="output directory for output"/> <!-- do we want to make this mandatory -->
63 <option opt="recursiveParserWrapper"
64 description="use the RecursiveParserWrapper or not (default = false)"/>
65 <option opt="handleExisting" hasArg="true"
66 description="if an output file already exists, do you want to: overwrite, rename or skip"/>
67 <option opt="basicHandlerType" hasArg="true"
68 description="what type of content handler: xml, text, html, body"/>
69 <option opt="outputSuffix" hasArg="true"
70 description="suffix to add to the end of the output file name"/>
71 <option opt="timeoutThresholdMillis" hasArg="true"
72 description="how long to wait before determining that a consumer is stale"/>
73 <option opt="includeFilePat" hasArg="true"
74 description="regex that specifies which files to process"/>
75 <option opt="excludeFilePat" hasArg="true"
76 description="regex that specifies which files to avoid processing"/>
77 <option opt="reporterSleepMillis" hasArg="true"
78 description="millisecond between reports by the reporter"/>
79 </commandline>
80
81
82 <!-- can specify inputDir="input", but the default config should not include this -->
83 <!-- can also specify startDir="input/someDir" to specify which child directory
84 to start processing -->
85 <crawler builderClass="org.apache.tika.batch.fs.builders.FSCrawlerBuilder"
86 crawlOrder="random"
87 maxFilesToAdd="-1"
88 maxFilesToConsider="-1"
89 includeFilePat=""
90 excludeFilePat=""
91 maxFileSizeBytes="-1"
92 />
93 <!--
94 This is an example of a crawler that reads a list of files to be processed from a
95 file. This assumes that the files in the list are relative to inputDir.
96 <crawler class="org.apache.tika.batch.fs.builders.FSCrawlerBuilder"
97 fileList="files.txt"
98 fileListEncoding="UTF-8"
99 maxFilesToAdd="-1"
100 maxFilesToConsider="-1"
101 includeFilePat="(?i).pdf$"
102 excludeFilePat="(?i).msg$"
103 maxFileSizeBytes="-1"
104 inputDir="input"
105 />
106 -->
107 <consumers builderClass="org.apache.tika.batch.fs.builders.BasicTikaFSConsumersBuilder"
108 recursiveParserWrapper="false" consumersManagerMaxMillis="60000">
109 <parser class="org.apache.tika.batch.AutoDetectParserFactory" parseRecursively="true"/>
110 <contenthandler builderClass="org.apache.tika.batch.builders.DefaultContentHandlerFactoryBuilder"
111 basicHandlerType="xml" writeLimit="-1"/>
112 <!-- overwritePolicy: "skip" a file if output file exists, "rename" a output file, "overwrite" -->
113 <!-- can include e.g. outputDir="output", but we don't want to include this in the default! -->
114 <outputstream class="FSOutputStreamFactory" encoding="UTF-8" outputSuffix="xml"/>
115 </consumers>
116
117 <!-- reporter and interrupter are optional -->
118 <reporter builderClass="org.apache.tika.batch.builders.SimpleLogReporterBuilder" reporterSleepMillis="1000"
119 reporterStaleThresholdMillis="60000"/>
120 <interrupter builderClass="org.apache.tika.batch.builders.InterrupterBuilder"/>
121 </tika-batch-config>
0 package org.apache.tika.batch;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.File;
20 import java.io.FileInputStream;
21 import java.io.InputStream;
22
23 import org.apache.commons.cli.Options;
24 import org.apache.tika.batch.builders.CommandLineParserBuilder;
25 import org.apache.tika.batch.fs.FSBatchTestBase;
26 import org.apache.tika.io.IOUtils;
27 import org.junit.Test;
28
29
30 public class CommandLineParserBuilderTest extends FSBatchTestBase {
31
32 @Test
33 public void testBasic() throws Exception {
34 String configFile = this.getClass().getResource(
35 "/tika-batch-config-test.xml").getFile();
36 InputStream is = null;
37 try {
38 is = new FileInputStream(new File(configFile));
39 CommandLineParserBuilder builder = new CommandLineParserBuilder();
40 Options options = builder.build(is);
41 //TODO: insert actual tests :)
42 } finally {
43 IOUtils.closeQuietly(is);
44 }
45
46 }
47 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import static junit.framework.Assert.assertEquals;
20 import static junit.framework.TestCase.assertNotNull;
21 import static org.junit.Assert.assertFalse;
22 import static org.junit.Assert.assertTrue;
23
24 import java.io.File;
25 import java.util.Arrays;
26 import java.util.HashMap;
27 import java.util.Map;
28
29 import org.apache.commons.io.FileUtils;
30 import org.apache.tika.batch.BatchProcessDriverCLI;
31 import org.apache.tika.io.IOUtils;
32 import org.junit.Test;
33
34
35 public class BatchDriverTest extends FSBatchTestBase {
36
37 //for debugging, turn logging off/on via resources/log4j.properties for the driver
38 //and log4j_process.properties for the process.
39
40 @Test(timeout = 15000)
41 public void oneHeavyHangTest() throws Exception {
42 //batch runner hits one heavy hang file, keep going
43 File outputDir = getNewOutputDir("daemon-");
44 assertNotNull(outputDir.listFiles());
45 //make sure output directory is empty!
46 assertEquals(0, outputDir.listFiles().length);
47
48 String[] args = getDefaultCommandLineArgsArr("one_heavy_hang", outputDir, null);
49 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", args);
50 driver.execute();
51 assertEquals(0, driver.getNumRestarts());
52 assertFalse(driver.getUserInterrupted());
53 assertEquals(5, outputDir.listFiles().length);
54 assertContains("first test file",
55 FileUtils.readFileToString(new File(outputDir, "test2_ok.xml.xml"),
56 IOUtils.UTF_8.toString()));
57
58
59 }
60
61 @Test(timeout = 30000)
62 public void restartOnFullHangTest() throws Exception {
63 //batch runner hits more heavy hangs than threads; needs to restart
64 File outputDir = getNewOutputDir("daemon-");
65
66 //make sure output directory is empty!
67 assertEquals(0, outputDir.listFiles().length);
68
69 String[] args = getDefaultCommandLineArgsArr("heavy_heavy_hangs", outputDir, null);
70 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", args);
71 driver.execute();
72 //could be one or two depending on timing
73 assertTrue(driver.getNumRestarts() > 0);
74 assertFalse(driver.getUserInterrupted());
75 assertContains("first test file",
76 FileUtils.readFileToString(new File(outputDir, "test6_ok.xml.xml"),
77 IOUtils.UTF_8.toString()));
78 }
79
80 @Test(timeout = 15000)
81 public void noRestartTest() throws Exception {
82 File outputDir = getNewOutputDir("daemon-");
83
84 //make sure output directory is empty!
85 assertEquals(0, outputDir.listFiles().length);
86
87 String[] args = getDefaultCommandLineArgsArr("no_restart", outputDir, null);
88 String[] mod = Arrays.copyOf(args, args.length + 2);
89 mod[args.length] = "-numConsumers";
90 mod[args.length+1] = "1";
91
92 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", mod);
93 driver.execute();
94 assertEquals(0, driver.getNumRestarts());
95 assertFalse(driver.getUserInterrupted());
96 File[] files = outputDir.listFiles();
97 assertEquals(2, files.length);
98 File test2 = new File(outputDir, "test2_norestart.xml.xml");
99 assertTrue("test2_norestart.xml", test2.exists());
100 File test3 = new File(outputDir, "test3_ok.xml.xml");
101 assertFalse("test3_ok.xml", test3.exists());
102 assertEquals(0, test3.length());
103 }
104
105 @Test(timeout = 15000)
106 public void restartOnOOMTest() throws Exception {
107 //batch runner hits more heavy hangs than threads; needs to restart
108 File outputDir = getNewOutputDir("daemon-");
109
110 //make sure output directory is empty!
111 assertEquals(0, outputDir.listFiles().length);
112
113 String[] args = getDefaultCommandLineArgsArr("oom", outputDir, null);
114 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", args);
115 driver.execute();
116 assertEquals(1, driver.getNumRestarts());
117 assertFalse(driver.getUserInterrupted());
118 assertContains("first test file",
119 FileUtils.readFileToString(new File(outputDir, "test2_ok.xml.xml"),
120 IOUtils.UTF_8.toString()));
121 }
122
123 @Test(timeout = 30000)
124 public void allHeavyHangsTestWithStarvedCrawler() throws Exception {
125 //this tests that if all consumers are hung and the crawler is
126 //waiting to add to the queue, there isn't deadlock. The BatchProcess should
127 //just shutdown, and the driver should restart
128 File outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-");
129 Map<String, String> args = new HashMap<String,String>();
130 args.put("-numConsumers", "2");
131 args.put("-maxQueueSize", "2");
132 String[] commandLine = getDefaultCommandLineArgsArr("heavy_heavy_hangs", outputDir, args);
133 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine);
134 driver.execute();
135 assertEquals(3, driver.getNumRestarts());
136 assertFalse(driver.getUserInterrupted());
137 assertContains("first test file",
138 FileUtils.readFileToString(new File(outputDir, "test6_ok.xml.xml"),
139 IOUtils.UTF_8.toString()));
140 }
141
142 @Test(timeout = 30000)
143 public void maxRestarts() throws Exception {
144 //tests that maxRestarts works
145 //if -maxRestarts is not correctly removed from the commandline,
146 //FSBatchProcessCLI's cli parser will throw an Unrecognized option exception
147
148 File outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-");
149 Map<String, String> args = new HashMap<String,String>();
150 args.put("-numConsumers", "1");
151 args.put("-maxQueueSize", "10");
152 args.put("-maxRestarts", "2");
153
154 String[] commandLine = getDefaultCommandLineArgsArr("max_restarts", outputDir, args);
155
156 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine);
157 driver.execute();
158 assertEquals(2, driver.getNumRestarts());
159 assertFalse(driver.getUserInterrupted());
160 assertEquals(3, outputDir.listFiles().length);
161 }
162
163 @Test(timeout = 30000)
164 public void maxRestartsBadParameter() throws Exception {
165 //tests that maxRestarts must be followed by an Integer
166 File outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-");
167 Map<String, String> args = new HashMap<String,String>();
168 args.put("-numConsumers", "1");
169 args.put("-maxQueueSize", "10");
170 args.put("-maxRestarts", "zebra");
171
172 String[] commandLine = getDefaultCommandLineArgsArr("max_restarts", outputDir, args);
173 boolean ex = false;
174 try {
175 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine);
176 driver.execute();
177 } catch (IllegalArgumentException e) {
178 ex = true;
179 }
180 assertTrue("IllegalArgumentException should have been thrown", ex);
181 }
182
183 @Test(timeout = 30000)
184 public void testNoRestartIfProcessFails() throws Exception {
185 //tests that if something goes horribly wrong with FSBatchProcessCLI
186 //the driver will not restart it again and again
187 //this calls a bad xml file which should trigger a no restart exit.
188 File outputDir = getNewOutputDir("nostart-norestart-");
189 Map<String, String> args = new HashMap<String,String>();
190 args.put("-numConsumers", "1");
191 args.put("-maxQueueSize", "10");
192
193 String[] commandLine = getDefaultCommandLineArgsArr("basic", outputDir, args);
194 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-broken.xml", commandLine);
195 driver.execute();
196 assertEquals(0, outputDir.listFiles().length);
197 assertEquals(0, driver.getNumRestarts());
198 }
199
200 @Test(timeout = 30000)
201 public void testNoRestartIfProcessFailsTake2() throws Exception {
202 File outputDir = getNewOutputDir("nostart-norestart-");
203 Map<String, String> args = new HashMap<String,String>();
204 args.put("-numConsumers", "1");
205 args.put("-maxQueueSize", "10");
206 args.put("-somethingOrOther", "I don't Know");
207
208 String[] commandLine = getDefaultCommandLineArgsArr("basic", outputDir, args);
209 BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine);
210 driver.execute();
211 assertEquals(0, outputDir.listFiles().length);
212 assertEquals(0, driver.getNumRestarts());
213 }
214
215
216 }
0 package org.apache.tika.batch.fs;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18
19 import static junit.framework.TestCase.assertEquals;
20 import static junit.framework.TestCase.fail;
21 import static org.junit.Assert.assertFalse;
22 import static org.junit.Assert.assertTrue;
23
24 import java.io.File;
25 import java.io.IOException;
26 import java.util.Map;
27
28 import org.apache.commons.io.FileUtils;
29 import org.apache.tika.batch.BatchProcess;
30 import org.apache.tika.batch.BatchProcessDriverCLI;
31 import org.apache.tika.io.IOUtils;
32 import org.junit.Test;
33
34 public class BatchProcessTest extends FSBatchTestBase {
35
36 @Test(timeout = 15000)
37 public void oneHeavyHangTest() throws Exception {
38
39 File outputDir = getNewOutputDir("one_heavy_hang-");
40
41 Map<String, String> args = getDefaultArgs("one_heavy_hang", outputDir);
42 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
43 StreamStrings streamStrings = ex.execute();
44 assertEquals(5, outputDir.listFiles().length);
45 File hvyHang = new File(outputDir, "test0_heavy_hang.xml.xml");
46 assertTrue(hvyHang.exists());
47 assertEquals(0, hvyHang.length());
48 assertNotContained(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(),
49 streamStrings.getErrString());
50 }
51
52
53 @Test(timeout = 15000)
54 public void allHeavyHangsTest() throws Exception {
55 //each of the three threads hits a heavy hang. The BatchProcess runs into
56 //all timedouts and shuts down.
57 File outputDir = getNewOutputDir("allHeavyHangs-");
58 Map<String, String> args = getDefaultArgs("heavy_heavy_hangs", outputDir);
59 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
60 StreamStrings streamStrings = ex.execute();
61
62 assertEquals(3, outputDir.listFiles().length);
63 for (File hvyHang : outputDir.listFiles()){
64 assertTrue(hvyHang.exists());
65 assertEquals("file length for "+hvyHang.getName()+" should be 0, but is: " +hvyHang.length(),
66 0, hvyHang.length());
67 }
68 assertContains(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(),
69 streamStrings.getErrString());
70 }
71
72 @Test(timeout = 30000)
73 public void allHeavyHangsTestWithCrazyNumberConsumersTest() throws Exception {
74 File outputDir = getNewOutputDir("allHeavyHangsCrazyNumberConsumers-");
75 Map<String, String> args = getDefaultArgs("heavy_heavy_hangs", outputDir);
76 args.put("numConsumers", "100");
77 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
78 StreamStrings streamStrings = ex.execute();
79 assertEquals(7, outputDir.listFiles().length);
80
81 for (int i = 0; i < 6; i++){
82 File hvyHang = new File(outputDir, "test"+i+"_heavy_hang.xml.xml");
83 assertTrue(hvyHang.exists());
84 assertEquals(0, hvyHang.length());
85 }
86 assertContains("This is tika-batch's first test file",
87 FileUtils.readFileToString(new File(outputDir, "test6_ok.xml.xml"),
88 IOUtils.UTF_8.toString()));
89
90 //key that the process realize that there were no more processable files
91 //in the queue and does not ask for a restart!
92 assertNotContained(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(),
93 streamStrings.getErrString());
94 }
95
96 @Test(timeout = 30000)
97 public void allHeavyHangsTestWithStarvedCrawler() throws Exception {
98 //this tests that if all consumers are hung and the crawler is
99 //waiting to add to the queue, there isn't deadlock. The batchrunner should
100 //shutdown and ask to be restarted.
101 File outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-");
102 Map<String, String> args = getDefaultArgs("heavy_heavy_hangs", outputDir);
103 args.put("numConsumers", "2");
104 args.put("maxQueueSize", "2");
105 args.put("timeoutThresholdMillis", "100000000");//make sure that the batch process doesn't time out
106 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
107 StreamStrings streamStrings = ex.execute();
108 assertEquals(2, outputDir.listFiles().length);
109
110 for (int i = 0; i < 2; i++){
111 File hvyHang = new File(outputDir, "test"+i+"_heavy_hang.xml.xml");
112 assertTrue(hvyHang.exists());
113 assertEquals(0, hvyHang.length());
114 }
115 assertContains(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(),
116 streamStrings.getErrString());
117 assertContains("Crawler timed out", streamStrings.getErrString());
118 }
119
120 @Test(timeout = 15000)
121 public void outOfMemory() throws Exception {
122 //the first consumer should sleep for 10 seconds
123 //the second should be tied up in a heavy hang
124 //the third one should hit the oom after processing test2_ok.xml
125 //no consumers should process test2-4.txt!
126 //i.e. the first consumer will finish in 10 seconds and
127 //then otherwise would be looking for more, but the oom should prevent that
128 File outputDir = getNewOutputDir("oom-");
129
130 Map<String, String> args = getDefaultArgs("oom", outputDir);
131 args.put("numConsumers", "3");
132 args.put("timeoutThresholdMillis", "30000");
133
134 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
135 StreamStrings streamStrings = ex.execute();
136
137 assertEquals(4, outputDir.listFiles().length);
138 assertContains("This is tika-batch's first test file",
139 FileUtils.readFileToString(new File(outputDir, "test2_ok.xml.xml"),
140 IOUtils.UTF_8.toString()));
141
142 assertContains(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(),
143 streamStrings.getErrString());
144 }
145
146
147
148 @Test(timeout = 15000)
149 public void noRestart() throws Exception {
150 File outputDir = getNewOutputDir("no_restart");
151
152 Map<String, String> args = getDefaultArgs("no_restart", outputDir);
153 args.put("numConsumers", "1");
154
155 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
156
157 StreamStrings streamStrings = ex.execute();
158 File[] files = outputDir.listFiles();
159 File test2 = new File(outputDir, "test2_norestart.xml.xml");
160 assertTrue("test2_norestart.xml", test2.exists());
161 File test3 = new File(outputDir, "test3_ok.xml.xml");
162 assertFalse("test3_ok.xml", test3.exists());
163 assertEquals(0, test3.length());
164 assertContains("exitStatus="+ BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE,
165 streamStrings.getOutString());
166 assertContains("causeForTermination='MAIN_LOOP_EXCEPTION_NO_RESTART'",
167 streamStrings.getOutString());
168 }
169
170 /**
171 * This tests to make sure that BatchProcess waits the appropriate
172 * amount of time on an early termination before stopping.
173 *
174 * If this fails, then interruptible parsers (e.g. those with
175 * nio channels) will be interrupted and there will be corrupted data.
176 */
177 @Test(timeout = 60000)
178 public void testWaitAfterEarlyTermination() throws Exception {
179 File outputDir = getNewOutputDir("wait_after_early_termination");
180
181 Map<String, String> args = getDefaultArgs("wait_after_early_termination", outputDir);
182 args.put("numConsumers", "1");
183 args.put("maxAliveTimeSeconds", "5");//main process loop should stop after 5 seconds
184 args.put("timeoutThresholdMillis", "300000");//effectively never
185 args.put("pauseOnEarlyTerminationMillis", "20000");//let the parser have up to 20 seconds
186
187 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
188
189 StreamStrings streamStrings = ex.execute();
190 File[] files = outputDir.listFiles();
191 assertEquals(1, files.length);
192 assertContains("<p>some content</p>",
193 FileUtils.readFileToString(new File(outputDir, "test0_sleep.xml.xml"),
194 IOUtils.UTF_8.toString()));
195
196 assertContains("exitStatus="+BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE, streamStrings.getOutString());
197 assertContains("causeForTermination='BATCH_PROCESS_ALIVE_TOO_LONG'",
198 streamStrings.getOutString());
199 }
200
201 @Test(timeout = 60000)
202 public void testTimeOutAfterBeingAskedToShutdown() throws Exception {
203 File outputDir = getNewOutputDir("timeout_after_early_termination");
204
205 Map<String, String> args = getDefaultArgs("timeout_after_early_termination", outputDir);
206 args.put("numConsumers", "1");
207 args.put("maxAliveTimeSeconds", "5");//main process loop should stop after 5 seconds
208 args.put("timeoutThresholdMillis", "10000");
209 args.put("pauseOnEarlyTerminationMillis", "20000");//let the parser have up to 20 seconds
210
211 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
212 StreamStrings streamStrings = ex.execute();
213 File[] files = outputDir.listFiles();
214 assertEquals(1, files.length);
215 assertEquals(0, files[0].length());
216 assertContains("exitStatus="+BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE, streamStrings.getOutString());
217 assertContains("causeForTermination='BATCH_PROCESS_ALIVE_TOO_LONG'",
218 streamStrings.getOutString());
219 }
220
221 @Test(timeout = 10000)
222 public void testRedirectionOfStreams() throws Exception {
223 //test redirection of system.err to system.out
224 File outputDir = getNewOutputDir("noisy_parsers");
225
226 Map<String, String> args = getDefaultArgs("noisy_parsers", outputDir);
227 args.put("numConsumers", "1");
228 args.put("maxAliveTimeSeconds", "20");//main process loop should stop after 5 seconds
229
230 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args);
231 StreamStrings streamStrings = ex.execute();
232 File[] files = outputDir.listFiles();
233 assertEquals(1, files.length);
234 assertContains("System.out", streamStrings.getOutString());
235 assertContains("System.err", streamStrings.getOutString());
236 assertEquals(0, streamStrings.getErrString().length());
237
238 }
239
240 @Test(timeout = 10000)
241 public void testConsumersManagerInitHang() throws Exception {
242 File outputDir = getNewOutputDir("init_hang");
243
244 Map<String, String> args = getDefaultArgs("noisy_parsers", outputDir);
245 args.put("numConsumers", "1");
246 args.put("hangOnInit", "true");
247 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args, "/tika-batch-config-MockConsumersBuilder.xml");
248 StreamStrings streamStrings = ex.execute();
249 assertEquals(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE, ex.getExitValue());
250 assertContains("causeForTermination='CONSUMERS_MANAGER_DIDNT_INIT_IN_TIME_NO_RESTART'", streamStrings.getOutString());
251 }
252
253 @Test(timeout = 10000)
254 public void testConsumersManagerShutdownHang() throws Exception {
255 File outputDir = getNewOutputDir("shutdown_hang");
256
257 Map<String, String> args = getDefaultArgs("noisy_parsers", outputDir);
258 args.put("numConsumers", "1");
259 args.put("hangOnShutdown", "true");
260
261 BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args, "/tika-batch-config-MockConsumersBuilder.xml");
262 StreamStrings streamStrings = ex.execute();
263 assertEquals(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE, ex.getExitValue());
264 assertContains("ConsumersManager did not shutdown within", streamStrings.getOutString());
265 }
266
267 private class BatchProcessTestExecutor {
268 private final Map<String, String> args;
269 private final String configPath;
270 private int exitValue = Integer.MIN_VALUE;
271
272 public BatchProcessTestExecutor(Map<String, String> args) {
273 this(args, "/tika-batch-config-test.xml");
274 }
275
276 public BatchProcessTestExecutor(Map<String, String> args, String configPath) {
277 this.args = args;
278 this.configPath = configPath;
279 }
280
281 private StreamStrings execute() {
282 Process p = null;
283 try {
284 ProcessBuilder b = getNewBatchRunnerProcess(configPath, args);
285 p = b.start();
286 StringStreamGobbler errorGobbler = new StringStreamGobbler(p.getErrorStream());
287 StringStreamGobbler outGobbler = new StringStreamGobbler(p.getInputStream());
288 Thread errorThread = new Thread(errorGobbler);
289 Thread outThread = new Thread(outGobbler);
290 errorThread.start();
291 outThread.start();
292 while (true) {
293 try {
294 exitValue = p.exitValue();
295 break;
296 } catch (IllegalThreadStateException e) {
297 //still going;
298 }
299 }
300 errorGobbler.stopGobblingAndDie();
301 outGobbler.stopGobblingAndDie();
302 errorThread.interrupt();
303 outThread.interrupt();
304 return new StreamStrings(outGobbler.toString(), errorGobbler.toString());
305 } catch (IOException e) {
306 fail();
307 } finally {
308 destroyProcess(p);
309 }
310 return null;
311 }
312
313 private int getExitValue() {
314 return exitValue;
315 }
316
317 }
318
319 private class StreamStrings {
320 private final String outString;
321 private final String errString;
322
323 private StreamStrings(String outString, String errString) {
324 this.outString = outString;
325 this.errString = errString;
326 }
327
328 private String getOutString() {
329 return outString;
330 }
331
332 private String getErrString() {
333 return errString;
334 }
335
336 @Override
337 public String toString() {
338 return "OUT>>"+outString+"<<\n"+
339 "ERR>>"+errString+"<<\n";
340 }
341 }
342 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.File;
20 import java.io.IOException;
21 import java.io.InputStream;
22 import java.util.ArrayList;
23 import java.util.HashMap;
24 import java.util.List;
25 import java.util.Map;
26 import java.util.concurrent.ExecutorService;
27 import java.util.concurrent.Executors;
28 import java.util.concurrent.Future;
29 import java.util.concurrent.TimeUnit;
30
31 import org.apache.commons.io.FileUtils;
32 import org.apache.tika.TikaTest;
33 import org.apache.tika.batch.BatchProcess;
34 import org.apache.tika.batch.BatchProcessDriverCLI;
35 import org.apache.tika.batch.ParallelFileProcessingResult;
36 import org.apache.tika.batch.builders.BatchProcessBuilder;
37 import org.apache.tika.io.IOUtils;
38 import org.junit.AfterClass;
39 import org.junit.BeforeClass;
40
41 /**
42 * This is the base class for file-system batch tests.
43 * <p/>
44 * There are a few areas for improvement in this test suite.
45 * <ol>
46 * <li>For the heavy load tests, the test cases leave behind files that
47 * cannot be deleted from within the same jvm. A thread is still actively writing to an
48 * OutputStream when tearDown() is called. The current solution is to create
49 * the temp dir within the target/tika-batch/test-classes so that they will at least
50 * be removed during each maven &quot;clean&quot;</li>
51 * <li>The &quot;mock&quot; tests are time-based. This is not
52 * extremely reliable across different machines with different number/power of cpus.
53 * </li>
54 * </ol>
55 */
56 public abstract class FSBatchTestBase extends TikaTest {
57
58 private static File outputRoot = null;
59
60 @BeforeClass
61 public static void setUp() throws Exception {
62
63 File testOutput = new File("target/test-classes/test-output");
64 testOutput.mkdirs();
65 outputRoot = File.createTempFile("tika-batch-output-root-", "", testOutput);
66 outputRoot.delete();
67 outputRoot.mkdirs();
68
69 }
70
71 @AfterClass
72 public static void tearDown() throws Exception {
73 //not ideal, but should be ok for testing
74 //see caveat in TikaCLITest's textExtract
75
76 try {
77 FileUtils.deleteDirectory(outputRoot);
78 } catch (IOException e) {
79 e.printStackTrace();
80 }
81 }
82
83 protected void destroyProcess(Process p) {
84 if (p == null)
85 return;
86
87 try {
88 p.exitValue();
89 } catch (IllegalThreadStateException e) {
90 p.destroy();
91 }
92 }
93
94 File getNewOutputDir(String subdirPrefix) throws IOException {
95 File outputDir = File.createTempFile(subdirPrefix, "", outputRoot);
96 outputDir.delete();
97 outputDir.mkdirs();
98 return outputDir;
99 }
100
101 Map<String, String> getDefaultArgs(String inputSubDir, File outputDir) throws Exception {
102 Map<String, String> args = new HashMap<String, String>();
103 args.put("inputDir", "\""+getInputRoot(inputSubDir).getAbsolutePath()+"\"");
104 if (outputDir != null) {
105 args.put("outputDir", "\""+outputDir.getAbsolutePath()+"\"");
106 }
107 return args;
108 }
109
110 public String[] getDefaultCommandLineArgsArr(String inputSubDir, File outputDir, Map<String, String> commandLine) throws Exception {
111 List<String> args = new ArrayList<String>();
112 //need to include "-" because these are going to the commandline!
113 if (inputSubDir != null) {
114 args.add("-inputDir");
115 args.add(getInputRoot(inputSubDir).getAbsolutePath());
116 }
117 if (outputDir != null) {
118 args.add("-outputDir");
119 args.add(outputDir.getAbsolutePath());
120 }
121 if (commandLine != null) {
122 for (Map.Entry<String, String> e : commandLine.entrySet()) {
123 args.add(e.getKey());
124 args.add(e.getValue());
125 }
126 }
127 return args.toArray(new String[args.size()]);
128 }
129
130
131 public File getInputRoot(String subdir) throws Exception {
132 String path = (subdir == null || subdir.length() == 0) ? "/test-input" : "/test-input/"+subdir;
133 return new File(this.getClass().getResource(path).toURI());
134 }
135
136 BatchProcess getNewBatchRunner(String testConfig,
137 Map<String, String> args) throws IOException {
138 InputStream is = this.getClass().getResourceAsStream(testConfig);
139 BatchProcessBuilder b = new BatchProcessBuilder();
140 BatchProcess runner = b.build(is, args);
141
142 IOUtils.closeQuietly(is);
143 return runner;
144 }
145
146 public ProcessBuilder getNewBatchRunnerProcess(String testConfig, Map<String, String> args) {
147 List<String> argList = new ArrayList<String>();
148 for (Map.Entry<String, String> e : args.entrySet()) {
149 argList.add("-"+e.getKey());
150 argList.add(e.getValue());
151 }
152
153 String[] fullCommandLine = commandLine(testConfig, argList.toArray(new String[argList.size()]));
154 return new ProcessBuilder(fullCommandLine);
155 }
156
157 private String[] commandLine(String testConfig, String[] args) {
158 List<String> commandLine = new ArrayList<String>();
159 commandLine.add("java");
160 commandLine.add("-Dlog4j.configuration=file:"+
161 this.getClass().getResource("/log4j_process.properties").getFile());
162 commandLine.add("-Xmx128m");
163 commandLine.add("-cp");
164 String cp = System.getProperty("java.class.path");
165 //need to test for " " on *nix, can't just add double quotes
166 //across platforms.
167 if (cp.contains(" ")){
168 cp = "\""+cp+"\"";
169 }
170 commandLine.add(cp);
171 commandLine.add("org.apache.tika.batch.fs.FSBatchProcessCLI");
172
173 String configFile = this.getClass().getResource(testConfig).getFile();
174 commandLine.add("-bc");
175
176 commandLine.add(configFile);
177
178 for (String s : args) {
179 commandLine.add(s);
180 }
181 return commandLine.toArray(new String[commandLine.size()]);
182 }
183
184 public BatchProcessDriverCLI getNewDriver(String testConfig,
185 String[] args) throws Exception {
186 List<String> commandLine = new ArrayList<String>();
187 commandLine.add("java");
188 commandLine.add("-Xmx128m");
189 commandLine.add("-cp");
190 String cp = System.getProperty("java.class.path");
191 //need to test for " " on *nix, can't just add double quotes
192 //across platforms.
193 if (cp.contains(" ")){
194 cp = "\""+cp+"\"";
195 }
196 commandLine.add(cp);
197 commandLine.add("org.apache.tika.batch.fs.FSBatchProcessCLI");
198
199 String configFile = this.getClass().getResource(testConfig).getFile();
200 commandLine.add("-bc");
201
202 commandLine.add(configFile);
203
204 for (String s : args) {
205 commandLine.add(s);
206 }
207
208 BatchProcessDriverCLI driver = new BatchProcessDriverCLI(
209 commandLine.toArray(new String[commandLine.size()]));
210 driver.setRedirectChildProcessToStdOut(false);
211 return driver;
212 }
213
214 protected ParallelFileProcessingResult run(BatchProcess process) throws Exception {
215 ExecutorService executor = Executors.newSingleThreadExecutor();
216 Future<ParallelFileProcessingResult> futureResult = executor.submit(process);
217 return futureResult.get(10, TimeUnit.SECONDS);
218 }
219 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import static org.junit.Assert.assertFalse;
20 import static org.junit.Assert.assertTrue;
21
22 import java.io.File;
23 import java.util.Map;
24
25 import org.apache.commons.io.FileUtils;
26 import org.apache.tika.batch.BatchProcess;
27 import org.apache.tika.batch.ParallelFileProcessingResult;
28 import org.apache.tika.io.IOUtils;
29 import org.junit.Test;
30
31 public class HandlerBuilderTest extends FSBatchTestBase {
32
33 @Test
34 public void testXML() throws Exception {
35
36 File outputDir = getNewOutputDir("handler-xml-");
37 Map<String, String> args = getDefaultArgs("basic", outputDir);
38 args.put("basicHandlerType", "xml");
39 args.put("outputSuffix", "xml");
40
41 BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
42 ParallelFileProcessingResult result = run(runner);
43 File outputFile = new File(outputDir, "test0.xml.xml");
44 String resultString = FileUtils.readFileToString(outputFile, IOUtils.UTF_8.toString());
45 assertTrue(resultString.contains("<html xmlns=\"http://www.w3.org/1999/xhtml\">"));
46 assertTrue(resultString.contains("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"));
47 assertTrue(resultString.contains("This is tika-batch's first test file"));
48 }
49
50
51 @Test
52 public void testHTML() throws Exception {
53 File outputDir = getNewOutputDir("handler-html-");
54
55 Map<String, String> args = getDefaultArgs("basic", outputDir);
56 args.put("basicHandlerType", "html");
57 args.put("outputSuffix", "html");
58 BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
59 ParallelFileProcessingResult result = run(runner);
60 File outputFile = new File(outputDir, "test0.xml.html");
61 String resultString = FileUtils.readFileToString(outputFile, IOUtils.UTF_8.toString());
62 assertTrue(resultString.contains("<html xmlns=\"http://www.w3.org/1999/xhtml\">"));
63 assertFalse(resultString.contains("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"));
64 assertTrue(resultString.contains("This is tika-batch's first test file"));
65 }
66
67 @Test
68 public void testText() throws Exception {
69 File outputDir = getNewOutputDir("handler-txt-");
70
71 Map<String, String> args = getDefaultArgs("basic", outputDir);
72 args.put("basicHandlerType", "txt");
73 args.put("outputSuffix", "txt");
74
75 BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
76 ParallelFileProcessingResult result = run(runner);
77 File outputFile = new File(outputDir, "test0.xml.txt");
78 String resultString = FileUtils.readFileToString(outputFile, IOUtils.UTF_8.toString());
79 assertFalse(resultString.contains("<html xmlns=\"http://www.w3.org/1999/xhtml\">"));
80 assertFalse(resultString.contains("<?xml version=\"1.0\" encoding=\"UTF-8\"?>"));
81 assertTrue(resultString.contains("This is tika-batch's first test file"));
82 }
83
84
85 @Test
86 public void testXMLWithWriteLimit() throws Exception {
87 File outputDir = getNewOutputDir("handler-xml-write-limit-");
88
89 Map<String, String> args = getDefaultArgs("basic", outputDir);
90 args.put("writeLimit", "5");
91
92 BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
93 ParallelFileProcessingResult result = run(runner);
94
95 File outputFile = new File(outputDir, "test0.xml.xml");
96 String resultString = FileUtils.readFileToString(outputFile, IOUtils.UTF_8.toString());
97 //this is not ideal. How can we change handlers to writeout whatever
98 //they've gotten so far, up to the writeLimit?
99 assertTrue(resultString.equals(""));
100 }
101
102 @Test
103 public void testRecursiveParserWrapper() throws Exception {
104 File outputDir = getNewOutputDir("handler-recursive-parser");
105
106 Map<String, String> args = getDefaultArgs("basic", outputDir);
107 args.put("basicHandlerType", "txt");
108 args.put("outputSuffix", "json");
109 args.put("recursiveParserWrapper", "true");
110
111 BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
112 ParallelFileProcessingResult result = run(runner);
113 File outputFile = new File(outputDir, "test0.xml.json");
114 String resultString = FileUtils.readFileToString(outputFile, IOUtils.UTF_8.toString());
115 assertTrue(resultString.contains("\"author\":\"Nikolai Lobachevsky\""));
116 assertTrue(resultString.contains("tika-batch\\u0027s first test file"));
117 }
118
119
120 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
22 import java.io.File;
23 import java.util.Map;
24 import java.util.concurrent.ExecutionException;
25
26 import org.apache.tika.batch.BatchProcess;
27 import org.apache.tika.batch.ParallelFileProcessingResult;
28 import org.junit.Test;
29
30 public class OutputStreamFactoryTest extends FSBatchTestBase {
31
32
33 @Test
34 public void testIllegalState() throws Exception {
35 File outputDir = getNewOutputDir("os-factory-illegal-state-");
36 Map<String, String> args = getDefaultArgs("basic", outputDir);
37 BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
38 run(runner);
39 assertEquals(1, outputDir.listFiles().length);
40
41 boolean illegalState = false;
42 try {
43 ParallelFileProcessingResult result = run(runner);
44 } catch (ExecutionException e) {
45 if (e.getCause() instanceof IllegalStateException) {
46 illegalState = true;
47 }
48 }
49 assertTrue("Should have been an illegal state exception", illegalState);
50 }
51
52 @Test
53 public void testSkip() throws Exception {
54 File outputDir = getNewOutputDir("os-factory-skip-");
55 Map<String, String> args = getDefaultArgs("basic", outputDir);
56 args.put("handleExisting", "skip");
57 BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
58 ParallelFileProcessingResult result = run(runner);
59 assertEquals(1, outputDir.listFiles().length);
60
61 runner = getNewBatchRunner("/tika-batch-config-test.xml", args);
62 result = run(runner);
63 assertEquals(1, outputDir.listFiles().length);
64 }
65
66 /* turn this back on if there is any need to add "handleExisting"
67 @Test
68 public void testRename() throws Exception {
69 File outputDir = getNewOutputDir("os-factory-rename-");
70 Map<String, String> args = getDefaultArgs("basic", outputDir);
71
72 args.put("handleExisting", "rename");
73 BatchProcess runner = getNewBatchRunner("/tika-batch-config-basic-test.xml", args);
74 ParallelFileProcessingResult result = runner.execute();
75 assertEquals(1, outputDir.listFiles().length);
76
77 runner = getNewBatchRunner("/tika-batch-config-basic-test.xml", args);
78 result = runner.execute();
79 assertEquals(2, outputDir.listFiles().length);
80
81 runner = getNewBatchRunner("/tika-batch-config-basic-test.xml", args);
82 result = runner.execute();
83 assertEquals(3, outputDir.listFiles().length);
84
85 int hits = 0;
86 for (File f : outputDir.listFiles()){
87 String name = f.getName();
88 if (name.equals("test2_ok.xml.xml")) {
89 hits++;
90 } else if (name.equals("test1(1).txt.xml")) {
91 hits++;
92 } else if (name.equals("test1(2).txt.xml")) {
93 hits++;
94 }
95 }
96 assertEquals(3, hits);
97 }
98 */
99
100 }
0 package org.apache.tika.batch.fs;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.BufferedInputStream;
20 import java.io.BufferedReader;
21 import java.io.IOException;
22 import java.io.InputStream;
23 import java.io.InputStreamReader;
24
25 import org.apache.tika.io.IOUtils;
26
27 public class StringStreamGobbler implements Runnable {
28
29 //plagiarized from org.apache.oodt's StreamGobbler
30 private final BufferedReader reader;
31 private volatile boolean running = true;
32 private final StringBuilder sb = new StringBuilder();
33
34 public StringStreamGobbler(InputStream is) {
35 this.reader = new BufferedReader(new InputStreamReader(new BufferedInputStream(is),
36 IOUtils.UTF_8));
37 }
38
39 @Override
40 public void run() {
41 String line = null;
42 try {
43 while ((line = reader.readLine()) != null && this.running) {
44 sb.append(line);
45 sb.append("\n");
46 }
47 } catch (IOException e) {
48 //swallow ioe
49 }
50 }
51
52 public void stopGobblingAndDie() {
53 running = false;
54 IOUtils.closeQuietly(reader);
55 }
56
57 @Override
58 public String toString() {
59 return sb.toString();
60 }
61
62 }
0 package org.apache.tika.batch.fs.strawman;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17 import org.junit.Test;
18
19 public class StrawmanTest {
20 //TODO: actually write some tests!!!
21 @Test
22 public void basicTest() {
23
24 }
25 }
0 package org.apache.tika.batch.mock;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18 import java.util.Map;
19 import java.util.concurrent.ArrayBlockingQueue;
20
21 import org.apache.tika.batch.ConsumersManager;
22 import org.apache.tika.batch.FileResource;
23 import org.apache.tika.batch.fs.builders.BasicTikaFSConsumersBuilder;
24 import org.w3c.dom.Node;
25
26 public class MockConsumersBuilder extends BasicTikaFSConsumersBuilder {
27
28 @Override
29 public ConsumersManager build(Node node, Map<String, String> runtimeAttributes,
30 ArrayBlockingQueue<FileResource> queue) {
31 ConsumersManager manager = super.build(node, runtimeAttributes, queue);
32
33 boolean hangOnInit = runtimeAttributes.containsKey("hangOnInit");
34 boolean hangOnShutdown = runtimeAttributes.containsKey("hangOnShutdown");
35 return new MockConsumersManager(manager, hangOnInit, hangOnShutdown);
36 }
37 }
0 package org.apache.tika.batch.mock;
1
2 import org.apache.tika.batch.ConsumersManager;
3
4 /*
5 * Licensed to the Apache Software Foundation (ASF) under one or more
6 * contributor license agreements. See the NOTICE file distributed with
7 * this work for additional information regarding copyright ownership.
8 * The ASF licenses this file to You under the Apache License, Version 2.0
9 * (the "License"); you may not use this file except in compliance with
10 * the License. You may obtain a copy of the License at
11 *
12 * http://www.apache.org/licenses/LICENSE-2.0
13 *
14 * Unless required by applicable law or agreed to in writing, software
15 * distributed under the License is distributed on an "AS IS" BASIS,
16 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
17 * See the License for the specific language governing permissions and
18 * limitations under the License.
19 */
20
21 public class MockConsumersManager extends ConsumersManager {
22
23 private final long HANG_MS = 30000;
24
25 private final ConsumersManager wrapped;
26 private final boolean hangOnInit;
27 private final boolean hangOnClose;
28
29 public MockConsumersManager(ConsumersManager wrapped, boolean hangOnInit,
30 boolean hangOnClose) {
31 super(wrapped.getConsumers());
32 this.wrapped = wrapped;
33 this.hangOnInit = hangOnInit;
34 this.hangOnClose = hangOnClose;
35 }
36
37
38 @Override
39 public void init() {
40 if (hangOnInit) {
41 //interruptible light hang
42 try {
43 Thread.sleep(HANG_MS);
44 } catch (InterruptedException e) {
45 return;
46 }
47 return;
48 }
49 super.init();
50 }
51
52 @Override
53 public void shutdown() {
54 if (hangOnClose) {
55 //interruptible light hang
56 try {
57 Thread.sleep(HANG_MS);
58 } catch (InterruptedException e) {
59 return;
60 }
61 return;
62 }
63 super.shutdown();
64 }
65
66 @Override
67 public long getConsumersManagerMaxMillis() {
68 return wrapped.getConsumersManagerMaxMillis();
69 }
70
71 @Override
72 public void setConsumersManagerMaxMillis(long millis) {
73 wrapped.setConsumersManagerMaxMillis(millis);
74 }
75
76 }
0 package org.apache.tika.parser.mock;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.batch.ParserFactory;
20 import org.apache.tika.config.TikaConfig;
21 import org.apache.tika.parser.Parser;
22
23 public class MockParserFactory implements ParserFactory {
24 @Override
25 public Parser getParser(TikaConfig config) {
26 return new MockParser();
27 }
28 }
0 package org.apache.tika.util;
1
2 import static org.junit.Assert.assertEquals;
3
4 import org.apache.tika.TikaTest;
5 import org.junit.Test;
6
7 /*
8 * Licensed to the Apache Software Foundation (ASF) under one or more
9 * contributor license agreements. See the NOTICE file distributed with
10 * this work for additional information regarding copyright ownership.
11 * The ASF licenses this file to You under the Apache License, Version 2.0
12 * (the "License"); you may not use this file except in compliance with
13 * the License. You may obtain a copy of the License at
14 *
15 * http://www.apache.org/licenses/LICENSE-2.0
16 *
17 * Unless required by applicable law or agreed to in writing, software
18 * distributed under the License is distributed on an "AS IS" BASIS,
19 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
20 * See the License for the specific language governing permissions and
21 * limitations under the License.
22 */
23 public class TikaExceptionFilterTest extends TikaTest {
24
25 @Test
26 public void simpleNPETest() {
27 TikaExceptionFilter filter = new TikaExceptionFilter();
28 Throwable t = null;
29 try {
30 getXML("null_pointer.xml");
31 } catch (Throwable t2) {
32 assertContains("Unexpected RuntimeException", t2.getMessage());
33 t = filter.filter(t2);
34 }
35 assertEquals("another null pointer exception", t.getMessage());
36 }
37
38 }
2424 <parent>
2525 <groupId>org.apache.tika</groupId>
2626 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
27 <version>1.8</version>
2828 <relativePath>../tika-parent/pom.xml</relativePath>
2929 </parent>
3030
4040 <url>http://tika.apache.org/</url>
4141
4242 <properties>
43 <pax.exam.version>2.2.0</pax.exam.version>
43 <pax.exam.version>4.4.0</pax.exam.version>
4444 </properties>
4545
4646 <dependencies>
6060 <dependency>
6161 <groupId>junit</groupId>
6262 <artifactId>junit</artifactId>
63 <scope>test</scope>
64 <version>4.11</version>
6563 </dependency>
6664 <dependency>
6765 <groupId>org.ops4j.pax.exam</groupId>
7876 <dependency>
7977 <groupId>org.apache.felix</groupId>
8078 <artifactId>org.apache.felix.framework</artifactId>
81 <version>4.0.1</version>
79 <version>4.6.0</version>
8280 <scope>test</scope>
8381 </dependency>
8482 <dependency>
9088 <dependency>
9189 <groupId>org.ops4j.pax.url</groupId>
9290 <artifactId>pax-url-aether</artifactId>
93 <version>1.3.3</version>
91 <version>2.3.0</version>
9492 <scope>test</scope>
9593 </dependency>
9694 <dependency>
9795 <groupId>org.slf4j</groupId>
9896 <artifactId>slf4j-simple</artifactId>
99 <version>1.6.1</version>
97 <scope>test</scope>
98 </dependency>
99 <dependency>
100 <groupId>javax.inject</groupId>
101 <artifactId>javax.inject</artifactId>
102 <version>1</version>
103 <scope>test</scope>
104 </dependency>
105 <dependency>
106 <groupId>org.osgi</groupId>
107 <artifactId>org.osgi.core</artifactId>
108 <version>5.0.0</version>
100109 <scope>test</scope>
101110 </dependency>
102111 </dependencies>
114123 </Bundle-Activator>
115124 <Embed-Dependency>
116125 tika-parsers;inline=true,
117 commons-compress, xz, commons-codec,
118 pdfbox,fontbox,jempbox,bcmail-jdk15,bcprov-jdk15,
126 commons-compress, xz, commons-codec, commons-csv, junrar,
127 pdfbox,fontbox,jempbox,bcmail-jdk15on,bcprov-jdk15on,bcpkix-jdk15on,
119128 poi,poi-scratchpad,poi-ooxml,poi-ooxml-schemas,
120 xmlbeans, dom4j,
129 xmlbeans,
121130 tagsoup,
122131 asm-debug-all,
123132 juniversalchardet,
127136 boilerpipe, rome,
128137 apache-mime4j-core, apache-mime4j-dom,
129138 jhighlight, java-libpst,
130 netcdf, jcip-annotations, jmatio
139 netcdf4, grib, cdm, httpservices, jcip-annotations,
140 jmatio, guava
131141 </Embed-Dependency>
132142 <Embed-Transitive>true</Embed-Transitive>
133143 <Bundle-DocURL>${project.url}</Bundle-DocURL>
137147 org.apache.tika.parser.*
138148 </Export-Package>
139149 <Import-Package>
140 !org.junit,
150 !org.junit,
141151 *,
152 org.apache.tika.fork,
142153 android.util;resolution:=optional,
143154 com.adobe.xmp;resolution:=optional,
144155 com.adobe.xmp.properties;resolution:=optional,
173184 org.apache.commons.httpclient.params;resolution:=optional,
174185 org.apache.commons.httpclient.protocol;resolution:=optional,
175186 org.apache.commons.httpclient.util;resolution:=optional,
187 org.apache.commons.vfs2;resolution:=optional,
188 org.apache.commons.vfs2.provider;resolution:=optional,
189 org.apache.commons.vfs2.util;resolution:=optional,
176190 org.apache.crimson.jaxp;resolution:=optional,
191 org.apache.jcp.xml.dsig.internal.dom;resolution:=optional,
177192 org.apache.tools.ant;resolution:=optional,
178193 org.apache.tools.ant.taskdefs;resolution:=optional,
179194 org.apache.tools.ant.types;resolution:=optional,
183198 org.apache.xerces.xni.parser;resolution:=optional,
184199 org.apache.xml.resolver;resolution:=optional,
185200 org.apache.xml.resolver.tools;resolution:=optional,
201 org.apache.xml.security;resolution:=optional,
202 org.apache.xml.security.c14n;resolution:=optional,
203 org.apache.xml.security.utils;resolution:=optional,
186204 org.apache.xmlbeans.impl.xpath.saxon;resolution:=optional,
187205 org.apache.xmlbeans.impl.xquery.saxon;resolution:=optional,
206 org.bouncycastle.cert;resolution:=optional,
207 org.bouncycastle.cert.jcajce;resolution:=optional,
208 org.bouncycastle.cert.ocsp;resolution:=optional,
209 org.bouncycastle.cms.bc;resolution:=optional,
210 org.bouncycastle.operator;resolution:=optional,
211 org.bouncycastle.operator.bc;resolution:=optional,
212 org.bouncycastle.tsp;resolution:=optional,
188213 org.cyberneko.html.xercesbridge;resolution:=optional,
214 org.etsi.uri.x01903.v14;resolution:=optional,
215 org.ibex.nestedvm;resolution:=optional,
189216 org.gjt.xpp;resolution:=optional,
190217 org.jaxen;resolution:=optional,
191218 org.jaxen.dom4j;resolution:=optional,
194221 org.jdom;resolution:=optional,
195222 org.jdom.input;resolution:=optional,
196223 org.jdom.output;resolution:=optional,
224 org.jdom2;resolution:=optional,
225 org.jdom2.input;resolution:=optional,
226 org.jdom2.output;resolution:=optional,
197227 org.openxmlformats.schemas.officeDocument.x2006.math;resolution:=optional,
198228 org.openxmlformats.schemas.schemaLibrary.x2006.main;resolution:=optional,
199229 org.osgi.framework;resolution:=optional,
200230 org.quartz;resolution:=optional,
201231 org.quartz.impl;resolution:=optional,
232 org.slf4j;resolution:=optional,
233 org.sqlite;resolution:=optional,
202234 org.w3c.dom;resolution:=optional,
203235 org.relaxng.datatype;resolution:=optional,
204236 org.xml.sax;resolution:=optional,
208240 schemasMicrosoftComOfficePowerpoint;resolution:=optional,
209241 schemasMicrosoftComOfficeWord;resolution:=optional,
210242 sun.misc;resolution:=optional,
243 ucar.units;resolution:=optional,
244 ucar.httpservices;resolution:=optional,
245 ucar.nc2.util;resolution:=optional,
246 ucar.nc2.util.cache;resolution:=optional,
247 ucar.nc2.dataset;resolution:=optional,
248 ucar.nc2;resolution:=optional,
249 ucar.nc2.constants;resolution:=optional,
250 ucar.nc2.dt;resolution:=optional,
251 ucar.nc2.dt.grid;resolution:=optional,
252 ucar.nc2.ft;resolution:=optional,
253 ucar.nc2.iosp;resolution:=optional,
254 ucar.nc2.iosp.hdf4;resolution:=optional,
255 ucar.nc2.ncml;resolution:=optional,
256 ucar.nc2.stream;resolution:=optional,
257 ucar.nc2.time;resolution:=optional,
258 ucar.nc2.units;resolution:=optional,
259 ucar.nc2.wmo;resolution:=optional,
260 ucar.nc2.write;resolution:=optional,
261 ucar.ma2;resolution:=optional,
211262 ucar.grib;resolution:=optional,
212263 ucar.grib.grib1;resolution:=optional,
213264 ucar.grib.grib2;resolution:=optional,
223274 visad.data;resolution:=optional,
224275 visad.data.vis5d;resolution:=optional,
225276 visad.jmet;resolution:=optional,
226 visad.util;resolution:=optional
277 visad.util;resolution:=optional,
278 colorspace;resolution:=optional,
279 com.sun.jna;resolution:=optional,
280 com.sun.jna.ptr;resolution:=optional,
281 icc;resolution:=optional,
282 jj2000.j2k.codestream;resolution:=optional,
283 jj2000.j2k.codestream.reader;resolution:=optional,
284 jj2000.j2k.decoder;resolution:=optional,
285 jj2000.j2k.entropy.decoder;resolution:=optional,
286 jj2000.j2k.fileformat.reader;resolution:=optional,
287 jj2000.j2k.image;resolution:=optional,
288 jj2000.j2k.image.invcomptransf;resolution:=optional,
289 jj2000.j2k.image.output;resolution:=optional,
290 jj2000.j2k.io;resolution:=optional,
291 jj2000.j2k.quantization.dequantizer;resolution:=optional,
292 jj2000.j2k.roi;resolution:=optional,
293 jj2000.j2k.util;resolution:=optional,
294 jj2000.j2k.wavelet.synthesis;resolution:=optional,
295 org.itadaki.bzip2;resolution:=optional,
296 org.jsoup;resolution:=optional,
297 org.jsoup.nodes;resolution:=optional,
298 org.jsoup.select;resolution:=optional,
299 thredds.featurecollection;resolution:=optional,
300 thredds.filesystem;resolution:=optional,
301 thredds.inventory;resolution:=optional,
302 thredds.inventory.filter;resolution:=optional,
303 thredds.inventory.partition;resolution:=optional,
304 com.beust.jcommander;resolution:=optional,
305 com.google.common.base;resolution:=optional,
306 com.google.common.math;resolution:=optional,
307 org.apache.http;resolution:=optional,
308 org.joda.time;resolution:=optional,
309 org.joda.time.chrono;resolution:=optional,
310 org.joda.time.field;resolution:=optional,
311 org.joda.time.format;resolution:=optional,
312 sun.reflect.generics.reflectiveObjects;resolution:=optional,
313 org.apache.http.auth;resolution:=optional,
314 org.apache.http.client;resolution:=optional,
315 org.apache.http.client.entity;resolution:=optional,
316 org.apache.http.client.methods;resolution:=optional,
317 org.apache.http.conn;resolution:=optional,
318 org.apache.http.conn.scheme;resolution:=optional,
319 org.apache.http.cookie;resolution:=optional,
320 org.apache.http.entity;resolution:=optional,
321 org.apache.http.impl.client;resolution:=optional,
322 org.apache.http.impl.conn;resolution:=optional,
323 org.apache.http.message;resolution:=optional,
324 org.apache.http.params;resolution:=optional,
325 org.apache.http.protocol;resolution:=optional,
326 org.apache.http.util;resolution:=optional
227327 </Import-Package>
228328 </instructions>
229329 </configuration>
251351 </configuration>
252352 </execution>
253353 </executions>
354 </plugin>
355
356 <!-- The Tika Bundle has no java code of its own, so no need to do -->
357 <!-- any forbidden API checking against it (it gets confused...) -->
358 <plugin>
359 <groupId>de.thetaphi</groupId>
360 <artifactId>forbiddenapis</artifactId>
361 <configuration>
362 <skip>true</skip>
363 </configuration>
254364 </plugin>
255365 </plugins>
256366 </build>
291401 </execution>
292402 </executions>
293403 <configuration>
294 <systemPropertyVariables>
295 <org.ops4j.pax.logging.DefaultServiceLog.level>
296 WARN
297 </org.ops4j.pax.logging.DefaultServiceLog.level>
298 </systemPropertyVariables>
404 <systemPropertyVariables>
405 <org.ops4j.pax.logging.DefaultServiceLog.level>
406 WARN
407 </org.ops4j.pax.logging.DefaultServiceLog.level>
408 </systemPropertyVariables>
299409 </configuration>
300410 </plugin>
301411 </plugins>
304414 </profiles>
305415
306416 <organization>
307 <name>The Apache Software Founation</name>
308 <url>http://www.apache.org</url>
417 <name>The Apache Software Founation</name>
418 <url>http://www.apache.org</url>
309419 </organization>
310420 <scm>
311 <url>http://svn.apache.org/viewvc/tika/tags/1.6/tika-bundle</url>
312 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/tika-bundle</connection>
313 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/tika-bundle</developerConnection>
421 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-bundle</url>
422 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-bundle</connection>
423 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-bundle</developerConnection>
314424 </scm>
315425 <issueManagement>
316 <system>JIRA</system>
317 <url>https://issues.apache.org/jira/browse/TIKA</url>
426 <system>JIRA</system>
427 <url>https://issues.apache.org/jira/browse/TIKA</url>
318428 </issueManagement>
319429 <ciManagement>
320 <system>Jenkins</system>
321 <url>https://builds.apache.org/job/Tika-trunk/</url>
430 <system>Jenkins</system>
431 <url>https://builds.apache.org/job/Tika-trunk/</url>
322432 </ciManagement>
323433 </project>
1919 import static org.junit.Assert.assertNotNull;
2020 import static org.junit.Assert.assertTrue;
2121 import static org.junit.Assert.fail;
22 import static org.ops4j.pax.exam.CoreOptions.bundle;
23 import static org.ops4j.pax.exam.CoreOptions.junitBundles;
24
22 import static org.ops4j.pax.exam.CoreOptions.*;
23
24 import java.io.ByteArrayInputStream;
2525 import java.io.File;
2626 import java.io.FileInputStream;
2727 import java.io.IOException;
2828 import java.io.InputStream;
29 import java.io.StringWriter;
30 import java.io.Writer;
2931 import java.net.URISyntaxException;
3032 import java.util.HashSet;
3133 import java.util.List;
3234 import java.util.Set;
35
36 import javax.inject.Inject;
3337
3438 import org.apache.tika.Tika;
3539 import org.apache.tika.config.ServiceLoader;
3640 import org.apache.tika.config.TikaConfig;
3741 import org.apache.tika.detect.DefaultDetector;
3842 import org.apache.tika.detect.Detector;
43 import org.apache.tika.fork.ForkParser;
3944 import org.apache.tika.metadata.Metadata;
45 import org.apache.tika.mime.MediaType;
4046 import org.apache.tika.parser.ParseContext;
4147 import org.apache.tika.parser.Parser;
48 import org.apache.tika.parser.internal.Activator;
4249 import org.apache.tika.sax.BodyContentHandler;
4350 import org.junit.Ignore;
4451 import org.junit.Test;
4552 import org.junit.runner.RunWith;
46 import org.ops4j.pax.exam.CoreOptions;
53 import org.ops4j.pax.exam.Configuration;
4754 import org.ops4j.pax.exam.Option;
48 import org.ops4j.pax.exam.junit.Configuration;
49 import org.ops4j.pax.exam.junit.JUnit4TestRunner;
55 import org.ops4j.pax.exam.junit.PaxExam;
56 import org.ops4j.pax.exam.spi.reactors.ExamReactorStrategy;
57 import org.ops4j.pax.exam.spi.reactors.PerMethod;
5058 import org.osgi.framework.Bundle;
5159 import org.osgi.framework.BundleContext;
5260 import org.xml.sax.ContentHandler;
5361
54 @RunWith( JUnit4TestRunner.class )
62 @RunWith(PaxExam.class)
63 @ExamReactorStrategy(PerMethod.class)
5564 public class BundleIT {
65
5666 private final File TARGET = new File("target");
57
67
68 @Inject
69 private Parser defaultParser;
70 @Inject
71 private Detector contentTypeDetector;
72 @Inject
73 private BundleContext bc;
74
5875 @Configuration
5976 public Option[] configuration() throws IOException, URISyntaxException {
6077 File base = new File(TARGET, "test-bundles");
61 return CoreOptions.options(
78 return options(
6279 junitBundles(),
6380 bundle(new File(base, "tika-core.jar").toURI().toURL().toString()),
6481 bundle(new File(base, "tika-bundle.jar").toURI().toURL().toString()));
6582 }
66
67 @Test
68 public void testBundleLoaded(BundleContext bc) throws Exception {
83
84
85 @Test
86 public void testBundleLoaded() throws Exception {
6987 boolean hasCore = false, hasBundle = false;
7088 for (Bundle b : bc.getBundles()) {
7189 if ("org.apache.tika.core".equals(b.getSymbolicName())) {
8098 assertTrue("Core bundle not found", hasCore);
8199 assertTrue("Bundle bundle not found", hasBundle);
82100 }
83
84 @Test
85 public void testBundleDetection(BundleContext bc) throws Exception {
101
102
103 @Test
104 public void testBundleDetection() throws Exception {
86105 Tika tika = new Tika();
87106
88107 // Simple type detection
90109 assertEquals("application/pdf", tika.detect("test.pdf"));
91110 }
92111
112
113 @Test
114 public void testForkParser() throws Exception {
115 ForkParser parser = new ForkParser(Activator.class.getClassLoader(), defaultParser);
116 String data = "<!DOCTYPE html>\n<html><body><p>test <span>content</span></p></body></html>";
117 InputStream stream = new ByteArrayInputStream(data.getBytes("UTF-8"));
118 Writer writer = new StringWriter();
119 ContentHandler contentHandler = new BodyContentHandler(writer);
120 Metadata metadata = new Metadata();
121 MediaType type = contentTypeDetector.detect(stream, metadata);
122 assertEquals(type.toString(), "text/html");
123 metadata.add(Metadata.CONTENT_TYPE, type.toString());
124 ParseContext parseCtx = new ParseContext();
125 parser.parse(stream, contentHandler, metadata, parseCtx);
126 writer.flush();
127 String content = writer.toString();
128 assertTrue(content.length() > 0);
129 assertEquals("test content", content.trim());
130 }
131
132
93133 @Ignore // TODO Fix this test
94134 @Test
95 public void testBundleSimpleText(BundleContext bc) throws Exception {
135 public void testBundleSimpleText() throws Exception {
96136 Tika tika = new Tika();
97
137
98138 // Simple text extraction
99139 String xml = tika.parseToString(new File("pom.xml"));
100140 assertTrue(xml.contains("tika-bundle"));
101141 }
102
142
143
103144 @Ignore // TODO Fix this test
104145 @Test
105 public void testBundleDetectors(BundleContext bc) throws Exception {
146 public void testBundleDetectors() throws Exception {
106147 // Get the raw detectors list
107148 // TODO Why is this not finding the detector service resource files?
108149 TestingServiceLoader loader = new TestingServiceLoader();
109150 List<String> rawDetectors = loader.identifyStaticServiceProviders(Detector.class);
110
151
111152 // Check we did get a few, just in case...
112153 assertNotNull(rawDetectors);
113154 assertTrue("Should have several Detector names, found " + rawDetectors.size(),
114155 rawDetectors.size() > 3);
115
156
116157 // Get the classes found within OSGi
117158 DefaultDetector detector = new DefaultDetector();
118159 Set<String> osgiDetectors = new HashSet<String>();
119160 for (Detector d : detector.getDetectors()) {
120161 osgiDetectors.add(d.getClass().getName());
121162 }
122
163
123164 // Check that OSGi didn't miss any
124165 for (String detectorName : rawDetectors) {
125166 if (!osgiDetectors.contains(detectorName)) {
126 fail("Detector " + detectorName +
127 " not found within OSGi Detector list: " + osgiDetectors);
167 fail("Detector " + detectorName
168 + " not found within OSGi Detector list: " + osgiDetectors);
128169 }
129170 }
130171 }
131
132 @Test
133 public void testBundleParsers(BundleContext bc) throws Exception {
172
173
174 @Test
175 public void testBundleParsers() throws Exception {
134176 TikaConfig tika = new TikaConfig();
135177
136178 // TODO Implement as with Detectors
137179 }
138
180
181
139182 @Ignore // TODO Fix this test
140183 @Test
141 public void testTikaBundle(BundleContext bc) throws Exception {
184 public void testTikaBundle() throws Exception {
142185 Tika tika = new Tika();
143186
144187 // Package extraction
148191 ParseContext context = new ParseContext();
149192 context.set(Parser.class, parser);
150193
151 InputStream stream =
152 new FileInputStream("src/test/resources/test-documents.zip");
194 InputStream stream
195 = new FileInputStream("src/test/resources/test-documents.zip");
153196 try {
154197 parser.parse(stream, handler, new Metadata(), context);
155198 } finally {
178221 }
179222
180223 /**
181 * Alternate ServiceLoader which works outside of OSGi, so we
182 * can compare between the two environments
224 * Alternate ServiceLoader which works outside of OSGi, so we can compare between the two environments
183225 */
184226 private static class TestingServiceLoader extends ServiceLoader {
227
185228 private TestingServiceLoader() {
186229 super();
187230 }
231
232
188233 public <T> List<String> identifyStaticServiceProviders(Class<T> iface) {
189234 return super.identifyStaticServiceProviders(iface);
190235 }
2424 <parent>
2525 <groupId>org.apache.tika</groupId>
2626 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
27 <version>1.8</version>
2828 <relativePath>../tika-parent/pom.xml</relativePath>
2929 </parent>
3030
5959 <dependency>
6060 <groupId>junit</groupId>
6161 <artifactId>junit</artifactId>
62 <scope>test</scope>
63 <version>4.11</version>
6462 </dependency>
6563 </dependencies>
6664
153151 </plugins>
154152 </build>
155153
156 <description>This is the core Apache Tika™ toolkit library from which all other modules inherit functionality. It also includes the core facades for the Tika API. </description>
154 <description>This is the core Apache Tika™ toolkit library from which all other modules inherit functionality. It also
155 includes the core facades for the Tika API.
156 </description>
157157 <organization>
158 <name>The Apache Software Foundation</name>
159 <url>http://www.apache.org</url>
158 <name>The Apache Software Foundation</name>
159 <url>http://www.apache.org</url>
160160 </organization>
161161 <scm>
162 <url>http://svn.apache.org/viewvc/tika/tags/1.6/core</url>
163 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/core</connection>
164 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/core</developerConnection>
162 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/core</url>
163 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/core</connection>
164 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/core</developerConnection>
165165 </scm>
166166 <issueManagement>
167 <system>JIRA</system>
168 <url>https://issues.apache.org/jira/browse/TIKA</url>
167 <system>JIRA</system>
168 <url>https://issues.apache.org/jira/browse/TIKA</url>
169169 </issueManagement>
170170 <ciManagement>
171 <system>Jenkins</system>
172 <url>https://builds.apache.org/job/Tika-trunk/</url>
171 <system>Jenkins</system>
172 <url>https://builds.apache.org/job/Tika-trunk/</url>
173173 </ciManagement>
174174 </project>
2828 import java.util.List;
2929 import java.util.Map;
3030 import java.util.regex.Pattern;
31 import org.apache.tika.io.IOUtils;
3132
3233 /**
3334 * Internal utility class that Tika uses to look up service providers.
315316 }
316317 }
317318 }
318
319319 return providers;
320320 }
321321
328328 InputStream stream = resource.openStream();
329329 try {
330330 BufferedReader reader =
331 new BufferedReader(new InputStreamReader(stream, "UTF-8"));
331 new BufferedReader(new InputStreamReader(stream, IOUtils.UTF_8));
332332 String line = reader.readLine();
333333 while (line != null) {
334334 line = COMMENT.matcher(line).replaceFirst("");
1919 import java.io.FileInputStream;
2020 import java.io.IOException;
2121 import java.io.InputStream;
22 import java.lang.reflect.Constructor;
23 import java.lang.reflect.InvocationTargetException;
2224 import java.net.URL;
2325 import java.util.ArrayList;
26 import java.util.Collection;
27 import java.util.Collections;
2428 import java.util.HashSet;
2529 import java.util.List;
2630 import java.util.Set;
330334 NodeList nodes = element.getElementsByTagName("parser");
331335 for (int i = 0; i < nodes.getLength(); i++) {
332336 Element node = (Element) nodes.item(i);
333 String name = node.getAttribute("class");
334
335 try {
336 Class<? extends Parser> parserClass =
337 loader.getServiceClass(Parser.class, name);
338 // https://issues.apache.org/jira/browse/TIKA-866
339 if (AutoDetectParser.class.isAssignableFrom(parserClass)) {
340 throw new TikaException(
341 "AutoDetectParser not supported in a <parser>"
342 + " configuration element: " + name);
343 }
344 Parser parser = parserClass.newInstance();
345
346 NodeList mimes = node.getElementsByTagName("mime");
347 if (mimes.getLength() > 0) {
348 Set<MediaType> types = new HashSet<MediaType>();
349 for (int j = 0; j < mimes.getLength(); j++) {
350 String mime = getText(mimes.item(j));
351 MediaType type = MediaType.parse(mime);
352 if (type != null) {
353 types.add(type);
354 } else {
355 throw new TikaException(
356 "Invalid media type name: " + mime);
357 }
358 }
359 parser = ParserDecorator.withTypes(parser, types);
360 }
361
362 parsers.add(parser);
363 } catch (ClassNotFoundException e) {
364 throw new TikaException(
365 "Unable to find a parser class: " + name, e);
366 } catch (IllegalAccessException e) {
367 throw new TikaException(
368 "Unable to access a parser class: " + name, e);
369 } catch (InstantiationException e) {
370 throw new TikaException(
371 "Unable to instantiate a parser class: " + name, e);
372 }
373 }
337 parsers.add(parserFromParserDomElement(node, mimeTypes, loader));
338 }
339
374340 if (parsers.isEmpty()) {
341 // No parsers defined, create a DefaultParser
375342 return getDefaultParser(mimeTypes, loader);
343 } else if (parsers.size() == 1 && parsers.get(0) instanceof CompositeParser) {
344 // Single Composite defined, use that
345 return (CompositeParser)parsers.get(0);
376346 } else {
347 // Wrap the defined parsers up in a Composite
377348 MediaTypeRegistry registry = mimeTypes.getMediaTypeRegistry();
378349 return new CompositeParser(registry, parsers);
379350 }
351 }
352 private static Parser parserFromParserDomElement(
353 Element parserNode, MimeTypes mimeTypes, ServiceLoader loader)
354 throws TikaException, IOException {
355 String name = parserNode.getAttribute("class");
356 Parser parser = null;
357
358 try {
359 Class<? extends Parser> parserClass =
360 loader.getServiceClass(Parser.class, name);
361 // https://issues.apache.org/jira/browse/TIKA-866
362 if (AutoDetectParser.class.isAssignableFrom(parserClass)) {
363 throw new TikaException(
364 "AutoDetectParser not supported in a <parser>"
365 + " configuration element: " + name);
366 }
367
368 // Is this a composite parser? If so, support recursion
369 if (CompositeParser.class.isAssignableFrom(parserClass)) {
370 // Get the child parsers for it
371 List<Parser> childParsers = new ArrayList<Parser>();
372 NodeList childParserNodes = parserNode.getElementsByTagName("parser");
373 if (childParserNodes.getLength() > 0) {
374 for (int i = 0; i < childParserNodes.getLength(); i++) {
375 childParsers.add(parserFromParserDomElement(
376 (Element)childParserNodes.item(i), mimeTypes, loader
377 ));
378 }
379 }
380
381 // Get the list of parsers to exclude
382 Set<Class<? extends Parser>> excludeParsers = new HashSet<Class<? extends Parser>>();
383 NodeList excludeParserNodes = parserNode.getElementsByTagName("parser-exclude");
384 if (excludeParserNodes.getLength() > 0) {
385 for (int i = 0; i < excludeParserNodes.getLength(); i++) {
386 Element excl = (Element)excludeParserNodes.item(i);
387 String exclName = excl.getAttribute("class");
388 excludeParsers.add(loader.getServiceClass(Parser.class, exclName));
389 }
390 }
391
392 // Create the Composite Parser
393 Constructor<? extends Parser> c = null;
394 if (c == null) {
395 try {
396 c = parserClass.getConstructor(MediaTypeRegistry.class, ServiceLoader.class, Collection.class);
397 parser = c.newInstance(mimeTypes.getMediaTypeRegistry(), loader, excludeParsers);
398 }
399 catch (NoSuchMethodException me) {}
400 }
401 if (c == null) {
402 try {
403 c = parserClass.getConstructor(MediaTypeRegistry.class, List.class, Collection.class);
404 parser = c.newInstance(mimeTypes.getMediaTypeRegistry(), childParsers, excludeParsers);
405 } catch (NoSuchMethodException me) {}
406 }
407 if (c == null) {
408 parser = parserClass.newInstance();
409 }
410 } else {
411 // Regular parser, create as-is
412 parser = parserClass.newInstance();
413 }
414
415 // Is there an explicit list of mime types for this to handle?
416 Set<MediaType> parserTypes = mediaTypesListFromDomElement(parserNode, "mime");
417 if (! parserTypes.isEmpty()) {
418 parser = ParserDecorator.withTypes(parser, parserTypes);
419 }
420 // Is there an explicit list of mime types this shouldn't handle?
421 Set<MediaType> parserExclTypes = mediaTypesListFromDomElement(parserNode, "mime-exclude");
422 if (! parserExclTypes.isEmpty()) {
423 parser = ParserDecorator.withoutTypes(parser, parserExclTypes);
424 }
425
426 // All done with setup
427 return parser;
428 } catch (ClassNotFoundException e) {
429 throw new TikaException(
430 "Unable to find a parser class: " + name, e);
431 } catch (IllegalAccessException e) {
432 throw new TikaException(
433 "Unable to access a parser class: " + name, e);
434 } catch (InvocationTargetException e) {
435 throw new TikaException(
436 "Unable to create a parser class: " + name, e);
437 } catch (InstantiationException e) {
438 throw new TikaException(
439 "Unable to instantiate a parser class: " + name, e);
440 }
441 }
442
443 private static Set<MediaType> mediaTypesListFromDomElement(
444 Element node, String tag)
445 throws TikaException, IOException {
446 NodeList mimes = node.getElementsByTagName(tag);
447 if (mimes.getLength() > 0) {
448 Set<MediaType> types = new HashSet<MediaType>();
449 for (int j = 0; j < mimes.getLength(); j++) {
450 String mime = getText(mimes.item(j));
451 MediaType type = MediaType.parse(mime);
452 if (type != null) {
453 types.add(type);
454 } else {
455 throw new TikaException(
456 "Invalid media type name: " + mime);
457 }
458 }
459 return types;
460 }
461 return Collections.emptySet();
380462 }
381463
382464 private static Detector detectorFromDomElement(
2121 import java.nio.ByteBuffer;
2222 import java.nio.CharBuffer;
2323 import java.nio.charset.Charset;
24 import java.util.Locale;
2425 import java.util.regex.Matcher;
2526 import java.util.regex.Pattern;
26
27 import org.apache.tika.io.IOUtils;
2728 import org.apache.tika.metadata.Metadata;
2829 import org.apache.tika.mime.MediaType;
2930
9495 || type.equals("unicodeBE")) {
9596 decoded = decodeString(value, type);
9697 } else if (type.equals("stringignorecase")) {
97 decoded = decodeString(value.toLowerCase(), type);
98 decoded = decodeString(value.toLowerCase(Locale.ROOT), type);
9899 } else if (type.equals("byte")) {
99 decoded = tmpVal.getBytes();
100 decoded = tmpVal.getBytes(IOUtils.UTF_8);
100101 } else if (type.equals("host16") || type.equals("little16")) {
101102 int i = Integer.parseInt(tmpVal, radix);
102103 decoded = new byte[] { (byte) (i & 0x00FF), (byte) (i >> 8) };
392393 flags = Pattern.CASE_INSENSITIVE;
393394 }
394395
395 Pattern p = Pattern.compile(new String(this.pattern), flags);
396 Pattern p = Pattern.compile(new String(this.pattern, IOUtils.UTF_8), flags);
396397
397398 ByteBuffer bb = ByteBuffer.wrap(buffer);
398399 CharBuffer result = ISO_8859_1.decode(bb);
2121 import java.util.Map;
2222 import java.util.regex.Pattern;
2323
24 import org.apache.tika.io.IOUtils;
2425 import org.apache.tika.metadata.Metadata;
2526 import org.apache.tika.mime.MediaType;
2627
118119 int percent = name.indexOf('%');
119120 if (percent != -1) {
120121 try {
121 name = URLDecoder.decode(name, "UTF-8");
122 name = URLDecoder.decode(name, IOUtils.UTF_8.name());
122123 } catch (UnsupportedEncodingException e) {
123124 throw new IllegalStateException("UTF-8 not supported", e);
124125 }
2626 import org.apache.tika.sax.OfflineContentHandler;
2727 import org.xml.sax.Attributes;
2828 import org.xml.sax.SAXException;
29 import org.xml.sax.SAXNotRecognizedException;
2930 import org.xml.sax.helpers.DefaultHandler;
3031
3132 /**
4950 SAXParserFactory factory = SAXParserFactory.newInstance();
5051 factory.setNamespaceAware(true);
5152 factory.setValidating(false);
52 factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
53 try {
54 factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
55 } catch (SAXNotRecognizedException e) {
56 // TIKA-271 and TIKA-1000: Some XML parsers do not support the secure-processing
57 // feature, even though it's required by JAXP in Java 5. Ignoring
58 // the exception is fine here, deployments without this feature
59 // are inherently vulnerable to XML denial-of-service attacks.
60 }
5361 factory.newSAXParser().parse(
5462 new CloseShieldInputStream(stream),
5563 new OfflineContentHandler(handler));
412412 if (process.exitValue() != 0) {
413413 throw new TikaException("There was an error executing the command line" +
414414 "\nExecutable Command:\n\n" + cmd +
415 "\nExecutable Error:\n\n" + stdErrOutputStream.toString("UTF-8"));
415 "\nExecutable Error:\n\n" + stdErrOutputStream.toString(IOUtils.UTF_8.name()));
416416 }
417417 }
418418 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.exception;
17
18 /**
19 * Exception to be thrown when a document does not allow content extraction.
20 * As of this writing, PDF documents are the only type of document that might
21 * cause this type of exception.
22 */
23 public class AccessPermissionException extends TikaException {
24 public AccessPermissionException() {
25 super("Unable to process: content extraction is not allowed");
26 }
27
28 public AccessPermissionException(Throwable th) {
29 super("Unable to process: content extraction is not allowed", th);
30 }
31
32 public AccessPermissionException(String info) {
33 super(info);
34 }
35
36 public AccessPermissionException(String info, Throwable th) {
37 super(info, th);
38 }
39 }
2020 import java.io.IOException;
2121 import java.io.InputStream;
2222
23 import org.apache.tika.exception.EncryptedDocumentException;
2324 import org.apache.tika.exception.TikaException;
2425 import org.apache.tika.io.CloseShieldInputStream;
2526 import org.apache.tika.io.TemporaryResources;
102103 newStream,
103104 new EmbeddedContentHandler(new BodyContentHandler(handler)),
104105 metadata, context);
106 } catch (EncryptedDocumentException ede) {
107 // TODO: can we log a warning that we lack the password?
108 // For now, just skip the content
105109 } catch (TikaException e) {
106110 // TODO: can we log a warning somehow?
107111 // Could not parse the entry, just skip the content
2323 import java.io.InputStream;
2424 import java.io.NotSerializableException;
2525 import java.util.ArrayList;
26 import java.util.Arrays;
2726 import java.util.List;
2827 import java.util.jar.JarEntry;
2928 import java.util.jar.JarOutputStream;
5049
5150 private final InputStream error;
5251
53 public ForkClient(ClassLoader loader, Object object, String java)
52 public ForkClient(ClassLoader loader, Object object, List<String> java)
5453 throws IOException, TikaException {
5554 boolean ok = false;
5655 try {
5958
6059 ProcessBuilder builder = new ProcessBuilder();
6160 List<String> command = new ArrayList<String>();
62 command.addAll(Arrays.asList(java.split("\\s+")));
61 command.addAll(java);
6362 command.add("-jar");
6463 command.add(jar.getPath());
6564 builder.command(command);
262261 String manifest =
263262 "Main-Class: " + ForkServer.class.getName() + "\n";
264263 jar.putNextEntry(new ZipEntry("META-INF/MANIFEST.MF"));
265 jar.write(manifest.getBytes("UTF-8"));
264 jar.write(manifest.getBytes(IOUtils.UTF_8));
266265
267266 Class<?>[] bootstrap = {
268267 ForkServer.class, ForkObjectInputStream.class,
1717
1818 import java.io.IOException;
1919 import java.io.InputStream;
20 import java.util.ArrayList;
21 import java.util.Arrays;
22 import java.util.Collections;
2023 import java.util.LinkedList;
24 import java.util.List;
2125 import java.util.Queue;
2226 import java.util.Set;
2327
4246 private final Parser parser;
4347
4448 /** Java command line */
45 private String java = "java -Xmx32m";
49 private List<String> java = Arrays.asList("java", "-Xmx32m");
4650
4751 /** Process pool size */
4852 private int poolSize = 5;
9498 * Returns the command used to start the forked server process.
9599 *
96100 * @return java command line
97 */
101 * @deprecated since 1.8
102 * @see ForkParser#getJavaCommandAsList()
103 */
104 @Deprecated
98105 public String getJavaCommand() {
99 return java;
106 StringBuilder sb = new StringBuilder();
107 for (String part : getJavaCommandAsList()) {
108 sb.append(part).append(' ');
109 }
110 sb.deleteCharAt(sb.length() - 1);
111 return sb.toString();
112 }
113
114 /**
115 * Returns the command used to start the forked server process.
116 * <p/>
117 * Returned list is unmodifiable.
118 * @return java command line args
119 */
120 public List<String> getJavaCommandAsList() {
121 return Collections.unmodifiableList(java);
122 }
123
124 /**
125 * Sets the command used to start the forked server process.
126 * The arguments "-jar" and "/path/to/bootstrap.jar" are
127 * appended to the given command when starting the process.
128 * The default setting is {"java", "-Xmx32m"}.
129 * <p/>
130 * Creates a defensive copy.
131 * @param java java command line
132 */
133 public void setJavaCommand(List<String> java) {
134 this.java = new ArrayList<String>(java);
100135 }
101136
102137 /**
103138 * Sets the command used to start the forked server process.
104139 * The given command line is split on whitespace and the arguments
105 * "-jar" and "/path/to/bootstrap.jar" are appended to it when starting
106 * the process. The default setting is "java -Xmx32m".
140 2 * "-jar" and "/path/to/bootstrap.jar" are appended to it when starting
141 2 * the process. The default setting is "java -Xmx32m".
107142 *
108143 * @param java java command line
109 */
144 * @deprecated since 1.8
145 * @see ForkParser#setJavaCommand(List)
146 */
147 @Deprecated
110148 public void setJavaCommand(String java) {
111 this.java = java;
149 setJavaCommand(Arrays.asList(java.split(" ")));
112150 }
113151
114152 public Set<MediaType> getSupportedTypes(ParseContext context) {
169169 (ch8 << 0);
170170 }
171171
172 /**
173 * Gets the integer value that is stored in UTF-8 like fashion, in Big Endian
174 * but with the high bit on each number indicating if it continues or not
175 */
176 public static long readUE7(InputStream stream) throws IOException {
177 int i;
178 long v = 0;
179 while ((i = stream.read()) >= 0) {
180 v = v << 7;
181 if ((i & 128) == 128) {
182 // Continues
183 v += (i&127);
184 } else {
185 // Last value
186 v += i;
187 break;
188 }
189 }
190 return v;
191 }
192
172193
173194 /**
174195 * Get a LE short value from the beginning of a byte array
1616 package org.apache.tika.io;
1717
1818 import java.util.HashSet;
19 import java.util.Locale;
1920
2021
2122 public class FilenameUtils {
6465
6566 for (char c: name.toCharArray()) {
6667 if (RESERVED.contains(c)) {
67 sb.append('%').append((c<16) ? "0" : "").append(Integer.toHexString(c).toUpperCase());
68 sb.append('%').append((c<16) ? "0" : "").append(Integer.toHexString(c).toUpperCase(Locale.ROOT));
6869 } else {
6970 sb.append(c);
7071 }
2929 import java.io.StringWriter;
3030 import java.io.Writer;
3131 import java.nio.channels.Channel;
32 import java.nio.charset.Charset;
3233 import java.util.ArrayList;
3334 import java.util.List;
3435
7576 */
7677 public class IOUtils {
7778
79 //TODO: switch to StandardCharsets when we move to Java 1.7
80 public static final Charset UTF_8 = Charset.forName("UTF-8");
81
7882 /**
7983 * The default buffer size to use.
8084 */
253257 */
254258 @Deprecated
255259 public static byte[] toByteArray(String input) throws IOException {
256 return input.getBytes();
260 return input.getBytes(IOUtils.UTF_8);
257261 }
258262
259263 // read char[]
391395 */
392396 @Deprecated
393397 public static String toString(byte[] input) throws IOException {
394 return new String(input);
398 return new String(input, IOUtils.UTF_8);
395399 }
396400
397401 /**
411415 @Deprecated
412416 public static String toString(byte[] input, String encoding)
413417 throws IOException {
418 // If no encoding is specified, default to UTF-8.
414419 if (encoding == null) {
415 return new String(input);
420 return new String(input, IOUtils.UTF_8);
416421 } else {
417422 return new String(input, encoding);
418423 }
434439 * @since Commons IO 1.1
435440 */
436441 public static List<String> readLines(InputStream input) throws IOException {
437 InputStreamReader reader = new InputStreamReader(input);
442 InputStreamReader reader = new InputStreamReader(input, IOUtils.UTF_8);
438443 return readLines(reader);
439444 }
440445
528533 * @since Commons IO 1.1
529534 */
530535 public static InputStream toInputStream(String input) {
531 byte[] bytes = input.getBytes();
536 byte[] bytes = input.getBytes(IOUtils.UTF_8);
532537 return new ByteArrayInputStream(bytes);
533538 }
534539
546551 * @since Commons IO 1.1
547552 */
548553 public static InputStream toInputStream(String input, String encoding) throws IOException {
549 byte[] bytes = encoding != null ? input.getBytes(encoding) : input.getBytes();
554 byte[] bytes = encoding != null ? input.getBytes(encoding) : input.getBytes(IOUtils.UTF_8);
550555 return new ByteArrayInputStream(bytes);
551556 }
552557
584589 */
585590 public static void write(byte[] data, Writer output) throws IOException {
586591 if (data != null) {
587 output.write(new String(data));
592 output.write(new String(data, IOUtils.UTF_8));
588593 }
589594 }
590595
652657 public static void write(char[] data, OutputStream output)
653658 throws IOException {
654659 if (data != null) {
655 output.write(new String(data).getBytes());
660 output.write(new String(data).getBytes(IOUtils.UTF_8));
656661 }
657662 }
658663
778783 public static void write(String data, OutputStream output)
779784 throws IOException {
780785 if (data != null) {
781 output.write(data.getBytes());
786 output.write(data.getBytes(IOUtils.UTF_8));
782787 }
783788 }
784789
847852 public static void write(StringBuffer data, OutputStream output)
848853 throws IOException {
849854 if (data != null) {
850 output.write(data.toString().getBytes());
855 output.write(data.toString().getBytes(IOUtils.UTF_8));
851856 }
852857 }
853858
953958 */
954959 public static void copy(InputStream input, Writer output)
955960 throws IOException {
956 InputStreamReader in = new InputStreamReader(input);
961 InputStreamReader in = new InputStreamReader(input, IOUtils.UTF_8);
957962 copy(in, output);
958963 }
959964
10601065 */
10611066 public static void copy(Reader input, OutputStream output)
10621067 throws IOException {
1063 OutputStreamWriter out = new OutputStreamWriter(output);
1068 OutputStreamWriter out = new OutputStreamWriter(output, IOUtils.UTF_8);
10641069 copy(input, out);
10651070 // XXX Unless anyone is planning on rewriting OutputStreamWriter, we
10661071 // have to flush here.
3232 * a stream with custom pre-, post- or error processing functionality.
3333 *
3434 * @author Stephen Colebourne
35 * @version $Id: ProxyInputStream.java 934061 2010-04-14 17:56:37Z jukka $
35 * @version $Id$
3636 */
3737 public abstract class ProxyInputStream extends FilterInputStream {
3838
2424 import java.util.Properties;
2525 import java.util.Set;
2626
27 import org.apache.tika.io.IOUtils;
28
2729 /**
2830 * Identifier of the language that best matches a given content profile.
2931 * The content profile is compared to generic language profiles based on
4345 private static final Map<String, LanguageProfile> PROFILES =
4446 new HashMap<String, LanguageProfile>();
4547 private static final String PROFILE_SUFFIX = ".ngp";
46 private static final String PROFILE_ENCODING = "UTF-8";
4748
4849 private static Properties props = new Properties();
4950 private static String errors = "";
7576 LanguageIdentifier.class.getResourceAsStream(language + PROFILE_SUFFIX);
7677 try {
7778 BufferedReader reader =
78 new BufferedReader(new InputStreamReader(stream, PROFILE_ENCODING));
79 new BufferedReader(new InputStreamReader(stream, IOUtils.UTF_8));
7980 String line = reader.readLine();
8081 while (line != null) {
8182 if (line.length() > 0 && !line.startsWith("#")) {
1515 */
1616 package org.apache.tika.language;
1717
18
1819 import java.util.HashMap;
1920 import java.util.HashSet;
2021 import java.util.Map;
2122 import java.util.Set;
23 import java.util.List;
24 import java.util.ArrayList;
25 import java.util.Collections;
26 import java.util.Comparator;
2227
2328 /**
2429 * Language profile based on ngram counts.
3843 new HashMap<String, Counter>();
3944
4045 /**
46 * Sorted ngram cache for faster distance calculation.
47 */
48 private Interleaved interleaved = new Interleaved();
49 public static boolean useInterleaved = true; // For testing purposes
50
51 /**
4152 * The sum of all ngram counts in this profile.
4253 * Used to calculate relative ngram frequency.
4354 */
122133 * @return distance between the profiles
123134 */
124135 public double distance(LanguageProfile that) {
136 return useInterleaved ? distanceInterleaved(that) : distanceStandard(that);
137 }
138
139 private double distanceStandard(LanguageProfile that) {
125140 if (length != that.length) {
126141 throw new IllegalArgumentException(
127142 "Unable to calculage distance of language profiles"
151166 return ngrams.toString();
152167 }
153168
169 /* Code for interleaved distance calculation below */
170
171 private double distanceInterleaved(LanguageProfile that) {
172 if (length != that.length) {
173 throw new IllegalArgumentException(
174 "Unable to calculage distance of language profiles"
175 + " with different ngram lengths: "
176 + that.length + " != " + length);
177 }
178
179 double sumOfSquares = 0.0;
180 double thisCount = Math.max(this.count, 1.0);
181 double thatCount = Math.max(that.count, 1.0);
182
183 Interleaved.Entry thisEntry = updateInterleaved().firstEntry();
184 Interleaved.Entry thatEntry = that.updateInterleaved().firstEntry();
185
186 // Iterate the lists in parallel, until both lists has been depleted
187 while (thisEntry.hasNgram() || thatEntry.hasNgram()) {
188 if (!thisEntry.hasNgram()) { // Depleted this
189 sumOfSquares += square(thatEntry.count / thatCount);
190 thatEntry.next();
191 continue;
192 }
193
194 if (!thatEntry.hasNgram()) { // Depleted that
195 sumOfSquares += square(thisEntry.count / thisCount);
196 thisEntry.next();
197 continue;
198 }
199
200 final int compare = thisEntry.compareTo(thatEntry);
201
202 if (compare == 0) { // Term exists both in this and that
203 double difference = thisEntry.count/thisCount - thatEntry.count/thatCount;
204 sumOfSquares += square(difference);
205 thisEntry.next();
206 thatEntry.next();
207 } else if (compare < 0) { // Term exists only in this
208 sumOfSquares += square(thisEntry.count/thisCount);
209 thisEntry.next();
210 } else { // Term exists only in that
211 sumOfSquares += square(thatEntry.count/thatCount);
212 thatEntry.next();
213 }
214 }
215 return Math.sqrt(sumOfSquares);
216 }
217 private double square(double count) {
218 return count * count;
219 }
220
221 private class Interleaved {
222
223 private char[] entries = null; // <ngram(length chars)><count(2 chars)>*
224 private int size = 0; // Number of entries (one entry = length+2 chars)
225 private long entriesGeneratedAtCount = -1; // Keeps track of when the sequential structure was current
226
227 /**
228 * Ensure that the entries array is in sync with the ngrams.
229 */
230 public void update() {
231 if (count == entriesGeneratedAtCount) { // Already up to date
232 return;
233 }
234 size = ngrams.size();
235 final int numChars = (length+2)*size;
236 if (entries == null || entries.length < numChars) {
237 entries = new char[numChars];
238 }
239 int pos = 0;
240 for (Map.Entry<String, Counter> entry: getSortedNgrams()) {
241 for (int l = 0 ; l < length ; l++) {
242 entries[pos + l] = entry.getKey().charAt(l);
243 }
244 entries[pos + length] = (char)(entry.getValue().count / 65536); // Upper 16 bit
245 entries[pos + length + 1] = (char)(entry.getValue().count % 65536); // lower 16 bit
246 pos += length + 2;
247 }
248 entriesGeneratedAtCount = count;
249 }
250
251 public Entry firstEntry() {
252 Entry entry = new Entry();
253 if (size > 0) {
254 entry.update(0);
255 }
256 return entry;
257 }
258
259 private List<Map.Entry<String, Counter>> getSortedNgrams() {
260 List<Map.Entry<String, Counter>> entries = new ArrayList<Map.Entry<String, Counter>>(ngrams.size());
261 entries.addAll(ngrams.entrySet());
262 Collections.sort(entries, new Comparator<Map.Entry<String, Counter>>() {
263 @Override
264 public int compare(Map.Entry<String, Counter> o1, Map.Entry<String, Counter> o2) {
265 return o1.getKey().compareTo(o2.getKey());
266 }
267 });
268 return entries;
269 }
270
271 private class Entry implements Comparable<Entry> {
272 char[] ngram = new char[length];
273 int count = 0;
274 int pos = 0;
275
276 private void update(int pos) {
277 this.pos = pos;
278 if (pos >= size) { // Reached the end
279 return;
280 }
281 final int origo = pos*(length+2);
282 System.arraycopy(entries, origo, ngram, 0, length);
283 count = entries[origo+length] * 65536 + entries[origo+length+1];
284 }
285
286 @Override
287 public int compareTo(Entry other) {
288 for (int i = 0 ; i < ngram.length ; i++) {
289 if (ngram[i] != other.ngram[i]) {
290 return ngram[i] - other.ngram[i];
291 }
292 }
293 return 0;
294 }
295 public boolean hasNext() {
296 return pos < size-1;
297 }
298 public boolean hasNgram() {
299 return pos < size;
300 }
301 public void next() {
302 update(pos+1);
303 }
304 public String toString() {
305 return new String(ngram) + "(" + count + ")";
306 }
307 }
308 }
309 private Interleaved updateInterleaved() {
310 interleaved.update();
311 return interleaved;
312 }
154313 }
3232 import java.util.Iterator;
3333 import java.util.List;
3434 import java.util.Map;
35
3536 import org.apache.tika.exception.TikaException;
36
37 import org.apache.tika.io.IOUtils;
3738 /**
3839 * This class runs a ngram analysis over submitted text, results might be used
3940 * for automatic language identification.
340341
341342 ngrams.clear();
342343 ngramcounts = new int[maxLength + 1];
343 BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
344 BufferedReader reader = new BufferedReader(new InputStreamReader(is, IOUtils.UTF_8));
344345 String line = null;
345346
346347 while ((line = reader.readLine()) != null) {
404405 */
405406 public void save(OutputStream os) throws IOException {
406407 os.write(("# NgramProfile generated at " + new Date() +
407 " for Apache Tika Language Identification\n").getBytes());
408 " for Apache Tika Language Identification\n").getBytes(IOUtils.UTF_8));
408409
409410 // And then each ngram
410411
431432 for (int i = 0; i < list.size(); i++) {
432433 NGramEntry e = list.get(i);
433434 String line = e.toString() + " " + e.getCount() + "\n";
434 os.write(line.getBytes("UTF-8"));
435 os.write(line.getBytes(IOUtils.UTF_8));
435436 }
436437 os.flush();
437438 }
1717 package org.apache.tika.language.translate;
1818
1919 import org.apache.tika.config.ServiceLoader;
20 import org.apache.tika.exception.TikaException;
21
22 import java.io.IOException;
2023 import java.util.Collections;
2124 import java.util.Comparator;
2225 import java.util.List;
5558 return translators;
5659 }
5760
58 public String translate(String text, String sourceLanguage, String targetLanguage) throws Exception {
61 public String translate(String text, String sourceLanguage, String targetLanguage) throws TikaException, IOException {
5962 return getDefaultTranslators(loader).get(0).translate(text, sourceLanguage, targetLanguage);
6063 }
6164
62 public String translate(String text, String targetLanguage) throws Exception {
65 public String translate(String text, String targetLanguage) throws TikaException, IOException {
6366 return getDefaultTranslators(loader).get(0).translate(text, targetLanguage);
6467 }
6568
1414 * limitations under the License.
1515 */
1616 package org.apache.tika.language.translate;
17
18 import org.apache.tika.exception.TikaException;
19
20 import java.io.IOException;
1721
1822 /**
1923 * Interface for Translator services.
3337 * @param sourceLanguage The input text language (for example, "en").
3438 * @param targetLanguage The desired language to translate to (for example, "fr").
3539 * @return The translation result. If translation is unavailable, returns the same text back.
36 * @throws Exception When there is an error with the API call.
40 * @throws TikaException When there is an error translating.
41 * @throws java.io.IOException
3742 * @since Tika 1.6
3843 */
39 public String translate(String text, String sourceLanguage, String targetLanguage) throws Exception;
44 public String translate(String text, String sourceLanguage, String targetLanguage) throws TikaException, IOException;
4045
4146 /**
4247 * Translate text to the given language. This method attempts to auto-detect the source language of the text.
5156 * @param text The text to translate.
5257 * @param targetLanguage The desired language to translate to (for example, "hi").
5358 * @return The translation result. If translation is unavailable, returns the same text back.
54 * @throws Exception When there is an error with the API call.
59 * @throws TikaException When there is an error translating.
60 * @throws java.io.IOException
5561 * @since Tika 1.6
5662 */
57 public String translate(String text, String targetLanguage) throws Exception;
63 public String translate(String text, String targetLanguage) throws TikaException, IOException;
5864
5965 /**
6066 * @return true if this Translator is probably able to translate right now.
0 package org.apache.tika.metadata;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 /**
20 * Until we can find a common standard, we'll use these options. They
21 * were mostly derived from PDFBox's AccessPermission, but some can
22 * apply to other document formats, especially CAN_MODIFY and FILL_IN_FORM.
23 */
24 public interface AccessPermissions {
25
26 final static String PREFIX = "access_permission"+Metadata.NAMESPACE_PREFIX_DELIMITER;
27
28 /**
29 * Can any modifications be made to the document
30 */
31 Property CAN_MODIFY = Property.externalTextBag(PREFIX+"can_modify");
32
33 /**
34 * Should content be extracted, generally.
35 */
36 Property EXTRACT_CONTENT = Property.externalText(PREFIX+"extract_content");
37
38 /**
39 * Should content be extracted for the purposes
40 * of accessibility.
41 */
42 Property EXTRACT_FOR_ACCESSIBILITY = Property.externalText(PREFIX + "extract_for_accessibility");
43
44 /**
45 * Can the user insert/rotate/delete pages.
46 */
47 Property ASSEMBLE_DOCUMENT = Property.externalText(PREFIX+"assemble_document");
48
49
50 /**
51 * Can the user fill in a form
52 */
53 Property FILL_IN_FORM = Property.externalText(PREFIX+"fill_in_form");
54
55 /**
56 * Can the user modify annotations
57 */
58 Property CAN_MODIFY_ANNOTATIONS = Property.externalText(PREFIX+"modify_annotations");
59
60 /**
61 * Can the user print the document
62 */
63 Property CAN_PRINT = Property.externalText(PREFIX+"can_print");
64
65 /**
66 * Can the user print an image-degraded version of the document.
67 */
68 Property CAN_PRINT_DEGRADED = Property.externalText(PREFIX+"can_print_degraded");
69
70 }
0 package org.apache.tika.metadata;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18 public interface Database {
19 final static String PREFIX = "database"+Metadata.NAMESPACE_PREFIX_DELIMITER;
20
21 Property TABLE_NAME = Property.externalTextBag(PREFIX+"table_name");
22 Property COLUMN_COUNT = Property.externalText(PREFIX+"column_count");
23 Property COLUMN_NAME = Property.externalTextBag(PREFIX+"column_name");
24 }
2121 * properties defined in the XMP standard.
2222 *
2323 * @since Apache Tika 0.8
24 * @see <a href="http://www.adobe.com/devnet/xmp/pdfs/XMPSpecificationPart2.pdf"
24 * @see <a href="http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/cc-201306/XMPSpecificationPart2.pdf"
2525 * >XMP Specification, Part 2: Standard Schemas</a>
2626 */
2727 public interface PagedText {
3737 Property AUTHORS_POSITION = Property.internalText(
3838 PREFIX_PHOTOSHOP + Metadata.NAMESPACE_PREFIX_DELIMITER + "AuthorsPosition");
3939
40 // TODO Replace this with proper indexed choices support
41 String[] _COLOR_MODE_CHOICES_INDEXED = { "Bitmap", "Greyscale", "Indexed Colour",
42 "RGB Color", "CMYK Colour", "Multi-Channel", "Duotone", "LAB Colour",
43 "reserved", "reserved", "YCbCr Colour", "YCgCo Colour", "YCbCrK Colour"};
44 Property COLOR_MODE = Property.internalClosedChoise(
45 PREFIX_PHOTOSHOP + Metadata.NAMESPACE_PREFIX_DELIMITER + "ColorMode",
46 _COLOR_MODE_CHOICES_INDEXED);
47
4048 Property CAPTION_WRITER = Property.internalText(
4149 PREFIX_PHOTOSHOP + Metadata.NAMESPACE_PREFIX_DELIMITER + "CaptionWriter");
4250
2121 * properties defined in the XMP standard.
2222 *
2323 * @since Apache Tika 0.8
24 * @see <a href="http://www.adobe.com/devnet/xmp/pdfs/XMPSpecificationPart2.pdf"
24 * @see <a href="http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/cc-201306/XMPSpecificationPart2.pdf"
2525 * >XMP Specification, Part 2: Standard Schemas</a>
2626 */
2727 public interface TIFF {
5050 ATTACHMENT
5151 };
5252
53 /**
54 * Use this to prefix metadata properties that store information
55 * about the parsing process. Users should be able to distinguish
56 * between metadata that was contained within the document and
57 * metadata about the parsing process.
58 * In Tika 2.0 (or earlier?), let's change X-ParsedBy to X-TIKA-Parsed-By.
59 */
60 public static String TIKA_META_PREFIX = "X-TIKA"+Metadata.NAMESPACE_PREFIX_DELIMITER;
61
62 /**
63 * Use this to store parse exception information in the Metadata object.
64 */
65 public static String TIKA_META_EXCEPTION_PREFIX = TIKA_META_PREFIX+"EXCEPTION"+
66 Metadata.NAMESPACE_PREFIX_DELIMITER;
67
68 /**
69 * This is currently used to identify Content-Type that may be
70 * included within a document, such as in html documents
71 * (e.g. <meta http-equiv="content-type" content="text/html; charset=UTF-8">)
72 , or the value might come from outside the document. This information
73 * may be faulty and should be treated only as a hint.
74 */
75 public static final Property CONTENT_TYPE_HINT =
76 Property.internalText(HttpHeaders.CONTENT_TYPE+"-Hint");
5377
5478 /**
5579 * @see DublinCore#FORMAT
2323 * properties defined in the XMP standard.
2424 *
2525 * @since Apache Tika 0.7
26 * @see <a href="http://www.adobe.com/devnet/xmp/pdfs/XMPSpecificationPart2.pdf"
26 * @see <a href="http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/cc-201306/XMPSpecificationPart2.pdf"
2727 * >XMP Specification, Part 2: Standard Schemas</a>
2828 */
2929 public interface XMPDM {
5757 * "The name of the artist or artists."
5858 */
5959 Property ARTIST = Property.externalText("xmpDM:artist");
60
61 /**
62 * "The name of the album artist or group for compilation albums."
63 */
64 Property ALBUM_ARTIST = Property.externalText("xmpDM:albumArtist");
6065
6166 /**
6267 * "The date and time when the audio was last modified."
141146 // Property BEAT_SPLICE_PARAMS = "xmpDM:beatSpliceParams";
142147
143148 /**
149 * "An album created by various artists."
150 */
151 Property COMPILATION = Property.externalInteger("xmpDM:compilation");
152
153 /**
144154 * "The composer's name."
145155 */
146156 Property COMPOSER = Property.externalText("xmpDM:composer");
154164 * "The copyright information."
155165 */
156166 Property COPYRIGHT = Property.externalText("xmpDM:copyright");
167
168 /**
169 * "The disc number for part of an album set."
170 */
171 Property DISC_NUMBER = Property.externalInteger("xmpDM:discNumber");
157172
158173 /**
159174 * "The duration of the media file."
8181 }
8282 return aliases;
8383 }
84
85 /**
86 * Returns the set of known children of the given canonical media type
87 *
88 * @since Apache Tika 1.8
89 * @param type canonical media type
90 * @return known children
91 */
92 public SortedSet<MediaType> getChildTypes(MediaType type) {
93 SortedSet<MediaType> children = new TreeSet<MediaType>();
94 for (Map.Entry<MediaType, MediaType> entry : inheritance.entrySet()) {
95 if (entry.getValue().equals(type)) {
96 children.add(entry.getKey());
97 }
98 }
99 return children;
100 }
84101
85102 public void addType(MediaType type) {
86103 registry.put(type, type);
152169 }
153170
154171 /**
155 * Returns the supertype of the given type. If the given type has any
156 * parameters, then the respective base type is returned. Otherwise
157 * built-in heuristics like text/... -&gt; text/plain and
158 * .../...+xml -&gt; application/xml are used in addition to explicit
159 * type inheritance rules read from the media type database. Finally
160 * application/octet-stream is returned for all types for which no other
172 * Returns the supertype of the given type. If the media type database
173 * has an explicit inheritance rule for the type, then that is used.
174 * Next, if the given type has any parameters, then the respective base
175 * type (parameter-less) is returned. Otherwise built-in heuristics like
176 * text/... -&gt; text/plain and .../...+xml -&gt; application/xml are used.
177 * Finally application/octet-stream is returned for all types for which no other
161178 * supertype is known, and the return value for application/octet-stream
162179 * is <code>null</code>.
163180 *
168185 public MediaType getSupertype(MediaType type) {
169186 if (type == null) {
170187 return null;
188 } else if (inheritance.containsKey(type)) {
189 return inheritance.get(type);
171190 } else if (type.hasParameters()) {
172191 return type.getBaseType();
173 } else if (inheritance.containsKey(type)) {
174 return inheritance.get(type);
175192 } else if (type.getSubtype().endsWith("+xml")) {
176193 return MediaType.APPLICATION_XML;
177194 } else if (type.getSubtype().endsWith("+zip")) {
6161 * available parsers have their 3rd party jars included, as otherwise the
6262 * use of the default TikaConfig will throw various "ClassNotFound" exceptions.
6363 *
64 * @param detector Detector to use
6564 * @param parsers
6665 */
6766 public AutoDetectParser(Parser...parsers) {
1515 */
1616 package org.apache.tika.parser;
1717
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.util.ArrayList;
21 import java.util.Arrays;
22 import java.util.Collections;
23 import java.util.HashMap;
24 import java.util.List;
25 import java.util.Map;
26 import java.util.Set;
27
2818 import org.apache.tika.exception.TikaException;
2919 import org.apache.tika.io.TemporaryResources;
3020 import org.apache.tika.io.TikaInputStream;
3525 import org.xml.sax.ContentHandler;
3626 import org.xml.sax.SAXException;
3727
28 import java.io.IOException;
29 import java.io.InputStream;
30 import java.util.ArrayList;
31 import java.util.Arrays;
32 import java.util.Collection;
33 import java.util.Collections;
34 import java.util.HashMap;
35 import java.util.List;
36 import java.util.Map;
37 import java.util.Set;
38
3839 /**
3940 * Composite parser that delegates parsing tasks to a component parser
4041 * based on the declared content type of the incoming document. A fallback
6162 */
6263 private Parser fallback = new EmptyParser();
6364
65 public CompositeParser(MediaTypeRegistry registry, List<Parser> parsers,
66 Collection<Class<? extends Parser>> excludeParsers) {
67 if (excludeParsers == null || excludeParsers.isEmpty()) {
68 this.parsers = parsers;
69 } else {
70 this.parsers = new ArrayList<Parser>();
71 for (Parser p : parsers) {
72 if (!isExcluded(excludeParsers, p.getClass())) {
73 this.parsers.add(p);
74 }
75 }
76 }
77 this.registry = registry;
78 }
6479 public CompositeParser(MediaTypeRegistry registry, List<Parser> parsers) {
65 this.parsers = parsers;
66 this.registry = registry;
80 this(registry, parsers, null);
6781 }
6882
6983 public CompositeParser(MediaTypeRegistry registry, Parser... parsers) {
8296 }
8397 }
8498 return map;
99 }
100
101 private boolean isExcluded(Collection<Class<? extends Parser>> excludeParsers, Class<? extends Parser> p){
102 return excludeParsers.contains(p) || assignableFrom(excludeParsers, p);
103 }
104
105 private boolean assignableFrom(Collection<Class<? extends Parser>> excludeParsers, Class<? extends Parser> p) {
106 for (Class<? extends Parser> e : excludeParsers) {
107 if (e.isAssignableFrom(p)) return true;
108 }
109 return false;
85110 }
86111
87112 /**
139164 }
140165
141166 /**
167 * Returns all parsers registered with the Composite Parser,
168 * including ones which may not currently be active.
169 * This won't include the Fallback Parser, if defined
170 */
171 public List<Parser> getAllComponentParsers() {
172 return Collections.unmodifiableList(parsers);
173 }
174
175 /**
142176 * Returns the component parsers.
143177 *
144178 * @return component parsers, keyed by media type
202236 // We always work on the normalised, canonical form
203237 type = registry.normalize(type);
204238 }
205
206239 while (type != null) {
207240 // Try finding a parser for the type
208241 Parser parser = map.get(type);
238271 TikaInputStream taggedStream = TikaInputStream.get(stream, tmp);
239272 TaggedContentHandler taggedHandler =
240273 handler != null ? new TaggedContentHandler(handler) : null;
274 if (parser instanceof ParserDecorator){
275 metadata.add("X-Parsed-By", ((ParserDecorator) parser).getWrappedParser().getClass().getName());
276 } else {
277 metadata.add("X-Parsed-By", parser.getClass().getName());
278 }
241279 try {
242280 parser.parse(taggedStream, taggedHandler, metadata, context);
243281 } catch (RuntimeException e) {
1515 */
1616 package org.apache.tika.parser;
1717
18 import java.util.Collection;
1819 import java.util.Collections;
1920 import java.util.Comparator;
2021 import java.util.List;
6869
6970 private transient final ServiceLoader loader;
7071
72 public DefaultParser(MediaTypeRegistry registry, ServiceLoader loader,
73 Collection<Class<? extends Parser>> excludeParsers) {
74 super(registry, getDefaultParsers(loader), excludeParsers);
75 this.loader = loader;
76 }
77
7178 public DefaultParser(MediaTypeRegistry registry, ServiceLoader loader) {
72 super(registry, getDefaultParsers(loader));
73 this.loader = loader;
79 this(registry, loader, null);
7480 }
7581
7682 public DefaultParser(MediaTypeRegistry registry, ClassLoader loader) {
3131 * for unknown document types.
3232 */
3333 public class EmptyParser extends AbstractParser {
34
3534 /**
3635 * Serial version UID.
3736 */
5453 xhtml.startDocument();
5554 xhtml.endDocument();
5655 }
57
5856 }
3030 * for unknown document types.
3131 */
3232 public class ErrorParser extends AbstractParser {
33
33 private static final long serialVersionUID = 7727423956957641824L;
34
3435 /**
3536 * Singleton instance of this class.
3637 */
4647 throws TikaException {
4748 throw new TikaException("Parse error");
4849 }
49
5050 }
1717
1818 import java.io.IOException;
1919 import java.io.InputStream;
20 import java.util.Collection;
21 import java.util.HashSet;
2022 import java.util.Set;
2123
2224 import org.apache.tika.exception.TikaException;
25 import org.apache.tika.io.TikaInputStream;
2326 import org.apache.tika.metadata.Metadata;
2427 import org.apache.tika.mime.MediaType;
2528 import org.xml.sax.ContentHandler;
5154 @Override
5255 public Set<MediaType> getSupportedTypes(ParseContext context) {
5356 return types;
57 }
58 };
59 }
60
61 /**
62 * Decorates the given parser so that it never claims to support
63 * parsing of the given media types, but will work for all others.
64 *
65 * @param parser the parser to be decorated
66 * @param types excluded/ignored media types
67 * @return the decorated parser
68 */
69 public static final Parser withoutTypes(
70 Parser parser, final Set<MediaType> excludeTypes) {
71 return new ParserDecorator(parser) {
72 private static final long serialVersionUID = 7979614774021768609L;
73 @Override
74 public Set<MediaType> getSupportedTypes(ParseContext context) {
75 // Get our own, writable copy of the types the parser supports
76 Set<MediaType> parserTypes =
77 new HashSet<MediaType>(super.getSupportedTypes(context));
78 // Remove anything on our excludes list
79 parserTypes.removeAll(excludeTypes);
80 // Return whatever is left
81 return parserTypes;
82 }
83 };
84 }
85
86 /**
87 * Decorates the given parsers into a virtual parser, where they'll
88 * be tried in preference order until one works without error.
89 * TODO Is this the right name?
90 * TODO Is this the right place to put this? Should it be in CompositeParser? Elsewhere?
91 * TODO Should we reset the Metadata if we try another parser?
92 * TODO Should we reset the ContentHandler if we try another parser?
93 * TODO Should we log/report failures anywhere?
94 * @deprecated Do not use until the TODOs are resolved, see TIKA-1509
95 */
96 public static final Parser withFallbacks(
97 final Collection<? extends Parser> parsers, final Set<MediaType> types) {
98 Parser parser = EmptyParser.INSTANCE;
99 if (!parsers.isEmpty()) parser = parsers.iterator().next();
100
101 return new ParserDecorator(parser) {
102 private static final long serialVersionUID = 1625187131782069683L;
103 @Override
104 public Set<MediaType> getSupportedTypes(ParseContext context) {
105 return types;
106 }
107 @Override
108 public void parse(InputStream stream, ContentHandler handler,
109 Metadata metadata, ParseContext context)
110 throws IOException, SAXException, TikaException {
111 // Must have a TikaInputStream, so we can re-use it if parsing fails
112 TikaInputStream tstream = TikaInputStream.get(stream);
113 tstream.getFile();
114 // Try each parser in turn
115 for (Parser p : parsers) {
116 tstream.mark(-1);
117 try {
118 p.parse(tstream, handler, metadata, context);
119 return;
120 } catch (Exception e) {
121 // TODO How to log / record this failure?
122 }
123 // Prepare for the next parser, if present
124 tstream.reset();
125 }
54126 }
55127 };
56128 }
0 package org.apache.tika.parser;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.util.Date;
22 import java.util.LinkedList;
23 import java.util.List;
24 import java.util.Set;
25
26 import org.apache.tika.exception.TikaException;
27 import org.apache.tika.io.FilenameUtils;
28 import org.apache.tika.metadata.Metadata;
29 import org.apache.tika.metadata.Property;
30 import org.apache.tika.metadata.TikaCoreProperties;
31 import org.apache.tika.metadata.TikaMetadataKeys;
32 import org.apache.tika.mime.MediaType;
33 import org.apache.tika.sax.ContentHandlerFactory;
34 import org.xml.sax.ContentHandler;
35 import org.xml.sax.SAXException;
36 import org.xml.sax.helpers.DefaultHandler;
37
38 /**
39 * This is a helper class that wraps a parser in a recursive handler.
40 * It takes care of setting the embedded parser in the ParseContext
41 * and handling the embedded path calculations.
42 * <p>
43 * After parsing a document, call getMetadata() to retrieve a list of
44 * Metadata objects, one for each embedded resource. The first item
45 * in the list will contain the Metadata for the outer container file.
46 * <p>
47 * Content can also be extracted and stored in the {@link #TIKA_CONTENT} field
48 * of a Metadata object. Select the type of content to be stored
49 * at initialization.
50 * <p>
51 * If a WriteLimitReachedException is encountered, the wrapper will stop
52 * processing the current resource, and it will not process
53 * any of the child resources for the given resource. However, it will try to
54 * parse as much as it can. If a WLRE is reached in the parent document,
55 * no child resources will be parsed.
56 * <p>
57 * The implementation is based on Jukka's RecursiveMetadataParser
58 * and Nick's additions. See:
59 * <a href="http://wiki.apache.org/tika/RecursiveMetadata#Jukka.27s_RecursiveMetadata_Parser">RecursiveMetadataParser</a>.
60 * <p>
61 * Note that this wrapper holds all data in memory and is not appropriate
62 * for files with content too large to be held in memory.
63 * <p>
64 * Note, too, that this wrapper is not thread safe because it stores state.
65 * The client must initialize a new wrapper for each thread, and the client
66 * is responsible for calling {@link #reset()} after each parse.
67 * <p>
68 * The unit tests for this class are in the tika-parsers module.
69 * </p>
70 */
71 public class RecursiveParserWrapper implements Parser {
72
73 /**
74 * Generated serial version
75 */
76 private static final long serialVersionUID = 9086536568120690938L;
77
78 //move this to TikaCoreProperties?
79 public final static Property TIKA_CONTENT = Property.internalText(TikaCoreProperties.TIKA_META_PREFIX+"content");
80 public final static Property PARSE_TIME_MILLIS = Property.internalText(TikaCoreProperties.TIKA_META_PREFIX+"parse_time_millis");
81 public final static Property WRITE_LIMIT_REACHED =
82 Property.internalBoolean(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"write_limit_reached");
83 public final static Property EMBEDDED_RESOURCE_LIMIT_REACHED =
84 Property.internalBoolean(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"embedded_resource_limit_reached");
85
86 //move this to TikaCoreProperties?
87 public final static Property EMBEDDED_RESOURCE_PATH =
88 Property.internalText(TikaCoreProperties.TIKA_META_PREFIX+"embedded_resource_path");
89
90 private final Parser wrappedParser;
91 private final ContentHandlerFactory contentHandlerFactory;
92 private final List<Metadata> metadatas = new LinkedList<Metadata>();
93
94 //used in naming embedded resources that don't have a name.
95 private int unknownCount = 0;
96 private int maxEmbeddedResources = -1;
97 private boolean hitMaxEmbeddedResources = false;
98
99 public RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory) {
100 this.wrappedParser = wrappedParser;
101 this.contentHandlerFactory = contentHandlerFactory;
102 }
103
104 @Override
105 public Set<MediaType> getSupportedTypes(ParseContext context) {
106 return wrappedParser.getSupportedTypes(context);
107 }
108
109 /**
110 * Acts like a regular parser except it ignores the ContentHandler
111 * and it automatically sets/overwrites the embedded Parser in the
112 * ParseContext object.
113 * <p>
114 * To retrieve the results of the parse, use {@link #getMetadata()}.
115 * <p>
116 * Make sure to call {@link #reset()} after each parse.
117 */
118 @Override
119 public void parse(InputStream stream, ContentHandler ignore,
120 Metadata metadata, ParseContext context) throws IOException,
121 SAXException, TikaException {
122
123 String name = getResourceName(metadata);
124 EmbeddedParserDecorator decorator = new EmbeddedParserDecorator(name);
125 context.set(Parser.class, decorator);
126 ContentHandler localHandler = contentHandlerFactory.getNewContentHandler();
127 long started = new Date().getTime();
128 try {
129 wrappedParser.parse(stream, localHandler, metadata, context);
130 } catch (SAXException e) {
131 boolean wlr = isWriteLimitReached(e);
132 if (wlr == false) {
133 throw e;
134 }
135 metadata.set(WRITE_LIMIT_REACHED, "true");
136 }
137 long elapsedMillis = new Date().getTime()-started;
138 metadata.set(PARSE_TIME_MILLIS, Long.toString(elapsedMillis));
139 addContent(localHandler, metadata);
140
141 if (hitMaxEmbeddedResources) {
142 metadata.set(EMBEDDED_RESOURCE_LIMIT_REACHED, "true");
143 }
144 metadatas.add(0, deepCopy(metadata));
145 }
146
147 /**
148 *
149 * The first element in the returned list represents the
150 * data from the outer container file. There is no guarantee
151 * about the ordering of the list after that.
152 *
153 * @return list of Metadata objects that were gathered during the parse
154 */
155 public List<Metadata> getMetadata() {
156 return metadatas;
157 }
158
159 /**
160 * Set the maximum number of embedded resources to store.
161 * If the max is hit during parsing, the {@link #EMBEDDED_RESOURCE_LIMIT_REACHED}
162 * property will be added to the container document's Metadata.
163 *
164 * <p>
165 * If this value is < 0 (the default), the wrapper will store all Metadata.
166 *
167 * @param max maximum number of embedded resources to store
168 */
169 public void setMaxEmbeddedResources(int max) {
170 maxEmbeddedResources = max;
171 }
172
173
174 /**
175 * This clears the metadata list and resets {@link #unknownCount} and
176 * {@link #hitMaxEmbeddedResources}
177 */
178 public void reset() {
179 metadatas.clear();
180 unknownCount = 0;
181 hitMaxEmbeddedResources = false;
182 }
183
184 /**
185 * Copied/modified from WriteOutContentHandler. Couldn't make that
186 * static, and we need to have something that will work
187 * with exceptions thrown from both BodyContentHandler and WriteOutContentHandler
188 * @param t
189 * @return
190 */
191 private boolean isWriteLimitReached(Throwable t) {
192 if (t.getMessage().indexOf("Your document contained more than") == 0) {
193 return true;
194 } else {
195 return t.getCause() != null && isWriteLimitReached(t.getCause());
196 }
197 }
198
199 //defensive copy
200 private Metadata deepCopy(Metadata m) {
201 Metadata clone = new Metadata();
202
203 for (String n : m.names()){
204 if (! m.isMultiValued(n)) {
205 clone.set(n, m.get(n));
206 } else {
207 String[] vals = m.getValues(n);
208 for (int i = 0; i < vals.length; i++) {
209 clone.add(n, vals[i]);
210 }
211 }
212 }
213 return clone;
214 }
215
216 private String getResourceName(Metadata metadata) {
217 String objectName = "";
218 if (metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY) != null) {
219 objectName = metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY);
220 } else if (metadata.get(TikaMetadataKeys.EMBEDDED_RELATIONSHIP_ID) != null) {
221 objectName = metadata.get(TikaMetadataKeys.EMBEDDED_RELATIONSHIP_ID);
222 } else {
223 objectName = "embedded-" + (++unknownCount);
224 }
225 //make sure that there isn't any path info in the objectName
226 //some parsers can return paths, not just file names
227 objectName = FilenameUtils.getName(objectName);
228 return objectName;
229 }
230
231 private void addContent(ContentHandler handler, Metadata metadata) {
232
233 if (handler.getClass().equals(DefaultHandler.class)){
234 //no-op: we can't rely on just testing for
235 //empty content because DefaultHandler's toString()
236 //returns e.g. "org.xml.sax.helpers.DefaultHandler@6c8b1edd"
237 } else {
238 String content = handler.toString();
239 if (content != null && content.trim().length() > 0 ) {
240 metadata.add(TIKA_CONTENT, content);
241 }
242 }
243
244 }
245
246 /**
247 * Override for different behavior.
248 *
249 * @return handler to be used for each document
250 */
251
252
253 private class EmbeddedParserDecorator extends ParserDecorator {
254
255 private static final long serialVersionUID = 207648200464263337L;
256
257 private String location = null;
258
259
260 private EmbeddedParserDecorator(String location) {
261 super(wrappedParser);
262 this.location = location;
263 if (! this.location.endsWith("/")) {
264 this.location += "/";
265 }
266 }
267
268 @Override
269 public void parse(InputStream stream, ContentHandler ignore,
270 Metadata metadata, ParseContext context) throws IOException,
271 SAXException, TikaException {
272 //Test to see if we should avoid parsing
273 if (maxEmbeddedResources > -1 &&
274 metadatas.size() >= maxEmbeddedResources) {
275 hitMaxEmbeddedResources = true;
276 return;
277 }
278 // Work out what this thing is
279 String objectName = getResourceName(metadata);
280 String objectLocation = this.location + objectName;
281
282 metadata.add(EMBEDDED_RESOURCE_PATH, objectLocation);
283
284 //ignore the content handler that is passed in
285 //and get a fresh handler
286 ContentHandler localHandler = contentHandlerFactory.getNewContentHandler();
287
288 Parser preContextParser = context.get(Parser.class);
289 context.set(Parser.class, new EmbeddedParserDecorator(objectLocation));
290
291 try {
292 super.parse(stream, localHandler, metadata, context);
293 } catch (SAXException e) {
294 boolean wlr = isWriteLimitReached(e);
295 if (wlr == true) {
296 metadata.add(WRITE_LIMIT_REACHED, "true");
297 } else {
298 throw e;
299 }
300 } finally {
301 context.set(Parser.class, preContextParser);
302 }
303
304 //Because of recursion, we need
305 //to re-test to make sure that we limit the
306 //number of stored resources
307 if (maxEmbeddedResources > -1 &&
308 metadatas.size() >= maxEmbeddedResources) {
309 hitMaxEmbeddedResources = true;
310 return;
311 }
312 addContent(localHandler, metadata);
313 metadatas.add(deepCopy(metadata));
314 }
315 }
316
317
318 }
230230 */
231231 private void extractOutput(InputStream stream, XHTMLContentHandler xhtml)
232232 throws SAXException, IOException {
233 Reader reader = new InputStreamReader(stream);
233 Reader reader = new InputStreamReader(stream, IOUtils.UTF_8);
234234 try {
235235 xhtml.startDocument();
236236 xhtml.startElement("p");
290290 private void extractMetadata(final InputStream stream, final Metadata metadata) {
291291 new Thread() {
292292 public void run() {
293 BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
293 BufferedReader reader;
294 reader = new BufferedReader(new InputStreamReader(stream, IOUtils.UTF_8));
294295 try {
295296 String line;
296297 while ( (line = reader.readLine()) != null ) {
297298 for(Pattern p : metadataPatterns.keySet()) {
298299 Matcher m = p.matcher(line);
299300 if(m.find()) {
300 metadata.add( metadataPatterns.get(p), m.group(1) );
301 if (metadataPatterns.get(p) != null &&
302 !metadataPatterns.get(p).equals("")){
303 metadata.add( metadataPatterns.get(p), m.group(1) );
304 }
305 else{
306 metadata.add( m.group(1), m.group(2));
307 }
301308 }
302309 }
303310 }
304311 } catch (IOException e) {
312 // Ignore
305313 } finally {
306314 IOUtils.closeQuietly(reader);
307315 IOUtils.closeQuietly(stream);
321329 public static boolean check(String checkCmd, int... errorValue) {
322330 return check(new String[] {checkCmd}, errorValue);
323331 }
332
324333 public static boolean check(String[] checkCmd, int... errorValue) {
325334 if(errorValue.length == 0) {
326335 errorValue = new int[] { 127 };
327336 }
328337
329338 try {
330 Process process;
331 if(checkCmd.length == 1) {
332 process = Runtime.getRuntime().exec(checkCmd[0]);
333 } else {
334 process = Runtime.getRuntime().exec(checkCmd);
335 }
339 Process process= Runtime.getRuntime().exec(checkCmd);
336340 int result = process.waitFor();
337341
338342 for(int err : errorValue) {
345349 } catch (InterruptedException ie) {
346350 // Some problem, command is there or is broken
347351 return false;
348 }
352 } catch (Error err) {
353 if (err.getMessage() != null &&
354 (err.getMessage().contains("posix_spawn") ||
355 err.getMessage().contains("UNIXProcess"))) {
356 //"Error forking command due to JVM locale bug
357 //(see TIKA-1526 and SOLR-6387)"
358 return false;
359 }
360 //throw if a different kind of error
361 throw err;
362 }
349363 }
350364 }
0 package org.apache.tika.sax;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.OutputStream;
19 import java.io.OutputStreamWriter;
20 import java.io.UnsupportedEncodingException;
21
22 import org.xml.sax.ContentHandler;
23 import org.xml.sax.helpers.DefaultHandler;
24
25 /**
26 * Basic factory for creating common types of ContentHandlers
27 */
28 public class BasicContentHandlerFactory implements ContentHandlerFactory {
29
30 /**
31 * Common handler types for content.
32 */
33 public enum HANDLER_TYPE {
34 BODY,
35 IGNORE, //don't store content
36 TEXT,
37 HTML,
38 XML
39 };
40
41 private final HANDLER_TYPE type;
42 private final int writeLimit;
43
44 /**
45 *
46 * @param type basic type of handler
47 * @param writeLimit max number of characters to store; if < 0, the handler will store all characters
48 */
49 public BasicContentHandlerFactory(HANDLER_TYPE type, int writeLimit) {
50 this.type = type;
51 this.writeLimit = writeLimit;
52 }
53
54 @Override
55 public ContentHandler getNewContentHandler() {
56
57 if (type == HANDLER_TYPE.BODY) {
58 return new BodyContentHandler(writeLimit);
59 } else if (type == HANDLER_TYPE.IGNORE) {
60 return new DefaultHandler();
61 }
62 if (writeLimit > -1) {
63 switch(type) {
64 case TEXT:
65 return new WriteOutContentHandler(new ToTextContentHandler(), writeLimit);
66 case HTML:
67 return new WriteOutContentHandler(new ToHTMLContentHandler(), writeLimit);
68 case XML:
69 return new WriteOutContentHandler(new ToXMLContentHandler(), writeLimit);
70 default:
71 return new WriteOutContentHandler(new ToTextContentHandler(), writeLimit);
72 }
73 } else {
74 switch (type) {
75 case TEXT:
76 return new ToTextContentHandler();
77 case HTML:
78 return new ToHTMLContentHandler();
79 case XML:
80 return new ToXMLContentHandler();
81 default:
82 return new ToTextContentHandler();
83
84 }
85 }
86 }
87
88 @Override
89 public ContentHandler getNewContentHandler(OutputStream os, String encoding) throws UnsupportedEncodingException {
90
91 if (type == HANDLER_TYPE.IGNORE) {
92 return new DefaultHandler();
93 }
94
95 if (writeLimit > -1) {
96 switch(type) {
97 case BODY:
98 return new WriteOutContentHandler(
99 new BodyContentHandler(
100 new OutputStreamWriter(os, encoding)), writeLimit);
101 case TEXT:
102 return new WriteOutContentHandler(new ToTextContentHandler(os, encoding), writeLimit);
103 case HTML:
104 return new WriteOutContentHandler(new ToHTMLContentHandler(os, encoding), writeLimit);
105 case XML:
106 return new WriteOutContentHandler(new ToXMLContentHandler(os, encoding), writeLimit);
107 default:
108 return new WriteOutContentHandler(new ToTextContentHandler(os, encoding), writeLimit);
109 }
110 } else {
111 switch (type) {
112 case BODY:
113 return new BodyContentHandler(new OutputStreamWriter(os, encoding));
114 case TEXT:
115 return new ToTextContentHandler(os, encoding);
116 case HTML:
117 return new ToHTMLContentHandler(os, encoding);
118 case XML:
119 return new ToXMLContentHandler(os, encoding);
120 default:
121 return new ToTextContentHandler(os, encoding);
122
123 }
124 }
125 }
126
127 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.sax;
18
19 import java.util.Locale;
20 import java.util.regex.Matcher;
21 import java.util.regex.Pattern;
22 import java.util.ArrayList;
23
24 /**
25 * Class to help de-obfuscate phone numbers in text.
26 */
27 public class CleanPhoneText {
28 // Regex to identify a phone number
29 static final String cleanPhoneRegex = "([2-9]\\d{2}[2-9]\\d{6})";
30
31 // Regex which attempts to ignore punctuation and other distractions.
32 static final String phoneRegex = "([{(<]{0,3}[2-9][\\W_]{0,3}\\d[\\W_]{0,3}\\d[\\W_]{0,6}[2-9][\\W_]{0,3}\\d[\\W_]{0,3}\\d[\\W_]{0,6}\\d[\\W_]{0,3}\\d[\\W_]{0,3}\\d[\\W_]{0,3}\\d)";
33
34 public static ArrayList<String> extractPhoneNumbers(String text) {
35 text = clean(text);
36 int idx = 0;
37 Pattern p = Pattern.compile(cleanPhoneRegex);
38 Matcher m = p.matcher(text);
39 ArrayList<String> phoneNumbers = new ArrayList<String>();
40 while (m.find(idx)) {
41 String digits = m.group(1);
42 int start = m.start(1);
43 int end = m.end(1);
44 String prefix = "";
45 if (start > 0) {
46 prefix = text.substring(start-1, start);
47 }
48 if (digits.substring(0, 2).equals("82") && prefix.equals("*")) {
49 // this number overlaps with a *82 sequence
50 idx += 2;
51 } else {
52 // seems good
53 phoneNumbers.add(digits);
54 idx = end;
55 }
56 }
57 return phoneNumbers;
58 }
59
60 public static String clean(String text) {
61 text = text.toLowerCase(Locale.ROOT);
62 for (String[][] group : cleanSubstitutions) {
63 for (String[] sub : group) {
64 text = text.replaceAll(sub[0], sub[1]);
65 }
66 }
67 // Delete all non-digits and white space.
68 text = text.replaceAll("[\\D+\\s]", "");
69 return text;
70 }
71
72
73 public static final String[][][] cleanSubstitutions = new String[][][]{
74 {{"&#\\d{1,3};", ""}}, // first simply remove numeric entities
75 {{"th0usand", "thousand"}, // handle common misspellings
76 {"th1rteen", "thirteen"},
77 {"f0urteen", "fourteen"},
78 {"e1ghteen", "eighteen"},
79 {"n1neteen", "nineteen"},
80 {"f1fteen", "fifteen"},
81 {"s1xteen", "sixteen"},
82 {"th1rty", "thirty"},
83 {"e1ghty", "eighty"},
84 {"n1nety", "ninety"},
85 {"fourty", "forty"},
86 {"f0urty", "forty"},
87 {"e1ght", "eight"},
88 {"f0rty", "forty"},
89 {"f1fty", "fifty"},
90 {"s1xty", "sixty"},
91 {"zer0", "zero"},
92 {"f0ur", "four"},
93 {"f1ve", "five"},
94 {"n1ne", "nine"},
95 {"0ne", "one"},
96 {"tw0", "two"},
97 {"s1x", "six"}},
98 // mixed compound numeral words
99 // consider 7teen, etc.
100 {{"twenty[\\W_]{0,3}1", "twenty-one"},
101 {"twenty[\\W_]{0,3}2", "twenty-two"},
102 {"twenty[\\W_]{0,3}3", "twenty-three"},
103 {"twenty[\\W_]{0,3}4", "twenty-four"},
104 {"twenty[\\W_]{0,3}5", "twenty-five"},
105 {"twenty[\\W_]{0,3}6", "twenty-six"},
106 {"twenty[\\W_]{0,3}7", "twenty-seven"},
107 {"twenty[\\W_]{0,3}8", "twenty-eight"},
108 {"twenty[\\W_]{0,3}9", "twenty-nine"},
109 {"thirty[\\W_]{0,3}1", "thirty-one"},
110 {"thirty[\\W_]{0,3}2", "thirty-two"},
111 {"thirty[\\W_]{0,3}3", "thirty-three"},
112 {"thirty[\\W_]{0,3}4", "thirty-four"},
113 {"thirty[\\W_]{0,3}5", "thirty-five"},
114 {"thirty[\\W_]{0,3}6", "thirty-six"},
115 {"thirty[\\W_]{0,3}7", "thirty-seven"},
116 {"thirty[\\W_]{0,3}8", "thirty-eight"},
117 {"thirty[\\W_]{0,3}9", "thirty-nine"},
118 {"forty[\\W_]{0,3}1", "forty-one"},
119 {"forty[\\W_]{0,3}2", "forty-two"},
120 {"forty[\\W_]{0,3}3", "forty-three"},
121 {"forty[\\W_]{0,3}4", "forty-four"},
122 {"forty[\\W_]{0,3}5", "forty-five"},
123 {"forty[\\W_]{0,3}6", "forty-six"},
124 {"forty[\\W_]{0,3}7", "forty-seven"},
125 {"forty[\\W_]{0,3}8", "forty-eight"},
126 {"forty[\\W_]{0,3}9", "forty-nine"},
127 {"fifty[\\W_]{0,3}1", "fifty-one"},
128 {"fifty[\\W_]{0,3}2", "fifty-two"},
129 {"fifty[\\W_]{0,3}3", "fifty-three"},
130 {"fifty[\\W_]{0,3}4", "fifty-four"},
131 {"fifty[\\W_]{0,3}5", "fifty-five"},
132 {"fifty[\\W_]{0,3}6", "fifty-six"},
133 {"fifty[\\W_]{0,3}7", "fifty-seven"},
134 {"fifty[\\W_]{0,3}8", "fifty-eight"},
135 {"fifty[\\W_]{0,3}9", "fifty-nine"},
136 {"sixty[\\W_]{0,3}1", "sixty-one"},
137 {"sixty[\\W_]{0,3}2", "sixty-two"},
138 {"sixty[\\W_]{0,3}3", "sixty-three"},
139 {"sixty[\\W_]{0,3}4", "sixty-four"},
140 {"sixty[\\W_]{0,3}5", "sixty-five"},
141 {"sixty[\\W_]{0,3}6", "sixty-six"},
142 {"sixty[\\W_]{0,3}7", "sixty-seven"},
143 {"sixty[\\W_]{0,3}8", "sixty-eight"},
144 {"sixty[\\W_]{0,3}9", "sixty-nine"},
145 {"seventy[\\W_]{0,3}1", "seventy-one"},
146 {"seventy[\\W_]{0,3}2", "seventy-two"},
147 {"seventy[\\W_]{0,3}3", "seventy-three"},
148 {"seventy[\\W_]{0,3}4", "seventy-four"},
149 {"seventy[\\W_]{0,3}5", "seventy-five"},
150 {"seventy[\\W_]{0,3}6", "seventy-six"},
151 {"seventy[\\W_]{0,3}7", "seventy-seven"},
152 {"seventy[\\W_]{0,3}8", "seventy-eight"},
153 {"seventy[\\W_]{0,3}9", "seventy-nine"},
154 {"eighty[\\W_]{0,3}1", "eighty-one"},
155 {"eighty[\\W_]{0,3}2", "eighty-two"},
156 {"eighty[\\W_]{0,3}3", "eighty-three"},
157 {"eighty[\\W_]{0,3}4", "eighty-four"},
158 {"eighty[\\W_]{0,3}5", "eighty-five"},
159 {"eighty[\\W_]{0,3}6", "eighty-six"},
160 {"eighty[\\W_]{0,3}7", "eighty-seven"},
161 {"eighty[\\W_]{0,3}8", "eighty-eight"},
162 {"eighty[\\W_]{0,3}9", "eighty-nine"},
163 {"ninety[\\W_]{0,3}1", "ninety-one"},
164 {"ninety[\\W_]{0,3}2", "ninety-two"},
165 {"ninety[\\W_]{0,3}3", "ninety-three"},
166 {"ninety[\\W_]{0,3}4", "ninety-four"},
167 {"ninety[\\W_]{0,3}5", "ninety-five"},
168 {"ninety[\\W_]{0,3}6", "ninety-six"},
169 {"ninety[\\W_]{0,3}7", "ninety-seven"},
170 {"ninety[\\W_]{0,3}8", "ninety-eight"},
171 {"ninety[\\W_]{0,3}9", "ninety-nine"}},
172 // now resolve compound numeral words
173 {{"twenty-one", "21"},
174 {"twenty-two", "22"},
175 {"twenty-three", "23"},
176 {"twenty-four", "24"},
177 {"twenty-five", "25"},
178 {"twenty-six", "26"},
179 {"twenty-seven", "27"},
180 {"twenty-eight", "28"},
181 {"twenty-nine", "29"},
182 {"thirty-one", "31"},
183 {"thirty-two", "32"},
184 {"thirty-three", "33"},
185 {"thirty-four", "34"},
186 {"thirty-five", "35"},
187 {"thirty-six", "36"},
188 {"thirty-seven", "37"},
189 {"thirty-eight", "38"},
190 {"thirty-nine", "39"},
191 {"forty-one", "41"},
192 {"forty-two", "42"},
193 {"forty-three", "43"},
194 {"forty-four", "44"},
195 {"forty-five", "45"},
196 {"forty-six", "46"},
197 {"forty-seven", "47"},
198 {"forty-eight", "48"},
199 {"forty-nine", "49"},
200 {"fifty-one", "51"},
201 {"fifty-two", "52"},
202 {"fifty-three", "53"},
203 {"fifty-four", "54"},
204 {"fifty-five", "55"},
205 {"fifty-six", "56"},
206 {"fifty-seven", "57"},
207 {"fifty-eight", "58"},
208 {"fifty-nine", "59"},
209 {"sixty-one", "61"},
210 {"sixty-two", "62"},
211 {"sixty-three", "63"},
212 {"sixty-four", "64"},
213 {"sixty-five", "65"},
214 {"sixty-six", "66"},
215 {"sixty-seven", "67"},
216 {"sixty-eight", "68"},
217 {"sixty-nine", "69"},
218 {"seventy-one", "71"},
219 {"seventy-two", "72"},
220 {"seventy-three", "73"},
221 {"seventy-four", "74"},
222 {"seventy-five", "75"},
223 {"seventy-six", "76"},
224 {"seventy-seven", "77"},
225 {"seventy-eight", "78"},
226 {"seventy-nine", "79"},
227 {"eighty-one", "81"},
228 {"eighty-two", "82"},
229 {"eighty-three", "83"},
230 {"eighty-four", "84"},
231 {"eighty-five", "85"},
232 {"eighty-six", "86"},
233 {"eighty-seven", "87"},
234 {"eighty-eight", "88"},
235 {"eighty-nine", "89"},
236 {"ninety-one", "91"},
237 {"ninety-two", "92"},
238 {"ninety-three", "93"},
239 {"ninety-four", "94"},
240 {"ninety-five", "95"},
241 {"ninety-six", "96"},
242 {"ninety-seven", "97"},
243 {"ninety-eight", "98"},
244 {"ninety-nine", "99"}},
245 // larger units function as suffixes now
246 // assume never have three hundred four, three hundred and four
247 {{"hundred", "00"},
248 {"thousand", "000"}},
249 // single numeral words now
250 // some would have been ambiguous
251 {{"seventeen", "17"},
252 {"thirteen", "13"},
253 {"fourteen", "14"},
254 {"eighteen", "18"},
255 {"nineteen", "19"},
256 {"fifteen", "15"},
257 {"sixteen", "16"},
258 {"seventy", "70"},
259 {"eleven", "11"},
260 {"twelve", "12"},
261 {"twenty", "20"},
262 {"thirty", "30"},
263 {"eighty", "80"},
264 {"ninety", "90"},
265 {"three", "3"},
266 {"seven", "7"},
267 {"eight", "8"},
268 {"forty", "40"},
269 {"fifty", "50"},
270 {"sixty", "60"},
271 {"zero", "0"},
272 {"four", "4"},
273 {"five", "5"},
274 {"nine", "9"},
275 {"one", "1"},
276 {"two", "2"},
277 {"six", "6"},
278 {"ten", "10"}},
279 // now do letter for digit substitutions
280 {{"oh", "0"},
281 {"o", "0"},
282 {"i", "1"},
283 {"l", "1"}}
284 };
285 }
0 package org.apache.tika.sax;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.xml.sax.ContentHandler;
20
21 import java.io.OutputStream;
22 import java.io.UnsupportedEncodingException;
23
24 /**
25 * Interface to allow easier injection of code for getting a new ContentHandler
26 */
27 public interface ContentHandlerFactory {
28 public ContentHandler getNewContentHandler();
29 public ContentHandler getNewContentHandler(OutputStream os, String encoding) throws UnsupportedEncodingException;
30
31 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.sax;
18
19 import org.apache.tika.metadata.Metadata;
20 import org.apache.tika.sax.CleanPhoneText;
21 import org.apache.tika.sax.ContentHandlerDecorator;
22 import org.xml.sax.ContentHandler;
23 import org.xml.sax.SAXException;
24 import org.xml.sax.helpers.DefaultHandler;
25
26 import java.util.Arrays;
27 import java.util.List;
28
29 /**
30 * Class used to extract phone numbers while parsing.
31 *
32 * Every time a document is parsed in Tika, the content is split into SAX events.
33 * Those SAX events are handled by a ContentHandler. You can think of these events
34 * as marking a tag in an HTML file. Once you're finished parsing, you can call
35 * handler.toString(), for example, to get the text contents of the file. On the other
36 * hand, any of the metadata of the file will be added to the Metadata object passed
37 * in during the parse() call. So, the Parser class sends metadata to the Metadata
38 * object and content to the ContentHandler.
39 *
40 * This class is an example of how to combine a ContentHandler and a Metadata.
41 * As content is passed to the handler, we first check to see if it matches a
42 * textual pattern for a phone number. If the extracted content is a phone number,
43 * we add it to the metadata under the key "phonenumbers". So, if you used this
44 * ContentHandler when you parsed a document, then called
45 * metadata.getValues("phonenumbers"), you would get an array of Strings of phone
46 * numbers found in the document.
47 *
48 * Please see the PhoneExtractingContentHandlerTest for an example of how to use
49 * this class.
50 *
51 */
52 public class PhoneExtractingContentHandler extends ContentHandlerDecorator {
53 private Metadata metadata;
54 private static final String PHONE_NUMBERS = "phonenumbers";
55 private StringBuilder stringBuilder;
56
57 /**
58 * Creates a decorator for the given SAX event handler and Metadata object.
59 *
60 * @param handler SAX event handler to be decorated
61 */
62 public PhoneExtractingContentHandler(ContentHandler handler, Metadata metadata) {
63 super(handler);
64 this.metadata = metadata;
65 this.stringBuilder = new StringBuilder();
66 }
67
68 /**
69 * Creates a decorator that by default forwards incoming SAX events to
70 * a dummy content handler that simply ignores all the events. Subclasses
71 * should use the {@link #setContentHandler(ContentHandler)} method to
72 * switch to a more usable underlying content handler.
73 * Also creates a dummy Metadata object to store phone numbers in.
74 */
75 protected PhoneExtractingContentHandler() {
76 this(new DefaultHandler(), new Metadata());
77 }
78
79 /**
80 * The characters method is called whenever a Parser wants to pass raw...
81 * characters to the ContentHandler. But, sometimes, phone numbers are split
82 * accross different calls to characters, depending on the specific Parser
83 * used. So, we simply add all characters to a StringBuilder and analyze it
84 * once the document is finished.
85 */
86 @Override
87 public void characters(char[] ch, int start, int length) throws SAXException {
88 try {
89 String text = new String(Arrays.copyOfRange(ch, start, start + length));
90 stringBuilder.append(text);
91 super.characters(ch, start, length);
92 } catch (SAXException e) {
93 handleException(e);
94 }
95 }
96
97
98 /**
99 * This method is called whenever the Parser is done parsing the file. So,
100 * we check the output for any phone numbers.
101 */
102 @Override
103 public void endDocument() throws SAXException {
104 super.endDocument();
105 List<String> numbers = CleanPhoneText.extractPhoneNumbers(stringBuilder.toString());
106 for (String number : numbers) {
107 metadata.add(PHONE_NUMBERS, number);
108 }
109 }
110 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import java.io.OutputStream;
19 import java.io.UnsupportedEncodingException;
20 import java.util.Arrays;
21 import java.util.HashSet;
22 import java.util.Set;
23
24 import org.xml.sax.SAXException;
25
26 /**
27 * SAX event handler that serializes the HTML document to a character stream.
28 * The incoming SAX events are expected to be well-formed (properly nested,
29 * etc.) and valid HTML.
30 *
31 * @since Apache Tika 0.10
32 */
33 public class ToHTMLContentHandler extends ToXMLContentHandler {
34
35 private static final Set<String> EMPTY_ELEMENTS =
36 new HashSet<String>(Arrays.asList(
37 "area", "base", "basefont", "br", "col", "frame", "hr",
38 "img", "input", "isindex", "link", "meta", "param"));
39
40 public ToHTMLContentHandler(OutputStream stream, String encoding)
41 throws UnsupportedEncodingException {
42 super(stream, encoding);
43 }
44
45 public ToHTMLContentHandler() {
46 super();
47 }
48
49 @Override
50 public void startDocument() throws SAXException {
51 }
52
53 @Override
54 public void endElement(String uri, String localName, String qName)
55 throws SAXException {
56 if (inStartElement) {
57 write('>');
58 inStartElement = false;
59
60 if (EMPTY_ELEMENTS.contains(localName)) {
61 namespaces.clear();
62 return;
63 }
64 }
65
66 super.endElement(uri, localName, qName);
67 }
68
69 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import java.io.OutputStream;
19 import java.io.UnsupportedEncodingException;
20 import java.util.Arrays;
21 import java.util.HashSet;
22 import java.util.Set;
23
24 import org.xml.sax.SAXException;
25
26 /**
27 * SAX event handler that serializes the HTML document to a character stream.
28 * The incoming SAX events are expected to be well-formed (properly nested,
29 * etc.) and valid HTML.
30 *
31 * @since Apache Tika 0.10
32 */
33 public class ToHTMLContentHandler extends ToXMLContentHandler {
34
35 private static final Set<String> EMPTY_ELEMENTS =
36 new HashSet<String>(Arrays.asList(
37 "area", "base", "basefont", "br", "col", "frame", "hr",
38 "img", "input", "isindex", "link", "meta", "param"));
39
40 public ToHTMLContentHandler(OutputStream stream, String encoding)
41 throws UnsupportedEncodingException {
42 super(stream, encoding);
43 }
44
45 public ToHTMLContentHandler() {
46 super();
47 }
48
49 @Override
50 public void startDocument() throws SAXException {
51 }
52
53 @Override
54 public void endElement(String uri, String localName, String qName)
55 throws SAXException {
56 if (inStartElement) {
57 write('>');
58 inStartElement = false;
59
60 if (EMPTY_ELEMENTS.contains(localName)) {
61 namespaces.clear();
62 return;
63 }
64 }
65
66 super.endElement(uri, localName, qName);
67 }
68
69 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import java.io.IOException;
19 import java.io.OutputStream;
20 import java.io.OutputStreamWriter;
21 import java.io.StringWriter;
22 import java.io.UnsupportedEncodingException;
23 import java.io.Writer;
24
25 import org.xml.sax.SAXException;
26 import org.xml.sax.helpers.DefaultHandler;
27
28 /**
29 * SAX event handler that writes all character content out to a character
30 * stream. No escaping or other transformations are made on the character
31 * content.
32 *
33 * @since Apache Tika 0.10
34 */
35 public class ToTextContentHandler extends DefaultHandler {
36
37 /**
38 * The character stream.
39 */
40 private final Writer writer;
41
42 /**
43 * Creates a content handler that writes character events to
44 * the given writer.
45 *
46 * @param writer writer
47 */
48 public ToTextContentHandler(Writer writer) {
49 this.writer = writer;
50 }
51
52 /**
53 * Creates a content handler that writes character events to
54 * the given output stream using the platform default encoding.
55 *
56 * @param stream output stream
57 */
58 public ToTextContentHandler(OutputStream stream) {
59 this(new OutputStreamWriter(stream));
60 }
61
62 /**
63 * Creates a content handler that writes character events to
64 * the given output stream using the given encoding.
65 *
66 * @param stream output stream
67 * @param encoding output encoding
68 * @throws UnsupportedEncodingException if the encoding is unsupported
69 */
70 public ToTextContentHandler(OutputStream stream, String encoding)
71 throws UnsupportedEncodingException {
72 this(new OutputStreamWriter(stream, encoding));
73 }
74
75 /**
76 * Creates a content handler that writes character events
77 * to an internal string buffer. Use the {@link #toString()}
78 * method to access the collected character content.
79 */
80 public ToTextContentHandler() {
81 this(new StringWriter());
82 }
83
84 /**
85 * Writes the given characters to the given character stream.
86 */
87 @Override
88 public void characters(char[] ch, int start, int length)
89 throws SAXException {
90 try {
91 writer.write(ch, start, length);
92 } catch (IOException e) {
93 throw new SAXException(
94 "Error writing: " + new String(ch, start, length), e);
95 }
96 }
97
98
99 /**
100 * Writes the given ignorable characters to the given character stream.
101 * The default implementation simply forwards the call to the
102 * {@link #characters(char[], int, int)} method.
103 */
104 @Override
105 public void ignorableWhitespace(char[] ch, int start, int length)
106 throws SAXException {
107 characters(ch, start, length);
108 }
109
110 /**
111 * Flushes the character stream so that no characters are forgotten
112 * in internal buffers.
113 *
114 * @see <a href="https://issues.apache.org/jira/browse/TIKA-179">TIKA-179</a>
115 * @throws SAXException if the stream can not be flushed
116 */
117 @Override
118 public void endDocument() throws SAXException {
119 try {
120 writer.flush();
121 } catch (IOException e) {
122 throw new SAXException("Error flushing character output", e);
123 }
124 }
125
126 /**
127 * Returns the contents of the internal string buffer where
128 * all the received characters have been collected. Only works
129 * when this object was constructed using the empty default
130 * constructor or by passing a {@link StringWriter} to the
131 * other constructor.
132 */
133 @Override
134 public String toString() {
135 return writer.toString();
136 }
137
138 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import java.io.IOException;
19 import java.io.OutputStream;
20 import java.io.OutputStreamWriter;
21 import java.io.StringWriter;
22 import java.io.UnsupportedEncodingException;
23 import java.io.Writer;
24 import java.nio.charset.Charset;
25
26 import org.xml.sax.SAXException;
27 import org.xml.sax.helpers.DefaultHandler;
28
29 /**
30 * SAX event handler that writes all character content out to a character
31 * stream. No escaping or other transformations are made on the character
32 * content.
33 *
34 * @since Apache Tika 0.10
35 */
36 public class ToTextContentHandler extends DefaultHandler {
37
38 /**
39 * The character stream.
40 */
41 private final Writer writer;
42
43 /**
44 * Creates a content handler that writes character events to
45 * the given writer.
46 *
47 * @param writer writer
48 */
49 public ToTextContentHandler(Writer writer) {
50 this.writer = writer;
51 }
52
53 /**
54 * Creates a content handler that writes character events to
55 * the given output stream using the platform default encoding.
56 *
57 * @param stream output stream
58 */
59 public ToTextContentHandler(OutputStream stream) {
60 this(new OutputStreamWriter(stream, Charset.defaultCharset()));
61 }
62
63 /**
64 * Creates a content handler that writes character events to
65 * the given output stream using the given encoding.
66 *
67 * @param stream output stream
68 * @param encoding output encoding
69 * @throws UnsupportedEncodingException if the encoding is unsupported
70 */
71 public ToTextContentHandler(OutputStream stream, String encoding)
72 throws UnsupportedEncodingException {
73 this(new OutputStreamWriter(stream, encoding));
74 }
75
76 /**
77 * Creates a content handler that writes character events
78 * to an internal string buffer. Use the {@link #toString()}
79 * method to access the collected character content.
80 */
81 public ToTextContentHandler() {
82 this(new StringWriter());
83 }
84
85 /**
86 * Writes the given characters to the given character stream.
87 */
88 @Override
89 public void characters(char[] ch, int start, int length)
90 throws SAXException {
91 try {
92 writer.write(ch, start, length);
93 } catch (IOException e) {
94 throw new SAXException(
95 "Error writing: " + new String(ch, start, length), e);
96 }
97 }
98
99
100 /**
101 * Writes the given ignorable characters to the given character stream.
102 * The default implementation simply forwards the call to the
103 * {@link #characters(char[], int, int)} method.
104 */
105 @Override
106 public void ignorableWhitespace(char[] ch, int start, int length)
107 throws SAXException {
108 characters(ch, start, length);
109 }
110
111 /**
112 * Flushes the character stream so that no characters are forgotten
113 * in internal buffers.
114 *
115 * @see <a href="https://issues.apache.org/jira/browse/TIKA-179">TIKA-179</a>
116 * @throws SAXException if the stream can not be flushed
117 */
118 @Override
119 public void endDocument() throws SAXException {
120 try {
121 writer.flush();
122 } catch (IOException e) {
123 throw new SAXException("Error flushing character output", e);
124 }
125 }
126
127 /**
128 * Returns the contents of the internal string buffer where
129 * all the received characters have been collected. Only works
130 * when this object was constructed using the empty default
131 * constructor or by passing a {@link StringWriter} to the
132 * other constructor.
133 */
134 @Override
135 public String toString() {
136 return writer.toString();
137 }
138
139 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import java.io.OutputStream;
19 import java.io.UnsupportedEncodingException;
20 import java.util.Collections;
21 import java.util.HashMap;
22 import java.util.Map;
23
24 import org.xml.sax.Attributes;
25 import org.xml.sax.SAXException;
26
27 /**
28 * SAX event handler that serializes the XML document to a character stream.
29 * The incoming SAX events are expected to be well-formed (properly nested,
30 * etc.) and to explicitly include namespace declaration attributes and
31 * corresponding namespace prefixes in element and attribute names.
32 *
33 * @since Apache Tika 0.10
34 */
35 public class ToXMLContentHandler extends ToTextContentHandler {
36
37 private static class ElementInfo {
38
39 private final ElementInfo parent;
40
41 private final Map<String, String> namespaces;
42
43 public ElementInfo(ElementInfo parent, Map<String, String> namespaces) {
44 this.parent = parent;
45 if (namespaces.isEmpty()) {
46 this.namespaces = Collections.emptyMap();
47 } else {
48 this.namespaces = new HashMap<String, String>(namespaces);
49 }
50 }
51
52 public String getPrefix(String uri) throws SAXException {
53 String prefix = namespaces.get(uri);
54 if (prefix != null) {
55 return prefix;
56 } else if (parent != null) {
57 return parent.getPrefix(uri);
58 } else if (uri == null || uri.length() == 0) {
59 return "";
60 } else {
61 throw new SAXException("Namespace " + uri + " not declared");
62 }
63 }
64
65 public String getQName(String uri, String localName)
66 throws SAXException {
67 String prefix = getPrefix(uri);
68 if (prefix.length() > 0) {
69 return prefix + ":" + localName;
70 } else {
71 return localName;
72 }
73 }
74
75 }
76
77 private final String encoding;
78
79 protected boolean inStartElement = false;
80
81 protected final Map<String, String> namespaces =
82 new HashMap<String, String>();
83
84 private ElementInfo currentElement;
85
86 /**
87 * Creates an XML serializer that writes to the given byte stream
88 * using the given character encoding.
89 *
90 * @param stream output stream
91 * @param encoding output encoding
92 * @throws UnsupportedEncodingException if the encoding is unsupported
93 */
94 public ToXMLContentHandler(OutputStream stream, String encoding)
95 throws UnsupportedEncodingException {
96 super(stream, encoding);
97 this.encoding = encoding;
98 }
99
100 public ToXMLContentHandler(String encoding) {
101 super();
102 this.encoding = encoding;
103 }
104
105 public ToXMLContentHandler() {
106 super();
107 this.encoding = null;
108 }
109
110 /**
111 * Writes the XML prefix.
112 */
113 @Override
114 public void startDocument() throws SAXException {
115 if (encoding != null) {
116 write("<?xml version=\"1.0\" encoding=\"");
117 write(encoding);
118 write("\"?>\n");
119 }
120
121 currentElement = null;
122 namespaces.clear();
123 }
124
125 @Override
126 public void startPrefixMapping(String prefix, String uri)
127 throws SAXException {
128 try {
129 if (currentElement != null
130 && prefix.equals(currentElement.getPrefix(uri))) {
131 return;
132 }
133 } catch (SAXException ignore) {
134 }
135 namespaces.put(uri, prefix);
136 }
137
138 @Override
139 public void startElement(
140 String uri, String localName, String qName, Attributes atts)
141 throws SAXException {
142 lazyCloseStartElement();
143
144 currentElement = new ElementInfo(currentElement, namespaces);
145
146 write('<');
147 write(currentElement.getQName(uri, localName));
148
149 for (int i = 0; i < atts.getLength(); i++) {
150 write(' ');
151 write(currentElement.getQName(atts.getURI(i), atts.getLocalName(i)));
152 write('=');
153 write('"');
154 char[] ch = atts.getValue(i).toCharArray();
155 writeEscaped(ch, 0, ch.length, true);
156 write('"');
157 }
158
159 for (Map.Entry<String, String> entry : namespaces.entrySet()) {
160 write(' ');
161 write("xmlns");
162 String prefix = entry.getValue();
163 if (prefix.length() > 0) {
164 write(':');
165 write(prefix);
166 }
167 write('=');
168 write('"');
169 char[] ch = entry.getKey().toCharArray();
170 writeEscaped(ch, 0, ch.length, true);
171 write('"');
172 }
173 namespaces.clear();
174
175 inStartElement = true;
176 }
177
178 @Override
179 public void endElement(String uri, String localName, String qName)
180 throws SAXException {
181 if (inStartElement) {
182 write(" />");
183 inStartElement = false;
184 } else {
185 write("</");
186 write(qName);
187 write('>');
188 }
189
190 namespaces.clear();
191
192 // Reset the position in the tree, to avoid endless stack overflow
193 // chains (see TIKA-1070)
194 currentElement = currentElement.parent;
195 }
196
197 @Override
198 public void characters(char[] ch, int start, int length)
199 throws SAXException {
200 lazyCloseStartElement();
201 writeEscaped(ch, start, start + length, false);
202 }
203
204 private void lazyCloseStartElement() throws SAXException {
205 if (inStartElement) {
206 write('>');
207 inStartElement = false;
208 }
209 }
210
211 /**
212 * Writes the given character as-is.
213 *
214 * @param ch character to be written
215 * @throws SAXException if the character could not be written
216 */
217 protected void write(char ch) throws SAXException {
218 super.characters(new char[] { ch }, 0, 1);
219 }
220
221 /**
222 * Writes the given string of character as-is.
223 *
224 * @param string string of character to be written
225 * @throws SAXException if the character string could not be written
226 */
227 protected void write(String string) throws SAXException {
228 super.characters(string.toCharArray(), 0, string.length());
229 }
230
231 /**
232 * Writes the given characters as-is followed by the given entity.
233 *
234 * @param ch character array
235 * @param from start position in the array
236 * @param to end position in the array
237 * @param entity entity code
238 * @return next position in the array,
239 * after the characters plus one entity
240 * @throws SAXException if the characters could not be written
241 */
242 private int writeCharsAndEntity(char[] ch, int from, int to, String entity)
243 throws SAXException {
244 super.characters(ch, from, to - from);
245 write('&');
246 write(entity);
247 write(';');
248 return to + 1;
249 }
250
251 /**
252 * Writes the given characters with XML meta characters escaped.
253 *
254 * @param ch character array
255 * @param from start position in the array
256 * @param to end position in the array
257 * @param attribute whether the characters should be escaped as
258 * an attribute value or normal character content
259 * @throws SAXException if the characters could not be written
260 */
261 private void writeEscaped(char[] ch, int from, int to, boolean attribute)
262 throws SAXException {
263 int pos = from;
264 while (pos < to) {
265 if (ch[pos] == '<') {
266 from = pos = writeCharsAndEntity(ch, from, pos, "lt");
267 } else if (ch[pos] == '>') {
268 from = pos = writeCharsAndEntity(ch, from, pos, "gt");
269 } else if (ch[pos] == '&') {
270 from = pos = writeCharsAndEntity(ch, from, pos, "amp");
271 } else if (attribute && ch[pos] == '"') {
272 from = pos = writeCharsAndEntity(ch, from, pos, "quot");
273 } else {
274 pos++;
275 }
276 }
277 super.characters(ch, from, to - from);
278 }
279
280 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import java.io.OutputStream;
19 import java.io.UnsupportedEncodingException;
20 import java.util.Collections;
21 import java.util.HashMap;
22 import java.util.Map;
23
24 import org.xml.sax.Attributes;
25 import org.xml.sax.SAXException;
26
27 /**
28 * SAX event handler that serializes the XML document to a character stream.
29 * The incoming SAX events are expected to be well-formed (properly nested,
30 * etc.) and to explicitly include namespace declaration attributes and
31 * corresponding namespace prefixes in element and attribute names.
32 *
33 * @since Apache Tika 0.10
34 */
35 public class ToXMLContentHandler extends ToTextContentHandler {
36
37 private static class ElementInfo {
38
39 private final ElementInfo parent;
40
41 private final Map<String, String> namespaces;
42
43 public ElementInfo(ElementInfo parent, Map<String, String> namespaces) {
44 this.parent = parent;
45 if (namespaces.isEmpty()) {
46 this.namespaces = Collections.emptyMap();
47 } else {
48 this.namespaces = new HashMap<String, String>(namespaces);
49 }
50 }
51
52 public String getPrefix(String uri) throws SAXException {
53 String prefix = namespaces.get(uri);
54 if (prefix != null) {
55 return prefix;
56 } else if (parent != null) {
57 return parent.getPrefix(uri);
58 } else if (uri == null || uri.length() == 0) {
59 return "";
60 } else {
61 throw new SAXException("Namespace " + uri + " not declared");
62 }
63 }
64
65 public String getQName(String uri, String localName)
66 throws SAXException {
67 String prefix = getPrefix(uri);
68 if (prefix.length() > 0) {
69 return prefix + ":" + localName;
70 } else {
71 return localName;
72 }
73 }
74
75 }
76
77 private final String encoding;
78
79 protected boolean inStartElement = false;
80
81 protected final Map<String, String> namespaces =
82 new HashMap<String, String>();
83
84 private ElementInfo currentElement;
85
86 /**
87 * Creates an XML serializer that writes to the given byte stream
88 * using the given character encoding.
89 *
90 * @param stream output stream
91 * @param encoding output encoding
92 * @throws UnsupportedEncodingException if the encoding is unsupported
93 */
94 public ToXMLContentHandler(OutputStream stream, String encoding)
95 throws UnsupportedEncodingException {
96 super(stream, encoding);
97 this.encoding = encoding;
98 }
99
100 public ToXMLContentHandler(String encoding) {
101 super();
102 this.encoding = encoding;
103 }
104
105 public ToXMLContentHandler() {
106 super();
107 this.encoding = null;
108 }
109
110 /**
111 * Writes the XML prefix.
112 */
113 @Override
114 public void startDocument() throws SAXException {
115 if (encoding != null) {
116 write("<?xml version=\"1.0\" encoding=\"");
117 write(encoding);
118 write("\"?>\n");
119 }
120
121 currentElement = null;
122 namespaces.clear();
123 }
124
125 @Override
126 public void startPrefixMapping(String prefix, String uri)
127 throws SAXException {
128 try {
129 if (currentElement != null
130 && prefix.equals(currentElement.getPrefix(uri))) {
131 return;
132 }
133 } catch (SAXException ignore) {
134 }
135 namespaces.put(uri, prefix);
136 }
137
138 @Override
139 public void startElement(
140 String uri, String localName, String qName, Attributes atts)
141 throws SAXException {
142 lazyCloseStartElement();
143
144 currentElement = new ElementInfo(currentElement, namespaces);
145
146 write('<');
147 write(currentElement.getQName(uri, localName));
148
149 for (int i = 0; i < atts.getLength(); i++) {
150 write(' ');
151 write(currentElement.getQName(atts.getURI(i), atts.getLocalName(i)));
152 write('=');
153 write('"');
154 char[] ch = atts.getValue(i).toCharArray();
155 writeEscaped(ch, 0, ch.length, true);
156 write('"');
157 }
158
159 for (Map.Entry<String, String> entry : namespaces.entrySet()) {
160 write(' ');
161 write("xmlns");
162 String prefix = entry.getValue();
163 if (prefix.length() > 0) {
164 write(':');
165 write(prefix);
166 }
167 write('=');
168 write('"');
169 char[] ch = entry.getKey().toCharArray();
170 writeEscaped(ch, 0, ch.length, true);
171 write('"');
172 }
173 namespaces.clear();
174
175 inStartElement = true;
176 }
177
178 @Override
179 public void endElement(String uri, String localName, String qName)
180 throws SAXException {
181 if (inStartElement) {
182 write(" />");
183 inStartElement = false;
184 } else {
185 write("</");
186 write(qName);
187 write('>');
188 }
189
190 namespaces.clear();
191
192 // Reset the position in the tree, to avoid endless stack overflow
193 // chains (see TIKA-1070)
194 currentElement = currentElement.parent;
195 }
196
197 @Override
198 public void characters(char[] ch, int start, int length)
199 throws SAXException {
200 lazyCloseStartElement();
201 writeEscaped(ch, start, start + length, false);
202 }
203
204 private void lazyCloseStartElement() throws SAXException {
205 if (inStartElement) {
206 write('>');
207 inStartElement = false;
208 }
209 }
210
211 /**
212 * Writes the given character as-is.
213 *
214 * @param ch character to be written
215 * @throws SAXException if the character could not be written
216 */
217 protected void write(char ch) throws SAXException {
218 super.characters(new char[] { ch }, 0, 1);
219 }
220
221 /**
222 * Writes the given string of character as-is.
223 *
224 * @param string string of character to be written
225 * @throws SAXException if the character string could not be written
226 */
227 protected void write(String string) throws SAXException {
228 super.characters(string.toCharArray(), 0, string.length());
229 }
230
231 /**
232 * Writes the given characters as-is followed by the given entity.
233 *
234 * @param ch character array
235 * @param from start position in the array
236 * @param to end position in the array
237 * @param entity entity code
238 * @return next position in the array,
239 * after the characters plus one entity
240 * @throws SAXException if the characters could not be written
241 */
242 private int writeCharsAndEntity(char[] ch, int from, int to, String entity)
243 throws SAXException {
244 super.characters(ch, from, to - from);
245 write('&');
246 write(entity);
247 write(';');
248 return to + 1;
249 }
250
251 /**
252 * Writes the given characters with XML meta characters escaped.
253 *
254 * @param ch character array
255 * @param from start position in the array
256 * @param to end position in the array
257 * @param attribute whether the characters should be escaped as
258 * an attribute value or normal character content
259 * @throws SAXException if the characters could not be written
260 */
261 private void writeEscaped(char[] ch, int from, int to, boolean attribute)
262 throws SAXException {
263 int pos = from;
264 while (pos < to) {
265 if (ch[pos] == '<') {
266 from = pos = writeCharsAndEntity(ch, from, pos, "lt");
267 } else if (ch[pos] == '>') {
268 from = pos = writeCharsAndEntity(ch, from, pos, "gt");
269 } else if (ch[pos] == '&') {
270 from = pos = writeCharsAndEntity(ch, from, pos, "amp");
271 } else if (attribute && ch[pos] == '"') {
272 from = pos = writeCharsAndEntity(ch, from, pos, "quot");
273 } else {
274 pos++;
275 }
276 }
277 super.characters(ch, from, to - from);
278 }
279
280 }
2020 import java.io.Serializable;
2121 import java.io.StringWriter;
2222 import java.io.Writer;
23 import java.nio.charset.Charset;
2324 import java.util.UUID;
2425
2526 import org.xml.sax.ContentHandler;
8990 * @param stream output stream
9091 */
9192 public WriteOutContentHandler(OutputStream stream) {
92 this(new OutputStreamWriter(stream));
93 this(new OutputStreamWriter(stream, Charset.defaultCharset()));
9394 }
9495
9596 /**
5959 * skip them if they get sent to startElement/endElement by mistake.
6060 */
6161 private static final Set<String> AUTO =
62 unmodifiableSet("html", "head", "body", "frameset");
62 unmodifiableSet("html", "head", "frameset");
6363
6464 /**
6565 * The elements that get prepended with the {@link #TAB} character.
7979 */
8080 public static String formatDateUnknownTimezone(Date date) {
8181 // Create the Calendar object in the system timezone
82 Calendar calendar = GregorianCalendar.getInstance(Locale.US);
82 Calendar calendar = GregorianCalendar.getInstance(TimeZone.getDefault(), Locale.US);
8383 calendar.setTime(date);
8484 // Have it formatted
8585 String formatted = formatDate(calendar);
8888 }
8989 private static String doFormatDate(Calendar calendar) {
9090 return String.format(
91 Locale.ROOT,
9192 "%04d-%02d-%02dT%02d:%02d:%02dZ",
9293 calendar.get(Calendar.YEAR),
9394 calendar.get(Calendar.MONTH) + 1,
7070 </mime-type>
7171
7272 <mime-type type="application/cals-1840"/>
73
74 <mime-type type="application/cbor">
75 <acronym>CBOR</acronym>
76 <_comment>Concise Binary Object Representation container</_comment>
77 <tika:link>http://tools.ietf.org/html/rfc7049</tika:link>
78 <magic priority="40">
79 <match value="0xd9d9f7" type="string" offset="0" />
80 </magic>
81 </mime-type>
82
7383 <mime-type type="application/ccxml+xml">
7484 <glob pattern="*.ccxml"/>
7585 </mime-type>
91101 <mime-type type="application/dca-rft"/>
92102 <mime-type type="application/dec-dx"/>
93103 <mime-type type="application/dialog-info+xml"/>
94 <mime-type type="application/dicom"/>
104
105 <mime-type type="application/dicom">
106 <_comment>DICOM medical imaging data</_comment>
107 <magic priority="50">
108 <match value="DICM" type="string" offset="128"/>
109 </magic>
110 </mime-type>
95111
96112 <mime-type type="application/dita+xml">
97113 <sub-class-of type="application/xml"/>
114130 <glob pattern="*.dita"/>
115131 </mime-type>
116132 <mime-type type="application/dita+xml;format=task">
117 <sub-class-of type="application/dita+xml;format=task"/>
133 <sub-class-of type="application/dita+xml"/>
118134 <_comment>DITA Task Topic</_comment>
119135 <root-XML localName="task"/>
120136 <root-XML localName="task" namespaceURI="http://docs.oasis-open.org/namespace"/>
201217 <mime-type type="application/index.obj"/>
202218 <mime-type type="application/index.response"/>
203219 <mime-type type="application/index.vnd"/>
220
221 <mime-type type="application/inf">
222 <_comment>Windows setup INFormation</_comment>
223 <tika:link>http://msdn.microsoft.com/en-us/library/windows/hardware/ff549520(v=vs.85).aspx</tika:link>
224 <alias type="application/x-setupscript"/>
225 <alias type="application/x-wine-extension-inf"/>
226 <sub-class-of type="text/plain"/>
227 <magic priority="30">
228 <match value="[version]" type="string" offset="0" />
229 <match value="[strings]" type="string" offset="0" />
230 </magic>
231 </mime-type>
232
204233 <mime-type type="application/iotp"/>
205234 <mime-type type="application/ipp"/>
206235 <mime-type type="application/isup"/>
597626 <mime-type type="application/sdp">
598627 <glob pattern="*.sdp"/>
599628 </mime-type>
629
630 <mime-type type="application/sereal">
631 <_comment>Sereal binary serialization format</_comment>
632 <tika:link>https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod</tika:link>
633 <glob pattern="*.srl"/>
634 </mime-type>
635 <mime-type type="application/sereal;version=1">
636 <sub-class-of type="application/sereal"/>
637 <magic priority="50">
638 <match value="0x6C72733D" type="little32" offset="0">
639 <match value="0x01" mask="0x0F" type="string" offset="4"/>
640 </match>
641 </magic>
642 </mime-type>
643 <mime-type type="application/sereal;version=2">
644 <sub-class-of type="application/sereal"/>
645 <magic priority="50">
646 <match value="0x6C72733D" type="little32" offset="0">
647 <match value="0x02" mask="0x0F" type="string" offset="4"/>
648 </match>
649 </magic>
650 </mime-type>
651 <mime-type type="application/sereal;version=3">
652 <sub-class-of type="application/sereal"/>
653 <magic priority="50">
654 <match value="0x6C72F33D" type="little32" offset="0">
655 <match value="0x03" mask="0x0F" type="string" offset="4"/>
656 </match>
657 </magic>
658 </mime-type>
659
600660 <mime-type type="application/set-payment"/>
601661 <mime-type type="application/set-payment-initiation">
602662 <glob pattern="*.setpay"/>
13941454 <match value="Foglio\ di\ lavoro\ Microsoft\ Exce" type="string" offset="2080"/>
13951455 <match value="Biff5" type="string" offset="2114"/>
13961456 <match value="Biff5" type="string" offset="2121"/>
1397 <match value="\x09\x04\x06\x00\x00\x00\x10\x00" type="string" offset="0"/>
13981457 <match value="0xd0cf11e0a1b11ae1" type="string" offset="0:8">
13991458 <match value="W\x00o\x00r\x00k\x00b\x00o\x00o\x00k" type="string" offset="1152:4096" />
14001459 </match>
14261485 <_comment>Microsoft Excel 2007 Binary Spreadsheet</_comment>
14271486 <glob pattern="*.xlsb"/>
14281487 <sub-class-of type="application/x-tika-ooxml"/>
1488 </mime-type>
1489
1490 <mime-type type="application/vnd.ms-excel.sheet.4">
1491 <_comment>Microsoft Excel 4 Worksheet</_comment>
1492 <magic priority="60">
1493 <match value="0x09040600" type="string" offset="0">
1494 <match value="0x00001000" type="string" offset="4"/> <!-- Sheet -->
1495 <match value="0x00002000" type="string" offset="4"/> <!-- Chart -->
1496 <match value="0x00004000" type="string" offset="4"/> <!-- Macro -->
1497 </match>
1498 </magic>
1499 <sub-class-of type="application/x-tika-old-excel"/>
1500 </mime-type>
1501 <mime-type type="application/vnd.ms-excel.workspace.4">
1502 <_comment>Microsoft Excel 4 Workspace</_comment>
1503 <magic priority="60">
1504 <match value="0x09040600" type="string" offset="0">
1505 <match value="0x00000001" type="string" offset="4"/>
1506 </match>
1507 </magic>
1508 <sub-class-of type="application/x-tika-old-excel"/>
1509 </mime-type>
1510
1511 <mime-type type="application/vnd.ms-excel.sheet.3">
1512 <_comment>Microsoft Excel 3 Worksheet</_comment>
1513 <magic priority="60">
1514 <match value="0x09020600" type="string" offset="0">
1515 <match value="0x00001000" type="string" offset="4"/> <!-- Sheet -->
1516 <match value="0x00002000" type="string" offset="4"/> <!-- Chart -->
1517 <match value="0x00004000" type="string" offset="4"/> <!-- Macro -->
1518 </match>
1519 </magic>
1520 <sub-class-of type="application/x-tika-old-excel"/>
1521 </mime-type>
1522 <mime-type type="application/vnd.ms-excel.workspace.3">
1523 <_comment>Microsoft Excel 3 Workspace</_comment>
1524 <magic priority="60">
1525 <match value="0x09020600" type="string" offset="0">
1526 <match value="0x00000001" type="string" offset="4"/>
1527 </match>
1528 </magic>
1529 <sub-class-of type="application/x-tika-old-excel"/>
1530 </mime-type>
1531
1532 <mime-type type="application/vnd.ms-excel.sheet.2">
1533 <_comment>Microsoft Excel 2 Worksheet</_comment>
1534 <magic priority="60">
1535 <match value="0x09000400" type="string" offset="0">
1536 <match value="0x00001000" type="string" offset="4"/> <!-- Sheet -->
1537 <match value="0x00002000" type="string" offset="4"/> <!-- Chart -->
1538 <match value="0x00004000" type="string" offset="4"/> <!-- Macro -->
1539 </match>
1540 </magic>
1541 <sub-class-of type="application/x-tika-old-excel"/>
14291542 </mime-type>
14301543
14311544 <mime-type type="application/vnd.ms-fontobject">
22662379
22672380 <!-- http://www.iana.org/assignments/media-types/application/vnd.visio -->
22682381 <mime-type type="application/vnd.visio">
2382 <alias type="application/vnd.ms-visio"/>
22692383 <_comment>Microsoft Visio Diagram</_comment>
22702384 <glob pattern="*.vsd"/>
22712385 <glob pattern="*.vst"/>
22722386 <glob pattern="*.vss"/>
22732387 <glob pattern="*.vsw"/>
22742388 <sub-class-of type="application/x-tika-msoffice"/>
2389 </mime-type>
2390
2391 <mime-type type="application/vnd.ms-visio.drawing">
2392 <_comment>Office Open XML Visio Drawing (macro-free)</_comment>
2393 <glob pattern="*.vsdx"/>
2394 <sub-class-of type="application/x-tika-visio-ooxml"/>
2395 </mime-type>
2396 <mime-type type="application/vnd.ms-visio.template">
2397 <_comment>Office Open XML Visio Template (macro-free)</_comment>
2398 <glob pattern="*.vstx"/>
2399 <sub-class-of type="application/x-tika-visio-ooxml"/>
2400 </mime-type>
2401 <mime-type type="application/vnd.ms-visio.stencil">
2402 <_comment>Office Open XML Visio Stencil (macro-free)</_comment>
2403 <glob pattern="*.vssx"/>
2404 <sub-class-of type="application/x-tika-visio-ooxml"/>
2405 </mime-type>
2406 <mime-type type="application/vnd.ms-visio.drawing.macroEnabled.12">
2407 <_comment>Office Open XML Visio Drawing (macro-enabled)</_comment>
2408 <glob pattern="*.vsdm"/>
2409 <sub-class-of type="application/x-tika-visio-ooxml"/>
2410 </mime-type>
2411 <mime-type type="application/vnd.ms-visio.template.macroEnabled.12">
2412 <_comment>Office Open XML Visio Template (macro-enabled)</_comment>
2413 <glob pattern="*.vstm"/>
2414 <sub-class-of type="application/x-tika-visio-ooxml"/>
2415 </mime-type>
2416 <mime-type type="application/vnd.ms-visio.stencil.macroEnabled.12">
2417 <_comment>Office Open XML Visio Stencil (macro-enabled)</_comment>
2418 <glob pattern="*.vssm"/>
2419 <sub-class-of type="application/x-tika-visio-ooxml"/>
22752420 </mime-type>
22762421
22772422 <mime-type type="application/vnd.visionary">
23942539 <glob pattern="*.ace"/>
23952540 </mime-type>
23962541
2542 <mime-type type="application/x-axcrypt">
2543 <_comment>AxCrypt</_comment>
2544 <glob pattern="*.axx" />
2545 <magic priority="60">
2546 <!-- AxCrypt block header, skip length field, then Header of type Preamble -->
2547 <match value="0xc0b9072e4f93f146a015792ca1d9e821" type="string" offset="0">
2548 <match value="2" type="big32" offset="17" />
2549 </match>
2550 </magic>
2551 </mime-type>
2552
23972553 <mime-type type="application/x-adobe-indesign">
23982554 <acronym>INDD</acronym>
23992555 <_comment>Adobe InDesign document</_comment>
24592615 </mime-type>
24602616
24612617 <mime-type type="application/x-berkeley-db">
2462 <magic priority="50">
2618 <_comment>Berkeley DB</_comment>
2619 <alias type="application/x-dbm"/>
2620 </mime-type>
2621 <mime-type type="application/x-berkeley-db;format=hash">
2622 <_comment>Berkeley DB Hash Database</_comment>
2623 <magic priority="50">
2624 <match value="0x00061561" type="host32" offset="0"/>
24632625 <match value="0x00061561" type="big32" offset="0"/>
2626 <match value="0x00061561" type="little32" offset="0"/>
24642627 <match value="0x00061561" type="host32" offset="12"/>
24652628 <match value="0x00061561" type="big32" offset="12"/>
24662629 <match value="0x00061561" type="little32" offset="12"/>
2630 </magic>
2631 <sub-class-of type="application/x-berkeley-db"/>
2632 </mime-type>
2633 <mime-type type="application/x-berkeley-db;format=btree">
2634 <_comment>Berkeley DB BTree Database</_comment>
2635 <magic priority="50">
2636 <match value="0x00053162" type="host32" offset="0"/>
2637 <match value="0x00053162" type="big32" offset="0"/>
2638 <match value="0x00053162" type="little32" offset="0"/>
24672639 <match value="0x00053162" type="host32" offset="12"/>
24682640 <match value="0x00053162" type="big32" offset="12"/>
24692641 <match value="0x00053162" type="little32" offset="12"/>
2642 </magic>
2643 <sub-class-of type="application/x-berkeley-db"/>
2644 </mime-type>
2645 <mime-type type="application/x-berkeley-db;format=queue">
2646 <_comment>Berkeley DB Queue Database</_comment>
2647 <magic priority="50">
24702648 <match value="0x00042253" type="host32" offset="12"/>
24712649 <match value="0x00042253" type="big32" offset="12"/>
24722650 <match value="0x00042253" type="little32" offset="12"/>
2651 </magic>
2652 <sub-class-of type="application/x-berkeley-db"/>
2653 </mime-type>
2654 <mime-type type="application/x-berkeley-db;format=log">
2655 <_comment>Berkeley DB Log Database</_comment>
2656 <magic priority="50">
24732657 <match value="0x00040988" type="host32" offset="12"/>
24742658 <match value="0x00040988" type="little32" offset="12"/>
24752659 <match value="0x00040988" type="big32" offset="12"/>
2476 <match value="0x00053162" type="host32" offset="0"/>
2477 <match value="0x00053162" type="big32" offset="0"/>
2478 <match value="0x00053162" type="little32" offset="0"/>
2479 </magic>
2660 </magic>
2661 <sub-class-of type="application/x-berkeley-db"/>
2662 </mime-type>
2663
2664 <mime-type type="application/x-berkeley-db;format=hash;version=2">
2665 <_comment>Berkeley DB Version 2 Hash Database</_comment>
2666 <magic priority="60">
2667 <match value="0x00061561" type="host32" offset="12">
2668 <match value="0x0005" type="host32" offset="16"/>
2669 </match>
2670 <match value="0x00061561" type="big32" offset="12">
2671 <match value="0x0005" type="big32" offset="16"/>
2672 </match>
2673 <match value="0x00061561" type="little32" offset="12">
2674 <match value="0x0005" type="little32" offset="16"/>
2675 </match>
2676 </magic>
2677 <sub-class-of type="application/x-berkeley-db;format=hash"/>
2678 </mime-type>
2679 <mime-type type="application/x-berkeley-db;format=hash;version=3">
2680 <_comment>Berkeley DB Version 3 Hash Database</_comment>
2681 <magic priority="60">
2682 <match value="0x00061561" type="host32" offset="12">
2683 <match value="0x0007" type="host32" offset="16"/>
2684 </match>
2685 <match value="0x00061561" type="big32" offset="12">
2686 <match value="0x0007" type="big32" offset="16"/>
2687 </match>
2688 <match value="0x00061561" type="little32" offset="12">
2689 <match value="0x0007" type="little32" offset="16"/>
2690 </match>
2691 </magic>
2692 <sub-class-of type="application/x-berkeley-db;format=hash"/>
2693 </mime-type>
2694 <mime-type type="application/x-berkeley-db;format=hash;version=4">
2695 <_comment>Berkeley DB Version 4 Hash Database</_comment>
2696 <magic priority="60">
2697 <match value="0x00061561" type="host32" offset="12">
2698 <match value="0x0008" type="host32" offset="16"/>
2699 </match>
2700 <match value="0x00061561" type="big32" offset="12">
2701 <match value="0x0008" type="big32" offset="16"/>
2702 </match>
2703 <match value="0x00061561" type="little32" offset="12">
2704 <match value="0x0008" type="little32" offset="16"/>
2705 </match>
2706 </magic>
2707 <sub-class-of type="application/x-berkeley-db;format=hash"/>
2708 </mime-type>
2709 <mime-type type="application/x-berkeley-db;format=hash;version=5">
2710 <_comment>Berkeley DB Version 5 Hash Database</_comment>
2711 <magic priority="60">
2712 <match value="0x00061561" type="host32" offset="12">
2713 <match value="0x0009" type="host32" offset="16"/>
2714 </match>
2715 <match value="0x00061561" type="big32" offset="12">
2716 <match value="0x0009" type="big32" offset="16"/>
2717 </match>
2718 <match value="0x00061561" type="little32" offset="12">
2719 <match value="0x0009" type="little32" offset="16"/>
2720 </match>
2721 </magic>
2722 <sub-class-of type="application/x-berkeley-db;format=hash"/>
2723 </mime-type>
2724
2725 <mime-type type="application/x-berkeley-db;format=btree;version=2">
2726 <_comment>Berkeley DB Version 2 BTree Database</_comment>
2727 <magic priority="60">
2728 <match value="0x00053162" type="host32" offset="12">
2729 <match value="0x0006" type="host32" offset="16"/>
2730 </match>
2731 <match value="0x00053162" type="big32" offset="12">
2732 <match value="0x0006" type="big32" offset="16"/>
2733 </match>
2734 <match value="0x00053162" type="little32" offset="12">
2735 <match value="0x0006" type="little32" offset="16"/>
2736 </match>
2737 </magic>
2738 <sub-class-of type="application/x-berkeley-db;format=btree"/>
2739 </mime-type>
2740 <mime-type type="application/x-berkeley-db;format=btree;version=3">
2741 <_comment>Berkeley DB Version 3 BTree Database</_comment>
2742 <magic priority="60">
2743 <match value="0x00053162" type="host32" offset="12">
2744 <match value="0x0008" type="host32" offset="16"/>
2745 </match>
2746 <match value="0x00053162" type="big32" offset="12">
2747 <match value="0x0008" type="big32" offset="16"/>
2748 </match>
2749 <match value="0x00053162" type="little32" offset="12">
2750 <match value="0x0008" type="little32" offset="16"/>
2751 </match>
2752 </magic>
2753 <sub-class-of type="application/x-berkeley-db;format=btree"/>
2754 </mime-type>
2755 <mime-type type="application/x-berkeley-db;format=btree;version=4">
2756 <_comment>Berkeley DB Version 4 and 5 BTree Database</_comment>
2757 <magic priority="60">
2758 <match value="0x00053162" type="host32" offset="12">
2759 <match value="0x0009" type="host32" offset="16"/>
2760 </match>
2761 <match value="0x00053162" type="big32" offset="12">
2762 <match value="0x0009" type="big32" offset="16"/>
2763 </match>
2764 <match value="0x00053162" type="little32" offset="12">
2765 <match value="0x0009" type="little32" offset="16"/>
2766 </match>
2767 </magic>
2768 <sub-class-of type="application/x-berkeley-db;format=btree"/>
24802769 </mime-type>
24812770
24822771 <mime-type type="application/x-bibtex-text-file">
25912880 </mime-type>
25922881
25932882 <mime-type type="application/x-debian-package">
2883 <alias type="application/vnd.debian.binary-package"/>
25942884 <sub-class-of type="application/x-archive"/>
25952885 <magic priority="60">
25962886 <match value="!&lt;arch&gt;\ndebian-binary" type="string" offset="0"/>
2887 <match value="!&lt;arch&gt;\ndebian-split" type="string" offset="0"/>
25972888 </magic>
25982889 <glob pattern="*.deb"/>
25992890 <glob pattern="*.udeb"/>
26792970 <match value="\nDate:" type="string" offset="2:9"/>
26802971 </magic>
26812972 <glob pattern="*.emlx"/>
2973 <sub-class-of type="text/x-tika-text-based-message"/>
26822974 </mime-type>
26832975
26842976 <mime-type type="application/x-killustrator">
27323024 <mime-type type="application/x-emf">
27333025 <acronym>EMF</acronym>
27343026 <_comment>Extended Metafile</_comment>
3027 <tika:link>https://msdn.microsoft.com/en-us/library/cc230711.aspx</tika:link>
27353028 <glob pattern="*.emf"/>
27363029 <magic priority="50">
2737 <match value="0x01000000" type="string" offset="0"/>
3030 <match value="0x01000000" type="string" offset="0">
3031 <match value="0x464D4520" type="little32" offset="40"/>
3032 </match>
27383033 </magic>
27393034 </mime-type>
27403035
28443139 <glob pattern="*.gnumeric"/>
28453140 </mime-type>
28463141
3142 <mime-type type="application/x-grib">
3143 <acronym>GRIB</acronym>
3144 <_comment>General Regularly-distributed Information in Binary form</_comment>
3145 <tika:link>http://en.wikipedia.org/wiki/GRIB</tika:link>
3146 <magic priority="50">
3147 <match value="GRIB" type="string" offset="0"/>
3148 </magic>
3149 <glob pattern="*.grb"/>
3150 <glob pattern="*.grb1"/>
3151 <glob pattern="*.grb2"/>
3152 </mime-type>
3153
28473154 <mime-type type="application/x-gtar">
28483155 <_comment>GNU tar Compressed File Archive (GNU Tape Archive)</_comment>
28493156 <magic priority="50">
28583165 <_comment>Gzip Compressed Archive</_comment>
28593166 <alias type="application/x-gzip"/>
28603167 <alias type="application/x-gunzip"/>
2861 <alias type="application/gzip-compressed"/>
28623168 <alias type="application/gzipped"/>
28633169 <alias type="application/gzip-compressed"/>
28643170 <alias type="application/x-gzip-compressed"/>
28673173 <match value="\037\213" type="string" offset="0" />
28683174 <match value="\x1f\x8b" type="string" offset="0" />
28693175 </magic>
3176 <glob pattern="*.gz" />
28703177 <glob pattern="*.tgz" />
2871 <glob pattern="*.gz" />
28723178 <glob pattern="*-gz" />
28733179 <glob pattern="*.emz" />
28743180 </mime-type>
29063212 </match>
29073213 </magic>
29083214 <glob pattern="*.ibooks"/>
3215 </mime-type>
3216
3217 <mime-type type="application/x-isatab-investigation">
3218 <_comment>ISA-Tab Investigation file</_comment>
3219 <magic priority="50">
3220 <match value="ONTOLOGY SOURCE REFERENCE" type="string" offset="0"/>
3221 </magic>
3222 <glob pattern="i_*.txt"/>
3223 </mime-type>
3224
3225 <!--<mime-type type="application/x-isatab-study">-->
3226 <mime-type type="application/x-isatab">
3227 <_comment>ISA-Tab Study file</_comment>
3228 <magic priority="50">
3229 <match value="Source Name" type="string" offset="1"/>
3230 </magic>
3231 <glob pattern="s_*.txt"/>
3232 </mime-type>
3233
3234 <mime-type type="application/x-isatab-assay">
3235 <_comment>ISA-Tab Assay file</_comment>
3236 <magic priority="50">
3237 <match value="Sample Name" type="string" offset="1"/>
3238 </magic>
3239 <glob pattern="a_*.txt"/>
29093240 </mime-type>
29103241
29113242 <mime-type type="application/x-iso9660-image">
30283359 <sub-class-of type="application/x-msdownload"/>
30293360 <magic priority="55">
30303361 <!-- Technically the header offset is stored at 0x3c, and isn't a -->
3031 <!-- constant, but it's almost always set to start at 0x80 or 0xf0 -->
3032 <match value="PE\000\000" type="string" offset="128"/>
3033 <match value="PE\000\000" type="string" offset="240"/>
3362 <!-- constant, but it's almost always set to start at 0x80, 0xb0, -->
3363 <!-- 0xd0 or 0xf0. Will always have the MZ msdoc header too. -->
3364 <match value="MZ" type="string" offset="0">
3365 <match value="PE\000\000" type="string" offset="128"/>
3366 <match value="PE\000\000" type="string" offset="176"/>
3367 <match value="PE\000\000" type="string" offset="208"/>
3368 <match value="PE\000\000" type="string" offset="240"/>
3369 </match>
30343370 </magic>
30353371 </mime-type>
30363372 <!-- the PE header should be PEx00x00 then a two byte machine type -->
31233459 <mime-type type="application/x-mswrite">
31243460 <glob pattern="*.wri"/>
31253461 </mime-type>
3462
3463 <mime-type type="application/x-mysql-db">
3464 </mime-type>
3465 <mime-type type="application/x-mysql-table-definition">
3466 <_comment>MySQL Table Definition (Format)</_comment>
3467 <!-- Glob is normally .frm, but that's already taken -->
3468 <magic priority="40">
3469 <match value="0xfe0107" type="string" offset="0"/>
3470 <match value="0xfe0108" type="string" offset="0"/>
3471 <match value="0xfe0109" type="string" offset="0"/>
3472 <match value="0xfe010a" type="string" offset="0"/>
3473 <match value="0xfe010b" type="string" offset="0"/>
3474 <match value="0xfe010c" type="string" offset="0"/>
3475 </magic>
3476 <sub-class-of type="application/x-mysql-db"/>
3477 </mime-type>
3478 <mime-type type="application/x-mysql-misam-index">
3479 <_comment>MySQL MISAM Index</_comment>
3480 <magic priority="40">
3481 <match value="0xfefe03" type="string" offset="0"/>
3482 <match value="0xfefe05" type="string" offset="0"/>
3483 </magic>
3484 <sub-class-of type="application/x-mysql-db"/>
3485 </mime-type>
3486 <mime-type type="application/x-mysql-misam-compressed-index">
3487 <_comment>MySQL MISAM Compressed Index</_comment>
3488 <glob pattern="*.MYI"/>
3489 <magic priority="40">
3490 <match value="0xfefe06" type="string" offset="0"/>
3491 <match value="0xfefe07" type="string" offset="0"/>
3492 </magic>
3493 <sub-class-of type="application/x-mysql-db"/>
3494 </mime-type>
3495 <mime-type type="application/x-mysql-misam-data">
3496 <_comment>MySQL MISAM Data</_comment>
3497 <glob pattern="*.MYD"/>
3498 <!-- MISAM Data files are header-less, so no magic -->
3499 <sub-class-of type="application/x-mysql-db"/>
3500 </mime-type>
3501
31263502 <mime-type type="application/x-netcdf">
31273503 <glob pattern="*.nc"/>
31283504 <glob pattern="*.cdf"/>
31543530 <sub-class-of type="application/x-tika-msoffice"/>
31553531 </mime-type>
31563532
3533 <mime-type type="application/xquery">
3534 <_comment>XQuery source code</_comment>
3535 <glob pattern="*.xq"/>
3536 <glob pattern="*.xquery"/>
3537 <sub-class-of type="text/plain"/>
3538 </mime-type>
3539
31573540 <mime-type type="application/x-rar-compressed">
31583541 <_comment>RAR archive</_comment>
31593542 <alias type="application/x-rar"/>
32943677
32953678 <mime-type type="application/x-silverlight-app">
32963679 <glob pattern="*.xap"/>
3680 </mime-type>
3681
3682 <mime-type type="application/x-sqlite3">
3683 <magic priority="50">
3684 <match value="SQLite format 3\x00" type="string" offset="0"/>
3685 </magic>
32973686 </mime-type>
32983687
32993688 <mime-type type="application/x-stuffit">
33833772 </magic>
33843773 </mime-type>
33853774
3775 <mime-type type="application/x-tika-old-excel">
3776 <_comment>Pre-OLE2 (Old) Microsoft Excel Worksheets</_comment>
3777 </mime-type>
3778
33863779 <!-- =================================================================== -->
33873780 <!-- Office Open XML file formats -->
33883781 <!-- http://www.ecma-international.org/publications/standards/Ecma-376.htm -->
34033796 <_comment>Password Protected OOXML File</_comment>
34043797 </mime-type>
34053798
3799 <mime-type type="application/x-tika-visio-ooxml">
3800 <sub-class-of type="application/x-tika-ooxml"/>
3801 <_comment>Visio OOXML File</_comment>
3802 </mime-type>
3803
34063804 <!-- Older StarOffice formats extend up the Microsoft OLE2 format -->
34073805 <mime-type type="application/x-tika-staroffice">
34083806 <sub-class-of type="application/x-tika-msoffice"/>
34163814 </mime-type>
34173815 <mime-type type="application/x-ustar">
34183816 <glob pattern="*.ustar"/>
3817 </mime-type>
3818
3819 <mime-type type="application/x-vhd">
3820 <acronym>VHD</acronym>
3821 <_comment>Virtual PC Virtual Hard Disk</_comment>
3822 <tika:link>http://en.wikipedia.org/wiki/VHD_%28file_format%29</tika:link>
3823 <magic priority="50">
3824 <match value="conectix" type="string" offset="0"/>
3825 </magic>
34193826 </mime-type>
34203827
34213828 <mime-type type="application/x-vmdk">
34983905 <magic priority="50">
34993906 <match value="&lt;?xml" type="string" offset="0"/>
35003907 <match value="&lt;?XML" type="string" offset="0"/>
3501 <match value="&lt;!--" type="string" offset="0"/>
35023908 <!-- UTF-8 BOM -->
35033909 <match value="0xEFBBBF3C3F786D6C" type="string" offset="0"/>
35043910 <!-- UTF-16 LE/BE -->
35053911 <match value="0xFFFE3C003F0078006D006C00" type="string" offset="0"/>
35063912 <match value="0xFEFF003C003F0078006D006C" type="string" offset="0"/>
35073913 <!-- TODO: Add matches for the other possible XML encoding schemes -->
3914 </magic>
3915 <!-- XML files can start with a comment but then must not contain processing instructions.
3916 This should be rare so we assign lower priority here. Priority is also lower than text/html magics
3917 for them to be preferred for HTML starting with comment.-->
3918 <magic priority="30">
3919 <match value="&lt;!--" type="string" offset="0"/>
35083920 </magic>
35093921 <glob pattern="*.xml"/>
35103922 <glob pattern="*.xsl"/>
37594171 <match value="OggS\000.......................\001vorbis" type="string"
37604172 mask="0xFFFFFFFF00000000000000000000000000000000000000000000000000FFFFFFFFFFFF"
37614173 offset="0"/>
3762 <match value="\x4f\x67\x67\x53\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00"
3763 type="string" offset="0"/>
37644174 </magic>
37654175 <glob pattern="*.ogg"/>
37664176 <sub-class-of type="audio/ogg"/>
40744484 <glob pattern="*.dib"/>
40754485 </mime-type>
40764486
4487 <mime-type type="image/x-bpg">
4488 <acronym>BPG</acronym>
4489 <_comment>Better Portable Graphics</_comment>
4490 <magic priority="50">
4491 <match value="0x425047FB" type="string" offset="0">
4492 </match>
4493 </magic>
4494 <glob pattern="*.bpg"/>
4495 </mime-type>
4496
40774497 <mime-type type="image/cgm">
40784498 <acronym>CGM</acronym>
40794499 <_comment>Computer Graphics Metafile</_comment>
42614681 <alias type="application/dwg"/>
42624682 <alias type="application/x-dwg"/>
42634683 <alias type="application/x-autocad"/>
4264 <alias type="image/vnd.dwg"/>
42654684 <alias type="drawing/dwg"/>
42664685 <glob pattern="*.dwg"/>
42674686 <magic priority="50">
43354754 <mime-type type="image/vnd.wap.wbmp">
43364755 <_comment>Wireless Bitmap File Format</_comment>
43374756 <glob pattern="*.wbmp"/>
4757 </mime-type>
4758
4759 <mime-type type="image/webp">
4760 <acronym>WEBP</acronym>
4761 <tika:link>http://en.wikipedia.org/wiki/WebP</tika:link>
4762 <!-- container spec https://developers.google.com/speed/webp/docs/riff_container -->
4763 <magic priority="50">
4764 <match value="RIFF....WEBP" type="string" offset="0"
4765 mask="0xFFFFFFFF00000000FFFFFFFF"/>
4766 </magic>
4767 <glob pattern="*.webp"/>
43384768 </mime-type>
43394769
43404770 <mime-type type="image/vnd.xiff">
46215051 <match value="Xref:" type="string" offset="0" />
46225052 <match value="Article" type="string" offset="0" />
46235053 </magic>
5054 <sub-class-of type="text/x-tika-text-based-message"/>
46245055 </mime-type>
46255056
46265057 <mime-type type="message/partial"/>
46465077 <glob pattern="*.mime"/>
46475078 <glob pattern="*.mht"/>
46485079 <glob pattern="*.mhtml"/>
5080 <sub-class-of type="text/x-tika-text-based-message"/>
46495081 </mime-type>
46505082
46515083 <mime-type type="message/s-http"/>
47165148 <mime-type type="multipart/signed"/>
47175149 <mime-type type="multipart/voice-message"/>
47185150
5151 <mime-type type="text/dif+xml">
5152 <root-XML localName="DIF"/>
5153 <root-XML localName="DIF" namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
5154 <glob pattern="*.dif"/>
5155 <sub-class-of type="application/xml"/>
5156 </mime-type>
5157
47195158 <mime-type type="text/x-actionscript">
47205159 <_comment>ActionScript source code</_comment>
47215160 <glob pattern="*.as"/>
47225161 <sub-class-of type="text/plain"/>
5162 </mime-type>
5163
5164 <mime-type type="text/dif+xml">
5165 <root-XML localName="DIF"/>
5166 <root-XML localName="DIF" namespaceURI="http://gcmd.gsfc.nasa.gov/Aboutus/xml/dif/"/>
5167 <glob pattern="*.dif"/>
5168 <sub-class-of type="application/xml"/>
47235169 </mime-type>
47245170
47255171 <mime-type type="text/x-ada">
54435889 <sub-class-of type="text/plain"/>
54445890 </mime-type>
54455891
5892 <mime-type type="text/x-tika-text-based-message">
5893 <_comment>Text-based (non-binary) Message</_comment>
5894 </mime-type>
5895
54465896 <mime-type type="text/x-uuencode">
54475897 <glob pattern="*.uu"/>
54485898 </mime-type>
56146064 <sub-class-of type="application/ogg"/>
56156065 </mime-type>
56166066
6067 <mime-type type="video/daala">
6068 <_comment>Ogg Daala Video</_comment>
6069 <alias type="video/x-daala"/>
6070 <magic priority="60">
6071 <!-- Assumes Video stream comes before Audio, may not always -->
6072 <match value="OggS\000.......................\x80daala" type="string"
6073 mask="0xFFFFFFFF00000000000000000000000000000000000000000000000000FFFFFFFFFFFF"
6074 offset="0"/>
6075 </magic>
6076 <sub-class-of type="video/ogg"/>
6077 </mime-type>
6078
56176079 <mime-type type="video/theora">
56186080 <_comment>Ogg Theora Video</_comment>
56196081 <alias type="video/x-theora"/>
58936355 <sub-class-of type="text/plain"/>
58946356 </mime-type>
58956357
5896 <mime-type type="application/xquery">
5897 <_comment>XQuery source code</_comment>
5898 <glob pattern="*.xq"/>
5899 <glob pattern="*.xquery"/>
5900 <sub-class-of type="text/plain"/>
5901 </mime-type>
5902
59036358 </mime-info>
840840 assertEquals("video/x-msvideo", tika.detect("x.avi"));
841841 assertEquals("video/x-sgi-movie", tika.detect("x.movie"));
842842 assertEquals("x-conference/x-cooltalk", tika.detect("x.ice"));
843
844 assertEquals("application/x-grib", tika.detect("x.grb"));
845 assertEquals("application/x-grib", tika.detect("x.grb1"));
846 assertEquals("application/x-grib", tika.detect("x.grb2"));
847 assertEquals("text/dif+xml", tika.detect("x.dif"));
843848 }
844849
845850 }
1919 import java.io.File;
2020 import java.io.FileInputStream;
2121 import java.io.InputStream;
22 import java.util.Locale;
2223
2324 import org.apache.tika.io.IOUtils;
2425
3435 }
3536 } else {
3637 benchmark(new File(
37 "../tika-parsers/src/test/resources/test-documents"));
38 "../tika-parsers/src/test/resources/test-documents"));
3839 }
3940 System.out.println(
4041 "Total benchmark time: "
5556 tika.detect(new ByteArrayInputStream(content));
5657 }
5758 System.out.printf(
59 Locale.ROOT,
5860 "%6dns per Tika.detect(%s) = %s%n",
5961 System.currentTimeMillis() - start, file, type);
6062 } finally {
2222 import org.apache.tika.ResourceLoggingClassLoader;
2323 import org.apache.tika.exception.TikaException;
2424 import org.apache.tika.parser.AutoDetectParser;
25 import org.apache.tika.parser.CompositeParser;
2526 import org.apache.tika.parser.DefaultParser;
27 import org.apache.tika.parser.EmptyParser;
28 import org.apache.tika.parser.ErrorParser;
29 import org.apache.tika.parser.Parser;
30 import org.apache.tika.parser.ParserDecorator;
2631 import org.junit.Test;
2732
2833 import static org.junit.Assert.assertEquals;
3944 * @see <a href="https://issues.apache.org/jira/browse/TIKA-866">TIKA-866</a>
4045 */
4146 @Test
42 public void testInvalidParser() throws Exception {
47 public void withInvalidParser() throws Exception {
4348 URL url = TikaConfigTest.class.getResource("TIKA-866-invalid.xml");
4449 System.setProperty("tika.config", url.toExternalForm());
4550 try {
5863 *
5964 * @see <a href="https://issues.apache.org/jira/browse/TIKA-866">TIKA-866</a>
6065 */
61 public void testCompositeParser() throws Exception {
66 @Test
67 public void asCompositeParser() throws Exception {
6268 URL url = TikaConfigTest.class.getResource("TIKA-866-composite.xml");
6369 System.setProperty("tika.config", url.toExternalForm());
6470 try {
7682 *
7783 * @see <a href="https://issues.apache.org/jira/browse/TIKA-866">TIKA-866</a>
7884 */
79 public void testValidParser() throws Exception {
85 @Test
86 public void onlyValidParser() throws Exception {
8087 URL url = TikaConfigTest.class.getResource("TIKA-866-valid.xml");
8188 System.setProperty("tika.config", url.toExternalForm());
8289 try {
93100 * that should be used when loading the mimetypes and when
94101 * discovering services
95102 */
96 public void testClassLoaderUsedEverywhere() throws Exception {
103 @Test
104 public void ensureClassLoaderUsedEverywhere() throws Exception {
97105 ResourceLoggingClassLoader customLoader =
98106 new ResourceLoggingClassLoader(getClass().getClassLoader());
99107 TikaConfig config;
126134 // - Custom Mimetypes
127135 assertNotNull(resources.get("org/apache/tika/mime/custom-mimetypes.xml"));
128136 }
137
138 /**
139 * TIKA-1445 It should be possible to exclude DefaultParser from
140 * certain types, so another parser explicitly listed will take them
141 */
142 @Test
143 public void defaultParserWithExcludes() throws Exception {
144 URL url = TikaConfigTest.class.getResource("TIKA-1445-default-except.xml");
145 System.setProperty("tika.config", url.toExternalForm());
146 try {
147 TikaConfig config = new TikaConfig();
148
149 CompositeParser cp = (CompositeParser)config.getParser();
150 List<Parser> parsers = cp.getAllComponentParsers();
151 Parser p;
152
153 // Will be the three parsers defined in the xml
154 assertEquals(3, parsers.size());
155
156 // Should have a wrapped DefaultParser, not the main DefaultParser,
157 // as it is excluded from handling certain classes
158 p = parsers.get(0);
159 assertTrue(p.toString(), p instanceof ParserDecorator);
160 assertEquals(DefaultParser.class, ((ParserDecorator)p).getWrappedParser().getClass());
161
162 // Should have two others which claim things, which they wouldn't
163 // otherwise handle
164 p = parsers.get(1);
165 assertTrue(p.toString(), p instanceof ParserDecorator);
166 assertEquals(EmptyParser.class, ((ParserDecorator)p).getWrappedParser().getClass());
167 assertEquals("hello/world", p.getSupportedTypes(null).iterator().next().toString());
168
169 p = parsers.get(2);
170 assertTrue(p.toString(), p instanceof ParserDecorator);
171 assertEquals(ErrorParser.class, ((ParserDecorator)p).getWrappedParser().getClass());
172 assertEquals("fail/world", p.getSupportedTypes(null).iterator().next().toString());
173 } catch (TikaException e) {
174 fail("Unexpected TikaException: " + e);
175 } finally {
176 System.clearProperty("tika.config");
177 }
178 }
129179 }
2020 import java.io.InputStream;
2121 import java.util.Arrays;
2222
23 import org.apache.tika.io.IOUtils;
2324 import org.apache.tika.metadata.Metadata;
2425 import org.apache.tika.mime.MediaType;
2526 import org.junit.Test;
5354
5455 @Test
5556 public void testDetectText() throws Exception {
56 assertText("Hello, World!".getBytes("UTF-8"));
57 assertText(" \t\r\n".getBytes("UTF-8"));
57 assertText("Hello, World!".getBytes(IOUtils.UTF_8));
58 assertText(" \t\r\n".getBytes(IOUtils.UTF_8));
5859 assertNotText(new byte[] { -1, -2, -3, 0x09, 0x0A, 0x0C, 0x0D, 0x1B });
5960 assertNotText(new byte[] { 0 });
6061 assertNotText(new byte[] { 'H', 'e', 'l', 'l', 'o', 0 });
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.io;
18
19 import static org.junit.Assert.assertEquals;
20
21 import java.io.ByteArrayInputStream;
22
23 import org.junit.Test;
24
25 public class EndianUtilsTest {
26 @Test
27 public void testReadUE7() throws Exception {
28 byte[] data;
29
30 data = new byte[] { 0x08 };
31 assertEquals((long)8, EndianUtils.readUE7(new ByteArrayInputStream(data)));
32
33 data = new byte[] { (byte)0x84, 0x1e };
34 assertEquals((long)542, EndianUtils.readUE7(new ByteArrayInputStream(data)));
35
36 data = new byte[] { (byte)0xac, (byte)0xbe, 0x17 };
37 assertEquals((long)728855, EndianUtils.readUE7(new ByteArrayInputStream(data)));
38 }
39 }
1616
1717 package org.apache.tika.io;
1818
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21 import static org.junit.Assert.fail;
1922
2023 import org.junit.Test;
21 import static org.junit.Assert.*;
2224
2325 public class FilenameUtilsTest {
2426
6767 */
6868 private static InputStream generateStream(int from, int length)
6969 {
70 return new ByteArrayInputStream(generateText(from, length).getBytes());
70 return new ByteArrayInputStream(generateText(from, length).getBytes(IOUtils.UTF_8));
7171 }
7272
7373 /**
122122 TailStream stream = new TailStream(generateStream(0, 2 * count), count);
123123 readStream(stream);
124124 assertEquals("Wrong buffer", generateText(count, count), new String(
125 stream.getTail()));
125 stream.getTail(), IOUtils.UTF_8));
126126 }
127127
128128 /**
143143 read = stream.read(buf);
144144 }
145145 assertEquals("Wrong buffer", generateText(count - tailSize, tailSize),
146 new String(stream.getTail()));
146 new String(stream.getTail(), IOUtils.UTF_8));
147147 stream.close();
148148 }
149149
163163 stream.reset();
164164 readStream(stream);
165165 assertEquals("Wrong buffer", generateText(tailSize, tailSize),
166 new String(stream.getTail()));
166 new String(stream.getTail(), IOUtils.UTF_8));
167167 }
168168
169169 /**
179179 byte[] buf = new byte[count];
180180 stream.read(buf);
181181 assertEquals("Wrong buffer", generateText(count - tailSize, tailSize),
182 new String(stream.getTail()));
182 new String(stream.getTail(), IOUtils.UTF_8));
183183 stream.close();
184184 }
185185
196196 assertEquals("Wrong skip result", skipCount, stream.skip(skipCount));
197197 assertEquals("Wrong buffer",
198198 generateText(skipCount - tailSize, tailSize),
199 new String(stream.getTail()));
199 new String(stream.getTail(), IOUtils.UTF_8));
200200 stream.close();
201201 }
202202
210210 TailStream stream = new TailStream(generateStream(0, count), 2 * count);
211211 assertEquals("Wrong skip result", count, stream.skip(2 * count));
212212 assertEquals("Wrong buffer", generateText(0, count),
213 new String(stream.getTail()));
213 new String(stream.getTail(), IOUtils.UTF_8));
214214 stream.close();
215215 }
216216
1515 */
1616 package org.apache.tika.io;
1717
18 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertFalse;
20 import static org.junit.Assert.assertTrue;
21
1822 import java.io.ByteArrayInputStream;
1923 import java.io.ByteArrayOutputStream;
2024 import java.io.File;
2630 import java.net.URL;
2731
2832 import org.apache.tika.metadata.Metadata;
29
3033 import org.junit.Test;
31 import static org.junit.Assert.assertEquals;
32 import static org.junit.Assert.assertFalse;
33 import static org.junit.Assert.assertTrue;
3434
3535 public class TikaInputStreamTest {
3636
6161 @Test
6262 public void testStreamBased() throws IOException {
6363 InputStream input =
64 new ByteArrayInputStream("Hello, World!".getBytes("UTF-8"));
64 new ByteArrayInputStream("Hello, World!".getBytes(IOUtils.UTF_8));
6565 InputStream stream = TikaInputStream.get(input);
6666
6767 File file = TikaInputStream.get(stream).getFile();
8888 File file = File.createTempFile("tika-", ".tmp");
8989 OutputStream stream = new FileOutputStream(file);
9090 try {
91 stream.write(data.getBytes("UTF-8"));
91 stream.write(data.getBytes(IOUtils.UTF_8));
9292 } finally {
9393 stream.close();
9494 }
107107 private String readStream(InputStream stream) throws IOException {
108108 ByteArrayOutputStream buffer = new ByteArrayOutputStream();
109109 IOUtils.copy(stream, buffer);
110 return buffer.toString("UTF-8");
110 return buffer.toString(IOUtils.UTF_8.name());
111111 }
112112
113113 @Test
1515 */
1616 package org.apache.tika.language;
1717
18 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertFalse;
20 import static org.junit.Assert.assertTrue;
21
1822 import java.io.IOException;
1923 import java.io.InputStream;
2024 import java.io.InputStreamReader;
2125 import java.io.Writer;
2226 import java.util.HashMap;
23
24 import static org.junit.Assert.assertEquals;
25 import static org.junit.Assert.assertFalse;
26 import static org.junit.Assert.assertTrue;
27 import java.util.Locale;
2728
2829 import org.apache.tika.io.IOUtils;
2930 import org.junit.Before;
4849 public void setUp() {
4950 LanguageIdentifier.initProfiles();
5051 }
51
52
5253 @Test
5354 public void testLanguageDetection() throws IOException {
5455 for (String language : languages) {
104105 assertTrue(identifier.isReasonablyCertain());
105106 }
106107
107 @Test
108 public void testMixedLanguages() throws IOException {
108 // Enable this to compare performance
109 public void testPerformance() throws IOException {
110 final int MRUNS = 8;
111 final int IRUNS = 10;
112 int detected = 0; // To avoid code removal by JVM or compiler
113 String lastResult = null;
114 for (int m = 0 ; m < MRUNS ; m++) {
115 LanguageProfile.useInterleaved = (m & 1) == 1; // Alternate between standard and interleaved
116 String currentResult = "";
117 final long start = System.nanoTime();
118 for (int i = 0 ; i < IRUNS ; i++) {
119 for (String language : languages) {
120 ProfilingWriter writer = new ProfilingWriter();
121 writeTo(language, writer);
122 LanguageIdentifier identifier = new LanguageIdentifier(writer.getProfile());
123 if (identifier.isReasonablyCertain()) {
124 currentResult += identifier.getLanguage();
125 detected++;
126 }
127 }
128 }
129 System.out.println(String.format(Locale.ROOT,
130 "Performed %d detections at %2d ms/test with interleaved=%b",
131 languages.length*IRUNS, (System.nanoTime()-start)/1000000/(languages.length*IRUNS),
132 LanguageProfile.useInterleaved));
133 if (lastResult != null) { // Might as well test that they behave the same while we're at it
134 assertEquals("This result should be equal to the last", lastResult, currentResult);
135 }
136 lastResult = currentResult;
137 }
138 if (detected == -1) {
139 System.out.println("Never encountered but keep it to guard against over-eager optimization");
140 }
141 }
142
143 @Test
144 public void testMixedLanguages() throws IOException {
109145 for (String language : languages) {
110146 for (String other : languages) {
111147 if (!language.equals(other)) {
138174 InputStream stream =
139175 LanguageIdentifierTest.class.getResourceAsStream(language + ".test");
140176 try {
141 IOUtils.copy(new InputStreamReader(stream, "UTF-8"), writer);
177 IOUtils.copy(new InputStreamReader(stream, IOUtils.UTF_8), writer);
142178 } finally {
143179 stream.close();
144180 }
1616
1717 package org.apache.tika.language;
1818
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
1922 import java.io.BufferedReader;
2023 import java.io.File;
2124 import java.io.FileInputStream;
2629 import java.net.URISyntaxException;
2730
2831 import org.apache.tika.exception.TikaException;
32 import org.apache.tika.io.IOUtils;
2933 import org.junit.After;
3034 import org.junit.Test;
31
32 import static org.junit.Assert.assertEquals;
33 import static org.junit.Assert.assertTrue;
3435
3536 public class LanguageProfilerBuilderTest {
3637 /* Test members */
3940 private final String profileName = "../tika-core/src/test/resources/org/apache/tika/language/langbuilder/"
4041 + LanguageProfilerBuilderTest.class.getName();
4142 private final String corpusName = "langbuilder/welsh_corpus.txt";
42 private final String encoding = "UTF-8";
4343 private final String FILE_EXTENSION = "ngp";
4444 private final String LANGUAGE = "welsh";
4545 private final int maxlen = 1000;
4949 InputStream is =
5050 LanguageProfilerBuilderTest.class.getResourceAsStream(corpusName);
5151 try {
52 ngramProfile = LanguageProfilerBuilder.create(profileName, is , encoding);
52 ngramProfile = LanguageProfilerBuilder.create(profileName, is , IOUtils.UTF_8.name());
5353 } finally {
5454 is.close();
5555 }
8181 + FILE_EXTENSION));
8282 try {
8383 BufferedReader reader = new BufferedReader(new InputStreamReader(
84 stream, encoding));
84 stream, IOUtils.UTF_8));
8585 String line = reader.readLine();
8686 while (line != null) {
8787 if (line.length() > 0 && !line.startsWith("#")) {// skips the
2424 import java.net.URL;
2525
2626 import org.apache.tika.config.TikaConfig;
27 import org.apache.tika.io.IOUtils;
2728 import org.apache.tika.metadata.Metadata;
28
2929 import org.junit.Before;
3030 import org.junit.Test;
3131
7373 testFile("image/cgm", "plotutils-bin-cgm-v3.cgm");
7474 // test HTML detection of malformed file, previously identified as image/cgm (TIKA-1170)
7575 testFile("text/html", "test-malformed-header.html.bin");
76
77 //test GCMD Directory Interchange Format (.dif) TIKA-1561
78 testFile("text/dif+xml", "brwNIMS_2014.dif");
7679 }
7780
7881 @Test
8487 new ByteArrayInputStream("\ufefftest".getBytes("UTF-16BE")),
8588 new Metadata()));
8689 assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect(
87 new ByteArrayInputStream("\ufefftest".getBytes("UTF-8")),
90 new ByteArrayInputStream("\ufefftest".getBytes(IOUtils.UTF_8)),
8891 new Metadata()));
8992 }
9093
194197 @Test
195198 public void testNotXML() throws IOException {
196199 assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect(
197 new ByteArrayInputStream("<!-- test -->".getBytes("UTF-8")),
200 new ByteArrayInputStream("<!-- test -->".getBytes(IOUtils.UTF_8)),
198201 new Metadata()));
199202 }
200203
218221 */
219222 @Test
220223 public void testMimeMagicClashSamePriority() throws IOException {
221 byte[] helloWorld = "Hello, World!".getBytes("UTF-8");
224 byte[] helloWorld = "Hello, World!".getBytes(IOUtils.UTF_8);
222225 MediaType helloType = MediaType.parse("hello/world-file");
223226 MediaType helloXType = MediaType.parse("hello/x-world-hello");
224227 Metadata metadata;
1919 import java.lang.reflect.Field;
2020 import java.util.ArrayList;
2121 import java.util.List;
22 import java.util.Set;
2223
2324 import org.apache.tika.config.TikaConfig;
2425 import org.apache.tika.metadata.Metadata;
25
2626 import org.junit.Before;
2727 import org.junit.Test;
28
2829 import static org.junit.Assert.assertEquals;
2930 import static org.junit.Assert.assertNotNull;
3031 import static org.junit.Assert.assertTrue;
141142 mime.getLinks().get(0).toString());
142143 }
143144
145 @Test
146 public void testReadParameterHierarchy() throws Exception {
147 MimeType mimeBTree4 = this.mimeTypes.forName("application/x-berkeley-db;format=btree;version=4");
148 MediaType mtBTree4 = mimeBTree4.getType();
149
150 // Canonicalised with spaces
151 assertEquals("application/x-berkeley-db; format=btree; version=4", mimeBTree4.toString());
152 assertEquals("application/x-berkeley-db; format=btree; version=4", mtBTree4.toString());
153
154 // Parent has one parameter
155 MediaType mtBTree = this.mimeTypes.getMediaTypeRegistry().getSupertype(mtBTree4);
156 assertEquals("application/x-berkeley-db; format=btree", mtBTree.toString());
157
158 // Parent has several children, for versions 2 through 4
159 Set<MediaType> mtBTreeChildren = this.mimeTypes.getMediaTypeRegistry().getChildTypes(mtBTree);
160 assertTrue(mtBTreeChildren.toString(), mtBTreeChildren.size() >= 3);
161 assertTrue(mtBTreeChildren.toString(), mtBTreeChildren.contains(mtBTree4));
162
163 // Parent of that has none
164 MediaType mtBD = this.mimeTypes.getMediaTypeRegistry().getSupertype(mtBTree);
165 assertEquals("application/x-berkeley-db", mtBD.toString());
166
167 // If we use one with parameters not known in the media registry,
168 // getting the parent will return the non-parameter version
169 MediaType mtAlt = MediaType.application("x-berkeley-db; format=unknown; version=42");
170 MediaType mtAltP = this.mimeTypes.getMediaTypeRegistry().getSupertype(mtAlt);
171 assertEquals("application/x-berkeley-db", mtAltP.toString());
172 }
173
144174 /**
145175 * TIKA-746 Ensures that the custom mimetype maps were also
146176 * loaded and used
3838 public class CompositeParserTest {
3939
4040 @Test
41 @SuppressWarnings("serial")
4142 public void testFindDuplicateParsers() {
4243 Parser a = new EmptyParser() {
4344 public Set<MediaType> getSupportedTypes(ParseContext context) {
2424 import org.apache.tika.exception.TikaException;
2525 import org.apache.tika.metadata.Metadata;
2626 import org.apache.tika.mime.MediaType;
27 import org.apache.tika.sax.XHTMLContentHandler;
2728 import org.xml.sax.ContentHandler;
2829 import org.xml.sax.SAXException;
2930
3031 /**
3132 * A Dummy Parser for use with unit tests.
33 * <p>
34 * See also {@link org.apache.tika.parser.mock.MockParser}.
3235 */
3336 public class DummyParser extends AbstractParser {
3437 private Set<MediaType> types;
5356 metadata.add(m.getKey(), m.getValue());
5457 }
5558
56 handler.startDocument();
59 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
60 xhtml.startDocument();
5761 if (xmlText != null) {
58 handler.characters(xmlText.toCharArray(), 0, xmlText.length());
62 xhtml.characters(xmlText.toCharArray(), 0, xmlText.length());
5963 }
60 handler.endDocument();
64 xhtml.endDocument();
6165 }
6266
6367 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser;
17
18 import static org.junit.Assert.assertEquals;
19
20 import java.io.ByteArrayInputStream;
21 import java.util.Arrays;
22 import java.util.Collections;
23 import java.util.HashMap;
24 import java.util.HashSet;
25 import java.util.Set;
26
27 import org.apache.tika.metadata.Metadata;
28 import org.apache.tika.mime.MediaType;
29 import org.apache.tika.sax.BodyContentHandler;
30 import org.junit.Test;
31
32 public class ParserDecoratorTest {
33 @Test
34 public void withAndWithoutTypes() {
35 Set<MediaType> onlyTxt = Collections.singleton(MediaType.TEXT_PLAIN);
36 Set<MediaType> onlyOct = Collections.singleton(MediaType.OCTET_STREAM);
37 Set<MediaType> both = new HashSet<MediaType>();
38 both.addAll(onlyOct);
39 both.addAll(onlyTxt);
40
41 Parser p;
42 Set<MediaType> types;
43 ParseContext context = new ParseContext();
44
45
46 // With a parser of no types, get the decorated type
47 p = ParserDecorator.withTypes(EmptyParser.INSTANCE, onlyTxt);
48 types = p.getSupportedTypes(context);
49 assertEquals(1, types.size());
50 assertEquals(types.toString(), true, types.contains(MediaType.TEXT_PLAIN));
51
52 // With a parser with other types, still just the decorated type
53 p = ParserDecorator.withTypes(
54 new DummyParser(onlyOct, new HashMap<String,String>(), ""), onlyTxt);
55 types = p.getSupportedTypes(context);
56 assertEquals(1, types.size());
57 assertEquals(types.toString(), true, types.contains(MediaType.TEXT_PLAIN));
58
59
60 // Exclude will remove if there
61 p = ParserDecorator.withoutTypes(EmptyParser.INSTANCE, onlyTxt);
62 types = p.getSupportedTypes(context);
63 assertEquals(0, types.size());
64
65 p = ParserDecorator.withoutTypes(
66 new DummyParser(onlyOct, new HashMap<String,String>(), ""), onlyTxt);
67 types = p.getSupportedTypes(context);
68 assertEquals(1, types.size());
69 assertEquals(types.toString(), true, types.contains(MediaType.OCTET_STREAM));
70
71 p = ParserDecorator.withoutTypes(
72 new DummyParser(both, new HashMap<String,String>(), ""), onlyTxt);
73 types = p.getSupportedTypes(context);
74 assertEquals(1, types.size());
75 assertEquals(types.toString(), true, types.contains(MediaType.OCTET_STREAM));
76 }
77
78 /**
79 * Testing one proposed implementation for TIKA-1509
80 */
81 @Test
82 public void withFallback() throws Exception {
83 Set<MediaType> onlyOct = Collections.singleton(MediaType.OCTET_STREAM);
84 Set<MediaType> octAndText = new HashSet<MediaType>(Arrays.asList(
85 MediaType.OCTET_STREAM, MediaType.TEXT_PLAIN));
86
87 ParseContext context = new ParseContext();
88 BodyContentHandler handler;
89 Metadata metadata;
90
91 ErrorParser pFail = new ErrorParser();
92 DummyParser pWork = new DummyParser(onlyOct, new HashMap<String,String>(), "Fell back!");
93 EmptyParser pNothing = new EmptyParser();
94
95 // Create a combination which will fail first
96 @SuppressWarnings("deprecation")
97 Parser p = ParserDecorator.withFallbacks(Arrays.asList(pFail, pWork), octAndText);
98
99 // Will claim to support the types given, not those on the child parsers
100 Set<MediaType> types = p.getSupportedTypes(context);
101 assertEquals(2, types.size());
102 assertEquals(types.toString(), true, types.contains(MediaType.TEXT_PLAIN));
103 assertEquals(types.toString(), true, types.contains(MediaType.OCTET_STREAM));
104
105 // Parsing will make it to the second one
106 metadata = new Metadata();
107 handler = new BodyContentHandler();
108 p.parse(new ByteArrayInputStream(new byte[] {0,1,2,3,4}), handler, metadata, context);
109 assertEquals("Fell back!", handler.toString());
110
111
112 // With a parser that will work with no output, will get nothing
113 p = ParserDecorator.withFallbacks(Arrays.asList(pNothing, pWork), octAndText);
114 metadata = new Metadata();
115 handler = new BodyContentHandler();
116 p.parse(new ByteArrayInputStream(new byte[] {0,1,2,3,4}), handler, metadata, context);
117 assertEquals("", handler.toString());
118 }
119 }
0 package org.apache.tika.parser.mock;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19
20 import javax.xml.parsers.DocumentBuilder;
21 import javax.xml.parsers.DocumentBuilderFactory;
22 import javax.xml.parsers.ParserConfigurationException;
23 import java.io.IOException;
24 import java.io.InputStream;
25 import java.lang.reflect.Constructor;
26 import java.util.ArrayList;
27 import java.util.Date;
28 import java.util.HashSet;
29 import java.util.List;
30 import java.util.Set;
31
32 import org.apache.tika.exception.TikaException;
33 import org.apache.tika.metadata.Metadata;
34 import org.apache.tika.mime.MediaType;
35 import org.apache.tika.parser.AbstractParser;
36 import org.apache.tika.parser.ParseContext;
37 import org.apache.tika.sax.XHTMLContentHandler;
38 import org.w3c.dom.Document;
39 import org.w3c.dom.NamedNodeMap;
40 import org.w3c.dom.Node;
41 import org.w3c.dom.NodeList;
42 import org.xml.sax.ContentHandler;
43 import org.xml.sax.SAXException;
44
45 /**
46 * This class enables mocking of parser behavior for use in testing
47 * wrappers and drivers of parsers.
48 * <p>
49 * See resources/test-documents/mock/example.xml in tika-parsers/test for the documentation
50 * of all the options for this MockParser.
51 * <p>
52 * Tests for this class are in tika-parsers.
53 * <p>
54 * See also {@link org.apache.tika.parser.DummyParser} for another option.
55 */
56
57 public class MockParser extends AbstractParser {
58
59 private static final long serialVersionUID = 1L;
60
61 @Override
62 public Set<MediaType> getSupportedTypes(ParseContext context) {
63 Set<MediaType> types = new HashSet<MediaType>();
64 MediaType type = MediaType.application("mock+xml");
65 types.add(type);
66 return types;
67 }
68
69 @Override
70 public void parse(InputStream stream, ContentHandler handler,
71 Metadata metadata, ParseContext context) throws IOException,
72 SAXException, TikaException {
73 Document doc = null;
74 DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance();
75 DocumentBuilder docBuilder = null;
76 try {
77 docBuilder = fact.newDocumentBuilder();
78 doc = docBuilder.parse(stream);
79 } catch (ParserConfigurationException e) {
80 throw new IOException(e);
81 } catch (SAXException e) {
82 throw new IOException(e);
83 }
84 Node root = doc.getDocumentElement();
85 NodeList actions = root.getChildNodes();
86 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
87 xhtml.startDocument();
88 for (int i = 0; i < actions.getLength(); i++) {
89 executeAction(actions.item(i), metadata, xhtml);
90 }
91 xhtml.endDocument();
92 }
93
94 private void executeAction(Node action, Metadata metadata, XHTMLContentHandler xhtml) throws SAXException,
95 IOException, TikaException {
96
97 if (action.getNodeType() != 1) {
98 return;
99 }
100
101 String name = action.getNodeName();
102 if ("metadata".equals(name)) {
103 metadata(action, metadata);
104 } else if("write".equals(name)) {
105 write(action, xhtml);
106 } else if ("throw".equals(name)) {
107 throwIt(action);
108 } else if ("hang".equals(name)) {
109 hang(action);
110 } else if ("oom".equals(name)) {
111 kabOOM();
112 } else if ("print_out".equals(name) || "print_err".equals(name)){
113 print(action, name);
114 } else {
115 throw new IllegalArgumentException("Didn't recognize mock action: "+name);
116 }
117 }
118
119 private void print(Node action, String name) {
120 String content = action.getTextContent();
121 if ("print_out".equals(name)) {
122 System.out.println(content);
123 } else if ("print_err".equals(name)) {
124 System.err.println(content);
125 } else {
126 throw new IllegalArgumentException("must be print_out or print_err");
127 }
128 }
129 private void hang(Node action) {
130 boolean interruptible = true;
131 boolean heavy = false;
132 long millis = -1;
133 long pulseMillis = -1;
134 NamedNodeMap attrs = action.getAttributes();
135 Node iNode = attrs.getNamedItem("interruptible");
136 if (iNode != null) {
137 interruptible = ("true".equals(iNode.getNodeValue()));
138 }
139 Node hNode = attrs.getNamedItem("heavy");
140 if (hNode != null) {
141 heavy = ("true".equals(hNode.getNodeValue()));
142 }
143
144 Node mNode = attrs.getNamedItem("millis");
145 if (mNode == null) {
146 throw new RuntimeException("Must specify \"millis\" attribute for hang.");
147 }
148 String millisString = mNode.getNodeValue();
149 try {
150 millis = Long.parseLong(millisString);
151 } catch (NumberFormatException e) {
152 throw new RuntimeException("Value for \"millis\" attribute must be a long.");
153 }
154
155 if (heavy) {
156 Node pNode = attrs.getNamedItem("pulse_millis");
157 if (pNode == null) {
158 throw new RuntimeException("Must specify attribute \"pulse_millis\" if the hang is \"heavy\"");
159 }
160 String pulseMillisString = mNode.getNodeValue();
161 try {
162 pulseMillis = Long.parseLong(pulseMillisString);
163 } catch (NumberFormatException e) {
164 throw new RuntimeException("Value for \"millis\" attribute must be a long.");
165 }
166 }
167 if (heavy) {
168 hangHeavy(millis, pulseMillis, interruptible);
169 } else {
170 sleep(millis, interruptible);
171 }
172 }
173
174 private void throwIt(Node action) throws IOException,
175 SAXException, TikaException {
176 NamedNodeMap attrs = action.getAttributes();
177 String className = attrs.getNamedItem("class").getNodeValue();
178 String msg = action.getTextContent();
179 throwIt(className, msg);
180 }
181
182 private void metadata(Node action, Metadata metadata) {
183 NamedNodeMap attrs = action.getAttributes();
184 //throws npe unless there is a name
185 String name = attrs.getNamedItem("name").getNodeValue();
186 String value = action.getTextContent();
187 Node actionType = attrs.getNamedItem("action");
188 if (actionType == null) {
189 metadata.add(name, value);
190 } else {
191 if ("set".equals(actionType.getNodeValue())) {
192 metadata.set(name, value);
193 } else {
194 metadata.add(name, value);
195 }
196 }
197 }
198
199 private void write(Node action, XHTMLContentHandler xhtml) throws SAXException {
200 NamedNodeMap attrs = action.getAttributes();
201 Node eNode = attrs.getNamedItem("element");
202 String elementType = "p";
203 if (eNode != null) {
204 elementType = eNode.getTextContent();
205 }
206 String text = action.getTextContent();
207 xhtml.startElement(elementType);
208 xhtml.characters(text);
209 xhtml.endElement(elementType);
210 }
211
212
213 private void throwIt(String className, String msg) throws IOException,
214 SAXException, TikaException {
215 Throwable t = null;
216 if (msg == null || msg.equals("")) {
217 try {
218 t = (Throwable) Class.forName(className).newInstance();
219 } catch (Exception e) {
220 throw new RuntimeException("couldn't create throwable class:"+className, e);
221 }
222 } else {
223 try {
224 Class<?> clazz = Class.forName(className);
225 Constructor<?> con = clazz.getConstructor(String.class);
226 t = (Throwable) con.newInstance(msg);
227 } catch (Exception e) {
228 throw new RuntimeException("couldn't create throwable class:" + className, e);
229 }
230 }
231 if (t instanceof SAXException) {
232 throw (SAXException)t;
233 } else if (t instanceof IOException) {
234 throw (IOException) t;
235 } else if (t instanceof TikaException) {
236 throw (TikaException) t;
237 } else if (t instanceof Error) {
238 throw (Error) t;
239 } else if (t instanceof RuntimeException) {
240 throw (RuntimeException) t;
241 } else {
242 //wrap the throwable in a RuntimeException
243 throw new RuntimeException(t);
244 }
245 }
246
247 private void kabOOM() {
248 List<int[]> ints = new ArrayList<int[]>();
249
250 while (true) {
251 int[] intArr = new int[32000];
252 ints.add(intArr);
253 }
254 }
255
256 private void hangHeavy(long maxMillis, long pulseCheckMillis, boolean interruptible) {
257 //do some heavy computation and occasionally check for
258 //whether time has exceeded maxMillis (see TIKA-1132 for inspiration)
259 //or whether the thread was interrupted
260 long start = new Date().getTime();
261 int lastChecked = 0;
262 while (true) {
263 for (int i = 1; i < Integer.MAX_VALUE; i++) {
264 for (int j = 1; j < Integer.MAX_VALUE; j++) {
265 double div = (double) i / (double) j;
266 lastChecked++;
267 if (lastChecked > pulseCheckMillis) {
268 lastChecked = 0;
269 if (interruptible && Thread.currentThread().isInterrupted()) {
270 return;
271 }
272 long elapsed = new Date().getTime()-start;
273 if (elapsed > maxMillis) {
274 return;
275 }
276 }
277 }
278 }
279 }
280 }
281
282 private void sleep(long maxMillis, boolean isInterruptible) {
283 long start = new Date().getTime();
284 long millisRemaining = maxMillis;
285 while (true) {
286 try {
287 Thread.sleep(millisRemaining);
288 } catch (InterruptedException e) {
289 if (isInterruptible) {
290 return;
291 }
292 }
293 long elapsed = new Date().getTime()-start;
294 millisRemaining = maxMillis - elapsed;
295 if (millisRemaining <= 0) {
296 break;
297 }
298 }
299 }
300 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import static junit.framework.Assert.assertFalse;
19 import static junit.framework.Assert.assertTrue;
20 import static org.junit.Assert.assertEquals;
21
22 import java.io.ByteArrayOutputStream;
23 import java.io.IOException;
24 import java.io.InputStream;
25 import java.io.UnsupportedEncodingException;
26 import java.util.Set;
27
28 import org.apache.tika.exception.TikaException;
29 import org.apache.tika.io.IOUtils;
30 import org.apache.tika.metadata.Metadata;
31 import org.apache.tika.mime.MediaType;
32 import org.apache.tika.parser.ParseContext;
33 import org.apache.tika.parser.Parser;
34 import org.junit.Test;
35 import org.xml.sax.Attributes;
36 import org.xml.sax.ContentHandler;
37 import org.xml.sax.SAXException;
38 import org.xml.sax.helpers.AttributesImpl;
39 import org.xml.sax.helpers.DefaultHandler;
40
41 /**
42 * Test cases for the {@link org.apache.tika.sax.BodyContentHandler} class.
43 */
44 public class BasicContentHandlerFactoryTest {
45
46 private static final String ENCODING = IOUtils.UTF_8.name();
47 //default max char len (at least in WriteOutContentHandler is 100k)
48 private static final int OVER_DEFAULT = 120000;
49
50 @Test
51 public void testIgnore() throws Exception {
52 Parser p = new MockParser(OVER_DEFAULT);
53 ContentHandler handler =
54 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1).getNewContentHandler();
55 assertTrue(handler instanceof DefaultHandler);
56 p.parse(null, handler, null, null);
57 //unfortunatley, the DefaultHandler does not return "",
58 assertContains("org.xml.sax.helpers.DefaultHandler", handler.toString());
59
60 //tests that no write limit exception is thrown
61 p = new MockParser(100);
62 handler =
63 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, 5).getNewContentHandler();
64 assertTrue(handler instanceof DefaultHandler);
65 p.parse(null, handler, null, null);
66 assertContains("org.xml.sax.helpers.DefaultHandler", handler.toString());
67 }
68
69 @Test
70 public void testText() throws Exception {
71 Parser p = new MockParser(OVER_DEFAULT);
72 BasicContentHandlerFactory.HANDLER_TYPE type =
73 BasicContentHandlerFactory.HANDLER_TYPE.TEXT;
74 ContentHandler handler =
75 new BasicContentHandlerFactory(type, -1).getNewContentHandler();
76
77 assertTrue(handler instanceof ToTextContentHandler);
78 p.parse(null, handler, null, null);
79 String extracted = handler.toString();
80 assertContains("This is the title", extracted);
81 assertContains("aaaaaaaaaa", extracted);
82 assertNotContains("<body", extracted);
83 assertNotContains("<html", extracted);
84 assertTrue(extracted.length() > 110000);
85 //now test write limit
86 p = new MockParser(10);
87 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler();
88 assertTrue(handler instanceof WriteOutContentHandler);
89 assertWriteLimitReached(p, (WriteOutContentHandler) handler);
90 extracted = handler.toString();
91 assertContains("This ", extracted);
92 assertNotContains("aaaa", extracted);
93
94 //now test outputstream call
95 p = new MockParser(OVER_DEFAULT);
96 ByteArrayOutputStream os = new ByteArrayOutputStream();
97 handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING);
98 assertTrue(handler instanceof ToTextContentHandler);
99 p.parse(null, handler, null, null);
100 assertContains("This is the title", os.toByteArray());
101 assertContains("aaaaaaaaaa", os.toByteArray());
102 assertTrue(os.toByteArray().length > 110000);
103 assertNotContains("<body", os.toByteArray());
104 assertNotContains("<html", os.toByteArray());
105
106 p = new MockParser(10);
107 os = new ByteArrayOutputStream();
108 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(os, ENCODING);
109 assertTrue(handler instanceof WriteOutContentHandler);
110 assertWriteLimitReached(p, (WriteOutContentHandler) handler);
111 //When writing to an OutputStream and a write limit is reached,
112 //currently, nothing is written.
113 assertEquals(0, os.toByteArray().length);
114 }
115
116
117 @Test
118 public void testHTML() throws Exception {
119 Parser p = new MockParser(OVER_DEFAULT);
120 BasicContentHandlerFactory.HANDLER_TYPE type =
121 BasicContentHandlerFactory.HANDLER_TYPE.HTML;
122 ContentHandler handler =
123 new BasicContentHandlerFactory(type, -1).getNewContentHandler();
124
125 assertTrue(handler instanceof ToHTMLContentHandler);
126 p.parse(null, handler, null, null);
127 String extracted = handler.toString();
128 assertContains("<head><title>This is the title", extracted);
129 assertContains("aaaaaaaaaa", extracted);
130 assertTrue(extracted.length() > 110000);
131
132 //now test write limit
133 p = new MockParser(10);
134 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler();
135 assertTrue(handler instanceof WriteOutContentHandler);
136 assertWriteLimitReached(p, (WriteOutContentHandler) handler);
137 extracted = handler.toString();
138 assertContains("This ", extracted);
139 assertNotContains("aaaa", extracted);
140
141 //now test outputstream call
142 p = new MockParser(OVER_DEFAULT);
143 ByteArrayOutputStream os = new ByteArrayOutputStream();
144 handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING);
145 assertTrue(handler instanceof ToHTMLContentHandler);
146 p.parse(null, handler, null, null);
147 assertContains("This is the title", os.toByteArray());
148 assertContains("aaaaaaaaaa", os.toByteArray());
149 assertContains("<body", os.toByteArray());
150 assertContains("<html", os.toByteArray());
151 assertTrue(os.toByteArray().length > 110000);
152
153
154 p = new MockParser(10);
155 os = new ByteArrayOutputStream();
156 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(os, ENCODING);
157 assertTrue(handler instanceof WriteOutContentHandler);
158 assertWriteLimitReached(p, (WriteOutContentHandler) handler);
159 assertEquals(0, os.toByteArray().length);
160 }
161
162 @Test
163 public void testXML() throws Exception {
164 Parser p = new MockParser(OVER_DEFAULT);
165 BasicContentHandlerFactory.HANDLER_TYPE type =
166 BasicContentHandlerFactory.HANDLER_TYPE.HTML;
167 ContentHandler handler =
168 new BasicContentHandlerFactory(type, -1).getNewContentHandler();
169
170 assertTrue(handler instanceof ToXMLContentHandler);
171 p.parse(null, handler, new Metadata(), null);
172 String extracted = handler.toString();
173 assertContains("<head><title>This is the title", extracted);
174 assertContains("aaaaaaaaaa", extracted);
175 assertTrue(handler.toString().length() > 110000);
176
177 //now test write limit
178 p = new MockParser(10);
179 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler();
180 assertTrue(handler instanceof WriteOutContentHandler);
181 assertWriteLimitReached(p, (WriteOutContentHandler) handler);
182 extracted = handler.toString();
183 assertContains("This ", extracted);
184 assertNotContains("aaaa", extracted);
185
186 //now test outputstream call
187 p = new MockParser(OVER_DEFAULT);
188 ByteArrayOutputStream os = new ByteArrayOutputStream();
189 handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING);
190 assertTrue(handler instanceof ToXMLContentHandler);
191 p.parse(null, handler, null, null);
192
193 assertContains("This is the title", os.toByteArray());
194 assertContains("aaaaaaaaaa", os.toByteArray());
195 assertContains("<body", os.toByteArray());
196 assertContains("<html", os.toByteArray());
197 assertTrue(os.toByteArray().length > 110000);
198
199
200 p = new MockParser(10);
201 os = new ByteArrayOutputStream();
202 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(os, ENCODING);
203 assertTrue(handler instanceof WriteOutContentHandler);
204 assertWriteLimitReached(p, (WriteOutContentHandler) handler);
205 assertEquals(0, os.toByteArray().length);
206 }
207
208
209 @Test
210 public void testBody() throws Exception {
211 Parser p = new MockParser(OVER_DEFAULT);
212 BasicContentHandlerFactory.HANDLER_TYPE type =
213 BasicContentHandlerFactory.HANDLER_TYPE.BODY;
214 ContentHandler handler =
215 new BasicContentHandlerFactory(type, -1).getNewContentHandler();
216
217 assertTrue(handler instanceof BodyContentHandler);
218
219 p.parse(null, handler, null, null);
220 String extracted = handler.toString();
221 assertNotContains("title", extracted);
222 assertContains("aaaaaaaaaa", extracted);
223 assertTrue(extracted.length() > 110000);
224
225 //now test write limit
226 p = new MockParser(10);
227 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler();
228 assertTrue(handler instanceof BodyContentHandler);
229 assertWriteLimitReached(p, (BodyContentHandler)handler);
230 extracted = handler.toString();
231 assertNotContains("This ", extracted);
232 assertContains("aaaa", extracted);
233
234 //now test outputstream call
235 p = new MockParser(OVER_DEFAULT);
236 ByteArrayOutputStream os = new ByteArrayOutputStream();
237 handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING);
238 assertTrue(handler instanceof BodyContentHandler);
239 p.parse(null, handler, null, null);
240 assertNotContains("title", os.toByteArray());
241 assertContains("aaaaaaaaaa", os.toByteArray());
242 assertNotContains("<body", os.toByteArray());
243 assertNotContains("<html", os.toByteArray());
244 assertTrue(os.toByteArray().length > 110000);
245
246 p = new MockParser(10);
247 os = new ByteArrayOutputStream();
248 handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(os, ENCODING);
249 assertTrue(handler instanceof WriteOutContentHandler);
250 assertWriteLimitReached(p, (WriteOutContentHandler) handler);
251 assertEquals(0, os.toByteArray().length);
252 }
253
254 private void assertWriteLimitReached(Parser p, WriteOutContentHandler handler) throws Exception {
255 boolean wlr = false;
256 try {
257 p.parse(null, handler, null, null);
258 } catch (SAXException e) {
259 if (! handler.isWriteLimitReached(e)) {
260 throw e;
261 }
262 wlr = true;
263 }
264 assertTrue("WriteLimitReached", wlr);
265 }
266 //TODO: is there a better way than to repeat this with diff signature?
267 private void assertWriteLimitReached(Parser p, BodyContentHandler handler) throws Exception {
268 boolean wlr = false;
269 try {
270 p.parse(null, handler, null, null);
271 } catch (SAXException e) {
272 if (! e.getClass().toString().contains("org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException")){
273 throw e;
274 }
275
276 wlr = true;
277 }
278 assertTrue("WriteLimitReached", wlr);
279 }
280
281
282 //copied from TikaTest in tika-parsers package
283 public static void assertNotContains(String needle, String haystack) {
284 assertFalse(needle + " found in:\n" + haystack, haystack.contains(needle));
285 }
286
287 public static void assertNotContains(String needle, byte[] hayStack)
288 throws UnsupportedEncodingException {
289 assertNotContains(needle, new String(hayStack, ENCODING));
290 }
291
292 public static void assertContains(String needle, String haystack) {
293 assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle));
294 }
295
296 public static void assertContains(String needle, byte[] hayStack)
297 throws UnsupportedEncodingException {
298 assertContains(needle, new String(hayStack, ENCODING));
299 }
300
301 //Simple mockparser that writes a title
302 //and charsToWrite number of 'a'
303 private class MockParser implements Parser {
304 private final String XHTML = "http://www.w3.org/1999/xhtml";
305 private final Attributes EMPTY_ATTRIBUTES = new AttributesImpl();
306 private final char[] TITLE = "This is the title".toCharArray();
307
308 private final int charsToWrite;
309 public MockParser(int charsToWrite) {
310 this.charsToWrite = charsToWrite;
311 }
312
313 @Override
314 public Set<MediaType> getSupportedTypes(ParseContext context) {
315 return null;
316 }
317
318 @Override
319 public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
320 handler.startDocument();
321 handler.startPrefixMapping("", XHTML);
322 handler.startElement(XHTML, "html", "html", EMPTY_ATTRIBUTES);
323 handler.startElement(XHTML, "head", "head", EMPTY_ATTRIBUTES);
324 handler.startElement(XHTML, "title", "head", EMPTY_ATTRIBUTES);
325 handler.characters(TITLE, 0, TITLE.length);
326 handler.endElement(XHTML, "title", "head");
327
328 handler.endElement(XHTML, "head", "head");
329 handler.startElement(XHTML, "body", "body", EMPTY_ATTRIBUTES);
330 char[] body = new char[charsToWrite];
331 for (int i = 0; i < charsToWrite; i++) {
332 body[i] = 'a';
333 }
334 handler.characters(body, 0, body.length);
335 handler.endElement(XHTML, "body", "body");
336 handler.endElement(XHTML, "html", "html");
337 handler.endDocument();
338 }
339 }
340 }
2020 import java.io.ByteArrayOutputStream;
2121 import java.io.OutputStream;
2222
23 import org.apache.tika.io.IOUtils;
2324 import org.apache.tika.metadata.Metadata;
2425 import org.junit.Test;
2526
4445 xhtml.element("p", "Test text");
4546 xhtml.endDocument();
4647
47 assertEquals("Test text\n", buffer.toString());
48 assertEquals("Test text\n", buffer.toString(IOUtils.UTF_8.name()));
4849 }
4950
5051 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import static org.junit.Assert.assertEquals;
19 import org.junit.Test;
20 import org.xml.sax.ContentHandler;
21 import org.xml.sax.helpers.AttributesImpl;
22
23 public class SerializerTest {
24
25 @Test
26 public void testToTextContentHandler() throws Exception {
27 assertStartDocument("", new ToTextContentHandler());
28 assertCharacters("content", new ToTextContentHandler());
29 assertCharacterEscaping("<&\">", new ToTextContentHandler());
30 assertIgnorableWhitespace(" \t\r\n", new ToTextContentHandler());
31 assertEmptyElement("", new ToTextContentHandler());
32 assertEmptyElementWithAttributes("", new ToTextContentHandler());
33 assertEmptyElementWithAttributeEscaping("", new ToTextContentHandler());
34 assertElement("content", new ToTextContentHandler());
35 assertElementWithAttributes("content", new ToTextContentHandler());
36 }
37
38 @Test
39 public void testToXMLContentHandler() throws Exception {
40 assertStartDocument("", new ToXMLContentHandler());
41 assertStartDocument(
42 "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
43 new ToXMLContentHandler("UTF-8"));
44 assertCharacters("content", new ToXMLContentHandler());
45 assertCharacterEscaping("&lt;&amp;\"&gt;", new ToXMLContentHandler());
46 assertIgnorableWhitespace(" \t\r\n", new ToXMLContentHandler());
47 assertEmptyElement("<br />", new ToXMLContentHandler());
48 assertEmptyElementWithAttributes(
49 "<meta name=\"foo\" value=\"bar\" />",
50 new ToXMLContentHandler());
51 assertEmptyElementWithAttributeEscaping(
52 "<p class=\"&lt;&amp;&quot;&gt;\" />",
53 new ToXMLContentHandler());
54 assertElement("<p>content</p>", new ToXMLContentHandler());
55 assertElementWithAttributes(
56 "<p class=\"test\">content</p>",
57 new ToXMLContentHandler());
58 }
59
60 @Test
61 public void testToHTMLContentHandler() throws Exception {
62 assertStartDocument("", new ToHTMLContentHandler());
63 assertCharacters("content", new ToHTMLContentHandler());
64 assertCharacterEscaping("&lt;&amp;\"&gt;", new ToHTMLContentHandler());
65 assertIgnorableWhitespace(" \t\r\n", new ToHTMLContentHandler());
66 assertEmptyElement("<br>", new ToHTMLContentHandler());
67 assertEmptyElementWithAttributes(
68 "<meta name=\"foo\" value=\"bar\">",
69 new ToHTMLContentHandler());
70 assertEmptyElementWithAttributeEscaping(
71 "<p class=\"&lt;&amp;&quot;&gt;\"></p>",
72 new ToHTMLContentHandler());
73 assertElement("<p>content</p>", new ToHTMLContentHandler());
74 assertElementWithAttributes(
75 "<p class=\"test\">content</p>",
76 new ToHTMLContentHandler());
77 }
78
79 private void assertStartDocument(String expected, ContentHandler handler)
80 throws Exception {
81 handler.startDocument();
82 assertEquals(expected, handler.toString());
83 }
84
85 private void assertCharacters(String expected, ContentHandler handler)
86 throws Exception {
87 handler.characters("content".toCharArray(), 0, 7);
88 assertEquals(expected, handler.toString());
89 }
90
91 private void assertCharacterEscaping(
92 String expected, ContentHandler handler) throws Exception {
93 handler.characters("<&\">".toCharArray(), 0, 4);
94 assertEquals(expected, handler.toString());
95 }
96
97 private void assertIgnorableWhitespace(
98 String expected, ContentHandler handler) throws Exception {
99 handler.ignorableWhitespace(" \t\r\n".toCharArray(), 0, 4);
100 assertEquals(expected, handler.toString());
101 }
102
103 private void assertEmptyElement(String expected, ContentHandler handler)
104 throws Exception {
105 AttributesImpl attributes = new AttributesImpl();
106 handler.startElement("", "br", "br", attributes);
107 handler.endElement("", "br", "br");
108 assertEquals(expected, handler.toString());
109 }
110
111 private void assertEmptyElementWithAttributes(
112 String expected, ContentHandler handler) throws Exception {
113 AttributesImpl attributes = new AttributesImpl();
114 attributes.addAttribute("", "name", "name", "CDATA", "foo");
115 attributes.addAttribute("", "value", "value", "CDATA", "bar");
116 handler.startElement("", "meta", "meta", attributes);
117 handler.endElement("", "meta", "meta");
118 assertEquals(expected, handler.toString());
119 }
120
121 private void assertEmptyElementWithAttributeEscaping(
122 String expected, ContentHandler handler) throws Exception {
123 AttributesImpl attributes = new AttributesImpl();
124 attributes.addAttribute("", "class", "class", "CDATA", "<&\">");
125 handler.startElement("", "p", "p", attributes);
126 handler.endElement("", "p", "p");
127 assertEquals(expected, handler.toString());
128 }
129
130 private void assertElement(
131 String expected, ContentHandler handler) throws Exception {
132 AttributesImpl attributes = new AttributesImpl();
133 handler.startElement("", "p", "p", attributes);
134 handler.characters("content".toCharArray(), 0, 7);
135 handler.endElement("", "p", "p");
136 assertEquals(expected, handler.toString());
137 }
138
139 private void assertElementWithAttributes(
140 String expected, ContentHandler handler) throws Exception {
141 AttributesImpl attributes = new AttributesImpl();
142 attributes.addAttribute("", "class", "class", "CDATA", "test");
143 handler.startElement("", "p", "p", attributes);
144 handler.characters("content".toCharArray(), 0, 7);
145 handler.endElement("", "p", "p");
146 assertEquals(expected, handler.toString());
147 }
148
149 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.sax;
17
18 import static org.junit.Assert.assertEquals;
19 import org.junit.Test;
20 import org.xml.sax.ContentHandler;
21 import org.xml.sax.helpers.AttributesImpl;
22
23 public class SerializerTest {
24
25 @Test
26 public void testToTextContentHandler() throws Exception {
27 assertStartDocument("", new ToTextContentHandler());
28 assertCharacters("content", new ToTextContentHandler());
29 assertCharacterEscaping("<&\">", new ToTextContentHandler());
30 assertIgnorableWhitespace(" \t\r\n", new ToTextContentHandler());
31 assertEmptyElement("", new ToTextContentHandler());
32 assertEmptyElementWithAttributes("", new ToTextContentHandler());
33 assertEmptyElementWithAttributeEscaping("", new ToTextContentHandler());
34 assertElement("content", new ToTextContentHandler());
35 assertElementWithAttributes("content", new ToTextContentHandler());
36 }
37
38 @Test
39 public void testToXMLContentHandler() throws Exception {
40 assertStartDocument("", new ToXMLContentHandler());
41 assertStartDocument(
42 "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
43 new ToXMLContentHandler("UTF-8"));
44 assertCharacters("content", new ToXMLContentHandler());
45 assertCharacterEscaping("&lt;&amp;\"&gt;", new ToXMLContentHandler());
46 assertIgnorableWhitespace(" \t\r\n", new ToXMLContentHandler());
47 assertEmptyElement("<br />", new ToXMLContentHandler());
48 assertEmptyElementWithAttributes(
49 "<meta name=\"foo\" value=\"bar\" />",
50 new ToXMLContentHandler());
51 assertEmptyElementWithAttributeEscaping(
52 "<p class=\"&lt;&amp;&quot;&gt;\" />",
53 new ToXMLContentHandler());
54 assertElement("<p>content</p>", new ToXMLContentHandler());
55 assertElementWithAttributes(
56 "<p class=\"test\">content</p>",
57 new ToXMLContentHandler());
58 }
59
60 @Test
61 public void testToHTMLContentHandler() throws Exception {
62 assertStartDocument("", new ToHTMLContentHandler());
63 assertCharacters("content", new ToHTMLContentHandler());
64 assertCharacterEscaping("&lt;&amp;\"&gt;", new ToHTMLContentHandler());
65 assertIgnorableWhitespace(" \t\r\n", new ToHTMLContentHandler());
66 assertEmptyElement("<br>", new ToHTMLContentHandler());
67 assertEmptyElementWithAttributes(
68 "<meta name=\"foo\" value=\"bar\">",
69 new ToHTMLContentHandler());
70 assertEmptyElementWithAttributeEscaping(
71 "<p class=\"&lt;&amp;&quot;&gt;\"></p>",
72 new ToHTMLContentHandler());
73 assertElement("<p>content</p>", new ToHTMLContentHandler());
74 assertElementWithAttributes(
75 "<p class=\"test\">content</p>",
76 new ToHTMLContentHandler());
77 }
78
79 private void assertStartDocument(String expected, ContentHandler handler)
80 throws Exception {
81 handler.startDocument();
82 assertEquals(expected, handler.toString());
83 }
84
85 private void assertCharacters(String expected, ContentHandler handler)
86 throws Exception {
87 handler.characters("content".toCharArray(), 0, 7);
88 assertEquals(expected, handler.toString());
89 }
90
91 private void assertCharacterEscaping(
92 String expected, ContentHandler handler) throws Exception {
93 handler.characters("<&\">".toCharArray(), 0, 4);
94 assertEquals(expected, handler.toString());
95 }
96
97 private void assertIgnorableWhitespace(
98 String expected, ContentHandler handler) throws Exception {
99 handler.ignorableWhitespace(" \t\r\n".toCharArray(), 0, 4);
100 assertEquals(expected, handler.toString());
101 }
102
103 private void assertEmptyElement(String expected, ContentHandler handler)
104 throws Exception {
105 AttributesImpl attributes = new AttributesImpl();
106 handler.startElement("", "br", "br", attributes);
107 handler.endElement("", "br", "br");
108 assertEquals(expected, handler.toString());
109 }
110
111 private void assertEmptyElementWithAttributes(
112 String expected, ContentHandler handler) throws Exception {
113 AttributesImpl attributes = new AttributesImpl();
114 attributes.addAttribute("", "name", "name", "CDATA", "foo");
115 attributes.addAttribute("", "value", "value", "CDATA", "bar");
116 handler.startElement("", "meta", "meta", attributes);
117 handler.endElement("", "meta", "meta");
118 assertEquals(expected, handler.toString());
119 }
120
121 private void assertEmptyElementWithAttributeEscaping(
122 String expected, ContentHandler handler) throws Exception {
123 AttributesImpl attributes = new AttributesImpl();
124 attributes.addAttribute("", "class", "class", "CDATA", "<&\">");
125 handler.startElement("", "p", "p", attributes);
126 handler.endElement("", "p", "p");
127 assertEquals(expected, handler.toString());
128 }
129
130 private void assertElement(
131 String expected, ContentHandler handler) throws Exception {
132 AttributesImpl attributes = new AttributesImpl();
133 handler.startElement("", "p", "p", attributes);
134 handler.characters("content".toCharArray(), 0, 7);
135 handler.endElement("", "p", "p");
136 assertEquals(expected, handler.toString());
137 }
138
139 private void assertElementWithAttributes(
140 String expected, ContentHandler handler) throws Exception {
141 AttributesImpl attributes = new AttributesImpl();
142 attributes.addAttribute("", "class", "class", "CDATA", "test");
143 handler.startElement("", "p", "p", attributes);
144 handler.characters("content".toCharArray(), 0, 7);
145 handler.endElement("", "p", "p");
146 assertEquals(expected, handler.toString());
147 }
148
149 }
1616 package org.apache.tika.sax;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
1920
2021 import java.util.ArrayList;
2122 import java.util.List;
2223
24 import org.apache.tika.config.TikaConfigTest;
2325 import org.apache.tika.metadata.Metadata;
2426
2527 import org.junit.Before;
2729
2830 import org.xml.sax.ContentHandler;
2931 import org.xml.sax.SAXException;
32 import org.xml.sax.helpers.AttributesImpl;
3033
3134 /**
3235 * Unit tests for the {@link XHTMLContentHandler} class.
120123 assertEquals("two", words[1]);
121124 }
122125
126 @Test
127 public void testAttributesOnBody() throws Exception {
128 ToHTMLContentHandler toHTMLContentHandler = new ToHTMLContentHandler();
129 XHTMLContentHandler xhtmlContentHandler = new XHTMLContentHandler(toHTMLContentHandler, new Metadata());
130 AttributesImpl attributes = new AttributesImpl();
131
132 attributes.addAttribute(XHTMLContentHandler.XHTML, "itemscope", "itemscope", "", "");
133 attributes.addAttribute(XHTMLContentHandler.XHTML, "itemtype", "itemtype", "", "http://schema.org/Event");
134
135 xhtmlContentHandler.startDocument();
136 xhtmlContentHandler.startElement(XHTMLContentHandler.XHTML, "body", "body", attributes);
137 xhtmlContentHandler.endElement("body");
138 xhtmlContentHandler.endDocument();
139
140 assertTrue(toHTMLContentHandler.toString().contains("itemscope"));
141 }
142
123143 /**
124144 * Return array of non-zerolength words. Splitting on whitespace will get us
125145 * empty words for emptylines.
1818 under the License.
1919 -->
2020
21 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
21 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
22 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
2223 <modelVersion>4.0.0</modelVersion>
2324
2425 <parent>
2526 <groupId>org.apache.tika</groupId>
2627 <artifactId>tika-parent</artifactId>
27 <version>1.6-SNAPSHOT</version>
28 <version>1.8-SNAPSHOT</version>
2829 <relativePath>../tika-parent/pom.xml</relativePath>
2930 </parent>
3031
9192 <configuration>
9293 <target>
9394 <exec executable="${ikvm}/bin/ikvmc.exe">
94 <arg value="-nowarn:0100" />
95 <arg value="-nowarn:0105" />
96 <arg value="-nowarn:0109" />
97 <arg value="-nowarn:0111" />
98 <arg value="-nowarn:0112" />
99 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Charsets.dll" />
100 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Core.dll" />
101 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Text.dll" />
102 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Util.dll" />
103 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.API.dll" />
104 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.Transform.dll" />
105 <arg value="-target:library" />
106 <arg value="-compressresources" />
107 <arg value="-out:${project.build.directory}/${project.build.finalName}.dll" />
108 <arg value="-recurse:${project.build.directory}\*.class" />
109 <arg value="${project.build.directory}/dependency/tika-app.jar" />
95 <arg value="-nowarn:0100"/>
96 <arg value="-nowarn:0105"/>
97 <arg value="-nowarn:0109"/>
98 <arg value="-nowarn:0111"/>
99 <arg value="-nowarn:0112"/>
100 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Charsets.dll"/>
101 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Core.dll"/>
102 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Text.dll"/>
103 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Util.dll"/>
104 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.API.dll"/>
105 <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.Transform.dll"/>
106 <arg value="-target:library"/>
107 <arg value="-compressresources"/>
108 <arg value="-out:${project.build.directory}/${project.build.finalName}.dll"/>
109 <arg value="-recurse:${project.build.directory}\*.class"/>
110 <arg value="${project.build.directory}/dependency/tika-app.jar"/>
110111 </exec>
111112 </target>
112113 </configuration>
169170
170171 <description>A .NET port of Tika functionality.</description>
171172 <organization>
172 <name>The Apache Software Foundation</name>
173 <url>http://www.apache.org</url>
173 <name>The Apache Software Foundation</name>
174 <url>http://www.apache.org</url>
174175 </organization>
175176 <scm>
176 <url>http://svn.apache.org/viewvc/tika/trunk/tika-dotnet</url>
177 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/trunk/tika-dotnet</connection>
178 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/trunk/tika-dotnet</developerConnection>
177 <url>http://svn.apache.org/viewvc/tika/trunk/tika-dotnet</url>
178 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/trunk/tika-dotnet</connection>
179 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/trunk/tika-dotnet</developerConnection>
179180 </scm>
180181 <issueManagement>
181 <system>JIRA</system>
182 <url>https://issues.apache.org/jira/browse/TIKA</url>
182 <system>JIRA</system>
183 <url>https://issues.apache.org/jira/browse/TIKA</url>
183184 </issueManagement>
184185 <ciManagement>
185 <system>Jenkins</system>
186 <url>https://builds.apache.org/job/Tika-trunk/</url>
186 <system>Jenkins</system>
187 <url>https://builds.apache.org/job/Tika-trunk/</url>
187188 </ciManagement>
188189 </project>
0 <?xml version="1.0" encoding="UTF-8"?>
1 <!--
2 Licensed to the Apache Software Foundation (ASF) under one
3 or more contributor license agreements. See the NOTICE file
4 distributed with this work for additional information
5 regarding copyright ownership. The ASF licenses this file
6 to you under the Apache License, Version 2.0 (the
7 "License"); you may not use this file except in compliance
8 with the License. You may obtain a copy of the License at
9
10 http://www.apache.org/licenses/LICENSE-2.0
11
12 Unless required by applicable law or agreed to in writing,
13 software distributed under the License is distributed on an
14 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15 KIND, either express or implied. See the License for the
16 specific language governing permissions and limitations
17 under the License.
18 -->
19 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
20 <parent>
21 <artifactId>tika-parent</artifactId>
22 <groupId>org.apache.tika</groupId>
23 <version>1.8</version>
24 <relativePath>../tika-parent/pom.xml</relativePath>
25 </parent>
26 <modelVersion>4.0.0</modelVersion>
27
28 <artifactId>tika-example</artifactId>
29
30 <name>Apache Tika examples</name>
31 <url>http://tika.apache.org/</url>
32
33 <description>This module contains examples of how to use Apache Tika.</description>
34 <organization>
35 <name>The Apache Software Foundation</name>
36 <url>http://www.apache.org</url>
37 </organization>
38
39 <scm>
40 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-example</url>
41 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-example</connection>
42 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-example</developerConnection>
43 </scm>
44
45 <issueManagement>
46 <system>JIRA</system>
47 <url>https://issues.apache.org/jira/browse/TIKA</url>
48 </issueManagement>
49
50 <ciManagement>
51 <system>Jenkins</system>
52 <url>https://builds.apache.org/job/Tika-trunk/</url>
53 </ciManagement>
54
55 <!-- List of dependencies that we depend on for the examples. See the full list of Tika
56 modules and how to use them at http://mvnrepository.com/artifact/org.apache.tika.-->
57 <dependencies>
58 <dependency>
59 <groupId>org.apache.tika</groupId>
60 <artifactId>tika-parsers</artifactId>
61 <version>${project.version}</version>
62 </dependency>
63 <dependency>
64 <groupId>org.apache.tika</groupId>
65 <artifactId>tika-serialization</artifactId>
66 <version>${project.version}</version>
67 </dependency>
68 <dependency>
69 <groupId>org.apache.tika</groupId>
70 <artifactId>tika-translate</artifactId>
71 <version>${project.version}</version>
72 </dependency>
73 <dependency>
74 <groupId>org.apache.tika</groupId>
75 <artifactId>tika-parsers</artifactId>
76 <version>${project.version}</version>
77 <type>test-jar</type>
78 <scope>test</scope>
79 </dependency>
80 <dependency>
81 <groupId>junit</groupId>
82 <artifactId>junit</artifactId>
83 <scope>test</scope>
84 </dependency>
85 </dependencies>
86 </project>
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.example;
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.util.ArrayList;
21 import java.util.List;
22
23 import org.apache.tika.exception.TikaException;
24 import org.apache.tika.metadata.Metadata;
25 import org.apache.tika.parser.AutoDetectParser;
26 import org.apache.tika.sax.BodyContentHandler;
27 import org.apache.tika.sax.ContentHandlerDecorator;
28 import org.apache.tika.sax.ToXMLContentHandler;
29 import org.apache.tika.sax.XHTMLContentHandler;
30 import org.apache.tika.sax.xpath.Matcher;
31 import org.apache.tika.sax.xpath.MatchingContentHandler;
32 import org.apache.tika.sax.xpath.XPathParser;
33 import org.xml.sax.ContentHandler;
34 import org.xml.sax.SAXException;
35
36 /**
37 * Examples of using different Content Handlers to
38 * get different parts of the file's contents
39 */
40 public class ContentHandlerExample {
41 /**
42 * Example of extracting the plain text of the contents.
43 * Will return only the "body" part of the document
44 */
45 public String parseToPlainText() throws IOException, SAXException, TikaException {
46 BodyContentHandler handler = new BodyContentHandler();
47
48 InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
49 AutoDetectParser parser = new AutoDetectParser();
50 Metadata metadata = new Metadata();
51 try {
52 parser.parse(stream, handler, metadata);
53 return handler.toString();
54 } finally {
55 stream.close();
56 }
57 }
58
59 /**
60 * Example of extracting the contents as HTML, as a string.
61 */
62 public String parseToHTML() throws IOException, SAXException, TikaException {
63 ContentHandler handler = new ToXMLContentHandler();
64
65 InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
66 AutoDetectParser parser = new AutoDetectParser();
67 Metadata metadata = new Metadata();
68 try {
69 parser.parse(stream, handler, metadata);
70 return handler.toString();
71 } finally {
72 stream.close();
73 }
74 }
75
76 /**
77 * Example of extracting just the body as HTML, without the
78 * head part, as a string
79 */
80 public String parseBodyToHTML() throws IOException, SAXException, TikaException {
81 ContentHandler handler = new BodyContentHandler(
82 new ToXMLContentHandler());
83
84 InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc");
85 AutoDetectParser parser = new AutoDetectParser();
86 Metadata metadata = new Metadata();
87 try {
88 parser.parse(stream, handler, metadata);
89 return handler.toString();
90 } finally {
91 stream.close();
92 }
93 }
94
95 /**
96 * Example of extracting just one part of the document's body,
97 * as HTML as a string, excluding the rest
98 */
99 public String parseOnePartToHTML() throws IOException, SAXException, TikaException {
100 // Only get things under html -> body -> div (class=header)
101 XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML);
102 Matcher divContentMatcher = xhtmlParser.parse(
103 "/xhtml:html/xhtml:body/xhtml:div/descendant::node()");
104 ContentHandler handler = new MatchingContentHandler(
105 new ToXMLContentHandler(), divContentMatcher);
106
107 InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc");
108 AutoDetectParser parser = new AutoDetectParser();
109 Metadata metadata = new Metadata();
110 try {
111 parser.parse(stream, handler, metadata);
112 return handler.toString();
113 } finally {
114 stream.close();
115 }
116 }
117
118 protected final int MAXIMUM_TEXT_CHUNK_SIZE = 40;
119 /**
120 * Example of extracting the plain text in chunks, with each chunk
121 * of no more than a certain maximum size
122 */
123 public List<String> parseToPlainTextChunks() throws IOException, SAXException, TikaException {
124 final List<String> chunks = new ArrayList<String>();
125 chunks.add("");
126 ContentHandlerDecorator handler = new ContentHandlerDecorator() {
127 @Override
128 public void characters(char[] ch, int start, int length) {
129 String lastChunk = chunks.get(chunks.size()-1);
130 String thisStr = new String(ch, start, length);
131
132 if (lastChunk.length()+length > MAXIMUM_TEXT_CHUNK_SIZE) {
133 chunks.add(thisStr);
134 } else {
135 chunks.set(chunks.size()-1, lastChunk+thisStr);
136 }
137 }
138 };
139
140 InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc");
141 AutoDetectParser parser = new AutoDetectParser();
142 Metadata metadata = new Metadata();
143 try {
144 parser.parse(stream, handler, metadata);
145 return chunks;
146 } finally {
147 stream.close();
148 }
149 }
150 }
0 package org.apache.tika.example;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import javax.xml.parsers.DocumentBuilder;
19 import javax.xml.parsers.DocumentBuilderFactory;
20 import javax.xml.transform.OutputKeys;
21 import javax.xml.transform.Transformer;
22 import javax.xml.transform.TransformerFactory;
23 import javax.xml.transform.dom.DOMSource;
24 import javax.xml.transform.stream.StreamResult;
25
26 import java.io.File;
27 import java.io.FileOutputStream;
28 import java.io.IOException;
29 import java.io.OutputStreamWriter;
30 import java.io.StringWriter;
31 import java.io.Writer;
32 import java.nio.charset.Charset;
33 import java.util.List;
34 import java.util.Map;
35 import java.util.Set;
36 import java.util.TreeMap;
37 import java.util.TreeSet;
38
39 import org.apache.tika.config.TikaConfig;
40 import org.apache.tika.detect.DefaultDetector;
41 import org.apache.tika.detect.Detector;
42 import org.apache.tika.exception.TikaException;
43 import org.apache.tika.io.IOUtils;
44 import org.apache.tika.language.translate.DefaultTranslator;
45 import org.apache.tika.language.translate.Translator;
46 import org.apache.tika.mime.MediaType;
47 import org.apache.tika.parser.CompositeParser;
48 import org.apache.tika.parser.ParseContext;
49 import org.apache.tika.parser.Parser;
50 import org.w3c.dom.Document;
51 import org.w3c.dom.Element;
52 import org.w3c.dom.Node;
53
54
55 /**
56 * This class shows how to dump a TikaConfig object to a configuration file.
57 * This allows users to easily dump the default TikaConfig as a base from which
58 * to start if they want to modify the default configuration file.
59 * <p>
60 * For those who want to modify the mimes file, take a look at
61 * tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
62 * for inspiration. Consider adding org/apache/tika/mime/custom-mimetypes.xml
63 * for your custom mime types.
64 */
65 public class DumpTikaConfigExample {
66
67 /**
68 *
69 * @param config config file to dump
70 * @param writer writer to which to write
71 * @throws Exception
72 */
73 public void dump(TikaConfig config, Writer writer, String encoding) throws Exception {
74 DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
75 DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
76 // root elements
77 Document doc = docBuilder.newDocument();
78 Element rootElement = doc.createElement("properties");
79
80 doc.appendChild(rootElement);
81 addMimeComment(rootElement, doc);
82 addTranslator(rootElement, doc, config);
83 addDetectors(rootElement, doc, config);
84 addParsers(rootElement, doc, config);
85
86
87 //now write
88 TransformerFactory transformerFactory = TransformerFactory.newInstance();
89 Transformer transformer = transformerFactory.newTransformer();
90 transformer.setOutputProperty(OutputKeys.INDENT, "yes");
91 transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
92 transformer.setOutputProperty(OutputKeys.ENCODING, encoding);
93 DOMSource source = new DOMSource(doc);
94 StreamResult result = new StreamResult(writer);
95
96 transformer.transform(source, result);
97 }
98
99 private void addTranslator(Element rootElement, Document doc, TikaConfig config) {
100 //TikaConfig only reads the first translator from the list,
101 //but it looks like it expects a list
102 Translator translator = config.getTranslator();
103 if (translator instanceof DefaultTranslator) {
104 Node mimeComment = doc.createComment(
105 "for example: "+
106 "<translator class=\"org.apache.tika.language.translate.GoogleTranslator\"/>");
107 rootElement.appendChild(mimeComment);
108 } else {
109 Element translatorElement = doc.createElement("translator");
110 translatorElement.setAttribute("class", translator.getClass().getCanonicalName());
111 rootElement.appendChild(translatorElement);
112 }
113 }
114
115 private void addMimeComment(Element rootElement, Document doc) {
116 Node mimeComment = doc.createComment(
117 "for example: <mimeTypeRepository resource=\"/org/apache/tika/mime/tika-mimetypes.xml\"/>");
118 rootElement.appendChild(mimeComment);
119 }
120
121 private void addDetectors(Element rootElement, Document doc, TikaConfig config) throws Exception {
122 Detector detector = config.getDetector();
123 Element detectorsElement = doc.createElement("detectors");
124
125 if (detector instanceof DefaultDetector) {
126 List<Detector> children = ((DefaultDetector)detector).getDetectors();
127 for (Detector d : children) {
128 Element detectorElement = doc.createElement("detector");
129 detectorElement.setAttribute("class", d.getClass().getCanonicalName());
130 detectorsElement.appendChild(detectorElement);
131 }
132 }
133 rootElement.appendChild(detectorsElement);
134 }
135
136 private void addParsers(Element rootElement, Document doc, TikaConfig config) throws Exception {
137 Map<String, Parser> parsers = getConcreteParsers(config.getParser());
138
139 Element parsersElement = doc.createElement("parsers");
140 rootElement.appendChild(parsersElement);
141
142 ParseContext context = new ParseContext();
143 for (Map.Entry<String, Parser> e : parsers.entrySet()) {
144 Element parserElement = doc.createElement("parser");
145 Parser child = e.getValue();
146 String className = e.getKey();
147 parserElement.setAttribute("class", className);
148 Set<MediaType> types = new TreeSet<MediaType>();
149 types.addAll(child.getSupportedTypes(context));
150 for (MediaType type : types){
151 Element mimeElement = doc.createElement("mime");
152 mimeElement.appendChild(doc.createTextNode(type.toString()));
153 parserElement.appendChild(mimeElement);
154 }
155 parsersElement.appendChild(parserElement);
156 }
157 rootElement.appendChild(parsersElement);
158
159 }
160
161 private Map<String, Parser> getConcreteParsers(Parser parentParser)throws TikaException, IOException {
162 Map<String, Parser> parsers = new TreeMap<String, Parser>();
163 if (parentParser instanceof CompositeParser) {
164 addParsers((CompositeParser)parentParser, parsers);
165 } else {
166 addParser(parentParser, parsers);
167 }
168 return parsers;
169 }
170
171 private void addParsers(CompositeParser p, Map<String, Parser> parsers) {
172 for (Parser child : p.getParsers().values()) {
173 if (child instanceof CompositeParser) {
174 addParsers((CompositeParser)child, parsers);
175 } else {
176 addParser(child, parsers);
177 }
178 }
179 }
180
181 private void addParser(Parser p, Map<String, Parser> parsers) {
182 parsers.put(p.getClass().getCanonicalName(), p);
183 }
184
185 /**
186 *
187 * @param args outputFile, outputEncoding, if args is empty, this prints to console
188 * @throws Exception
189 */
190 public static void main(String[] args) throws Exception {
191
192 Charset encoding = IOUtils.UTF_8;
193 Writer writer = null;
194 if (args.length > 0) {
195 writer = new OutputStreamWriter(new FileOutputStream(new File(args[0])), encoding);
196 } else {
197 writer = new StringWriter();
198 }
199
200 if (args.length > 1) {
201 encoding = Charset.forName(args[1]);
202 }
203 DumpTikaConfigExample ex = new DumpTikaConfigExample();
204 ex.dump(TikaConfig.getDefaultConfig(), writer, encoding.name());
205
206 writer.flush();
207
208 if (writer instanceof StringWriter) {
209 System.out.println(writer.toString());
210 }
211 writer.close();
212 }
213 }
0 package org.apache.tika.example;
1 /**
2 * Licensed under the Apache License, Version 2.0 (the "License");
3 * you may not use this file except in compliance with the License.
4 * You may obtain a copy of the License at
5 *
6 * http://www.apache.org/licenses/LICENSE-2.0
7 *
8 * Unless required by applicable law or agreed to in writing, software
9 * distributed under the License is distributed on an "AS IS" BASIS,
10 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11 * See the License for the specific language governing permissions and
12 * limitations under the License.
13 */
14 import org.apache.tika.metadata.Metadata;
15 import org.apache.tika.parser.AutoDetectParser;
16 import org.apache.tika.parser.ParseContext;
17 import org.apache.tika.parser.Parser;
18 import org.apache.tika.sax.BodyContentHandler;
19 import org.apache.tika.sax.PhoneExtractingContentHandler;
20
21 import java.io.File;
22 import java.io.FileInputStream;
23 import java.io.InputStream;
24 import java.util.HashSet;
25
26 /**
27 * Class to demonstrate how to use the {@link org.apache.tika.sax.PhoneExtractingContentHandler}
28 * to get a list of all of the phone numbers from every file in a directory.
29 *
30 * You can run this main method by running
31 * <code>
32 * mvn exec:java -Dexec.mainClass="org.apache.tika.example.GrabPhoneNumbersExample" -Dexec.args="/path/to/directory"
33 * </code>
34 * from the tika-example directory.
35 */
36 public class GrabPhoneNumbersExample {
37 private static HashSet<String> phoneNumbers = new HashSet<String>();
38 private static int failedFiles, successfulFiles = 0;
39
40 public static void main(String[] args){
41 if (args.length != 1) {
42 System.err.println("Usage `java GrabPhoneNumbers [corpus]");
43 return;
44 }
45 final File folder = new File(args[0]);
46 System.out.println("Searching " + folder.getAbsolutePath() + "...");
47 processFolder(folder);
48 System.out.println(phoneNumbers.toString());
49 System.out.println("Parsed " + successfulFiles + "/" + (successfulFiles + failedFiles));
50 }
51
52 public static void processFolder(final File folder) {
53 for (final File fileEntry : folder.listFiles()) {
54 if (fileEntry.isDirectory()) {
55 processFolder(fileEntry);
56 } else {
57 try {
58 process(fileEntry);
59 successfulFiles++;
60 } catch (Exception e) {
61 failedFiles++;
62 // Ignore this file...
63 }
64 }
65 }
66 }
67
68 public static void process(File file) throws Exception {
69 Parser parser = new AutoDetectParser();
70 Metadata metadata = new Metadata();
71 // The PhoneExtractingContentHandler will examine any characters for phone numbers before passing them
72 // to the underlying Handler.
73 PhoneExtractingContentHandler handler = new PhoneExtractingContentHandler(new BodyContentHandler(), metadata);
74 InputStream stream = new FileInputStream(file);
75 try {
76 parser.parse(stream, handler, metadata, new ParseContext());
77 }
78 finally {
79 stream.close();
80 }
81 String[] numbers = metadata.getValues("phonenumbers");
82 for (String number : numbers) {
83 phoneNumbers.add(number);
84 }
85 }
86 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.example;
18
19 import org.apache.tika.language.LanguageIdentifier;
20
21 public class LanguageIdentifierExample {
22 public String identifyLanguage(String text) {
23 LanguageIdentifier identifier = new LanguageIdentifier(text);
24 return identifier.getLanguage();
25 }
26 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.example;
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.io.StringWriter;
21 import java.util.List;
22
23 import org.apache.tika.Tika;
24 import org.apache.tika.exception.TikaException;
25 import org.apache.tika.metadata.Metadata;
26 import org.apache.tika.metadata.serialization.JsonMetadataList;
27 import org.apache.tika.parser.AutoDetectParser;
28 import org.apache.tika.parser.ParseContext;
29 import org.apache.tika.parser.Parser;
30 import org.apache.tika.parser.RecursiveParserWrapper;
31 import org.apache.tika.sax.BasicContentHandlerFactory;
32 import org.apache.tika.sax.BodyContentHandler;
33 import org.apache.tika.sax.ContentHandlerFactory;
34 import org.xml.sax.SAXException;
35 import org.xml.sax.helpers.DefaultHandler;
36
37 public class ParsingExample {
38
39 /**
40 * Example of how to use Tika's parseToString method to parse the content of a file,
41 * and return any text found.
42 *
43 * Note: Tika.parseToString() will extract content from the outer container
44 * document and any embedded/attached documents.
45 *
46 * @return The content of a file.
47 */
48 public String parseToStringExample() throws IOException, SAXException, TikaException {
49 InputStream stream = ParsingExample.class.getResourceAsStream("test.doc");
50 Tika tika = new Tika();
51 try {
52 return tika.parseToString(stream);
53 } finally {
54 stream.close();
55 }
56 }
57
58 /**
59 * Example of how to use Tika to parse a file when you do not know its file type
60 * ahead of time.
61 *
62 * AutoDetectParser attempts to discover the file's type automatically, then call
63 * the exact Parser built for that file type.
64 *
65 * The stream to be parsed by the Parser. In this case, we get a file from the
66 * resources folder of this project.
67 *
68 * Handlers are used to get the exact information you want out of the host of
69 * information gathered by Parsers. The body content handler, intuitively, extracts
70 * everything that would go between HTML body tags.
71 *
72 * The Metadata object will be filled by the Parser with Metadata discovered about
73 * the file being parsed.
74 *
75 * Note: This example will extract content from the outer document and all
76 * embedded documents. However, if you choose to use a {@link ParseContext},
77 * make sure to set a {@link Parser} or else embedded content will not be
78 * parsed.
79 *
80 * @return The content of a file.
81 */
82 public String parseExample() throws IOException, SAXException, TikaException {
83 InputStream stream = ParsingExample.class.getResourceAsStream("test.doc");
84 AutoDetectParser parser = new AutoDetectParser();
85 BodyContentHandler handler = new BodyContentHandler();
86 Metadata metadata = new Metadata();
87 try {
88 parser.parse(stream, handler, metadata);
89 return handler.toString();
90 } finally {
91 stream.close();
92 }
93 }
94
95 /**
96 * If you don't want content from embedded documents, send in
97 * a {@link org.apache.tika.parser.ParseContext} that does not contain a
98 * {@link Parser}.
99 *
100 * @return The content of a file.
101 */
102 public String parseNoEmbeddedExample() throws IOException, SAXException, TikaException {
103 InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx");
104 AutoDetectParser parser = new AutoDetectParser();
105 BodyContentHandler handler = new BodyContentHandler();
106 Metadata metadata = new Metadata();
107 try {
108 parser.parse(stream, handler, metadata, new ParseContext());
109 return handler.toString();
110 } finally {
111 stream.close();
112 }
113 }
114
115
116 /**
117 * This example shows how to extract content from the outer document and all
118 * embedded documents. The key is to specify a {@link Parser} in the {@link ParseContext}.
119 *
120 * @return content, including from embedded documents
121 * @throws IOException
122 * @throws SAXException
123 * @throws TikaException
124 */
125 public String parseEmbeddedExample() throws IOException, SAXException, TikaException {
126 InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx");
127 AutoDetectParser parser = new AutoDetectParser();
128 BodyContentHandler handler = new BodyContentHandler();
129 Metadata metadata = new Metadata();
130 ParseContext context = new ParseContext();
131 context.set(Parser.class, parser);
132 try {
133 parser.parse(stream, handler, metadata, context);
134 return handler.toString();
135 } finally {
136 stream.close();
137 }
138
139 }
140
141 /**
142 * For documents that may contain embedded documents, it might be helpful
143 * to create list of metadata objects, one for the container document and
144 * one for each embedded document. This allows easy access to both the
145 * extracted content and the metadata of each embedded document.
146 * Note that many document formats can contain embedded documents,
147 * including traditional container formats -- zip, tar and others -- but also
148 * common office document formats including: MSWord, MSExcel,
149 * MSPowerPoint, RTF, PDF, MSG and several others.
150 * <p>
151 * The "content" format is determined by the ContentHandlerFactory, and
152 * the content is stored in {@link org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT}
153 * <p>
154 * The drawback to the RecursiveParserWrapper is that it caches metadata and contents
155 * in memory. This should not be used on files whose contents are too big to be handled
156 * in memory.
157 *
158 * @return a list of metadata object, one each for the container file and each embedded file
159 * @throws IOException
160 * @throws SAXException
161 * @throws TikaException
162 */
163 public List<Metadata> recursiveParserWrapperExample() throws IOException,
164 SAXException, TikaException {
165
166 Parser p = new AutoDetectParser();
167 ContentHandlerFactory factory = new BasicContentHandlerFactory(
168 BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1);
169
170 RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p, factory);
171 InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx");
172 Metadata metadata = new Metadata();
173 metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx");
174 ParseContext context = new ParseContext();
175
176 try {
177 wrapper.parse(stream, new DefaultHandler(), metadata, context);
178 } finally {
179 stream.close();
180 }
181 return wrapper.getMetadata();
182 }
183
184 /**
185 * We include a simple JSON serializer for a list of metadata with
186 * {@link org.apache.tika.metadata.serialization.JsonMetadataList}.
187 * That class also includes a deserializer to convert from JSON
188 * back to a List<Metadata>.
189 * <p>
190 * This functionality is also available in tika-app's GUI, and
191 * with the -J option on tika-app's commandline. For tika-server
192 * users, there is the "rmeta" service that will return this format.
193 *
194 * @return a JSON representation of a list of Metadata objects
195 * @throws IOException
196 * @throws SAXException
197 * @throws TikaException
198 */
199 public String serializedRecursiveParserWrapperExample() throws IOException,
200 SAXException, TikaException {
201 List metadataList = recursiveParserWrapperExample();
202 StringWriter writer = new StringWriter();
203 JsonMetadataList.toJson(metadataList, writer);
204 return writer.toString();
205 }
206 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.example;
18
19 import org.apache.tika.language.translate.MicrosoftTranslator;
20
21 public class TranslatorExample {
22 public String microsoftTranslateToFrench(String text) {
23 MicrosoftTranslator translator = new MicrosoftTranslator();
24 // Change the id and secret! See http://msdn.microsoft.com/en-us/library/hh454950.aspx.
25 translator.setId("dummy-id");
26 translator.setSecret("dummy-secret");
27 try {
28 return translator.translate(text, "fr");
29 } catch (Exception e) {
30 return "Error while translating.";
31 }
32 }
33 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.example;
18
19 import org.apache.tika.exception.TikaException;
20 import org.junit.Before;
21 import org.junit.Test;
22 import org.xml.sax.SAXException;
23
24 import java.io.IOException;
25 import java.util.List;
26
27 import static org.apache.tika.TikaTest.assertContains;
28 import static org.apache.tika.TikaTest.assertNotContained;
29 import static org.junit.Assert.assertEquals;
30 import static org.junit.Assert.assertTrue;
31
32 public class ContentHandlerExampleTest {
33 ContentHandlerExample example;
34
35 @Before
36 public void setUp() {
37 example = new ContentHandlerExample();
38 }
39
40 @Test
41 public void testParseToPlainText() throws IOException, SAXException, TikaException {
42 String result = example.parseToPlainText().trim();
43 assertEquals("Expected 'test', but got '" + result + "'", "test", result);
44 }
45
46 @Test
47 public void testParseToHTML() throws IOException, SAXException, TikaException {
48 String result = example.parseToHTML().trim();
49
50 assertContains("<html", result);
51 assertContains("<head>", result);
52 assertContains("<meta name=\"dc:creator\"", result);
53 assertContains("<title>", result);
54 assertContains("<body>", result);
55 assertContains(">test", result);
56 }
57
58 @Test
59 public void testParseBodyToHTML() throws IOException, SAXException, TikaException {
60 String result = example.parseBodyToHTML().trim();
61
62 assertNotContained("<html", result);
63 assertNotContained("<head>", result);
64 assertNotContained("<meta name=\"dc:creator\"", result);
65 assertNotContained("<title>", result);
66 assertNotContained("<body>", result);
67 assertContains(">test", result);
68 }
69
70 @Test
71 public void testParseOnePartToHTML() throws IOException, SAXException, TikaException {
72 String result = example.parseOnePartToHTML().trim();
73
74 assertNotContained("<html", result);
75 assertNotContained("<head>", result);
76 assertNotContained("<meta name=\"dc:creator\"", result);
77 assertNotContained("<title>", result);
78 assertNotContained("<body>", result);
79 assertContains("<p class=\"header\"", result);
80 assertContains("This is in the header", result);
81 assertNotContained("<h1>Test Document", result);
82 assertNotContained("<p>1 2 3", result);
83 }
84
85
86 @Test
87 public void testParseToPlainTextChunks() throws IOException, SAXException, TikaException {
88 List<String> result = example.parseToPlainTextChunks();
89
90 assertEquals(3, result.size());
91 for (String chunk : result) {
92 assertTrue("Chunk under max size", chunk.length() <= example.MAXIMUM_TEXT_CHUNK_SIZE);
93 }
94
95 assertContains("This is in the header", result.get(0));
96 assertContains("Test Document", result.get(0));
97
98 assertContains("Testing", result.get(1));
99 assertContains("1 2 3", result.get(1));
100 assertContains("TestTable", result.get(1));
101
102 assertContains("Testing 123", result.get(2));
103 }
104 }
0 package org.apache.tika.example;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19
20 import org.apache.tika.config.TikaConfig;
21 import org.apache.tika.detect.CompositeDetector;
22 import org.apache.tika.parser.AutoDetectParser;
23 import org.apache.tika.parser.CompositeParser;
24 import org.apache.tika.parser.Parser;
25 import org.junit.After;
26 import org.junit.Before;
27 import org.junit.Test;
28
29 import java.io.File;
30 import java.io.FileOutputStream;
31 import java.io.IOException;
32 import java.io.OutputStreamWriter;
33 import java.io.Writer;
34
35 import static junit.framework.TestCase.assertEquals;
36 import static junit.framework.TestCase.assertTrue;
37
38 public class DumpTikaConfigExampleTest {
39 private File configFile;
40 @Before
41 public void setUp() {
42 try {
43 configFile = File.createTempFile("tmp", ".xml");
44 } catch (IOException e) {
45 throw new RuntimeException("Failed to create tmp file");
46 }
47 }
48
49 @After
50 public void tearDown() {
51 if (configFile != null && configFile.exists()) {
52 configFile.delete();
53 }
54 if (configFile != null && configFile.exists()) {
55 throw new RuntimeException("Failed to clean up: "+configFile.getAbsolutePath());
56 }
57 }
58
59 @Test
60 public void testDump() throws Exception {
61 DumpTikaConfigExample ex = new DumpTikaConfigExample();
62 for (String encoding : new String[]{ "UTF-8", "UTF-16LE"}) {
63 Writer writer = new OutputStreamWriter(new FileOutputStream(configFile), encoding);
64 ex.dump(TikaConfig.getDefaultConfig(), writer, encoding);
65 writer.flush();
66 writer.close();
67
68 TikaConfig c = new TikaConfig(configFile);
69 assertEquals(CompositeParser.class, c.getParser().getClass());
70 assertEquals(CompositeDetector.class, c.getDetector().getClass());
71
72 CompositeParser p = (CompositeParser) c.getParser();
73 assertTrue("enough parsers?", p.getParsers().size() > 130);
74
75 CompositeDetector d = (CompositeDetector) c.getDetector();
76 assertTrue("enough detectors?", d.getDetectors().size() > 3);
77 //just try to load it into autodetect to make sure no errors are thrown
78 Parser auto = new AutoDetectParser(c);
79 }
80 }
81
82 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.example;
18
19 import org.junit.Before;
20 import org.junit.Test;
21
22 import static org.junit.Assert.assertEquals;
23
24 public class LanguageIdentifierExampleTest {
25 LanguageIdentifierExample languageIdentifierExample;
26 @Before
27 public void setUp() {
28 languageIdentifierExample = new LanguageIdentifierExample();
29 }
30
31 @Test
32 public void testIdentifyLanguage() {
33 String text = "This is some text that should be identified as English.";
34 assertEquals("en", languageIdentifierExample.identifyLanguage(text));
35 }
36 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.example;
18
19 import static junit.framework.TestCase.assertFalse;
20 import static org.junit.Assert.assertEquals;
21 import static org.junit.Assert.assertTrue;
22
23 import java.io.IOException;
24 import java.io.StringReader;
25 import java.util.List;
26
27 import org.apache.tika.exception.TikaException;
28 import org.apache.tika.metadata.Metadata;
29 import org.apache.tika.metadata.serialization.JsonMetadataList;
30 import org.junit.Before;
31 import org.junit.Test;
32 import org.xml.sax.SAXException;
33
34 public class TestParsingExample {
35 ParsingExample parsingExample;
36 @Before
37 public void setUp() {
38 parsingExample = new ParsingExample();
39 }
40
41 @Test
42 public void testParseToStringExample() throws IOException, SAXException, TikaException {
43 String result = parsingExample.parseToStringExample().trim();
44 assertEquals("Expected 'test', but got '" + result + "'", "test", result);
45 }
46
47 @Test
48 public void testParseExample() throws IOException, SAXException, TikaException {
49 String result = parsingExample.parseExample().trim();
50 assertEquals("Expected 'test', but got '" + result + "'", "test", result);
51 }
52
53 @Test
54 public void testNoEmbeddedExample() throws IOException, SAXException, TikaException {
55 String result = parsingExample.parseNoEmbeddedExample();
56 assertContains("embed_0", result);
57 assertNotContains("embed1/embed1a.txt", result);
58 assertNotContains("embed3/embed3.txt", result);
59 assertNotContains("When in the Course", result);
60 }
61
62
63 @Test
64 public void testRecursiveParseExample() throws IOException, SAXException, TikaException {
65 String result = parsingExample.parseEmbeddedExample();
66 assertContains("embed_0", result);
67 assertContains("embed1/embed1a.txt", result);
68 assertContains("embed3/embed3.txt", result);
69 assertContains("When in the Course", result);
70 }
71
72 @Test
73 public void testRecursiveParserWrapperExample() throws IOException, SAXException, TikaException {
74 List<Metadata> metadataList = parsingExample.recursiveParserWrapperExample();
75 assertEquals("Number of embedded documents + 1 for the container document", 12, metadataList.size());
76 Metadata m = metadataList.get(6);
77 //this is the location the embed3.txt text file within the outer .docx
78 assertEquals("test_recursive_embedded.docx/embed1.zip/embed2.zip/embed3.zip/embed3.txt",
79 m.get("X-TIKA:embedded_resource_path"));
80 //it contains some html encoded content
81 assertContains("When in the Course", m.get("X-TIKA:content"));
82 }
83
84 @Test
85 public void testSerializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException {
86 String json = parsingExample.serializedRecursiveParserWrapperExample();
87 assertTrue(json.indexOf("When in the Course") > -1);
88 //now try deserializing the JSON
89 List<Metadata> metadataList = JsonMetadataList.fromJson(new StringReader(json));
90 assertEquals(12, metadataList.size());
91 }
92
93 public static void assertContains(String needle, String haystack) {
94 assertTrue("Should have found " + needle + " in: " + haystack, haystack.contains(needle));
95 }
96
97 public static void assertNotContains(String needle, String haystack) {
98 assertFalse("Should not have found " + needle + " in: " + haystack, haystack.contains(needle));
99 }
100
101 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.example;
18
19 import org.junit.Before;
20 import org.junit.Test;
21
22 import java.util.Locale;
23
24 import static org.junit.Assert.assertEquals;
25
26 public class TranslatorExampleTest {
27 TranslatorExample translatorExample;
28
29 @Before
30 public void setUp() {
31 translatorExample = new TranslatorExample();
32 }
33
34 @Test
35 public void testMicrosoftTranslateToFrench() {
36 String text = "hello";
37 String expected = "salut";
38 String translated = translatorExample.microsoftTranslateToFrench(text);
39 // The user may not have set the id and secret. So, we have to check if we just
40 // got the same text back.
41 if (!translated.equals(text)) assertEquals(expected, translated.toLowerCase(Locale.ROOT));
42 }
43 }
2222 <modelVersion>4.0.0</modelVersion>
2323
2424 <parent>
25
2526 <groupId>org.apache.tika</groupId>
2627 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
28 <version>1.8</version>
2829 <relativePath>../tika-parent/pom.xml</relativePath>
2930 </parent>
3031
3334
3435 <name>Apache Tika Java-7 Components</name>
3536 <description>Java-7 reliant components, including FileTypeDetector implementations</description>
37
38 <properties>
39 <maven.compiler.source>1.7</maven.compiler.source>
40 <maven.compiler.target>1.7</maven.compiler.target>
41 </properties>
3642
3743 <build>
3844 <plugins>
5359 </Export-Package>
5460 <Private-Package />
5561 </instructions>
56 </configuration>
57 </plugin>
58 <plugin>
59 <groupId>org.apache.maven.plugins</groupId>
60 <artifactId>maven-compiler-plugin</artifactId>
61 <version>3.1</version>
62 <configuration>
63 <source>1.7</source>
64 <target>1.7</target>
6562 </configuration>
6663 </plugin>
6764 </plugins>
8683 <dependency>
8784 <groupId>junit</groupId>
8885 <artifactId>junit</artifactId>
89 <scope>test</scope>
90 <version>4.11</version>
9186 </dependency>
9287 </dependencies>
9388
9489 <url>http://tika.apache.org/</url>
9590 <organization>
96 <name>The Apache Software Foundation</name>
97 <url>http://www.apache.org</url>
91 <name>The Apache Software Foundation</name>
92 <url>http://www.apache.org</url>
9893 </organization>
9994 <scm>
100 <url>http://svn.apache.org/viewvc/tika/tags/1.6/tika-java7</url>
101 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/tika-java7</connection>
102 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/tika-java7</developerConnection>
95 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-java7</url>
96 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-java7</connection>
97 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-java7</developerConnection>
10398 </scm>
10499 <issueManagement>
105 <system>JIRA</system>
106 <url>https://issues.apache.org/jira/browse/TIKA</url>
100 <system>JIRA</system>
101 <url>https://issues.apache.org/jira/browse/TIKA</url>
107102 </issueManagement>
108103 <ciManagement>
109 <system>Jenkins</system>
110 <url>https://builds.apache.org/job/Tika-trunk/</url>
104 <system>Jenkins</system>
105 <url>https://builds.apache.org/job/Tika-trunk/</url>
111106 </ciManagement>
112107 </project>
3030
3131 <groupId>org.apache.tika</groupId>
3232 <artifactId>tika-parent</artifactId>
33 <version>1.6</version>
33 <version>1.8</version>
3434 <packaging>pom</packaging>
3535
3636 <name>Apache Tika parent</name>
186186 </roles>
187187 <timezone>+3</timezone>
188188 </developer>
189 <developer>
189 <developer>
190190 <name>Oleg Tikhonov</name>
191191 <id>oleg</id>
192192 <roles>
204204 <role>committer</role>
205205 </roles>
206206 </developer>
207 <developer>
208 <name>Tyler Palsulich</name>
209 <id>tpalsulich</id>
210 <timezone>-8</timezone>
211 <roles>
212 <role>committer</role>
213 </roles>
214 </developer>
215 <developer>
216 <name>Tim Allison</name>
217 <id>tallison</id>
218 <timezone>-5</timezone>
219 <roles>
220 <role>committer</role>
221 </roles>
222 </developer>
223 <developer>
224 <name>Konstantin Gribov</name>
225 <id>grossws</id>
226 <timezone>+3</timezone>
227 <roles>
228 <role>committer</role>
229 </roles>
230 </developer>
207231 </developers>
208232 <contributors>
209233 <contributor>
241265 <dependency>
242266 <groupId>junit</groupId>
243267 <artifactId>junit</artifactId>
244 <version>4.10</version>
268 <version>4.11</version>
245269 <scope>test</scope>
270 </dependency>
271 <dependency>
272 <groupId>org.slf4j</groupId>
273 <artifactId>slf4j-api</artifactId>
274 <version>${slf4j.version}</version>
275 </dependency>
276 <dependency>
277 <groupId>org.slf4j</groupId>
278 <artifactId>slf4j-log4j12</artifactId>
279 <version>${slf4j.version}</version>
280 </dependency>
281 <dependency>
282 <groupId>org.slf4j</groupId>
283 <artifactId>slf4j-simple</artifactId>
284 <version>${slf4j.version}</version>
285 </dependency>
286 <dependency>
287 <groupId>org.slf4j</groupId>
288 <artifactId>jul-to-slf4j</artifactId>
289 <version>${slf4j.version}</version>
290 </dependency>
291 <dependency>
292 <groupId>org.slf4j</groupId>
293 <artifactId>jcl-over-slf4j</artifactId>
294 <version>${slf4j.version}</version>
246295 </dependency>
247296 </dependencies>
248297 </dependencyManagement>
249298
250299 <properties>
251 <maven.compile.source>1.6</maven.compile.source>
252 <maven.compile.target>1.6</maven.compile.target>
300 <maven.compiler.source>1.6</maven.compiler.source>
301 <maven.compiler.target>1.6</maven.compiler.target>
253302 <project.reporting.outputEncoding>${project.build.sourceEncoding}</project.reporting.outputEncoding>
303 <slf4j.version>1.7.12</slf4j.version>
254304 </properties>
255305
256306 <build>
257 <plugins>
258 <plugin>
259 <artifactId>maven-compiler-plugin</artifactId>
260 <configuration>
261 <source>${maven.compile.source}</source>
262 <target>${maven.compile.target}</target>
263 </configuration>
264 </plugin>
265 </plugins>
266 <pluginManagement>
267307 <plugins>
308 <plugin>
309 <artifactId>maven-compiler-plugin</artifactId>
310 <version>3.2</version>
311 <configuration>
312 <source>${maven.compiler.source}</source>
313 <target>${maven.compiler.target}</target>
314 </configuration>
315 </plugin>
316 <plugin>
317 <groupId>de.thetaphi</groupId>
318 <artifactId>forbiddenapis</artifactId>
319 <version>1.7</version>
320 <configuration>
321 <targetVersion>${maven.compiler.target}</targetVersion>
322 <internalRuntimeForbidden>true</internalRuntimeForbidden>
323 <failOnUnsupportedJava>false</failOnUnsupportedJava>
324 <bundledSignatures>
325 <bundledSignature>jdk-unsafe</bundledSignature>
326 <bundledSignature>jdk-deprecated</bundledSignature>
327 </bundledSignatures>
328 </configuration>
329 <executions>
330 <execution>
331 <goals>
332 <goal>check</goal>
333 <goal>testCheck</goal>
334 </goals>
335 </execution>
336 </executions>
337 </plugin>
268338 <plugin>
269339 <groupId>org.apache.felix</groupId>
270340 <artifactId>maven-bundle-plugin</artifactId>
273343 <plugin>
274344 <groupId>org.apache.maven.plugins</groupId>
275345 <artifactId>maven-surefire-plugin</artifactId>
276 <version>2.12</version>
346 <version>2.18.1</version>
347 <configuration>
348 <argLine>-Xmx2048m</argLine>
349 </configuration>
277350 </plugin>
278351 <plugin>
279352 <groupId>org.apache.maven.plugins</groupId>
280353 <artifactId>maven-shade-plugin</artifactId>
281 <version>1.6</version>
354 <version>2.3</version>
355 </plugin>
356 </plugins>
357
358 <pluginManagement>
359 <plugins>
360 <!--This plugin's configuration is used to store Eclipse m2e settings only. It has no influence on the Maven build itself.-->
361 <plugin>
362 <groupId>org.eclipse.m2e</groupId>
363 <artifactId>lifecycle-mapping</artifactId>
364 <version>1.0.0</version>
365 <configuration>
366 <lifecycleMappingMetadata>
367 <pluginExecutions>
368 <pluginExecution>
369 <pluginExecutionFilter>
370 <groupId>de.thetaphi</groupId>
371 <artifactId>forbiddenapis</artifactId>
372 <versionRange>[1.0,)</versionRange>
373 <goals>
374 <goal>check</goal>
375 <goal>testCheck</goal>
376 </goals>
377 </pluginExecutionFilter>
378 <action>
379 <ignore />
380 </action>
381 </pluginExecution>
382 </pluginExecutions>
383 </lifecycleMappingMetadata>
384 </configuration>
282385 </plugin>
283386 </plugins>
284387 </pluginManagement>
323426 </profiles>
324427
325428 <scm>
326 <connection>scm:svn:http://svn.apache.org/repos/asf/maven/pom/tags/1.6/tika-parent</connection>
327 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/maven/pom/tags/1.6/tika-parent</developerConnection>
328 <url>http://svn.apache.org/viewvc/maven/pom/tags/1.6/tika-parent</url>
429 <connection>scm:svn:http://svn.apache.org/repos/asf/maven/pom/tags/1.8-rc2/tika-parent</connection>
430 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/maven/pom/tags/1.8-rc2/tika-parent</developerConnection>
431 <url>http://svn.apache.org/viewvc/maven/pom/tags/1.8-rc2/tika-parent</url>
329432 </scm>
330433 </project>
2424 <parent>
2525 <groupId>org.apache.tika</groupId>
2626 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
27 <version>1.8</version>
2828 <relativePath>../tika-parent/pom.xml</relativePath>
2929 </parent>
3030
3434 <url>http://tika.apache.org/</url>
3535
3636 <properties>
37 <poi.version>3.10-FINAL</poi.version>
38 <codec.version>1.5</codec.version> <!-- NOTE: sync with POI -->
37 <poi.version>3.12-beta1</poi.version>
38 <codec.version>1.9</codec.version>
39 <!-- NOTE: sync with POI -->
40 <compress.version>1.9</compress.version>
41 <tukaani.version>1.5</tukaani.version>
42 <!-- NOTE: sync with commons-compress -->
3943 <mime4j.version>0.7.2</mime4j.version>
4044 <vorbis.version>0.6</vorbis.version>
41 <pdfbox.version>1.8.6</pdfbox.version>
45 <pdfbox.version>1.8.9</pdfbox.version>
46 <netcdf-java.version>4.5.5</netcdf-java.version>
4247 </properties>
4348
4449 <dependencies>
5762 <version>${project.version}</version>
5863 </dependency>
5964
65 <dependency>
66 <groupId>${project.groupId}</groupId>
67 <artifactId>tika-core</artifactId>
68 <version>${project.version}</version>
69 <type>test-jar</type>
70 <scope>test</scope>
71 </dependency>
72
6073 <!-- Externally Maintained Parsers -->
6174 <dependency>
6275 <groupId>org.gagravarr</groupId>
7386
7487 <!-- Upstream parser libraries -->
7588 <dependency>
76 <groupId>edu.ucar</groupId>
77 <artifactId>netcdf</artifactId>
78 <version>4.2.20</version>
79 </dependency>
80 <dependency>
8189 <groupId>net.sourceforge.jmatio</groupId>
8290 <artifactId>jmatio</artifactId>
8391 <version>1.0</version>
84 </dependency>
92 </dependency>
8593 <dependency>
8694 <groupId>org.apache.james</groupId>
8795 <artifactId>apache-mime4j-core</artifactId>
95103 <dependency>
96104 <groupId>org.apache.commons</groupId>
97105 <artifactId>commons-compress</artifactId>
98 <version>1.8</version>
99 </dependency>
106 <version>${compress.version}</version>
107 </dependency>
108 <dependency>
109 <groupId>org.tukaani</groupId>
110 <artifactId>xz</artifactId>
111 <version>${tukaani.version}</version>
112 </dependency>
113
100114 <dependency>
101115 <groupId>commons-codec</groupId>
102116 <artifactId>commons-codec</artifactId>
112126 problems with encrypted PDFs. -->
113127 <dependency>
114128 <groupId>org.bouncycastle</groupId>
115 <artifactId>bcmail-jdk15</artifactId>
116 <version>1.45</version>
129 <artifactId>bcmail-jdk15on</artifactId>
130 <version>1.52</version>
117131 </dependency>
118132 <dependency>
119133 <groupId>org.bouncycastle</groupId>
120 <artifactId>bcprov-jdk15</artifactId>
121 <version>1.45</version>
134 <artifactId>bcprov-jdk15on</artifactId>
135 <version>1.52</version>
122136 </dependency>
123137 <dependency>
124138 <groupId>org.apache.poi</groupId>
146160 </exclusions>
147161 </dependency>
148162 <dependency>
149 <groupId>org.apache.geronimo.specs</groupId>
150 <artifactId>geronimo-stax-api_1.0_spec</artifactId>
151 <version>1.0.1</version>
152 </dependency>
153 <dependency>
154163 <groupId>org.ccil.cowan.tagsoup</groupId>
155164 <artifactId>tagsoup</artifactId>
156165 <version>1.2.1</version>
166175 <version>1.0.2</version>
167176 </dependency>
168177 <dependency>
169 <groupId>com.drewnoakes</groupId>
170 <artifactId>metadata-extractor</artifactId>
171 <version>2.6.2</version>
178 <groupId>com.drewnoakes</groupId>
179 <artifactId>metadata-extractor</artifactId>
180 <version>2.8.0</version>
172181 </dependency>
173182 <dependency>
174183 <groupId>de.l3s.boilerpipe</groupId>
186195 <version>${vorbis.version}</version>
187196 </dependency>
188197 <dependency>
198 <groupId>org.xerial</groupId>
199 <artifactId>sqlite-jdbc</artifactId>
200 <version>3.8.6</version> <!-- 3.8.7 failed on Ubuntu -->
201 </dependency>
202 <dependency>
189203 <groupId>com.googlecode.juniversalchardet</groupId>
190204 <artifactId>juniversalchardet</artifactId>
191205 <version>1.0.3</version>
192206 </dependency>
193207 <dependency>
194 <groupId>com.uwyn</groupId>
208 <groupId>org.codelibs</groupId>
195209 <artifactId>jhighlight</artifactId>
196 <version>1.0</version>
210 <version>1.0.2</version>
211 </dependency>
212 <dependency>
213 <groupId>com.pff</groupId>
214 <artifactId>java-libpst</artifactId>
215 <version>0.8.1</version>
216 </dependency>
217 <dependency>
218 <groupId>com.github.junrar</groupId>
219 <artifactId>junrar</artifactId>
220 <version>0.7</version>
221 </dependency>
222
223 <!-- Test dependencies -->
224 <dependency>
225 <groupId>junit</groupId>
226 <artifactId>junit</artifactId>
227 </dependency>
228 <dependency>
229 <groupId>org.mockito</groupId>
230 <artifactId>mockito-core</artifactId>
231 <version>1.7</version>
232 <scope>test</scope>
233 </dependency>
234 <dependency>
235 <groupId>org.slf4j</groupId>
236 <artifactId>slf4j-log4j12</artifactId>
237 <scope>test</scope>
238 </dependency>
239
240 <!-- edu.ucar dependencies -->
241 <dependency>
242 <groupId>edu.ucar</groupId>
243 <artifactId>netcdf4</artifactId>
244 <version>${netcdf-java.version}</version>
245 </dependency>
246 <dependency>
247 <groupId>edu.ucar</groupId>
248 <artifactId>grib</artifactId>
249 <version>${netcdf-java.version}</version>
250 </dependency>
251 <dependency>
252 <groupId>edu.ucar</groupId>
253 <artifactId>cdm</artifactId>
254 <version>${netcdf-java.version}</version>
197255 <exclusions>
198256 <exclusion>
199 <groupId>javax.servlet</groupId>
200 <artifactId>servlet-api</artifactId>
257 <groupId>org.slf4j</groupId>
258 <artifactId>jcl-over-slf4j</artifactId>
201259 </exclusion>
202260 </exclusions>
203261 </dependency>
204262 <dependency>
205 <groupId>com.pff</groupId>
206 <artifactId>java-libpst</artifactId>
207 <version>0.8.1</version>
208 </dependency>
209
210 <!-- Test dependencies -->
211 <dependency>
212 <groupId>junit</groupId>
213 <artifactId>junit</artifactId>
214 <scope>test</scope>
215 </dependency>
216 <dependency>
217 <groupId>org.mockito</groupId>
218 <artifactId>mockito-core</artifactId>
219 <version>1.7</version>
220 <scope>test</scope>
221 </dependency>
222 <dependency>
223 <groupId>org.slf4j</groupId>
224 <artifactId>slf4j-log4j12</artifactId>
225 <version>1.6.1</version>
226 <scope>test</scope>
263 <groupId>edu.ucar</groupId>
264 <artifactId>httpservices</artifactId>
265 <version>${netcdf-java.version}</version>
266 </dependency>
267 <dependency>
268 <groupId>com.google.guava</groupId>
269 <artifactId>guava</artifactId>
270 <version>10.0.1</version>
271 </dependency>
272 <!-- Apache Commons CSV -->
273 <dependency>
274 <groupId>org.apache.commons</groupId>
275 <artifactId>commons-csv</artifactId>
276 <version>1.0</version>
227277 </dependency>
228278 </dependencies>
229279
240290 org.apache.tika.parser.internal.Activator
241291 </Bundle-Activator>
242292 <Import-Package>
243 org.w3c.dom,
244 org.apache.tika.*,
245 *;resolution:=optional
293 org.w3c.dom,
294 org.apache.tika.*,
295 *;resolution:=optional
246296 </Import-Package>
247297 </instructions>
248298 </configuration>
303353 </build>
304354
305355 <organization>
306 <name>The Apache Software Foundation</name>
307 <url>http://www.apache.org</url>
356 <name>The Apache Software Foundation</name>
357 <url>http://www.apache.org</url>
308358 </organization>
309359 <scm>
310 <url>http://svn.apache.org/viewvc/tika/tags/1.6/tika-parsers</url>
311 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/tika-parsers</connection>
312 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/tika-parsers</developerConnection>
360 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-parsers</url>
361 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-parsers</connection>
362 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-parsers</developerConnection>
313363 </scm>
314364 <issueManagement>
315 <system>JIRA</system>
316 <url>https://issues.apache.org/jira/browse/TIKA</url>
365 <system>JIRA</system>
366 <url>https://issues.apache.org/jira/browse/TIKA</url>
317367 </issueManagement>
318368 <ciManagement>
319 <system>Jenkins</system>
320 <url>https://builds.apache.org/job/Tika-trunk/</url>
369 <system>Jenkins</system>
370 <url>https://builds.apache.org/job/Tika-trunk/</url>
321371 </ciManagement>
322372 </project>
3434 not be used in advertising or otherwise to promote the sale, use or other
3535 dealings in this Software without prior written authorization of the
3636 copyright holder.
37
38
39 JUnRAR (https://github.com/edmund-wagner/junrar/)
40
41 JUnRAR is based on the UnRAR tool, and covered by the same license
42 It was formerly available from http://java-unrar.svn.sourceforge.net/
43
44 ****** ***** ****** UnRAR - free utility for RAR archives
45 ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46 ****** ******* ****** License for use and distribution of
47 ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
48 ** ** ** ** ** ** FREE portable version
49 ~~~~~~~~~~~~~~~~~~~~~
50
51 The source code of UnRAR utility is freeware. This means:
52
53 1. All copyrights to RAR and the utility UnRAR are exclusively
54 owned by the author - Alexander Roshal.
55
56 2. The UnRAR sources may be used in any software to handle RAR
57 archives without limitations free of charge, but cannot be used
58 to re-create the RAR compression algorithm, which is proprietary.
59 Distribution of modified UnRAR sources in separate form or as a
60 part of other software is permitted, provided that it is clearly
61 stated in the documentation and source comments that the code may
62 not be used to develop a RAR (WinRAR) compatible archiver.
63
64 3. The UnRAR utility may be freely distributed. It is allowed
65 to distribute UnRAR inside of other software packages.
66
67 4. THE RAR ARCHIVER AND THE UnRAR UTILITY ARE DISTRIBUTED "AS IS".
68 NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU USE AT
69 YOUR OWN RISK. THE AUTHOR WILL NOT BE LIABLE FOR DATA LOSS,
70 DAMAGES, LOSS OF PROFITS OR ANY OTHER KIND OF LOSS WHILE USING
71 OR MISUSING THIS SOFTWARE.
72
73 5. Installing and using the UnRAR utility signifies acceptance of
74 these terms and conditions of the license.
75
76 6. If you don't agree with terms of the license you must remove
77 UnRAR files from your storage devices and cease to use the
78 utility.
79
80 Thank you for your interest in RAR and UnRAR. Alexander L. Roshal
81
82 Sqlite (bundled in org.xerial's sqlite-jdbc)
83 This product bundles Sqlite, which is in the Public Domain. For details
84 see: https://www.sqlite.org/copyright.html
85
86 Two photos in test-documents (testWebp_Alpha_Lossy.webp and testWebp_Alpha_Lossless.webp)
87 are in the public domain. These files were retrieved from:
88 https://github.com/drewnoakes/metadata-extractor-images/tree/master/webp
89 These photos are also available here:
90 https://developers.google.com/speed/webp/gallery2#webp_links
91 Credits for the photo:
92 "Free Stock Photo in High Resolution - Yellow Rose 3 - Flowers"
93 Image Author: Jon Sullivan
2121 import java.util.Arrays;
2222 import java.util.Collections;
2323 import java.util.HashSet;
24 import java.util.Iterator;
2524 import java.util.Set;
2625
2726 import org.apache.tika.exception.TikaException;
3332 import org.apache.tika.parser.chm.core.ChmExtractor;
3433 import org.apache.tika.parser.html.HtmlParser;
3534 import org.apache.tika.sax.BodyContentHandler;
35 import org.apache.tika.sax.EmbeddedContentHandler;
3636 import org.apache.tika.sax.XHTMLContentHandler;
3737 import org.xml.sax.ContentHandler;
3838 import org.xml.sax.SAXException;
4848 MediaType.application("chm"),
4949 MediaType.application("x-chm"))));
5050
51 @Override
5152 public Set<MediaType> getSupportedTypes(ParseContext context) {
5253 return SUPPORTED_TYPES;
5354 }
5455
56 @Override
5557 public void parse(InputStream stream, ContentHandler handler,
5658 Metadata metadata, ParseContext context) throws IOException,
5759 SAXException, TikaException {
6567 xhtml.startDocument();
6668
6769 for (DirectoryListingEntry entry : chmExtractor.getChmDirList().getDirectoryListingEntryList()) {
68 if (entry.getName().endsWith(".html") || entry.getName().endsWith(".htm")) {
69 xhtml.characters(extract(chmExtractor.extractChmEntry(entry)));
70 final String entryName = entry.getName();
71 if (entryName.endsWith(".html")
72 || entryName.endsWith(".htm")
73 ) {
74 // AttributesImpl attrs = new AttributesImpl();
75 // attrs.addAttribute("", "name", "name", "String", entryName);
76 // xhtml.startElement("", "document", "document", attrs);
77
78 byte[] data = chmExtractor.extractChmEntry(entry);
79
80 parsePage(data, xhtml);
81
82 // xhtml.endElement("", "", "document");
7083 }
7184 }
7285
7386 xhtml.endDocument();
7487 }
7588
76 /**
77 * Extracts data from byte[]
78 */
79 private String extract(byte[] byteObject) throws TikaException {// throws IOException
80 StringBuilder wBuf = new StringBuilder();
89
90 private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException
8191 InputStream stream = null;
8292 Metadata metadata = new Metadata();
8393 HtmlParser htmlParser = new HtmlParser();
84 BodyContentHandler handler = new BodyContentHandler(-1);// -1
94 ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1
8595 ParseContext parser = new ParseContext();
8696 try {
8797 stream = new ByteArrayInputStream(byteObject);
8898 htmlParser.parse(stream, handler, metadata, parser);
89 wBuf.append(handler.toString()
90 + System.getProperty("line.separator"));
9199 } catch (SAXException e) {
92100 throw new RuntimeException(e);
93101 } catch (IOException e) {
94102 // Pushback overflow from tagsoup
95103 }
96 return wBuf.toString();
97104 }
98
99
105
100106 }
1818 import java.math.BigInteger;
1919 import java.util.ArrayList;
2020 import java.util.List;
21
2221 import org.apache.tika.exception.TikaException;
22 import org.apache.tika.io.IOUtils;
2323 import org.apache.tika.parser.chm.core.ChmCommons;
2424 import org.apache.tika.parser.chm.core.ChmConstants;
25 import org.apache.tika.parser.chm.exception.ChmParsingException;
2526
2627 /**
2728 * Holds chm listing entries
100101 }
101102
102103 /**
103 * Gets place holder
104 *
105 * @return place holder
106 */
107 private int getPlaceHolder() {
108 return placeHolder;
109 }
110
111 /**
112104 * Sets place holder
113105 *
114106 * @param placeHolder
117109 this.placeHolder = placeHolder;
118110 }
119111
112 private ChmPmglHeader PMGLheader;
120113 /**
121114 * Enumerates chm directory listing entries
122115 *
123116 * @param chmItsHeader
124 * chm itsf header
117 * chm itsf PMGLheader
125118 * @param chmItspHeader
126 * chm itsp header
119 * chm itsp PMGLheader
127120 */
128121 private void enumerateChmDirectoryListingList(ChmItsfHeader chmItsHeader,
129122 ChmItspHeader chmItspHeader) {
135128 setDataOffset(chmItsHeader.getDataOffset());
136129
137130 /* loops over all pmgls */
138 int previous_index = 0;
139131 byte[] dir_chunk = null;
140 for (int i = startPmgl; i <= stopPmgl; i++) {
141 int data_copied = ((1 + i) * (int) chmItspHeader.getBlock_len())
142 + dir_offset;
143 if (i == 0) {
144 dir_chunk = new byte[(int) chmItspHeader.getBlock_len()];
145 // dir_chunk = Arrays.copyOfRange(getData(), dir_offset,
146 // (((1+i) * (int)chmItspHeader.getBlock_len()) +
147 // dir_offset));
148 dir_chunk = ChmCommons
149 .copyOfRange(getData(), dir_offset,
150 (((1 + i) * (int) chmItspHeader
151 .getBlock_len()) + dir_offset));
152 previous_index = data_copied;
153 } else {
154 dir_chunk = new byte[(int) chmItspHeader.getBlock_len()];
155 // dir_chunk = Arrays.copyOfRange(getData(), previous_index,
156 // (((1+i) * (int)chmItspHeader.getBlock_len()) +
157 // dir_offset));
158 dir_chunk = ChmCommons
159 .copyOfRange(getData(), previous_index,
160 (((1 + i) * (int) chmItspHeader
161 .getBlock_len()) + dir_offset));
162 previous_index = data_copied;
163 }
132 for (int i = startPmgl; i>=0; ) {
133 dir_chunk = new byte[(int) chmItspHeader.getBlock_len()];
134 int start = i * (int) chmItspHeader.getBlock_len() + dir_offset;
135 dir_chunk = ChmCommons
136 .copyOfRange(getData(), start,
137 start +(int) chmItspHeader.getBlock_len());
138
139 PMGLheader = new ChmPmglHeader();
140 PMGLheader.parse(dir_chunk, PMGLheader);
164141 enumerateOneSegment(dir_chunk);
142
143 i=PMGLheader.getBlockNext();
165144 dir_chunk = null;
166145 }
167146 } catch (Exception e) {
201180 }
202181 }
203182
183 public static final boolean startsWith(byte[] data, String prefix) {
184 for (int i=0; i<prefix.length(); i++) {
185 if (data[i]!=prefix.charAt(i)) {
186 return false;
187 }
188 }
189
190 return true;
191 }
204192 /**
205193 * Enumerates chm directory listing entries in single chm segment
206194 *
207195 * @param dir_chunk
208196 */
209 private void enumerateOneSegment(byte[] dir_chunk) {
210 try {
197 private void enumerateOneSegment(byte[] dir_chunk) throws ChmParsingException {
198 // try {
211199 if (dir_chunk != null) {
212
213 int indexWorkData = ChmCommons.indexOf(dir_chunk,
214 "::".getBytes());
215 int indexUserData = ChmCommons.indexOf(dir_chunk,
216 "/".getBytes());
217
218 if (indexUserData < indexWorkData)
219 setPlaceHolder(indexUserData);
220 else
221 setPlaceHolder(indexWorkData);
222
223 if (getPlaceHolder() > 0
224 && dir_chunk[getPlaceHolder() - 1] != 115) {// #{
225 do {
226 if (dir_chunk[getPlaceHolder() - 1] > 0) {
227 DirectoryListingEntry dle = new DirectoryListingEntry();
228
229 // two cases: 1. when dir_chunk[getPlaceHolder() -
230 // 1] == 0x73
231 // 2. when dir_chunk[getPlaceHolder() + 1] == 0x2f
232 doNameCheck(dir_chunk, dle);
233
234 // dle.setName(new
235 // String(Arrays.copyOfRange(dir_chunk,
236 // getPlaceHolder(), (getPlaceHolder() +
237 // dle.getNameLength()))));
238 dle.setName(new String(ChmCommons.copyOfRange(
239 dir_chunk, getPlaceHolder(),
240 (getPlaceHolder() + dle.getNameLength()))));
241 checkControlData(dle);
242 checkResetTable(dle);
243 setPlaceHolder(getPlaceHolder()
244 + dle.getNameLength());
245
246 /* Sets entry type */
247 if (getPlaceHolder() < dir_chunk.length
248 && dir_chunk[getPlaceHolder()] == 0)
249 dle.setEntryType(ChmCommons.EntryType.UNCOMPRESSED);
250 else
251 dle.setEntryType(ChmCommons.EntryType.COMPRESSED);
252
253 setPlaceHolder(getPlaceHolder() + 1);
254 dle.setOffset(getEncint(dir_chunk));
255 dle.setLength(getEncint(dir_chunk));
256 getDirectoryListingEntryList().add(dle);
257 } else
258 setPlaceHolder(getPlaceHolder() + 1);
259
260 } while (hasNext(dir_chunk));
200 int header_len;
201 if (startsWith(dir_chunk, ChmConstants.CHM_PMGI_MARKER)) {
202 header_len = ChmConstants.CHM_PMGI_LEN;
203 return; //skip PMGI
261204 }
262 }
263
264 } catch (Exception e) {
265 e.printStackTrace();
266 }
267 }
268
269 /**
270 * Checks if a name and name length are correct. If not then handles it as
271 * follows: 1. when dir_chunk[getPlaceHolder() - 1] == 0x73 ('/') 2. when
272 * dir_chunk[getPlaceHolder() + 1] == 0x2f ('s')
273 *
274 * @param dir_chunk
275 * @param dle
276 */
277 private void doNameCheck(byte[] dir_chunk, DirectoryListingEntry dle) {
278 if (dir_chunk[getPlaceHolder() - 1] == 0x73) {
279 dle.setNameLength(dir_chunk[getPlaceHolder() - 1] & 0x21);
280 } else if (dir_chunk[getPlaceHolder() + 1] == 0x2f) {
281 dle.setNameLength(dir_chunk[getPlaceHolder()]);
282 setPlaceHolder(getPlaceHolder() + 1);
283 } else {
284 dle.setNameLength(dir_chunk[getPlaceHolder() - 1]);
285 }
286 }
287
288 /**
289 * Checks if it's possible move further on byte[]
290 *
291 * @param dir_chunk
292 *
293 * @return boolean
294 */
295 private boolean hasNext(byte[] dir_chunk) {
296 while (getPlaceHolder() < dir_chunk.length) {
297 if (dir_chunk[getPlaceHolder()] == 47
298 && dir_chunk[getPlaceHolder() + 1] != ':') {
299 setPlaceHolder(getPlaceHolder());
300 return true;
301 } else if (dir_chunk[getPlaceHolder()] == ':'
302 && dir_chunk[getPlaceHolder() + 1] == ':') {
303 setPlaceHolder(getPlaceHolder());
304 return true;
305 } else
306 setPlaceHolder(getPlaceHolder() + 1);
307 }
308 return false;
309 }
205 else if (startsWith(dir_chunk, ChmConstants.PMGL)) {
206 header_len = ChmConstants.CHM_PMGL_LEN;
207 }
208 else {
209 throw new ChmParsingException("Bad dir entry block.");
210 }
211
212 placeHolder = header_len;
213 //setPlaceHolder(header_len);
214 while (placeHolder > 0 && placeHolder < dir_chunk.length - PMGLheader.getFreeSpace()
215 /*&& dir_chunk[placeHolder - 1] != 115*/)
216 {
217 //get entry name length
218 int strlen = 0;// = getEncint(data);
219 byte temp;
220 while ((temp=dir_chunk[placeHolder++]) >= 0x80)
221 {
222 strlen <<= 7;
223 strlen += temp & 0x7f;
224 }
225
226 strlen = (strlen << 7) + temp & 0x7f;
227
228 if (strlen>dir_chunk.length) {
229 throw new ChmParsingException("Bad data of a string length.");
230 }
231
232 DirectoryListingEntry dle = new DirectoryListingEntry();
233 dle.setNameLength(strlen);
234 dle.setName(new String(ChmCommons.copyOfRange(
235 dir_chunk, placeHolder,
236 (placeHolder + dle.getNameLength())), IOUtils.UTF_8));
237
238 checkControlData(dle);
239 checkResetTable(dle);
240 setPlaceHolder(placeHolder
241 + dle.getNameLength());
242
243 /* Sets entry type */
244 if (placeHolder < dir_chunk.length
245 && dir_chunk[placeHolder] == 0)
246 dle.setEntryType(ChmCommons.EntryType.UNCOMPRESSED);
247 else
248 dle.setEntryType(ChmCommons.EntryType.COMPRESSED);
249
250 setPlaceHolder(placeHolder + 1);
251 dle.setOffset(getEncint(dir_chunk));
252 dle.setLength(getEncint(dir_chunk));
253 getDirectoryListingEntryList().add(dle);
254 }
255
256 // int indexWorkData = ChmCommons.indexOf(dir_chunk,
257 // "::".getBytes("UTF-8"));
258 // int indexUserData = ChmCommons.indexOf(dir_chunk,
259 // "/".getBytes("UTF-8"));
260 //
261 // if (indexUserData>=0 && indexUserData < indexWorkData)
262 // setPlaceHolder(indexUserData);
263 // else if (indexWorkData>=0) {
264 // setPlaceHolder(indexWorkData);
265 // }
266 // else {
267 // setPlaceHolder(indexUserData);
268 // }
269 //
270 // if (placeHolder > 0 && placeHolder < dir_chunk.length - PMGLheader.getFreeSpace()
271 // && dir_chunk[placeHolder - 1] != 115) {// #{
272 // do {
273 // if (dir_chunk[placeHolder - 1] > 0) {
274 // DirectoryListingEntry dle = new DirectoryListingEntry();
275 //
276 // // two cases: 1. when dir_chunk[placeHolder -
277 // // 1] == 0x73
278 // // 2. when dir_chunk[placeHolder + 1] == 0x2f
279 // doNameCheck(dir_chunk, dle);
280 //
281 // // dle.setName(new
282 // // String(Arrays.copyOfRange(dir_chunk,
283 // // placeHolder, (placeHolder +
284 // // dle.getNameLength()))));
285 // dle.setName(new String(ChmCommons.copyOfRange(
286 // dir_chunk, placeHolder,
287 // (placeHolder + dle.getNameLength())), "UTF-8"));
288 // checkControlData(dle);
289 // checkResetTable(dle);
290 // setPlaceHolder(placeHolder
291 // + dle.getNameLength());
292 //
293 // /* Sets entry type */
294 // if (placeHolder < dir_chunk.length
295 // && dir_chunk[placeHolder] == 0)
296 // dle.setEntryType(ChmCommons.EntryType.UNCOMPRESSED);
297 // else
298 // dle.setEntryType(ChmCommons.EntryType.COMPRESSED);
299 //
300 // setPlaceHolder(placeHolder + 1);
301 // dle.setOffset(getEncint(dir_chunk));
302 // dle.setLength(getEncint(dir_chunk));
303 // getDirectoryListingEntryList().add(dle);
304 // } else
305 // setPlaceHolder(placeHolder + 1);
306 //
307 // } while (nextEntry(dir_chunk));
308 // }
309 }
310
311 // } catch (Exception e) {
312 // e.printStackTrace();
313 // }
314 }
315
310316
311317 /**
312318 * Returns encrypted integer
320326 BigInteger bi = BigInteger.ZERO;
321327 byte[] nb = new byte[1];
322328
323 if (getPlaceHolder() < data_chunk.length) {
324 while ((ob = data_chunk[getPlaceHolder()]) < 0) {
329 if (placeHolder < data_chunk.length) {
330 while ((ob = data_chunk[placeHolder]) < 0) {
325331 nb[0] = (byte) ((ob & 0x7f));
326332 bi = bi.shiftLeft(7).add(new BigInteger(nb));
327 setPlaceHolder(getPlaceHolder() + 1);
333 setPlaceHolder(placeHolder + 1);
328334 }
329335 nb[0] = (byte) ((ob & 0x7f));
330336 bi = bi.shiftLeft(7).add(new BigInteger(nb));
331 setPlaceHolder(getPlaceHolder() + 1);
337 setPlaceHolder(placeHolder + 1);
332338 }
333339 return bi.intValue();
334 }
335
336 /**
337 * @param args
338 */
339 public static void main(String[] args) {
340340 }
341341
342342 /**
1818 import java.math.BigInteger;
1919
2020 import org.apache.tika.exception.TikaException;
21 import org.apache.tika.io.IOUtils;
2122 import org.apache.tika.parser.chm.assertion.ChmAssert;
2223 import org.apache.tika.parser.chm.core.ChmConstants;
2324 import org.apache.tika.parser.chm.exception.ChmParsingException;
4142 /* structure of ITSF headers */
4243 public class ChmItsfHeader implements ChmAccessor<ChmItsfHeader> {
4344 private static final long serialVersionUID = 2215291838533213826L;
44 private byte[] signature = new String("ITSF").getBytes(); /* 0 (ITSF) */
45 private byte[] signature;
4546 private int version; /* 4 */
4647 private int header_len; /* 8 */
4748 private int unknown_000c; /* c */
5960 private int dataRemained;
6061 private int currentPlace = 0;
6162
63 public ChmItsfHeader() {
64 signature = ChmConstants.ITSF.getBytes(IOUtils.UTF_8); /* 0 (ITSF) */
65 }
66
6267 /**
6368 * Prints the values of ChmfHeader
6469 */
6570 public String toString() {
6671 StringBuilder sb = new StringBuilder();
67 sb.append(new String(getSignature()) + " ");
72 sb.append(new String(getSignature(), IOUtils.UTF_8) + " ");
6873 sb.append(getVersion() + " ");
6974 sb.append(getHeaderLen() + " ");
7075 sb.append(getUnknown_000c() + " ");
375380
376381 if (4 > this.getDataRemained())
377382 throw new TikaException("4 > dataLenght");
378 dest = data[this.getCurrentPlace()]
379 | data[this.getCurrentPlace() + 1] << 8
380 | data[this.getCurrentPlace() + 2] << 16
381 | data[this.getCurrentPlace() + 3] << 24;
383 dest = (data[this.getCurrentPlace()] & 0xff)
384 | (data[this.getCurrentPlace() + 1] & 0xff) << 8
385 | (data[this.getCurrentPlace() + 2] & 0xff) << 16
386 | (data[this.getCurrentPlace() + 3] & 0xff) << 24;
382387
383388 this.setCurrentPlace(this.getCurrentPlace() + 4);
384389 this.setDataRemained(this.getDataRemained() - 4);
457462 chmItsfHeader.setUnknownLen(chmItsfHeader.unmarshalUint64(data, chmItsfHeader.getUnknownLen()));
458463 chmItsfHeader.setDirOffset(chmItsfHeader.unmarshalUint64(data, chmItsfHeader.getDirOffset()));
459464 chmItsfHeader.setDirLen(chmItsfHeader.unmarshalUint64(data, chmItsfHeader.getDirLen()));
460
461 if (!new String(chmItsfHeader.getSignature()).equals(ChmConstants.ITSF))
465 if (!new String(chmItsfHeader.getSignature(), IOUtils.UTF_8).equals(ChmConstants.ITSF))
462466 throw new TikaException("seems not valid file");
463467 if (chmItsfHeader.getVersion() == ChmConstants.CHM_VER_2) {
464468 if (chmItsfHeader.getHeaderLen() < ChmConstants.CHM_ITSF_V2_LEN)
1515 */
1616 package org.apache.tika.parser.chm.accessor;
1717
18 import java.io.UnsupportedEncodingException;
19
1820 import org.apache.tika.exception.TikaException;
21 import org.apache.tika.io.IOUtils;
1922 import org.apache.tika.parser.chm.assertion.ChmAssert;
2023 import org.apache.tika.parser.chm.core.ChmCommons;
2124 import org.apache.tika.parser.chm.core.ChmConstants;
4447 public class ChmItspHeader implements ChmAccessor<ChmItspHeader> {
4548 // TODO: refactor all unmarshals
4649 private static final long serialVersionUID = 1962394421998181341L;
47 private byte[] signature = new String(ChmConstants.ITSP).getBytes(); /*
48 * 0
49 * (ITSP
50 * )
51 */
50 private byte[] signature;
5251 private int version; /* 4 */
5352 private int header_len; /* 8 */
5453 private int unknown_000c; /* c */
6867 private int dataRemained;
6968 private int currentPlace = 0;
7069
70 public ChmItspHeader() {
71 signature = ChmConstants.ITSP.getBytes(IOUtils.UTF_8); /*
72 * 0
73 * (ITSP
74 * )
75 */
76 }
77
7178 public String toString() {
7279 StringBuilder sb = new StringBuilder();
73 sb.append("[ signature:=" + new String(getSignature())
80 sb.append("[ signature:=" + new String(getSignature(), IOUtils.UTF_8)
7481 + System.getProperty("line.separator"));
7582 sb.append("version:=\t" + getVersion()
7683 + System.getProperty("line.separator"));
132139 ChmAssert.assertByteArrayNotNull(data);
133140 if (4 > this.getDataRemained())
134141 throw new TikaException("4 > dataLenght");
135 dest = data[this.getCurrentPlace()]
136 | data[this.getCurrentPlace() + 1] << 8
137 | data[this.getCurrentPlace() + 2] << 16
138 | data[this.getCurrentPlace() + 3] << 24;
142 dest = (data[this.getCurrentPlace()] & 0xff)
143 | (data[this.getCurrentPlace() + 1] & 0xff) << 8
144 | (data[this.getCurrentPlace() + 2] & 0xff) << 16
145 | (data[this.getCurrentPlace() + 3] & 0xff) << 24;
139146
140147 this.setCurrentPlace(this.getCurrentPlace() + 4);
141148 this.setDataRemained(this.getDataRemained() - 4);
146153 ChmAssert.assertByteArrayNotNull(data);
147154 if (4 > dataLenght)
148155 throw new TikaException("4 > dataLenght");
149 dest = data[this.getCurrentPlace()]
150 | data[this.getCurrentPlace() + 1] << 8
151 | data[this.getCurrentPlace() + 2] << 16
152 | data[this.getCurrentPlace() + 3] << 24;
156 dest = (data[this.getCurrentPlace()] & 0xff)
157 | (data[this.getCurrentPlace() + 1] & 0xff) << 8
158 | (data[this.getCurrentPlace() + 2] & 0xff) << 16
159 | (data[this.getCurrentPlace() + 3] & 0xff) << 24;
153160
154161 setDataRemained(this.getDataRemained() - 4);
155162 this.setCurrentPlace(this.getCurrentPlace() + 4);
529536 ChmConstants.BYTE_ARRAY_LENGHT));
530537
531538 /* Checks validity of the itsp header */
532 if (!new String(chmItspHeader.getSignature()).equals(ChmConstants.ITSP))
533 throw new ChmParsingException("seems not valid signature");
539 if (!new String(chmItspHeader.getSignature(), IOUtils.UTF_8).equals(ChmConstants.ITSP))
540 throw new ChmParsingException("seems not valid signature");
534541
535542 if (chmItspHeader.getVersion() != ChmConstants.CHM_VER_1)
536543 throw new ChmParsingException("!=ChmConstants.CHM_VER_1");
538545 if (chmItspHeader.getHeader_len() != ChmConstants.CHM_ITSP_V1_LEN)
539546 throw new ChmParsingException("!= ChmConstants.CHM_ITSP_V1_LEN");
540547 }
541
542 /**
543 * @param args
544 */
545 public static void main(String[] args) {
546 }
547548 }
1515 */
1616 package org.apache.tika.parser.chm.accessor;
1717
18 import java.io.UnsupportedEncodingException;
19
1820 import org.apache.tika.exception.TikaException;
21 import org.apache.tika.io.IOUtils;
1922 import org.apache.tika.parser.chm.assertion.ChmAssert;
2023 import org.apache.tika.parser.chm.core.ChmConstants;
2124 import org.apache.tika.parser.chm.exception.ChmParsingException;
3942 private static final long serialVersionUID = -7897854774939631565L;
4043 /* class' members */
4144 private long size; /* 0 */
42 private byte[] signature = new String(ChmConstants.LZXC).getBytes(); /*
43 * 4
44 * (LZXC
45 * )
46 */
45 private byte[] signature;
4746 private long version; /* 8 */
4847 private long resetInterval; /* c */
4948 private long windowSize; /* 10 */
5352 /* local usage */
5453 private int dataRemained;
5554 private int currentPlace = 0;
55
56 public ChmLzxcControlData() {
57 signature = ChmConstants.LZXC.getBytes(IOUtils.UTF_8); /*
58 * 4
59 * (LZXC
60 * )
61 */
62 }
5663
5764 /**
5865 * Returns a remained data
247254 StringBuilder sb = new StringBuilder();
248255 sb.append("size(unknown):=" + this.getSize() + ", ");
249256 sb.append("signature(Compression type identifier):="
250 + new String(this.getSignature()) + ", ");
257 + new String(this.getSignature(), IOUtils.UTF_8) + ", ");
251258 sb.append("version(Possibly numeric code for LZX):="
252259 + this.getVersion() + System.getProperty("line.separator"));
253260 sb.append("resetInterval(The Huffman reset interval):="
298305 "window size / resetInterval should be more than 1");
299306
300307 /* checks a signature */
301 if (!new String(chmLzxcControlData.getSignature())
308 if (!new String(chmLzxcControlData.getSignature(), IOUtils.UTF_8)
302309 .equals(ChmConstants.LZXC))
303310 throw new ChmParsingException(
304311 "the signature does not seem to be correct");
157157
158158 private long unmarshalUInt32(byte[] data, long dest) throws TikaException {
159159 ChmAssert.assertByteArrayNotNull(data);
160 dest = data[this.getCurrentPlace()]
161 | data[this.getCurrentPlace() + 1] << 8
162 | data[this.getCurrentPlace() + 2] << 16
163 | data[this.getCurrentPlace() + 3] << 24;
160 dest = (data[this.getCurrentPlace()] & 0xff)
161 | (data[this.getCurrentPlace() + 1] & 0xff) << 8
162 | (data[this.getCurrentPlace() + 2] & 0xff) << 16
163 | (data[this.getCurrentPlace() + 3] & 0xff) << 24;
164164
165165 setDataRemained(this.getDataRemained() - 4);
166166 this.setCurrentPlace(this.getCurrentPlace() + 4);
315315 */
316316 public void setBlockLlen(long block_len) {
317317 this.block_len = block_len;
318 }
319
320 /**
321 * @param args
322 */
323 public static void main(String[] args) {
324
325318 }
326319
327320 // @Override
1818 import java.util.Arrays;
1919
2020 import org.apache.tika.exception.TikaException;
21 import org.apache.tika.io.IOUtils;
2122 import org.apache.tika.parser.chm.assertion.ChmAssert;
2223 import org.apache.tika.parser.chm.core.ChmCommons;
2324 import org.apache.tika.parser.chm.core.ChmConstants;
3839 * <p>
3940 * Note: This class is not in use
4041 *
41 * {@link http
42 * ://translated.by/you/microsoft-s-html-help-chm-format-incomplete/original
43 * /?show-translation-form=1 }
42 * {@link http://translated.by/you/microsoft-s-html-help-chm-format-incomplete/original/?show-translation-form=1 }
4443 *
4544 *
4645 */
4746 public class ChmPmgiHeader implements ChmAccessor<ChmPmgiHeader> {
4847 private static final long serialVersionUID = -2092282339894303701L;
49 private byte[] signature = new String(ChmConstants.CHM_PMGI_MARKER).getBytes(); /* 0 (PMGI) */
48 private byte[] signature;
5049 private long free_space; /* 4 */
5150
5251 /* local usage */
5352 private int dataRemained;
5453 private int currentPlace = 0;
54
55 public ChmPmgiHeader() {
56 signature = ChmConstants.CHM_PMGI_MARKER.getBytes(IOUtils.UTF_8); /* 0 (PMGI) */
57 }
5558
5659 private int getDataRemained() {
5760 return dataRemained;
7679 ChmAssert.assertChmAccessorNotNull(chmPmgiHeader);
7780 ChmAssert.assertPositiveInt(count);
7881 this.setDataRemained(data.length);
79 index = ChmCommons.indexOf(data,
80 ChmConstants.CHM_PMGI_MARKER.getBytes());
82 index = ChmCommons.indexOf(data,
83 ChmConstants.CHM_PMGI_MARKER.getBytes(IOUtils.UTF_8));
84
8185 if (index >= 0)
8286 System.arraycopy(data, index, chmPmgiHeader.getSignature(), 0, count);
8387 else{
9397
9498 if (4 > getDataRemained())
9599 throw new ChmParsingException("4 > dataLenght");
96 dest = data[this.getCurrentPlace()]
97 | data[this.getCurrentPlace() + 1] << 8
98 | data[this.getCurrentPlace() + 2] << 16
99 | data[this.getCurrentPlace() + 3] << 24;
100 dest = (data[this.getCurrentPlace()] & 0xff)
101 | (data[this.getCurrentPlace() + 1] & 0xff) << 8
102 | (data[this.getCurrentPlace() + 2] & 0xff) << 16
103 | (data[this.getCurrentPlace() + 3] & 0xff) << 24;
100104
101105 setDataRemained(this.getDataRemained() - 4);
102106 this.setCurrentPlace(this.getCurrentPlace() + 4);
144148 */
145149 public String toString() {
146150 StringBuilder sb = new StringBuilder();
147 sb.append("signature:=" + new String(getSignature()) + ", ");
151 sb.append("signature:=" + new String(getSignature(), IOUtils.UTF_8) + ", ");
148152 sb.append("free space:=" + getFreeSpace()
149153 + System.getProperty("line.separator"));
150154 return sb.toString();
162166
163167 /* check structure */
164168 if (!Arrays.equals(chmPmgiHeader.getSignature(),
165 ChmConstants.CHM_PMGI_MARKER.getBytes()))
169 ChmConstants.CHM_PMGI_MARKER.getBytes(IOUtils.UTF_8)))
166170 throw new TikaException(
167171 "it does not seem to be valid a PMGI signature, check ChmItsp index_root if it was -1, means no PMGI, use PMGL insted");
168172
169173 }
170
171 /**
172 * @param args
173 */
174 public static void main(String[] args) {
175
176 }
177174 }
1616 package org.apache.tika.parser.chm.accessor;
1717
1818 import org.apache.tika.exception.TikaException;
19 import org.apache.tika.io.IOUtils;
1920 import org.apache.tika.parser.chm.assertion.ChmAssert;
2021 import org.apache.tika.parser.chm.core.ChmConstants;
2122 import org.apache.tika.parser.chm.exception.ChmParsingException;
5455 */
5556 public class ChmPmglHeader implements ChmAccessor<ChmPmglHeader> {
5657 private static final long serialVersionUID = -6139486487475923593L;
57 private byte[] signature = new String(ChmConstants.PMGL).getBytes(); /*
58 private byte[] signature;
59 private long free_space; /* 4 */
60 private long unknown_0008; /* 8 */
61 private int block_prev; /* c */
62 private int block_next; /* 10 */
63
64 /* local usage */
65 private int dataRemained;
66 private int currentPlace = 0;
67
68 public ChmPmglHeader() {
69 signature = ChmConstants.PMGL.getBytes(IOUtils.UTF_8); /*
5870 * 0
5971 * (PMGL
6072 * )
6173 */
62 private long free_space; /* 4 */
63 private long unknown_0008; /* 8 */
64 private int block_prev; /* c */
65 private int block_next; /* 10 */
66
67 /* local usage */
68 private int dataRemained;
69 private int currentPlace = 0;
74 }
7075
7176 private int getDataRemained() {
7277 return dataRemained;
8893 return free_space;
8994 }
9095
91 public void setFreeSpace(long free_space) {
96 public void setFreeSpace(long free_space) throws TikaException {
97 if (free_space < 0) {
98 throw new TikaException("Bad PMGLheader.FreeSpace="+free_space);
99 }
92100 this.free_space = free_space;
93101 }
94102
95103 public String toString() {
96104 StringBuilder sb = new StringBuilder();
97 sb.append("signatute:=" + new String(getSignature()) + ", ");
105 sb.append("signatute:=" + new String(getSignature(), IOUtils.UTF_8) + ", ");
98106 sb.append("free space:=" + getFreeSpace() + ", ");
99107 sb.append("unknown0008:=" + getUnknown0008() + ", ");
100108 sb.append("prev block:=" + getBlockPrev() + ", ");
112120 this.setDataRemained(this.getDataRemained() - count);
113121 }
114122
115 private int unmarshalInt32(byte[] data, int dest) throws TikaException {
123 private int unmarshalInt32(byte[] data) throws TikaException {
116124 ChmAssert.assertByteArrayNotNull(data);
125 int dest;
117126 if (4 > this.getDataRemained())
118127 throw new TikaException("4 > dataLenght");
119 dest = data[this.getCurrentPlace()]
120 | data[this.getCurrentPlace() + 1] << 8
121 | data[this.getCurrentPlace() + 2] << 16
122 | data[this.getCurrentPlace() + 3] << 24;
128 dest = (data[this.getCurrentPlace()] & 0xff)
129 | (data[this.getCurrentPlace() + 1] & 0xff) << 8
130 | (data[this.getCurrentPlace() + 2] & 0xff) << 16
131 | (data[this.getCurrentPlace() + 3] & 0xff) << 24;
123132
124133 this.setCurrentPlace(this.getCurrentPlace() + 4);
125134 this.setDataRemained(this.getDataRemained() - 4);
126135 return dest;
127136 }
128137
129 private long unmarshalUInt32(byte[] data, long dest) throws ChmParsingException {
138 private long unmarshalUInt32(byte[] data) throws ChmParsingException {
130139 ChmAssert.assertByteArrayNotNull(data);
140 long dest;
131141 if (4 > getDataRemained())
132142 throw new ChmParsingException("4 > dataLenght");
133 dest = data[this.getCurrentPlace()]
134 | data[this.getCurrentPlace() + 1] << 8
135 | data[this.getCurrentPlace() + 2] << 16
136 | data[this.getCurrentPlace() + 3] << 24;
143 dest = (data[this.getCurrentPlace()] & 0xff)
144 | (data[this.getCurrentPlace() + 1] & 0xff) << 8
145 | (data[this.getCurrentPlace() + 2] & 0xff) << 16
146 | (data[this.getCurrentPlace() + 3] & 0xff) << 24;
137147
138148 setDataRemained(this.getDataRemained() - 4);
139149 this.setCurrentPlace(this.getCurrentPlace() + 4);
149159 /* unmarshal fields */
150160 chmPmglHeader.unmarshalCharArray(data, chmPmglHeader,
151161 ChmConstants.CHM_SIGNATURE_LEN);
152 chmPmglHeader.setFreeSpace(chmPmglHeader.unmarshalUInt32(data,
153 chmPmglHeader.getFreeSpace()));
154 chmPmglHeader.setUnknown0008(chmPmglHeader.unmarshalUInt32(data,
155 chmPmglHeader.getUnknown0008()));
156 chmPmglHeader.setBlockPrev(chmPmglHeader.unmarshalInt32(data,
157 chmPmglHeader.getBlockPrev()));
158 chmPmglHeader.setBlockNext(chmPmglHeader.unmarshalInt32(data,
159 chmPmglHeader.getBlockNext()));
162 chmPmglHeader.setFreeSpace(chmPmglHeader.unmarshalUInt32(data));
163 chmPmglHeader.setUnknown0008(chmPmglHeader.unmarshalUInt32(data));
164 chmPmglHeader.setBlockPrev(chmPmglHeader.unmarshalInt32(data));
165 chmPmglHeader.setBlockNext(chmPmglHeader.unmarshalInt32(data));
160166
161167 /* check structure */
162 if (!new String(chmPmglHeader.getSignature()).equals(ChmConstants.PMGL))
168 if (!new String(chmPmglHeader.getSignature(), IOUtils.UTF_8).equals(ChmConstants.PMGL))
163169 throw new ChmParsingException(ChmPmglHeader.class.getName()
164170 + " pmgl != pmgl.signature");
165
166171 }
167172
168173 public byte[] getSignature() {
196201 protected void setBlockNext(int block_next) {
197202 this.block_next = block_next;
198203 }
199
200 /**
201 * @param args
202 */
203 public static void main(String[] args) {
204
205 }
206204 }
8080 sb.append("length:=" + getLength());
8181 return sb.toString();
8282 }
83
83
8484 /**
8585 * Returns an entry name length
8686 *
147147 protected void setLength(int length) {
148148 this.length = length;
149149 }
150
151 public static void main(String[] args) {
152 }
153150 }
1818 import java.io.FileNotFoundException;
1919 import java.io.FileOutputStream;
2020 import java.io.IOException;
21 import java.util.Iterator;
2221 import java.util.List;
2322
2423 import org.apache.tika.exception.TikaException;
358357 return str == null || str.length() == 0;
359358 }
360359
361 /**
362 * @param args
363 */
364 public static void main(String[] args) {
365 }
366
367360 }
1515 */
1616 package org.apache.tika.parser.chm.core;
1717
18 import org.apache.tika.io.IOUtils;
19
1820 public class ChmConstants {
1921 /* Prevents instantiation */
2022 private ChmConstants() {
2123 }
2224
23 public static final String DEFAULT_CHARSET = "UTF-8";
25 public static final String DEFAULT_CHARSET = IOUtils.UTF_8.name();
2426 public static final String ITSF = "ITSF";
2527 public static final String ITSP = "ITSP";
2628 public static final String PMGL = "PMGL";
1919 import java.io.IOException;
2020 import java.io.InputStream;
2121 import java.util.ArrayList;
22 import java.util.Iterator;
2322 import java.util.List;
24
2523 import org.apache.tika.exception.TikaException;
2624 import org.apache.tika.io.IOUtils;
2725 import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet;
9593 }
9694
9795 /**
98 * Returns lzxc block length
96 * Returns lzxc hit_cache length
9997 *
10098 * @return lzxBlockLength
10199 */
104102 }
105103
106104 /**
107 * Sets lzxc block length
105 * Sets lzxc hit_cache length
108106 *
109107 * @param lzxBlockLength
110108 */
113111 }
114112
115113 /**
116 * Returns lzxc block offset
114 * Returns lzxc hit_cache offset
117115 *
118116 * @return lzxBlockOffset
119117 */
122120 }
123121
124122 /**
125 * Sets lzxc block offset
123 * Sets lzxc hit_cache offset
126124 */
127125 private void setLzxBlockOffset(long lzxBlockOffset) {
128126 this.lzxBlockOffset = lzxBlockOffset;
173171
174172 int indexOfControlData = getChmDirList().getControlDataIndex();
175173 int indexOfResetData = ChmCommons.indexOfResetTableBlock(getData(),
176 ChmConstants.LZXC.getBytes());
174 ChmConstants.LZXC.getBytes(IOUtils.UTF_8));
177175 byte[] dir_chunk = null;
178176 if (indexOfResetData > 0)
179177 dir_chunk = ChmCommons.copyOfRange( getData(), indexOfResetData, indexOfResetData
214212 setLzxBlocksCache(new ArrayList<ChmLzxBlock>());
215213
216214 } catch (IOException e) {
217 // ignore
215 e.printStackTrace();
218216 }
219217 }
220218
256254 dataOffset + directoryListingEntry.getLength()));
257255 } else if (directoryListingEntry.getEntryType() == EntryType.COMPRESSED
258256 && !ChmCommons.hasSkip(directoryListingEntry)) {
259 /* Gets a chm block info */
257 /* Gets a chm hit_cache info */
260258 ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance(
261259 directoryListingEntry, (int) getChmLzxcResetTable()
262260 .getBlockLen(), getChmLzxcControlData());
263261
264 int i = 0, start = 0, block = 0;
262 int i = 0, start = 0, hit_cache = 0;
265263
266264 if ((getLzxBlockLength() < Integer.MAX_VALUE)
267265 && (getLzxBlockOffset() < Integer.MAX_VALUE)) {
268266 // TODO: Improve the caching
269267 // caching ... = O(n^2) - depends on startBlock and endBlock
270 if (getLzxBlocksCache().size() != 0) {
268 start = -1;
269 if (!getLzxBlocksCache().isEmpty()) {
271270 for (i = 0; i < getLzxBlocksCache().size(); i++) {
272 lzxBlock = getLzxBlocksCache().get(i);
273 for (int j = bb.getIniBlock(); j <= bb
274 .getStartBlock(); j++) {
275 if (lzxBlock.getBlockNumber() == j)
271 //lzxBlock = getLzxBlocksCache().get(i);
272 int bn = getLzxBlocksCache().get(i).getBlockNumber();
273 for (int j = bb.getIniBlock(); j <= bb.getStartBlock(); j++) {
274 if (bn == j) {
276275 if (j > start) {
277276 start = j;
278 block = i;
277 hit_cache = i;
279278 }
280 if (start == bb.getStartBlock())
281 break;
279 }
282280 }
281 if (start == bb.getStartBlock())
282 break;
283283 }
284284 }
285285
286 if (i == getLzxBlocksCache().size() && i == 0) {
286 // if (i == getLzxBlocksCache().size() && i == 0) {
287 if (start<0) {
287288 start = bb.getIniBlock();
288289
289290 byte[] dataSegment = ChmCommons.getChmBlockSegment(
297298
298299 getLzxBlocksCache().add(lzxBlock);
299300 } else {
300 lzxBlock = getLzxBlocksCache().get(block);
301 lzxBlock = getLzxBlocksCache().get(hit_cache);
301302 }
302303
303304 for (i = start; i <= bb.getEndBlock();) {
348349 .getBlockCount()) {
349350 getLzxBlocksCache().clear();
350351 }
352 } //end of if
353
354 if (buffer.size() != directoryListingEntry.getLength()) {
355 throw new TikaException("CHM file extract error: extracted Length is wrong.");
351356 }
352 }
357 } //end of if compressed
353358 } catch (Exception e) {
354359 throw new TikaException(e.getMessage());
355360 }
7171 % bytesPerBlock);
7272 // potential problem with casting long to int
7373 chmBlockInfo
74 .setIniBlock((chmBlockInfo.startBlock - chmBlockInfo.startBlock)
75 % (int) clcd.getResetInterval());
74 .setIniBlock(chmBlockInfo.startBlock -
75 chmBlockInfo.startBlock % (int) clcd.getResetInterval());
76 // .setIniBlock((chmBlockInfo.startBlock - chmBlockInfo.startBlock)
77 // % (int) clcd.getResetInterval());
7678 return chmBlockInfo;
7779 }
7880
8890 (dle.getOffset() + dle.getLength()) % bytesPerBlock);
8991 // potential problem with casting long to int
9092 getChmBlockInfo().setIniBlock(
91 (getChmBlockInfo().startBlock - getChmBlockInfo().startBlock)
93 getChmBlockInfo().startBlock - getChmBlockInfo().startBlock
9294 % (int) clcd.getResetInterval());
95 // (getChmBlockInfo().startBlock - getChmBlockInfo().startBlock)
96 // % (int) clcd.getResetInterval());
9397 return getChmBlockInfo();
9498 }
9599
4646 private int previousBlockType = -1;
4747
4848 public ChmLzxBlock(int blockNumber, byte[] dataSegment, long blockLength,
49 ChmLzxBlock prevBlock) {
49 ChmLzxBlock prevBlock) throws TikaException {
5050 try {
5151 if (validateConstructorParams(blockNumber, dataSegment, blockLength)) {
5252 setBlockNumber(blockNumber);
5454 if (prevBlock != null
5555 && prevBlock.getState().getBlockLength() > prevBlock
5656 .getState().getBlockRemaining())
57 setChmSection(new ChmSection(prevBlock.getContent()));
57 setChmSection(new ChmSection(dataSegment, prevBlock.getContent()));
5858 else
5959 setChmSection(new ChmSection(dataSegment));
6060
6464 // we need to take care of previous context
6565 // ============================================
6666 checkLzxBlock(prevBlock);
67 setContent((int) blockLength);
6867 if (prevBlock == null
69 || getContent().length < (int) getBlockLength()) {
68 || blockLength < (int) getBlockLength()) {
7069 setContent((int) getBlockLength());
70 }
71 else {
72 setContent((int) blockLength);
7173 }
7274
7375 if (prevBlock != null && prevBlock.getState() != null)
7678 extractContent();
7779 } else
7880 throw new TikaException("Check your chm lzx block parameters");
79 } catch (Exception e) {
80 // TODO: handle exception
81 } catch (TikaException e) {
82 throw e;
8183 }
8284 }
8385
135137 }
136138
137139 switch (getState().getBlockType()) {
138 case ChmCommons.ALIGNED_OFFSET:
139 createAlignedTreeTable();
140 case ChmCommons.VERBATIM:
141 /* Creates mainTreeTable */
142 createMainTreeTable();
143 createLengthTreeTable();
144 if (getState().getMainTreeLengtsTable()[0xe8] != 0)
140 case ChmCommons.ALIGNED_OFFSET:
141 createAlignedTreeTable();
142 //fall through
143 case ChmCommons.VERBATIM:
144 /* Creates mainTreeTable */
145 createMainTreeTable();
146 createLengthTreeTable();
147 if (getState().getMainTreeLengtsTable()[0xe8] != 0)
148 getState().setIntelState(IntelState.STARTED);
149 break;
150 case ChmCommons.UNCOMPRESSED:
145151 getState().setIntelState(IntelState.STARTED);
146 break;
147 case ChmCommons.UNCOMPRESSED:
148 getState().setIntelState(IntelState.STARTED);
149 if (getChmSection().getTotal() > 16)
150 getChmSection().setSwath(
151 getChmSection().getSwath() - 1);
152 getState().setR0(
153 (new BigInteger(getChmSection()
154 .reverseByteOrder(
155 getChmSection().unmarshalBytes(
156 4))).longValue()));
157 getState().setR1(
158 (new BigInteger(getChmSection()
159 .reverseByteOrder(
160 getChmSection().unmarshalBytes(
161 4))).longValue()));
162 getState().setR2(
163 (new BigInteger(getChmSection()
164 .reverseByteOrder(
165 getChmSection().unmarshalBytes(
166 4))).longValue()));
167 break;
168 default:
169 break;
170 }
171 }
152 if (getChmSection().getTotal() > 16)
153 getChmSection().setSwath(
154 getChmSection().getSwath() - 1);
155 getState().setR0(
156 (new BigInteger(getChmSection()
157 .reverseByteOrder(
158 getChmSection().unmarshalBytes(
159 4))).longValue()));
160 getState().setR1(
161 (new BigInteger(getChmSection()
162 .reverseByteOrder(
163 getChmSection().unmarshalBytes(
164 4))).longValue()));
165 getState().setR2(
166 (new BigInteger(getChmSection()
167 .reverseByteOrder(
168 getChmSection().unmarshalBytes(
169 4))).longValue()));
170 break;
171 default:
172 break;
173 }
174 } //end of if BlockRemaining == 0
172175
173176 int tempLen;
174177
187190 switch (getState().getBlockType()) {
188191 case ChmCommons.ALIGNED_OFFSET:
189192 // if(prevblock.lzxState.length>prevblock.lzxState.remaining)
190 decompressAlignedBlock(tempLen, getChmSection().getData());// prevcontext
193 decompressAlignedBlock(tempLen, getChmSection().getPrevContent() == null ? getChmSection().getData() : getChmSection().getPrevContent());// prevcontext
191194 break;
192195 case ChmCommons.VERBATIM:
193 decompressVerbatimBlock(tempLen, getChmSection().getData());
196 decompressVerbatimBlock(tempLen, getChmSection().getPrevContent() == null ? getChmSection().getData() : getChmSection().getPrevContent());
194197 break;
195198 case ChmCommons.UNCOMPRESSED:
196 decompressUncompressedBlock(tempLen, getChmSection()
197 .getData());
199 decompressUncompressedBlock(tempLen, getChmSection().getPrevContent() == null ? getChmSection().getData() : getChmSection().getPrevContent());
198200 break;
199201 }
200202 getState().increaseFramesRead();
253255 }
254256
255257 private void createLengthTreeTable() throws TikaException {
258 //Read Pre Tree Table
256259 short[] prelentable = createPreLenTable();
257260
258261 if (prelentable == null) {
269272 throw new ChmParsingException("pretreetable is null");
270273 }
271274
275 //Build Length Tree
272276 createLengthTreeLenTable(0, ChmConstants.LZX_NUM_SECONDARY_LENGTHS,
273277 pretreetable, prelentable);
274278
275279 getState().setLengthTreeTable(
276280 createTreeTable2(getState().getLengthTreeLengtsTable(),
277 (1 << ChmConstants.LZX_MAINTREE_TABLEBITS)
281 (1 << ChmConstants.LZX_LENGTH_TABLEBITS)
278282 + (ChmConstants.LZX_LENGTH_MAXSYMBOLS << 1),
279 ChmConstants.LZX_MAINTREE_TABLEBITS,
283 ChmConstants.LZX_LENGTH_TABLEBITS,
280284 ChmConstants.LZX_NUM_SECONDARY_LENGTHS));
281285 }
282286
311315 int matchoffset = 0;
312316 for (i = getContentLength(); i < len; i++) {
313317 /* new code */
314 border = getChmSection().getDesyncBits(
315 ChmConstants.LZX_MAINTREE_TABLEBITS, 0);
318 //read huffman tree from main tree
319 border = getChmSection().peekBits(
320 ChmConstants.LZX_MAINTREE_TABLEBITS);
316321 if (border >= getState().mainTreeTable.length)
317 break;
322 throw new ChmParsingException("error decompressing aligned block.");
323 //break;
318324 /* end new code */
319 s = getState().mainTreeTable[getChmSection().getDesyncBits(
320 ChmConstants.LZX_MAINTREE_TABLEBITS, 0)];
325 s = getState().mainTreeTable[getChmSection().peekBits(
326 ChmConstants.LZX_MAINTREE_TABLEBITS)];
321327 if (s >= getState().getMainTreeElements()) {
322328 x = ChmConstants.LZX_MAINTREE_TABLEBITS;
323329 do {
327333 } while ((s = getState().mainTreeTable[s]) >= getState()
328334 .getMainTreeElements());
329335 }
330 getChmSection().getSyncBits(getState().mainTreeTable[s]);
336 //System.out.printf("%d,", s);
337 //?getChmSection().getSyncBits(getState().mainTreeTable[s]);
338 getChmSection().getSyncBits(getState().getMainTreeLengtsTable()[s]);
331339 if (s < ChmConstants.LZX_NUM_CHARS) {
332340 content[i] = (byte) s;
333341 } else {
335343 matchlen = s & ChmConstants.LZX_NUM_PRIMARY_LENGTHS;
336344 if (matchlen == ChmConstants.LZX_NUM_PRIMARY_LENGTHS) {
337345 matchfooter = getState().lengthTreeTable[getChmSection()
338 .getDesyncBits(ChmConstants.LZX_MAINTREE_TABLEBITS,
339 0)];
340 if (matchfooter >= ChmConstants.LZX_MAINTREE_TABLEBITS) {
341 x = ChmConstants.LZX_MAINTREE_TABLEBITS;
346 .peekBits(ChmConstants.LZX_LENGTH_TABLEBITS)];//.LZX_MAINTREE_TABLEBITS)];
347 if (matchfooter >= ChmConstants.LZX_LENGTH_MAXSYMBOLS/*?LZX_LENGTH_TABLEBITS*/) {
348 x = ChmConstants.LZX_LENGTH_TABLEBITS;
342349 do {
343350 x++;
344351 matchfooter <<= 1;
356363 matchoffset = (ChmConstants.POSITION_BASE[matchoffset] - 2);
357364 if (extra > 3) {
358365 extra -= 3;
359 long l = getChmSection().getSyncBits(extra);
360 matchoffset += (l << 3);
361 int g = getChmSection().getDesyncBits(
362 ChmConstants.LZX_NUM_PRIMARY_LENGTHS, 0);
363 int t = getState().getAlignedTreeTable()[g];
366 long verbatim_bits = getChmSection().getSyncBits(extra);
367 matchoffset += (verbatim_bits << 3);
368 //READ HUFF SYM in Aligned Tree
369 int aligned_bits = getChmSection().peekBits(
370 ChmConstants.LZX_NUM_PRIMARY_LENGTHS);
371 int t = getState().getAlignedTreeTable()[aligned_bits];
364372 if (t >= getState().getMainTreeElements()) {
365 x = ChmConstants.LZX_MAINTREE_TABLEBITS;
373 x = ChmConstants.LZX_ALIGNED_TABLEBITS; //?LZX_MAINTREE_TABLEBITS; //?LZX_ALIGNED_TABLEBITS
366374 do {
367375 x++;
368376 t <<= 1;
371379 .getMainTreeElements());
372380 }
373381 getChmSection().getSyncBits(
374 getState().getAlignedTreeTable()[t]);
382 getState().getAlignedLenTable()[t]);
375383 matchoffset += t;
376384 } else if (extra == 3) {
377 int g = getChmSection().getDesyncBits(
378 ChmConstants.LZX_NUM_PRIMARY_LENGTHS, 0);
385 int g = getChmSection().peekBits(
386 ChmConstants.LZX_NUM_PRIMARY_LENGTHS);
379387 int t = getState().getAlignedTreeTable()[g];
380388 if (t >= getState().getMainTreeElements()) {
381 x = ChmConstants.LZX_MAINTREE_TABLEBITS;
389 x = ChmConstants.LZX_ALIGNED_TABLEBITS; //?LZX_MAINTREE_TABLEBITS;
382390 do {
383391 x++;
384392 t <<= 1;
387395 .getMainTreeElements());
388396 }
389397 getChmSection().getSyncBits(
390 getState().getAlignedTreeTable()[t]);
398 getState().getAlignedLenTable()[t]);
391399 matchoffset += t;
392400 } else if (extra > 0) {
393401 long l = getChmSection().getSyncBits(extra);
456464 int matchlen = 0, matchfooter = 0, extra, rundest, runsrc;
457465 int matchoffset = 0;
458466 for (i = getContentLength(); i < len; i++) {
459 int f = getChmSection().getDesyncBits(
460 ChmConstants.LZX_MAINTREE_TABLEBITS, 0);
467 int f = getChmSection().peekBits(
468 ChmConstants.LZX_MAINTREE_TABLEBITS);
461469 assertShortArrayNotNull(getState().getMainTreeTable());
462470 s = getState().getMainTreeTable()[f];
463471 if (s >= ChmConstants.LZX_MAIN_MAXSYMBOLS) {
476484 matchlen = s & ChmConstants.LZX_NUM_PRIMARY_LENGTHS;
477485 if (matchlen == ChmConstants.LZX_NUM_PRIMARY_LENGTHS) {
478486 matchfooter = getState().getLengthTreeTable()[getChmSection()
479 .getDesyncBits(ChmConstants.LZX_LENGTH_TABLEBITS, 0)];
487 .peekBits(ChmConstants.LZX_LENGTH_TABLEBITS)];
480488 if (matchfooter >= ChmConstants.LZX_NUM_SECONDARY_LENGTHS) {
481489 x = ChmConstants.LZX_LENGTH_TABLEBITS;
482490 do {
568576 int i = offset; // represents offset
569577 int z, y, x;// local counters
570578 while (i < tablelen) {
571 z = pretreetable[getChmSection().getDesyncBits(
572 ChmConstants.LZX_PRETREE_TABLEBITS, 0)];
579 //Read HUFF sym to z
580 z = pretreetable[getChmSection().peekBits(
581 ChmConstants.LZX_PRETREE_TABLEBITS)];
573582 if (z >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS) {// 1 bug, should be
574583 // 20
575584 x = ChmConstants.LZX_PRETREE_TABLEBITS;
580589 } while ((z = pretreetable[z]) >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS);
581590 }
582591 getChmSection().getSyncBits(prelentable[z]);
592
583593 if (z < 17) {
584594 z = getState().getLengthTreeLengtsTable()[i] - z;
585595 if (z < 0)
596606 y = getChmSection().getSyncBits(5);
597607 y += 20;
598608 for (int j = 0; j < y; j++)
599 if (i < getState().getLengthTreeLengtsTable().length)
609 //no tolerate //if (i < getState().getLengthTreeLengtsTable().length)
600610 getState().getLengthTreeLengtsTable()[i++] = 0;
601611 } else if (z == 19) {
602612 y = getChmSection().getSyncBits(1);
603613 y += 4;
604 z = pretreetable[getChmSection().getDesyncBits(
605 ChmConstants.LZX_PRETREE_TABLEBITS, 0)];
614 z = pretreetable[getChmSection().peekBits(
615 ChmConstants.LZX_PRETREE_TABLEBITS)];
606616 if (z >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS) {// 20
607617 x = ChmConstants.LZX_PRETREE_TABLEBITS;// 6
608618 do {
609619 x++;
610620 z <<= 1;
611621 z += getChmSection().checkBit(x);
612 } while ((z = pretreetable[z]) >= ChmConstants.LZX_MAINTREE_TABLEBITS);
622 } while ((z = pretreetable[z]) >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS);//LZX_MAINTREE_TABLEBITS);
613623 }
614624 getChmSection().getSyncBits(prelentable[z]);
615625 z = getState().getLengthTreeLengtsTable()[i] - z;
622632 }
623633
624634 private void createMainTreeTable() throws TikaException {
635 //Read Pre Tree Table
625636 short[] prelentable = createPreLenTable();
626637 short[] pretreetable = createTreeTable2(prelentable,
627638 (1 << ChmConstants.LZX_PRETREE_TABLEBITS)
628639 + (ChmConstants.LZX_PRETREE_MAXSYMBOLS << 1),
629640 ChmConstants.LZX_PRETREE_TABLEBITS,
630641 ChmConstants.LZX_PRETREE_MAXSYMBOLS);
642
631643 createMainTreeLenTable(0, ChmConstants.LZX_NUM_CHARS, pretreetable,
632644 prelentable);
645
646 //Read Pre Tree Table
633647 prelentable = createPreLenTable();
634648 pretreetable = createTreeTable2(prelentable,
635649 (1 << ChmConstants.LZX_PRETREE_TABLEBITS)
636650 + (ChmConstants.LZX_PRETREE_MAXSYMBOLS << 1),
637651 ChmConstants.LZX_PRETREE_TABLEBITS,
638652 ChmConstants.LZX_PRETREE_MAXSYMBOLS);
653
639654 createMainTreeLenTable(ChmConstants.LZX_NUM_CHARS,
640655 getState().mainTreeLengtsTable.length, pretreetable,
641656 prelentable);
646661 + (ChmConstants.LZX_MAINTREE_MAXSYMBOLS << 1),
647662 ChmConstants.LZX_MAINTREE_TABLEBITS, getState()
648663 .getMainTreeElements()));
649
650664 }
651665
652666 private void createMainTreeLenTable(int offset, int tablelen,
656670 int i = offset;
657671 int z, y, x;
658672 while (i < tablelen) {
659 int f = getChmSection().getDesyncBits(
660 ChmConstants.LZX_PRETREE_TABLEBITS, 0);
673 int f = getChmSection().peekBits(
674 ChmConstants.LZX_PRETREE_TABLEBITS);
661675 z = pretreetable[f];
662676 if (z >= ChmConstants.LZX_PRETREE_MAXSYMBOLS) {
663677 x = ChmConstants.LZX_PRETREE_TABLEBITS;
691705 } else if (z == 19) {
692706 y = getChmSection().getSyncBits(1);
693707 y += 4;
694 z = pretreetable[getChmSection().getDesyncBits(
695 ChmConstants.LZX_PRETREE_TABLEBITS, 0)];
708 z = pretreetable[getChmSection().peekBits(
709 ChmConstants.LZX_PRETREE_TABLEBITS)];
696710 if (z >= ChmConstants.LZX_PRETREE_MAXSYMBOLS) {
697711 x = ChmConstants.LZX_PRETREE_TABLEBITS;
698712 do {
719733 }
720734
721735 private short[] createAlignedLenTable() {
722 int tablelen = ChmConstants.LZX_BLOCKTYPE_UNCOMPRESSED;
736 int tablelen = ChmConstants.LZX_ALIGNED_NUM_ELEMENTS;//LZX_BLOCKTYPE_UNCOMPRESSED;//
723737 int bits = ChmConstants.LZX_BLOCKTYPE_UNCOMPRESSED;
724738 short[] tmp = new short[tablelen];
725739 for (int i = 0; i < tablelen; i++) {
728742 return tmp;
729743 }
730744
731 private void createAlignedTreeTable() {
745 private void createAlignedTreeTable() throws ChmParsingException {
732746 getState().setAlignedLenTable(createAlignedLenTable());
733 getState().setAlignedLenTable(
747 getState().setAlignedTreeTable(//setAlignedLenTable(
734748 createTreeTable2(getState().getAlignedLenTable(),
735749 (1 << ChmConstants.LZX_NUM_PRIMARY_LENGTHS)
736750 + (ChmConstants.LZX_ALIGNED_MAXSYMBOLS << 1),
739753 }
740754
741755 private short[] createTreeTable2(short[] lentable, int tablelen, int bits,
742 int maxsymbol) {
756 int maxsymbol) throws ChmParsingException {
743757 short[] tmp = new short[tablelen];
744758 short sym;
745759 int leaf;
755769 while (bit_num <= bits) {
756770 for (sym = 0; sym < maxsymbol; sym++) {
757771 if (lentable.length > sym && lentable[sym] == bit_num) {
758 leaf = pos;// pos=0
759
760 if ((pos += bit_mask) > table_mask)
761 return null;
772 leaf = pos;
773
774 if ((pos += bit_mask) > table_mask) {
775 /* table overflow */
776 throw new ChmParsingException("Table overflow");
777 }
762778
763779 fill = bit_mask;
764780 while (fill-- > 0)
807823 }
808824 tmp[leaf] = sym;
809825
810 if ((pos += bit_mask) > table_mask)
811 return null;
812 /* table overflow */
813 } else {
814 // return null;
826 if ((pos += bit_mask) > table_mask) {
827 /* table overflow */
828 throw new ChmParsingException("Table overflow");
829 }
815830 }
816831 }
817832 bit_mask >>= 1;
831846 }
832847
833848 public byte[] getContent(int startOffset, int endOffset) {
834 int length = endOffset - startOffset;
835 // return (getContent() != null) ? Arrays.copyOfRange(getContent(),
836 // startOffset, (startOffset + length)) : new byte[1];
837849 return (getContent() != null) ? ChmCommons.copyOfRange(getContent(),
838 startOffset, (startOffset + length)) : new byte[1];
850 startOffset, endOffset) : new byte[1];
839851 }
840852
841853 public byte[] getContent(int start) {
842 // return (getContent() != null) ? Arrays.copyOfRange(getContent(),
843 // start, (getContent().length + start)) : new byte[1];
844854 return (getContent() != null) ? ChmCommons.copyOfRange(getContent(),
845 start, (getContent().length + start)) : new byte[1];
855 start, getContent().length) : new byte[1];
846856 }
847857
848858 private void setContent(int contentLength) {
853863 if (chmPrevLzxBlock == null && getBlockLength() < Integer.MAX_VALUE)
854864 setState(new ChmLzxState((int) getBlockLength()));
855865 else
856 setState(chmPrevLzxBlock.getState());
866 //use clone to avoid changing a cached or to be cached block
867 setState(chmPrevLzxBlock.getState().clone());
857868 }
858869
859870 private boolean validateConstructorParams(int blockNumber,
898909 private void setState(ChmLzxState state) {
899910 this.state = state;
900911 }
901
902 /**
903 * @param args
904 */
905 public static void main(String[] args) {
906 // TODO Auto-generated method stub
907
908 }
909912 }
1616 package org.apache.tika.parser.chm.lzx;
1717
1818 import java.util.concurrent.CancellationException;
19
2019 import org.apache.tika.exception.TikaException;
2120 import org.apache.tika.parser.chm.core.ChmCommons;
22 import org.apache.tika.parser.chm.core.ChmConstants;
2321 import org.apache.tika.parser.chm.core.ChmCommons.IntelState;
2422 import org.apache.tika.parser.chm.core.ChmCommons.LzxState;
23 import org.apache.tika.parser.chm.core.ChmConstants;
2524 import org.apache.tika.parser.chm.exception.ChmParsingException;
2625
27 public class ChmLzxState {
26 public class ChmLzxState implements Cloneable {
2827 /* Class' members */
2928 private int window; /* the actual decoding window */
3029 private long window_size; /* window size (32Kb through 2Mb) */
5251 protected short[] alignedLenTable;
5352 protected short[] alignedTreeTable;
5453
54 @Override
55 public ChmLzxState clone() {
56 try {
57 ChmLzxState clone = (ChmLzxState)super.clone();
58 clone.mainTreeLengtsTable = arrayClone(mainTreeLengtsTable);
59 clone.mainTreeTable = arrayClone(mainTreeTable);
60 clone.lengthTreeTable = arrayClone(lengthTreeTable);
61 clone.lengthTreeLengtsTable = arrayClone(lengthTreeLengtsTable);
62 clone.alignedLenTable = arrayClone(alignedLenTable);
63 clone.alignedTreeTable = arrayClone(alignedTreeTable);
64 return clone;
65 } catch (CloneNotSupportedException ex) {
66 return null;
67 }
68 }
69
5570 protected short[] getMainTreeTable() {
5671 return mainTreeTable;
5772 }
146161 position_slots = 50;
147162 else
148163 position_slots = win << 1;
149
164 //TODO: position_slots is not used ?
150165 setR0(1);
151166 setR1(1);
152167 setR2(1);
289304 return R2;
290305 }
291306
292 public static void main(String[] args) {
293 }
294
295307 public void setMainTreeLengtsTable(short[] mainTreeLengtsTable) {
296308 this.mainTreeLengtsTable = mainTreeLengtsTable;
297309 }
307319 public short[] getLengthTreeLengtsTable() {
308320 return lengthTreeLengtsTable;
309321 }
322
323 private static short[] arrayClone(short[] a) {
324 return a==null ? null : (short[]) a.clone();
325 }
310326 }
2222 import org.apache.tika.parser.chm.core.ChmCommons;
2323
2424 public class ChmSection {
25 private byte[] data;
25 final private byte[] data;
26 final private byte[] prevcontent;
2627 private int swath;// kiks
2728 private int total;// remains
2829 private int buffer;// val
2930
3031 public ChmSection(byte[] data) throws TikaException {
32 this(data, null);
33 }
34
35 public ChmSection(byte[] data, byte[] prevconent) throws TikaException {
3136 ChmCommons.assertByteArrayNotNull(data);
32 setData(data);
33 }
34
37 this.data = data;
38 this.prevcontent = prevconent;
39 //setData(data);
40 }
41
3542 /* Utilities */
3643 public byte[] reverseByteOrder(byte[] toBeReversed) throws TikaException {
3744 ChmCommons.assertByteArrayNotNull(toBeReversed);
4754 return getDesyncBits(bit, bit);
4855 }
4956
50 public int getDesyncBits(int bit, int removeBit) {
57 public int peekBits(int bit) {
58 return getDesyncBits(bit, 0);
59 }
60
61 private int getDesyncBits(int bit, int removeBit) {
5162 while (getTotal() < 16) {
5263 setBuffer((getBuffer() << 16) + unmarshalUByte()
5364 + (unmarshalUByte() << 8));
7990 return data;
8091 }
8192
93 public byte[] getPrevContent() {
94 return prevcontent;
95 }
96
8297 public BigInteger getBigInteger(int i) {
8398 if (getData() == null)
8499 return BigInteger.ZERO;
165180 }
166181 }
167182
168 private void setData(byte[] data) {
169 this.data = data;
170 }
183 // private void setData(byte[] data) {
184 // this.data = data;
185 // }
171186
172187 public int getSwath() {
173188 return swath;
4747 import com.uwyn.jhighlight.renderer.Renderer;
4848 import com.uwyn.jhighlight.renderer.XhtmlRendererFactory;
4949 /**
50 * Generic Source code parser for Java, Groovy, C++
50 * Generic Source code parser for Java, Groovy, C++.
51 * Aware: This parser uses JHightlight library (https://github.com/codelibs/jhighlight) under CDDL/LGPL dual license
5152 *
5253 * @author Hong-Thai.Nguyen
5354 * @since 1.6
3030 import org.bouncycastle.cms.CMSException;
3131 import org.bouncycastle.cms.CMSSignedDataParser;
3232 import org.bouncycastle.cms.CMSTypedStream;
33 import org.bouncycastle.operator.DigestCalculatorProvider;
34 import org.bouncycastle.operator.OperatorCreationException;
35 import org.bouncycastle.operator.jcajce.JcaDigestCalculatorProviderBuilder;
3336 import org.xml.sax.ContentHandler;
3437 import org.xml.sax.SAXException;
3538
5659 Metadata metadata, ParseContext context)
5760 throws IOException, SAXException, TikaException {
5861 try {
62 DigestCalculatorProvider digestCalculatorProvider =
63 new JcaDigestCalculatorProviderBuilder().setProvider("BC").build();
5964 CMSSignedDataParser parser =
60 new CMSSignedDataParser(new CloseShieldInputStream(stream));
65 new CMSSignedDataParser(digestCalculatorProvider, new CloseShieldInputStream(stream));
6166 try {
62 CMSTypedStream content = parser.getSignedContent();
67 CMSTypedStream content = parser.getSignedContent();
6368 if (content == null) {
64 throw new TikaException("cannot parse detached pkcs7 signature (no signed data to parse)");
69 throw new TikaException("cannot parse detached pkcs7 signature (no signed data to parse)");
6570 }
6671 InputStream input = content.getContentStream();
6772 try {
7479 } finally {
7580 parser.close();
7681 }
82 } catch (OperatorCreationException e) {
83 throw new TikaException("Unable to create DigestCalculatorProvider", e);
7784 } catch (CMSException e) {
7885 throw new TikaException("Unable to parse pkcs7 signed data", e);
7986 }
9292 ZipEntry entry = zip.getNextEntry();
9393 while (entry != null) {
9494 if (entry.getName().equals("mimetype")) {
95 String type = IOUtils.toString(zip, "UTF-8");
95 String type = IOUtils.toString(zip, IOUtils.UTF_8.name());
9696 metadata.set(Metadata.CONTENT_TYPE, type);
9797 } else if (entry.getName().equals("metadata.xml")) {
9898 meta.parse(zip, new DefaultHandler(), metadata, context);
9595 SyndContent content = entry.getDescription();
9696 if (content != null) {
9797 xhtml.newline();
98 xhtml.characters(content.getValue());
98 xhtml.characters(stripTags(content));
9999 }
100100 xhtml.endElement("li");
101101 }
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.parser.gdal;
18
19 //JDK imports
20 import java.io.ByteArrayInputStream;
21 import java.io.IOException;
22 import java.io.InputStream;
23 import java.io.InputStreamReader;
24 import java.io.Reader;
25 import java.util.HashMap;
26 import java.util.HashSet;
27 import java.util.Map;
28 import java.util.Scanner;
29 import java.util.Set;
30 import java.util.regex.Matcher;
31 import java.util.regex.Pattern;
32 import org.apache.tika.exception.TikaException;
33 import org.apache.tika.io.IOUtils;
34 import org.apache.tika.io.TemporaryResources;
35 import org.apache.tika.io.TikaInputStream;
36 import org.apache.tika.metadata.Metadata;
37 import org.apache.tika.mime.MediaType;
38 import org.apache.tika.parser.AbstractParser;
39 import org.apache.tika.parser.ParseContext;
40 import org.apache.tika.parser.external.ExternalParser;
41 import org.apache.tika.sax.XHTMLContentHandler;
42 import org.xml.sax.ContentHandler;
43 import org.xml.sax.SAXException;
44
45 import static org.apache.tika.parser.external.ExternalParser.INPUT_FILE_TOKEN;
46
47 //Tika imports
48 //SAX imports
49
50 /**
51 * Wraps execution of the <a href="http//gdal.org/">Geospatial Data Abstraction
52 * Library (GDAL)</a> <code>gdalinfo</code> tool used to extract geospatial
53 * information out of hundreds of geo file formats.
54 * <p/>
55 * The parser requires the installation of GDAL and for <code>gdalinfo</code> to
56 * be located on the path.
57 * <p/>
58 * Basic information (Size, Coordinate System, Bounding Box, Driver, and
59 * resource info) are extracted as metadata, and the remaining metadata patterns
60 * are extracted and added.
61 * <p/>
62 * The output of the command is available from the provided
63 * {@link ContentHandler} in the
64 * {@link #parse(InputStream, ContentHandler, Metadata, ParseContext)} method.
65 */
66 public class GDALParser extends AbstractParser {
67
68 private static final long serialVersionUID = -3869130527323941401L;
69
70 private String command;
71
72 public GDALParser() {
73 setCommand("gdalinfo ${INPUT}");
74 }
75
76 public void setCommand(String command) {
77 this.command = command;
78 }
79
80 public String getCommand() {
81 return this.command;
82 }
83
84 public String processCommand(InputStream stream) {
85 TikaInputStream tis = (TikaInputStream) stream;
86 String pCommand = this.command;
87 try {
88 if (this.command.contains(INPUT_FILE_TOKEN)) {
89 pCommand = this.command.replace(INPUT_FILE_TOKEN, tis.getFile()
90 .getPath());
91 }
92 } catch (Exception e) {
93 e.printStackTrace();
94 }
95
96 return pCommand;
97 }
98
99 @Override
100 public Set<MediaType> getSupportedTypes(ParseContext context) {
101 Set<MediaType> types = new HashSet<MediaType>();
102 types.add(MediaType.application("x-netcdf"));
103 types.add(MediaType.application("vrt"));
104 types.add(MediaType.image("geotiff"));
105 types.add(MediaType.image("ntif"));
106 types.add(MediaType.application("x-rpf-toc"));
107 types.add(MediaType.application("x-ecrg-toc"));
108 types.add(MediaType.image("hfa"));
109 types.add(MediaType.image("sar-ceos"));
110 types.add(MediaType.image("ceos"));
111 types.add(MediaType.application("jaxa-pal-sar"));
112 types.add(MediaType.application("gff"));
113 types.add(MediaType.application("elas"));
114 types.add(MediaType.application("aig"));
115 types.add(MediaType.application("aaigrid"));
116 types.add(MediaType.application("grass-ascii-grid"));
117 types.add(MediaType.application("sdts-raster"));
118 types.add(MediaType.application("dted"));
119 types.add(MediaType.image("png"));
120 types.add(MediaType.image("jpeg"));
121 types.add(MediaType.image("raster"));
122 types.add(MediaType.application("jdem"));
123 types.add(MediaType.image("gif"));
124 types.add(MediaType.image("big-gif"));
125 types.add(MediaType.image("envisat"));
126 types.add(MediaType.image("fits"));
127 types.add(MediaType.application("fits"));
128 types.add(MediaType.image("bsb"));
129 types.add(MediaType.application("xpm"));
130 types.add(MediaType.image("bmp"));
131 types.add(MediaType.image("x-dimap"));
132 types.add(MediaType.image("x-airsar"));
133 types.add(MediaType.application("x-rs2"));
134 types.add(MediaType.application("x-pcidsk"));
135 types.add(MediaType.application("pcisdk"));
136 types.add(MediaType.image("x-pcraster"));
137 types.add(MediaType.image("ilwis"));
138 types.add(MediaType.image("sgi"));
139 types.add(MediaType.application("x-srtmhgt"));
140 types.add(MediaType.application("leveller"));
141 types.add(MediaType.application("terragen"));
142 types.add(MediaType.application("x-gmt"));
143 types.add(MediaType.application("x-isis3"));
144 types.add(MediaType.application("x-isis2"));
145 types.add(MediaType.application("x-pds"));
146 types.add(MediaType.application("x-til"));
147 types.add(MediaType.application("x-ers"));
148 types.add(MediaType.application("x-l1b"));
149 types.add(MediaType.image("fit"));
150 types.add(MediaType.application("x-grib"));
151 types.add(MediaType.image("jp2"));
152 types.add(MediaType.application("x-rmf"));
153 types.add(MediaType.application("x-wcs"));
154 types.add(MediaType.application("x-wms"));
155 types.add(MediaType.application("x-msgn"));
156 types.add(MediaType.application("x-wms"));
157 types.add(MediaType.application("x-wms"));
158 types.add(MediaType.application("x-rst"));
159 types.add(MediaType.application("x-ingr"));
160 types.add(MediaType.application("x-gsag"));
161 types.add(MediaType.application("x-gsbg"));
162 types.add(MediaType.application("x-gs7bg"));
163 types.add(MediaType.application("x-cosar"));
164 types.add(MediaType.application("x-tsx"));
165 types.add(MediaType.application("x-coasp"));
166 types.add(MediaType.application("x-r"));
167 types.add(MediaType.application("x-map"));
168 types.add(MediaType.application("x-pnm"));
169 types.add(MediaType.application("x-doq1"));
170 types.add(MediaType.application("x-doq2"));
171 types.add(MediaType.application("x-envi"));
172 types.add(MediaType.application("x-envi-hdr"));
173 types.add(MediaType.application("x-generic-bin"));
174 types.add(MediaType.application("x-p-aux"));
175 types.add(MediaType.image("x-mff"));
176 types.add(MediaType.image("x-mff2"));
177 types.add(MediaType.image("x-fujibas"));
178 types.add(MediaType.application("x-gsc"));
179 types.add(MediaType.application("x-fast"));
180 types.add(MediaType.application("x-bt"));
181 types.add(MediaType.application("x-lan"));
182 types.add(MediaType.application("x-cpg"));
183 types.add(MediaType.image("ida"));
184 types.add(MediaType.application("x-ndf"));
185 types.add(MediaType.image("eir"));
186 types.add(MediaType.application("x-dipex"));
187 types.add(MediaType.application("x-lcp"));
188 types.add(MediaType.application("x-gtx"));
189 types.add(MediaType.application("x-los-las"));
190 types.add(MediaType.application("x-ntv2"));
191 types.add(MediaType.application("x-ctable2"));
192 types.add(MediaType.application("x-ace2"));
193 types.add(MediaType.application("x-snodas"));
194 types.add(MediaType.application("x-kro"));
195 types.add(MediaType.image("arg"));
196 types.add(MediaType.application("x-rik"));
197 types.add(MediaType.application("x-usgs-dem"));
198 types.add(MediaType.application("x-gxf"));
199 types.add(MediaType.application("x-dods"));
200 types.add(MediaType.application("x-http"));
201 types.add(MediaType.application("x-bag"));
202 types.add(MediaType.application("x-hdf"));
203 types.add(MediaType.image("x-hdf5-image"));
204 types.add(MediaType.application("x-nwt-grd"));
205 types.add(MediaType.application("x-nwt-grc"));
206 types.add(MediaType.image("adrg"));
207 types.add(MediaType.image("x-srp"));
208 types.add(MediaType.application("x-blx"));
209 types.add(MediaType.application("x-rasterlite"));
210 types.add(MediaType.application("x-epsilon"));
211 types.add(MediaType.application("x-sdat"));
212 types.add(MediaType.application("x-kml"));
213 types.add(MediaType.application("x-xyz"));
214 types.add(MediaType.application("x-geo-pdf"));
215 types.add(MediaType.image("x-ozi"));
216 types.add(MediaType.application("x-ctg"));
217 types.add(MediaType.application("x-e00-grid"));
218 types.add(MediaType.application("x-zmap"));
219 types.add(MediaType.application("x-webp"));
220 types.add(MediaType.application("x-ngs-geoid"));
221 types.add(MediaType.application("x-mbtiles"));
222 types.add(MediaType.application("x-ppi"));
223 types.add(MediaType.application("x-cappi"));
224 return types;
225 }
226
227 @Override
228 public void parse(InputStream stream, ContentHandler handler,
229 Metadata metadata, ParseContext context) throws IOException,
230 SAXException, TikaException {
231
232 if (!ExternalParser.check("gdalinfo")) {
233 return;
234 }
235
236 // first set up and run GDAL
237 // process the command
238 TemporaryResources tmp = new TemporaryResources();
239 TikaInputStream tis = TikaInputStream.get(stream, tmp);
240
241 String runCommand = processCommand(tis);
242 String output = execCommand(new String[]{runCommand});
243
244 // now extract the actual metadata params
245 // from the GDAL output in the content stream
246 // to do this, we need to literally process the output
247 // from the invoked command b/c we can't read metadata and
248 // output text from the handler in ExternalParser
249 // at the same time, so for now, we can't use the
250 // ExternalParser to do this and I've had to bring some of
251 // that functionality directly into this class
252 // TODO: investigate a way to do both using ExternalParser
253
254 extractMetFromOutput(output, metadata);
255 applyPatternsToOutput(output, metadata, getPatterns());
256
257 // make the content handler and provide output there
258 // now that we have metadata
259 processOutput(handler, metadata, output);
260 }
261
262 private Map<Pattern, String> getPatterns() {
263 Map<Pattern, String> patterns = new HashMap<Pattern, String>();
264 this.addPatternWithColon("Driver", patterns);
265 this.addPatternWithColon("Files", patterns);
266 this.addPatternWithIs("Size", patterns);
267 this.addPatternWithIs("Coordinate System", patterns);
268 this.addBoundingBoxPattern("Upper Left", patterns);
269 this.addBoundingBoxPattern("Lower Left", patterns);
270 this.addBoundingBoxPattern("Upper Right", patterns);
271 this.addBoundingBoxPattern("Lower Right", patterns);
272 return patterns;
273 }
274
275 private void addPatternWithColon(String name, Map<Pattern, String> patterns) {
276 patterns.put(
277 Pattern.compile(name + "\\:\\s*([A-Za-z0-9/ _\\-\\.]+)\\s*"),
278 name);
279 }
280
281 private void addPatternWithIs(String name, Map<Pattern, String> patterns) {
282 patterns.put(Pattern.compile(name + " is ([A-Za-z0-9\\.,\\s`']+)"),
283 name);
284 }
285
286 private void addBoundingBoxPattern(String name,
287 Map<Pattern, String> patterns) {
288 patterns.put(
289 Pattern.compile(name
290 + "\\s*\\(\\s*([0-9]+\\.[0-9]+\\s*,\\s*[0-9]+\\.[0-9]+\\s*)\\)\\s*"),
291 name);
292 }
293
294 private void extractMetFromOutput(String output, Metadata met) {
295 Scanner scanner = new Scanner(output);
296 String currentKey = null;
297 String[] headings = {"Subdatasets", "Corner Coordinates"};
298 StringBuilder metVal = new StringBuilder();
299 while (scanner.hasNextLine()) {
300 String line = scanner.nextLine();
301 if (line.contains("=") || hasHeadings(line, headings)) {
302 if (currentKey != null) {
303 // time to flush this key and met val
304 met.add(currentKey, metVal.toString());
305 }
306 metVal.setLength(0);
307
308 String[] lineToks = line.split("=");
309 currentKey = lineToks[0].trim();
310 if (lineToks.length == 2) {
311 metVal.append(lineToks[1]);
312 } else {
313 metVal.append("");
314 }
315 } else {
316 metVal.append(line);
317 }
318
319 }
320 }
321
322 private boolean hasHeadings(String line, String[] headings) {
323 if (headings != null && headings.length > 0) {
324 for (String heading : headings) {
325 if (line.contains(heading)) {
326 return true;
327 }
328 }
329 return false;
330 } else return false;
331 }
332
333 private void applyPatternsToOutput(String output, Metadata metadata,
334 Map<Pattern, String> metadataPatterns) {
335 Scanner scanner = new Scanner(output);
336 while (scanner.hasNextLine()) {
337 String line = scanner.nextLine();
338 for (Pattern p : metadataPatterns.keySet()) {
339 Matcher m = p.matcher(line);
340 if (m.find()) {
341 if (metadataPatterns.get(p) != null
342 && !metadataPatterns.get(p).equals("")) {
343 metadata.add(metadataPatterns.get(p), m.group(1));
344 } else {
345 metadata.add(m.group(1), m.group(2));
346 }
347 }
348 }
349 }
350
351 }
352
353 private String execCommand(String[] cmd) throws IOException {
354 // Execute
355 Process process;
356 String output = null;
357 if (cmd.length == 1) {
358 process = Runtime.getRuntime().exec(cmd[0]);
359 } else {
360 process = Runtime.getRuntime().exec(cmd);
361 }
362
363 try {
364 InputStream out = process.getInputStream();
365
366 try {
367 output = extractOutput(out);
368 } catch (Exception e) {
369 e.printStackTrace();
370 output = "";
371 }
372
373 } finally {
374 try {
375 process.waitFor();
376 } catch (InterruptedException ignore) {
377 }
378 }
379 return output;
380
381 }
382
383 private String extractOutput(InputStream stream) throws SAXException,
384 IOException {
385 StringBuilder sb = new StringBuilder();
386 Reader reader = new InputStreamReader(stream, IOUtils.UTF_8);
387 try {
388 char[] buffer = new char[1024];
389 for (int n = reader.read(buffer); n != -1; n = reader.read(buffer)) {
390 sb.append(buffer, 0, n);
391 }
392 } finally {
393 reader.close();
394 }
395 return sb.toString();
396 }
397
398 private void processOutput(ContentHandler handler, Metadata metadata,
399 String output) throws SAXException, IOException {
400 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
401 InputStream stream = new ByteArrayInputStream(output.getBytes(IOUtils.UTF_8));
402 Reader reader = new InputStreamReader(stream, IOUtils.UTF_8);
403 try {
404 xhtml.startDocument();
405 xhtml.startElement("p");
406 char[] buffer = new char[1024];
407 for (int n = reader.read(buffer); n != -1; n = reader.read(buffer)) {
408 xhtml.characters(buffer, 0, n);
409 }
410 xhtml.endElement("p");
411
412 } finally {
413 reader.close();
414 xhtml.endDocument();
415 }
416
417 }
418
419 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.parser.grib;
18
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.io.File;
22 import java.util.Collections;
23 import java.util.Set;
24 import org.apache.tika.exception.TikaException;
25 import org.apache.tika.io.TemporaryResources;
26 import org.apache.tika.io.TikaInputStream;
27 import org.apache.tika.metadata.Metadata;
28 import org.apache.tika.metadata.Property;
29 import org.apache.tika.metadata.TikaCoreProperties;
30 import org.apache.tika.mime.MediaType;
31 import org.apache.tika.parser.AbstractParser;
32 import org.apache.tika.parser.ParseContext;
33 import org.apache.tika.sax.XHTMLContentHandler;
34 import org.xml.sax.ContentHandler;
35 import org.xml.sax.SAXException;
36 import ucar.nc2.Attribute;
37 import ucar.nc2.Dimension;
38 import ucar.nc2.NetcdfFile;
39 import ucar.nc2.Variable;
40 import ucar.nc2.dataset.NetcdfDataset;
41
42 public class GribParser extends AbstractParser {
43
44 private static final long serialVersionUID = 7855458954474247655L;
45
46 public static final String GRIB_MIME_TYPE = "application/x-grib2";
47
48 private final Set<MediaType> SUPPORTED_TYPES =
49 Collections.singleton(MediaType.application("x-grib2"));
50
51 public Set<MediaType> getSupportedTypes(ParseContext context) {
52 return SUPPORTED_TYPES;
53 }
54
55 public void parse(InputStream stream, ContentHandler handler,
56 Metadata metadata, ParseContext context) throws IOException,
57 SAXException, TikaException {
58
59 //Set MIME type as grib2
60 metadata.set(Metadata.CONTENT_TYPE, GRIB_MIME_TYPE);
61
62 TikaInputStream tis = TikaInputStream.get(stream, new TemporaryResources());
63 File gribFile = tis.getFile();
64
65 try {
66 NetcdfFile ncFile = NetcdfDataset.openFile(gribFile.getAbsolutePath(), null);
67
68 // first parse out the set of global attributes
69 for (Attribute attr : ncFile.getGlobalAttributes()) {
70 Property property = resolveMetadataKey(attr.getFullName());
71 if (attr.getDataType().isString()) {
72 metadata.add(property, attr.getStringValue());
73 } else if (attr.getDataType().isNumeric()) {
74 int value = attr.getNumericValue().intValue();
75 metadata.add(property, String.valueOf(value));
76 }
77 }
78
79 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
80
81 xhtml.startDocument();
82
83 xhtml.newline();
84 xhtml.startElement("ul");
85 xhtml.characters("dimensions:");
86 xhtml.newline();
87
88 for (Dimension dim : ncFile.getDimensions()){
89 xhtml.element("li", dim.getFullName() + "=" + String.valueOf(dim.getLength()) + ";");
90 xhtml.newline();
91 }
92
93 xhtml.startElement("ul");
94 xhtml.characters("variables:");
95 xhtml.newline();
96
97 for (Variable var : ncFile.getVariables()){
98 xhtml.element("p", String.valueOf(var.getDataType()) + var.getNameAndDimensions() + ";");
99 for(Attribute element : var.getAttributes()){
100 xhtml.element("li", " :" + element + ";");
101 xhtml.newline();
102 }
103 }
104 xhtml.endElement("ul");
105 xhtml.endElement("ul");
106 xhtml.endDocument();
107
108 } catch (IOException e) {
109 throw new TikaException("NetCDF parse error", e);
110 }
111 }
112
113 private Property resolveMetadataKey(String localName) {
114 if ("title".equals(localName)) {
115 return TikaCoreProperties.TITLE;
116 }
117 return Property.internalText(localName);
118 }
119
120 }
100100 group = ncFile.getRootGroup();
101101 }
102102
103 // get file type
104 met.set("File-Type-Description", ncFile.getFileTypeDescription());
103105 // unravel its string attrs
104106 for (Attribute attribute : group.getAttributes()) {
105107 if (attribute.isString()) {
106 met.add(attribute.getName(), attribute.getStringValue());
108 met.add(attribute.getFullName(), attribute.getStringValue());
107109 } else {
108110 // try and cast its value to a string
109 met.add(attribute.getName(), String.valueOf(attribute
111 met.add(attribute.getFullName(), String.valueOf(attribute
110112 .getNumericValue()));
111113 }
112114 }
1919 import java.util.ArrayList;
2020 import java.util.BitSet;
2121 import java.util.List;
22 import java.util.Locale;
2223
2324 import org.apache.tika.metadata.Metadata;
2425 import org.apache.tika.sax.WriteOutContentHandler;
8283
8384 @Override
8485 public String toString() {
85 return String.format("<%s> of type %s", localName, elementType);
86 };
86 return String.format(Locale.ROOT, "<%s> of type %s", localName, elementType);
87 }
8788
8889 public String getUri() {
8990 return uri;
159159 metadata.set("ICBM", value);
160160 }
161161 } else if (name.equalsIgnoreCase(Metadata.CONTENT_TYPE)){
162 //don't overwrite Metadata.CONTENT_TYPE!
162163 MediaType type = MediaType.parse(value);
163164 if (type != null) {
164 metadata.set(Metadata.CONTENT_TYPE, type.toString());
165 } else {
166 metadata.set(Metadata.CONTENT_TYPE, value);
165 metadata.set(TikaCoreProperties.CONTENT_TYPE_HINT, type.toString());
166 } else {
167 metadata.set(TikaCoreProperties.CONTENT_TYPE_HINT, value);
167168 }
168169 } else {
169 metadata.set(name, value);
170 metadata.add(name, value);
170171 }
171172 }
172173
4646 /** Serial version UID */
4747 private static final long serialVersionUID = 7895315240498733128L;
4848
49 private static final MediaType XHTML = MediaType.application("xhtml+xml");
50 private static final MediaType WAP_XHTML = MediaType.application("vnd.wap.xhtml+xml");
51 private static final MediaType X_ASP = MediaType.application("x-asp");
52
4953 private static final Set<MediaType> SUPPORTED_TYPES =
5054 Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
5155 MediaType.text("html"),
52 MediaType.application("xhtml+xml"),
53 MediaType.application("vnd.wap.xhtml+xml"),
54 MediaType.application("x-asp"))));
56 XHTML,
57 WAP_XHTML,
58 X_ASP)));
5559
5660 private static final ServiceLoader LOADER =
5761 new ServiceLoader(HtmlParser.class.getClassLoader());
6064 * HTML schema singleton used to amortise the heavy instantiation time.
6165 */
6266 private static final Schema HTML_SCHEMA = new HTMLSchema();
67
6368
6469 public Set<MediaType> getSupportedTypes(ParseContext context) {
6570 return SUPPORTED_TYPES;
7681 try {
7782 Charset charset = reader.getCharset();
7883 String previous = metadata.get(Metadata.CONTENT_TYPE);
84 MediaType contentType = null;
7985 if (previous == null || previous.startsWith("text/html")) {
80 MediaType type = new MediaType(MediaType.TEXT_HTML, charset);
81 metadata.set(Metadata.CONTENT_TYPE, type.toString());
86 contentType = new MediaType(MediaType.TEXT_HTML, charset);
87 } else if (previous.startsWith("application/xhtml+xml")) {
88 contentType = new MediaType(XHTML, charset);
89 } else if (previous.startsWith("application/vnd.wap.xhtml+xml")) {
90 contentType = new MediaType(WAP_XHTML, charset);
91 } else if (previous.startsWith("application/x-asp")) {
92 contentType = new MediaType(X_ASP, charset);
93 }
94 if (contentType != null) {
95 metadata.set(Metadata.CONTENT_TYPE, contentType.toString());
8296 }
8397 // deprecated, see TIKA-431
8498 metadata.set(Metadata.CONTENT_ENCODING, charset.name());
152166 * the HTML mapping. This method will be removed in Tika 1.0.
153167 **/
154168 public String mapSafeAttribute(String elementName, String attributeName) {
155 return DefaultHtmlMapper.INSTANCE.mapSafeAttribute(elementName,attributeName) ;
169 return DefaultHtmlMapper.INSTANCE.mapSafeAttribute(elementName, attributeName) ;
156170 }
157171
158172 /**
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.image;
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.util.Arrays;
21 import java.util.Collections;
22 import java.util.HashSet;
23 import java.util.Set;
24
25 import org.apache.poi.util.IOUtils;
26 import org.apache.tika.exception.TikaException;
27 import org.apache.tika.io.EndianUtils;
28 import org.apache.tika.metadata.Metadata;
29 import org.apache.tika.metadata.Photoshop;
30 import org.apache.tika.metadata.TIFF;
31 import org.apache.tika.mime.MediaType;
32 import org.apache.tika.parser.AbstractParser;
33 import org.apache.tika.parser.ParseContext;
34 import org.apache.tika.sax.XHTMLContentHandler;
35 import org.xml.sax.ContentHandler;
36 import org.xml.sax.SAXException;
37
38 /**
39 * Parser for the Better Portable Graphics )BPG) File Format.
40 *
41 * Documentation on the file format is available from
42 * http://bellard.org/bpg/bpg_spec.txt
43 */
44 public class BPGParser extends AbstractParser {
45 private static final long serialVersionUID = -161736541253892772L;
46
47 private static final Set<MediaType> SUPPORTED_TYPES =
48 Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
49 MediaType.image("x-bpg"), MediaType.image("bpg"))));
50
51 public Set<MediaType> getSupportedTypes(ParseContext context) {
52 return SUPPORTED_TYPES;
53 }
54
55 protected static final int EXTENSION_TAG_EXIF = 1;
56 protected static final int EXTENSION_TAG_ICC_PROFILE = 2;
57 protected static final int EXTENSION_TAG_XMP = 3;
58 protected static final int EXTENSION_TAG_THUMBNAIL = 4;
59
60 public void parse(
61 InputStream stream, ContentHandler handler,
62 Metadata metadata, ParseContext context)
63 throws IOException, SAXException, TikaException {
64 // Check for the magic header signature
65 byte[] signature = new byte[4];
66 IOUtils.readFully(stream, signature);
67 if(signature[0] == (byte)'B' && signature[1] == (byte)'P' &&
68 signature[2] == (byte)'G' && signature[3] == (byte)0xfb) {
69 // Good, signature found
70 } else {
71 throw new TikaException("BPG magic signature invalid");
72 }
73
74 // Grab and decode the first byte
75 int pdf = stream.read();
76
77 // Pixel format: Greyscale / 4:2:0 / 4:2:2 / 4:4:4
78 int pixelFormat = pdf & 0x7;
79 // TODO Identify a suitable metadata key for this
80
81 // Is there an alpha plane as well as a colour plane?
82 boolean hasAlphaPlane1 = (pdf & 0x8) == 0x8;
83 // TODO Identify a suitable metadata key for this+hasAlphaPlane2
84
85 // Bit depth minus 8
86 int bitDepth = (pdf >> 4) + 8;
87 metadata.set(TIFF.BITS_PER_SAMPLE, Integer.toString(bitDepth));
88
89 // Grab and decode the second byte
90 int cer = stream.read();
91
92 // Colour Space: YCbCr / RGB / YCgCo / YCbCrK / CMYK
93 int colourSpace = cer & 0x15;
94 switch (colourSpace) {
95 case 0:
96 metadata.set(Photoshop.COLOR_MODE, "YCbCr Colour");
97 break;
98 case 1:
99 metadata.set(Photoshop.COLOR_MODE, "RGB Colour");
100 break;
101 case 2:
102 metadata.set(Photoshop.COLOR_MODE, "YCgCo Colour");
103 break;
104 case 3:
105 metadata.set(Photoshop.COLOR_MODE, "YCbCrK Colour");
106 break;
107 case 4:
108 metadata.set(Photoshop.COLOR_MODE, "CMYK Colour");
109 break;
110 }
111
112 // Are there extensions or not?
113 boolean hasExtensions = (cer & 16) == 16;
114
115 // Is the Alpha Plane 2 flag set?
116 boolean hasAlphaPlane2 = (cer & 32) == 32;
117
118 // cer then holds 2 more booleans - limited range, reserved
119
120 // Width and height next
121 int width = (int)EndianUtils.readUE7(stream);
122 int height = (int)EndianUtils.readUE7(stream);
123 metadata.set(TIFF.IMAGE_LENGTH, height);
124 metadata.set(TIFF.IMAGE_WIDTH, width);
125
126 // Picture Data length
127 EndianUtils.readUE7(stream);
128
129 // Extension Data Length, if extensions present
130 long extensionDataLength = 0;
131 if (hasExtensions)
132 extensionDataLength = EndianUtils.readUE7(stream);
133
134 // Alpha Data Length, if alpha used
135 long alphaDataLength = 0;
136 if (hasAlphaPlane1 || hasAlphaPlane2)
137 alphaDataLength = EndianUtils.readUE7(stream);
138
139 // Extension Data
140 if (hasExtensions) {
141 long extensionsDataSeen = 0;
142 ImageMetadataExtractor metadataExtractor =
143 new ImageMetadataExtractor(metadata);
144
145 while (extensionsDataSeen < extensionDataLength) {
146 int extensionType = (int)EndianUtils.readUE7(stream);
147 int extensionLength = (int)EndianUtils.readUE7(stream);
148 switch (extensionType) {
149 case EXTENSION_TAG_EXIF:
150 metadataExtractor.parseRawExif(stream, extensionLength, true);
151 break;
152 case EXTENSION_TAG_XMP:
153 handleXMP(stream, extensionLength, metadataExtractor);
154 break;
155 default:
156 stream.skip(extensionLength);
157 }
158 extensionsDataSeen += extensionLength;
159 }
160 }
161
162 // HEVC Header + Data
163 // Alpha HEVC Header + Data
164 // We can't do anything with these parts
165
166 // We don't have any helpful text, sorry...
167 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
168 xhtml.startDocument();
169 xhtml.endDocument();
170 }
171
172 protected void handleXMP(InputStream stream, int xmpLength,
173 ImageMetadataExtractor extractor) throws IOException, TikaException, SAXException {
174 byte[] xmp = new byte[xmpLength];
175 IOUtils.readFully(stream, xmp);
176 extractor.parseRawXMP(xmp);
177 }
178 }
1717
1818 import java.io.File;
1919 import java.io.IOException;
20 import java.io.InputStream;
2021 import java.text.DecimalFormat;
2122 import java.text.DecimalFormatSymbols;
2223 import java.text.SimpleDateFormat;
2627 import java.util.regex.Matcher;
2728 import java.util.regex.Pattern;
2829
30 import com.drew.imaging.jpeg.JpegMetadataReader;
31 import com.drew.imaging.jpeg.JpegProcessingException;
32 import com.drew.imaging.riff.RiffProcessingException;
33 import com.drew.imaging.tiff.TiffMetadataReader;
34 import com.drew.imaging.tiff.TiffProcessingException;
35 import com.drew.imaging.webp.WebpMetadataReader;
36 import com.drew.lang.ByteArrayReader;
37 import com.drew.lang.GeoLocation;
38 import com.drew.lang.Rational;
39 import com.drew.metadata.Directory;
40 import com.drew.metadata.MetadataException;
41 import com.drew.metadata.Tag;
42 import com.drew.metadata.exif.ExifIFD0Directory;
43 import com.drew.metadata.exif.ExifReader;
44 import com.drew.metadata.exif.ExifSubIFDDirectory;
45 import com.drew.metadata.exif.ExifThumbnailDirectory;
46 import com.drew.metadata.exif.GpsDirectory;
47 import com.drew.metadata.iptc.IptcDirectory;
48 import com.drew.metadata.jpeg.JpegCommentDirectory;
49 import com.drew.metadata.jpeg.JpegDirectory;
50 import com.drew.metadata.xmp.XmpReader;
51 import org.apache.poi.util.IOUtils;
2952 import org.apache.tika.exception.TikaException;
3053 import org.apache.tika.metadata.IPTC;
3154 import org.apache.tika.metadata.Metadata;
3356 import org.apache.tika.metadata.TikaCoreProperties;
3457 import org.xml.sax.SAXException;
3558
36 import com.drew.imaging.jpeg.JpegMetadataReader;
37 import com.drew.imaging.jpeg.JpegProcessingException;
38 import com.drew.imaging.tiff.TiffMetadataReader;
39 import com.drew.lang.GeoLocation;
40 import com.drew.lang.Rational;
41 import com.drew.metadata.Directory;
42 import com.drew.metadata.MetadataException;
43 import com.drew.metadata.Tag;
44 import com.drew.metadata.exif.ExifIFD0Directory;
45 import com.drew.metadata.exif.ExifSubIFDDirectory;
46 import com.drew.metadata.exif.ExifThumbnailDirectory;
47 import com.drew.metadata.exif.GpsDirectory;
48 import com.drew.metadata.iptc.IptcDirectory;
49 import com.drew.metadata.jpeg.JpegCommentDirectory;
50 import com.drew.metadata.jpeg.JpegDirectory;
51
5259 /**
5360 * Uses the <a href="http://www.drewnoakes.com/code/exif/">Metadata Extractor</a> library
5461 * to read EXIF and IPTC image metadata and map to Tika fields.
55 *
62 * <p/>
5663 * As of 2.4.0 the library supports jpeg and tiff.
64 * As of 2.8.0 the library supports webp.
5765 */
5866 public class ImageMetadataExtractor {
5967
6674 */
6775 public ImageMetadataExtractor(Metadata metadata) {
6876 this(metadata,
69 new CopyUnknownFieldsHandler(),
70 new JpegCommentHandler(),
71 new ExifHandler(),
72 new DimensionsHandler(),
73 new GeotagHandler(),
74 new IptcHandler()
77 new CopyUnknownFieldsHandler(),
78 new JpegCommentHandler(),
79 new ExifHandler(),
80 new DimensionsHandler(),
81 new GeotagHandler(),
82 new IptcHandler()
7583 );
7684 }
77
85
7886 /**
7987 * @param metadata to extract to
8088 * @param handlers handlers in order, note that handlers may override values from earlier handlers
103111 handle(tiffMetadata);
104112 } catch (MetadataException e) {
105113 throw new TikaException("Can't read TIFF metadata", e);
114 } catch (TiffProcessingException e) {
115 throw new TikaException("Can't read TIFF metadata", e);
116 }
117 }
118
119 public void parseWebP(File file) throws IOException, TikaException {
120
121 try {
122 com.drew.metadata.Metadata webPMetadata = new com.drew.metadata.Metadata();
123 webPMetadata = WebpMetadataReader.readMetadata(file);
124 handle(webPMetadata);
125 } catch (IOException e) {
126 throw e;
127 } catch (RiffProcessingException e) {
128 throw new TikaException("Can't process Riff data", e);
129 } catch (MetadataException e) {
130 throw new TikaException("Can't process Riff data", e);
131 }
132 }
133
134 public void parseRawExif(InputStream stream, int length, boolean needsExifHeader)
135 throws IOException, SAXException, TikaException {
136 byte[] exif;
137 if (needsExifHeader) {
138 exif = new byte[length + 6];
139 exif[0] = (byte) 'E';
140 exif[1] = (byte) 'x';
141 exif[2] = (byte) 'i';
142 exif[3] = (byte) 'f';
143 IOUtils.readFully(stream, exif, 6, length);
144 } else {
145 exif = new byte[length];
146 IOUtils.readFully(stream, exif, 0, length);
147 }
148 parseRawExif(exif);
149 }
150
151 public void parseRawExif(byte[] exifData)
152 throws IOException, SAXException, TikaException {
153 com.drew.metadata.Metadata metadata = new com.drew.metadata.Metadata();
154 ExifReader reader = new ExifReader();
155 reader.extract(new ByteArrayReader(exifData), metadata, ExifReader.JPEG_SEGMENT_PREAMBLE.length());
156
157 try {
158 handle(metadata);
159 } catch (MetadataException e) {
160 throw new TikaException("Can't process the EXIF Data", e);
161 }
162 }
163
164 public void parseRawXMP(byte[] xmpData)
165 throws IOException, SAXException, TikaException {
166 com.drew.metadata.Metadata metadata = new com.drew.metadata.Metadata();
167 XmpReader reader = new XmpReader();
168 reader.extract(xmpData, metadata);
169
170 try {
171 handle(metadata);
172 } catch (MetadataException e) {
173 throw new TikaException("Can't process the XMP Data", e);
106174 }
107175 }
108176
109177 /**
110178 * Copies extracted tags to tika metadata using registered handlers.
179 *
111180 * @param metadataExtractor Tag directories from a Metadata Extractor "reader"
112181 * @throws MetadataException This method does not handle exceptions from Metadata Extractor
113182 */
114 protected void handle(com.drew.metadata.Metadata metadataExtractor)
183 protected void handle(com.drew.metadata.Metadata metadataExtractor)
115184 throws MetadataException {
116185 handle(metadataExtractor.getDirectories().iterator());
117186 }
118187
119188 /**
120189 * Copies extracted tags to tika metadata using registered handlers.
190 *
121191 * @param directories Metadata Extractor {@link com.drew.metadata.Directory} instances.
122192 * @throws MetadataException This method does not handle exceptions from Metadata Extractor
123 */
193 */
124194 protected void handle(Iterator<Directory> directories) throws MetadataException {
125195 while (directories.hasNext()) {
126196 Directory directory = directories.next();
132202 }
133203 }
134204
205 private static String trimPixels(String s) {
206 //if height/width appears as "100 pixels", trim " pixels"
207 if (s != null) {
208 int i = s.lastIndexOf(" pixels");
209 s = s.substring(0, i);
210 }
211 return s;
212 }
213
135214 /**
136215 * Reads one or more type of Metadata Extractor fields.
137216 */
141220 * @return true if the directory type is supported by this handler
142221 */
143222 boolean supports(Class<? extends Directory> directoryType);
223
144224 /**
145225 * @param directory extracted tags
146 * @param metadata current tika metadata
226 * @param metadata current tika metadata
147227 * @throws MetadataException typically field extraction error, aborts all further extraction
148228 */
149 void handle(Directory directory, Metadata metadata)
229 void handle(Directory directory, Metadata metadata)
150230 throws MetadataException;
151231 }
152232
158238 public boolean supports(Class<? extends Directory> directoryType) {
159239 return true;
160240 }
241
161242 public void handle(Directory directory, Metadata metadata)
162243 throws MetadataException {
163244 if (directory.getTags() != null) {
166247 }
167248 }
168249 }
169 }
170
250 }
251
171252 /**
172253 * Copies all fields regardless of directory, if the tag name
173254 * is not identical to a known Metadata field name.
177258 public boolean supports(Class<? extends Directory> directoryType) {
178259 return true;
179260 }
261
180262 public void handle(Directory directory, Metadata metadata)
181263 throws MetadataException {
182264 if (directory.getTags() != null) {
183265 for (Tag tag : directory.getTags()) {
184266 String name = tag.getTagName();
185267 if (!MetadataFields.isMetadataField(name) && tag.getDescription() != null) {
186 String value = tag.getDescription().trim();
187 if (Boolean.TRUE.toString().equalsIgnoreCase(value)) {
188 value = Boolean.TRUE.toString();
189 } else if (Boolean.FALSE.toString().equalsIgnoreCase(value)) {
190 value = Boolean.FALSE.toString();
191 }
192 metadata.set(name, value);
268 String value = tag.getDescription().trim();
269 if (Boolean.TRUE.toString().equalsIgnoreCase(value)) {
270 value = Boolean.TRUE.toString();
271 } else if (Boolean.FALSE.toString().equalsIgnoreCase(value)) {
272 value = Boolean.FALSE.toString();
273 }
274 metadata.set(name, value);
193275 }
194276 }
195277 }
196278 }
197279 }
198
280
199281 /**
200282 * Basic image properties for TIFF and JPEG, at least.
201283 */
202284 static class DimensionsHandler implements DirectoryHandler {
203285 private final Pattern LEADING_NUMBERS = Pattern.compile("(\\d+)\\s*.*");
204 public boolean supports(Class<? extends Directory> directoryType) {
205 return directoryType == JpegDirectory.class ||
206 directoryType == ExifSubIFDDirectory.class ||
207 directoryType == ExifThumbnailDirectory.class ||
208 directoryType == ExifIFD0Directory.class;
209 }
286
287 public boolean supports(Class<? extends Directory> directoryType) {
288 return directoryType == JpegDirectory.class ||
289 directoryType == ExifSubIFDDirectory.class ||
290 directoryType == ExifThumbnailDirectory.class ||
291 directoryType == ExifIFD0Directory.class;
292 }
293
210294 public void handle(Directory directory, Metadata metadata) throws MetadataException {
211295 // The test TIFF has width and height stored as follows according to exiv2
212296 //Exif.Image.ImageWidth Short 1 100
213297 //Exif.Image.ImageLength Short 1 75
214298 // and the values are found in "Thumbnail Image Width" (and Height) from Metadata Extractor
215 set(directory, metadata, ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_WIDTH, Metadata.IMAGE_WIDTH);
216 set(directory, metadata, JpegDirectory.TAG_JPEG_IMAGE_WIDTH, Metadata.IMAGE_WIDTH);
217 set(directory, metadata, ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_HEIGHT, Metadata.IMAGE_LENGTH);
218 set(directory, metadata, JpegDirectory.TAG_JPEG_IMAGE_HEIGHT, Metadata.IMAGE_LENGTH);
299 set(directory, metadata, JpegDirectory.TAG_IMAGE_WIDTH, Metadata.IMAGE_WIDTH);
300 set(directory, metadata, JpegDirectory.TAG_IMAGE_HEIGHT, Metadata.IMAGE_LENGTH);
219301 // Bits per sample, two methods of extracting, exif overrides jpeg
220 set(directory, metadata, JpegDirectory.TAG_JPEG_DATA_PRECISION, Metadata.BITS_PER_SAMPLE);
302 set(directory, metadata, JpegDirectory.TAG_DATA_PRECISION, Metadata.BITS_PER_SAMPLE);
221303 set(directory, metadata, ExifSubIFDDirectory.TAG_BITS_PER_SAMPLE, Metadata.BITS_PER_SAMPLE);
222304 // Straightforward
223305 set(directory, metadata, ExifSubIFDDirectory.TAG_SAMPLES_PER_PIXEL, Metadata.SAMPLES_PER_PIXEL);
224306 }
307
225308 private void set(Directory directory, Metadata metadata, int extractTag, Property metadataField) {
226309 if (directory.containsTag(extractTag)) {
227310 Matcher m = LEADING_NUMBERS.matcher(directory.getString(extractTag));
228 if(m.matches()) {
311 if (m.matches()) {
229312 metadata.set(metadataField, m.group(1));
230313 }
231314 }
232315 }
233316 }
234
317
235318 static class JpegCommentHandler implements DirectoryHandler {
236319 public boolean supports(Class<? extends Directory> directoryType) {
237320 return directoryType == JpegCommentDirectory.class;
238321 }
322
239323 public void handle(Directory directory, Metadata metadata) throws MetadataException {
240 if (directory.containsTag(JpegCommentDirectory.TAG_JPEG_COMMENT)) {
241 metadata.add(TikaCoreProperties.COMMENTS, directory.getString(JpegCommentDirectory.TAG_JPEG_COMMENT));
242 }
243 }
244 }
245
324 if (directory.containsTag(JpegCommentDirectory.TAG_COMMENT)) {
325 metadata.add(TikaCoreProperties.COMMENTS, directory.getString(JpegCommentDirectory.TAG_COMMENT));
326 }
327 }
328 }
329
246330 static class ExifHandler implements DirectoryHandler {
247 private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
248 public boolean supports(Class<? extends Directory> directoryType) {
249 return directoryType == ExifIFD0Directory.class ||
331 // There's a new ExifHandler for each file processed, so this is thread safe
332 private static final ThreadLocal<SimpleDateFormat> DATE_UNSPECIFIED_TZ = new ThreadLocal<SimpleDateFormat>() {
333 @Override
334 protected SimpleDateFormat initialValue() {
335 return new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss", Locale.US);
336 }
337 };
338
339 public boolean supports(Class<? extends Directory> directoryType) {
340 return directoryType == ExifIFD0Directory.class ||
250341 directoryType == ExifSubIFDDirectory.class;
251342 }
343
252344 public void handle(Directory directory, Metadata metadata) {
253345 try {
254346 handleDateTags(directory, metadata);
258350 // ignore date parse errors and proceed with other tags
259351 }
260352 }
353
261354 /**
262355 * EXIF may contain image description, although with undefined encoding.
263356 * Use IPTC for other annotation fields, and XMP for unicode support.
265358 public void handleCommentTags(Directory directory, Metadata metadata) {
266359 if (metadata.get(TikaCoreProperties.DESCRIPTION) == null &&
267360 directory.containsTag(ExifIFD0Directory.TAG_IMAGE_DESCRIPTION)) {
268 metadata.set(TikaCoreProperties.DESCRIPTION,
361 metadata.set(TikaCoreProperties.DESCRIPTION,
269362 directory.getString(ExifIFD0Directory.TAG_IMAGE_DESCRIPTION));
270363 }
271364 }
365
272366 /**
273367 * Maps common TIFF and EXIF tags onto the Tika
274 * TIFF image metadata namespace.
275 */
368 * TIFF image metadata namespace.
369 */
276370 public void handlePhotoTags(Directory directory, Metadata metadata) {
277 if(directory.containsTag(ExifSubIFDDirectory.TAG_EXPOSURE_TIME)) {
278 Object exposure = directory.getObject(ExifSubIFDDirectory.TAG_EXPOSURE_TIME);
279 if(exposure instanceof Rational) {
280 metadata.set(Metadata.EXPOSURE_TIME, ((Rational)exposure).doubleValue());
281 } else {
282 metadata.set(Metadata.EXPOSURE_TIME, directory.getString(ExifSubIFDDirectory.TAG_EXPOSURE_TIME));
283 }
284 }
285
286 if(directory.containsTag(ExifSubIFDDirectory.TAG_FLASH)) {
287 String flash = directory.getDescription(ExifSubIFDDirectory.TAG_FLASH);
288 if(flash.contains("Flash fired")) {
289 metadata.set(Metadata.FLASH_FIRED, Boolean.TRUE.toString());
290 }
291 else if(flash.contains("Flash did not fire")) {
292 metadata.set(Metadata.FLASH_FIRED, Boolean.FALSE.toString());
293 }
294 else {
295 metadata.set(Metadata.FLASH_FIRED, flash);
296 }
297 }
298
299 if(directory.containsTag(ExifSubIFDDirectory.TAG_FNUMBER)) {
300 Object fnumber = directory.getObject(ExifSubIFDDirectory.TAG_FNUMBER);
301 if(fnumber instanceof Rational) {
302 metadata.set(Metadata.F_NUMBER, ((Rational)fnumber).doubleValue());
303 } else {
304 metadata.set(Metadata.F_NUMBER, directory.getString(ExifSubIFDDirectory.TAG_FNUMBER));
305 }
306 }
307
308 if(directory.containsTag(ExifSubIFDDirectory.TAG_FOCAL_LENGTH)) {
309 Object length = directory.getObject(ExifSubIFDDirectory.TAG_FOCAL_LENGTH);
310 if(length instanceof Rational) {
311 metadata.set(Metadata.FOCAL_LENGTH, ((Rational)length).doubleValue());
312 } else {
313 metadata.set(Metadata.FOCAL_LENGTH, directory.getString(ExifSubIFDDirectory.TAG_FOCAL_LENGTH));
314 }
315 }
316
317 if(directory.containsTag(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT)) {
318 metadata.set(Metadata.ISO_SPEED_RATINGS, directory.getString(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT));
319 }
320
321 if(directory.containsTag(ExifIFD0Directory.TAG_MAKE)) {
322 metadata.set(Metadata.EQUIPMENT_MAKE, directory.getString(ExifIFD0Directory.TAG_MAKE));
323 }
324 if(directory.containsTag(ExifIFD0Directory.TAG_MODEL)) {
325 metadata.set(Metadata.EQUIPMENT_MODEL, directory.getString(ExifIFD0Directory.TAG_MODEL));
326 }
327
328 if(directory.containsTag(ExifIFD0Directory.TAG_ORIENTATION)) {
329 Object length = directory.getObject(ExifIFD0Directory.TAG_ORIENTATION);
330 if(length instanceof Integer) {
331 metadata.set(Metadata.ORIENTATION, Integer.toString((Integer)length));
332 } else {
333 metadata.set(Metadata.ORIENTATION, directory.getString(ExifIFD0Directory.TAG_ORIENTATION));
334 }
335 }
336
337 if(directory.containsTag(ExifIFD0Directory.TAG_SOFTWARE)) {
338 metadata.set(Metadata.SOFTWARE, directory.getString(ExifIFD0Directory.TAG_SOFTWARE));
339 }
340
341 if(directory.containsTag(ExifIFD0Directory.TAG_X_RESOLUTION)) {
342 Object resolution = directory.getObject(ExifIFD0Directory.TAG_X_RESOLUTION);
343 if(resolution instanceof Rational) {
344 metadata.set(Metadata.RESOLUTION_HORIZONTAL, ((Rational)resolution).doubleValue());
345 } else {
346 metadata.set(Metadata.RESOLUTION_HORIZONTAL, directory.getString(ExifIFD0Directory.TAG_X_RESOLUTION));
347 }
348 }
349 if(directory.containsTag(ExifIFD0Directory.TAG_Y_RESOLUTION)) {
350 Object resolution = directory.getObject(ExifIFD0Directory.TAG_Y_RESOLUTION);
351 if(resolution instanceof Rational) {
352 metadata.set(Metadata.RESOLUTION_VERTICAL, ((Rational)resolution).doubleValue());
353 } else {
354 metadata.set(Metadata.RESOLUTION_VERTICAL, directory.getString(ExifIFD0Directory.TAG_Y_RESOLUTION));
355 }
356 }
357 if(directory.containsTag(ExifIFD0Directory.TAG_RESOLUTION_UNIT)) {
358 metadata.set(Metadata.RESOLUTION_UNIT, directory.getDescription(ExifIFD0Directory.TAG_RESOLUTION_UNIT));
359 }
360 if(directory.containsTag(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_WIDTH)) {
361 metadata.set(Metadata.IMAGE_WIDTH, directory.getDescription(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_WIDTH));
362 }
363 if(directory.containsTag(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_HEIGHT)) {
364 metadata.set(Metadata.IMAGE_LENGTH, directory.getDescription(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_HEIGHT));
365 }
366 }
371 if (directory.containsTag(ExifSubIFDDirectory.TAG_EXPOSURE_TIME)) {
372 Object exposure = directory.getObject(ExifSubIFDDirectory.TAG_EXPOSURE_TIME);
373 if (exposure instanceof Rational) {
374 metadata.set(Metadata.EXPOSURE_TIME, ((Rational) exposure).doubleValue());
375 } else {
376 metadata.set(Metadata.EXPOSURE_TIME, directory.getString(ExifSubIFDDirectory.TAG_EXPOSURE_TIME));
377 }
378 }
379
380 if (directory.containsTag(ExifSubIFDDirectory.TAG_FLASH)) {
381 String flash = directory.getDescription(ExifSubIFDDirectory.TAG_FLASH);
382 if (flash.contains("Flash fired")) {
383 metadata.set(Metadata.FLASH_FIRED, Boolean.TRUE.toString());
384 } else if (flash.contains("Flash did not fire")) {
385 metadata.set(Metadata.FLASH_FIRED, Boolean.FALSE.toString());
386 } else {
387 metadata.set(Metadata.FLASH_FIRED, flash);
388 }
389 }
390
391 if (directory.containsTag(ExifSubIFDDirectory.TAG_FNUMBER)) {
392 Object fnumber = directory.getObject(ExifSubIFDDirectory.TAG_FNUMBER);
393 if (fnumber instanceof Rational) {
394 metadata.set(Metadata.F_NUMBER, ((Rational) fnumber).doubleValue());
395 } else {
396 metadata.set(Metadata.F_NUMBER, directory.getString(ExifSubIFDDirectory.TAG_FNUMBER));
397 }
398 }
399
400 if (directory.containsTag(ExifSubIFDDirectory.TAG_FOCAL_LENGTH)) {
401 Object length = directory.getObject(ExifSubIFDDirectory.TAG_FOCAL_LENGTH);
402 if (length instanceof Rational) {
403 metadata.set(Metadata.FOCAL_LENGTH, ((Rational) length).doubleValue());
404 } else {
405 metadata.set(Metadata.FOCAL_LENGTH, directory.getString(ExifSubIFDDirectory.TAG_FOCAL_LENGTH));
406 }
407 }
408
409 if (directory.containsTag(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT)) {
410 metadata.set(Metadata.ISO_SPEED_RATINGS, directory.getString(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT));
411 }
412
413 if (directory.containsTag(ExifIFD0Directory.TAG_MAKE)) {
414 metadata.set(Metadata.EQUIPMENT_MAKE, directory.getString(ExifIFD0Directory.TAG_MAKE));
415 }
416 if (directory.containsTag(ExifIFD0Directory.TAG_MODEL)) {
417 metadata.set(Metadata.EQUIPMENT_MODEL, directory.getString(ExifIFD0Directory.TAG_MODEL));
418 }
419
420 if (directory.containsTag(ExifIFD0Directory.TAG_ORIENTATION)) {
421 Object length = directory.getObject(ExifIFD0Directory.TAG_ORIENTATION);
422 if (length instanceof Integer) {
423 metadata.set(Metadata.ORIENTATION, Integer.toString((Integer) length));
424 } else {
425 metadata.set(Metadata.ORIENTATION, directory.getString(ExifIFD0Directory.TAG_ORIENTATION));
426 }
427 }
428
429 if (directory.containsTag(ExifIFD0Directory.TAG_SOFTWARE)) {
430 metadata.set(Metadata.SOFTWARE, directory.getString(ExifIFD0Directory.TAG_SOFTWARE));
431 }
432
433 if (directory.containsTag(ExifIFD0Directory.TAG_X_RESOLUTION)) {
434 Object resolution = directory.getObject(ExifIFD0Directory.TAG_X_RESOLUTION);
435 if (resolution instanceof Rational) {
436 metadata.set(Metadata.RESOLUTION_HORIZONTAL, ((Rational) resolution).doubleValue());
437 } else {
438 metadata.set(Metadata.RESOLUTION_HORIZONTAL, directory.getString(ExifIFD0Directory.TAG_X_RESOLUTION));
439 }
440 }
441 if (directory.containsTag(ExifIFD0Directory.TAG_Y_RESOLUTION)) {
442 Object resolution = directory.getObject(ExifIFD0Directory.TAG_Y_RESOLUTION);
443 if (resolution instanceof Rational) {
444 metadata.set(Metadata.RESOLUTION_VERTICAL, ((Rational) resolution).doubleValue());
445 } else {
446 metadata.set(Metadata.RESOLUTION_VERTICAL, directory.getString(ExifIFD0Directory.TAG_Y_RESOLUTION));
447 }
448 }
449 if (directory.containsTag(ExifIFD0Directory.TAG_RESOLUTION_UNIT)) {
450 metadata.set(Metadata.RESOLUTION_UNIT, directory.getDescription(ExifIFD0Directory.TAG_RESOLUTION_UNIT));
451 }
452 if (directory.containsTag(ExifThumbnailDirectory.TAG_IMAGE_WIDTH)) {
453 metadata.set(Metadata.IMAGE_WIDTH,
454 trimPixels(directory.getDescription(ExifThumbnailDirectory.TAG_IMAGE_WIDTH)));
455 }
456 if (directory.containsTag(ExifThumbnailDirectory.TAG_IMAGE_HEIGHT)) {
457 metadata.set(Metadata.IMAGE_LENGTH,
458 trimPixels(directory.getDescription(ExifThumbnailDirectory.TAG_IMAGE_HEIGHT)));
459 }
460 }
461
367462 /**
368463 * Maps exif dates to metadata fields.
369464 */
376471 // Unless we have GPS time we don't know the time zone so date must be set
377472 // as ISO 8601 datetime without timezone suffix (no Z or +/-)
378473 if (original != null) {
379 String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses
474 String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.get().format(original); // Same time zone as Metadata Extractor uses
380475 metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone);
381476 metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone);
382477 }
384479 if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) {
385480 Date datetime = directory.getDate(ExifIFD0Directory.TAG_DATETIME);
386481 if (datetime != null) {
387 String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(datetime);
482 String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.get().format(datetime);
388483 metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone);
389484 // If Date/Time Original does not exist this might be creation date
390485 if (metadata.get(TikaCoreProperties.CREATED) == null) {
394489 }
395490 }
396491 }
397
492
398493 /**
399494 * Reads image comments, originally TIKA-472.
400495 * Metadata Extractor does not read XMP so we need to use the values from Iptc or EXIF
403498 public boolean supports(Class<? extends Directory> directoryType) {
404499 return directoryType == IptcDirectory.class;
405500 }
501
406502 public void handle(Directory directory, Metadata metadata)
407503 throws MetadataException {
408504 if (directory.containsTag(IptcDirectory.TAG_KEYWORDS)) {
436532 public boolean supports(Class<? extends Directory> directoryType) {
437533 return directoryType == GpsDirectory.class;
438534 }
535
439536 public void handle(Directory directory, Metadata metadata) throws MetadataException {
440537 GeoLocation geoLocation = ((GpsDirectory) directory).getGeoLocation();
441538 if (geoLocation != null) {
2727 import org.apache.tika.exception.TikaException;
2828 import org.apache.tika.io.EndianUtils;
2929 import org.apache.tika.metadata.Metadata;
30 import org.apache.tika.metadata.Photoshop;
3031 import org.apache.tika.metadata.TIFF;
3132 import org.apache.tika.metadata.TikaCoreProperties;
3233 import org.apache.tika.mime.MediaType;
9495 int depth = EndianUtils.readUShortBE(stream);
9596 metadata.set(TIFF.BITS_PER_SAMPLE, Integer.toString(depth));
9697
97 // Colour mode
98 // Bitmap = 0; Grayscale = 1; Indexed = 2; RGB = 3; CMYK = 4; Multichannel = 7; Duotone = 8; Lab = 9.
98 // Colour mode, eg Bitmap or RGB
9999 int colorMode = EndianUtils.readUShortBE(stream);
100 // TODO Identify a suitable metadata key for this
100 metadata.set(Photoshop.COLOR_MODE, Photoshop._COLOR_MODE_CHOICES_INDEXED[colorMode]);
101101
102102 // Next is the Color Mode section
103103 // We don't care about this bit
116116 if(rb.id == ResourceBlock.ID_CAPTION) {
117117 metadata.add(TikaCoreProperties.DESCRIPTION, rb.getDataAsString());
118118 } else if(rb.id == ResourceBlock.ID_EXIF_1) {
119 // TODO Parse the EXIF info
119 // TODO Parse the EXIF info via ImageMetadataExtractor
120120 } else if(rb.id == ResourceBlock.ID_EXIF_3) {
121 // TODO Parse the EXIF info
121 // TODO Parse the EXIF info via ImageMetadataExtractor
122122 } else if(rb.id == ResourceBlock.ID_XMP) {
123 // TODO Parse the XMP info
123 // TODO Parse the XMP info via ImageMetadataExtractor
124124 }
125125 }
126126
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.image;
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.util.Collections;
21 import java.util.Set;
22
23 import org.apache.tika.exception.TikaException;
24 import org.apache.tika.io.TemporaryResources;
25 import org.apache.tika.io.TikaInputStream;
26 import org.apache.tika.metadata.Metadata;
27 import org.apache.tika.mime.MediaType;
28 import org.apache.tika.parser.AbstractParser;
29 import org.apache.tika.parser.ParseContext;
30 import org.apache.tika.sax.XHTMLContentHandler;
31 import org.xml.sax.ContentHandler;
32 import org.xml.sax.SAXException;
33
34
35 public class WebPParser extends AbstractParser {
36
37 /** Serial version UID */
38 private static final long serialVersionUID = -3941143576535464926L;
39
40 private static final Set<MediaType> SUPPORTED_TYPES =
41 Collections.singleton(MediaType.image("webp"));
42
43 public Set<MediaType> getSupportedTypes(ParseContext context) {
44 return SUPPORTED_TYPES;
45 }
46
47 public void parse(
48 InputStream stream, ContentHandler handler,
49 Metadata metadata, ParseContext context)
50 throws IOException, SAXException, TikaException {
51 TemporaryResources tmp = new TemporaryResources();
52 try {
53 TikaInputStream tis = TikaInputStream.get(stream, tmp);
54 new ImageMetadataExtractor(metadata).parseWebP(tis.getFile());
55 } finally {
56 tmp.dispose();
57 }
58
59 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
60 xhtml.startDocument();
61 xhtml.endDocument();
62 }
63 }
2121 import java.io.InputStream;
2222 import java.io.InputStreamReader;
2323 import java.io.Reader;
24 import java.util.Iterator;
2524 import java.util.List;
2625
2726 import org.apache.jempbox.xmp.XMPMetadata;
2827 import org.apache.jempbox.xmp.XMPSchemaDublinCore;
2928 import org.apache.tika.exception.TikaException;
29 import org.apache.tika.io.IOUtils;
3030 import org.apache.tika.metadata.Metadata;
3131 import org.apache.tika.metadata.TikaCoreProperties;
3232 import org.xml.sax.InputSource;
3838 private Metadata metadata;
3939
4040 // The XMP spec says it must be unicode, but for most file formats it specifies "must be encoded in UTF-8"
41 private static final String DEFAULT_XMP_CHARSET = "UTF-8";
41 private static final String DEFAULT_XMP_CHARSET = IOUtils.UTF_8.name();
4242
4343 public JempboxExtractor(Metadata metadata) {
4444 this.metadata = metadata;
3131
3232 private ServiceRegistration parserService;
3333
34 @Override
3435 public void start(BundleContext context) throws Exception {
3536 detectorService = context.registerService(
3637 Detector.class.getName(),
3738 new DefaultDetector(Activator.class.getClassLoader()),
3839 new Properties());
40 Parser parser = new DefaultParser(Activator.class.getClassLoader());
3941 parserService = context.registerService(
4042 Parser.class.getName(),
41 new DefaultParser(Activator.class.getClassLoader()),
43 parser,
4244 new Properties());
4345 }
4446
47 @Override
4548 public void stop(BundleContext context) throws Exception {
4649 parserService.unregister();
4750 detectorService.unregister();
2222 import java.util.Collections;
2323 import java.util.Date;
2424 import java.util.HashMap;
25 import java.util.Locale;
2526 import java.util.Set;
2627 import java.util.TimeZone;
2728
2829 import org.apache.tika.exception.TikaException;
30 import org.apache.tika.io.IOUtils;
2931 import org.apache.tika.metadata.Metadata;
3032 import org.apache.tika.metadata.TikaCoreProperties;
3133 import org.apache.tika.mime.MediaType;
159161 }
160162 int msgsize = is.read(buf); // read in at least the full data
161163
162 String message = (new String(buf)).toLowerCase();
164 String message = (new String(buf, IOUtils.UTF_8)).toLowerCase(Locale.ROOT);
163165 // these are not if-then-else, because we want to go from most common
164166 // and fall through to least. this is imperfect, as these tags could
165167 // show up in other agency stories, but i can't find a spec or any
589591 --read;
590592 }
591593 }
592 if (tmp_line.toLowerCase().startsWith("by") || longline.equals("bdy_author")) {
594 if (tmp_line.toLowerCase(Locale.ROOT).startsWith("by") || longline.equals("bdy_author")) {
593595 longkey = "bdy_author";
594596
595597 // prepend a space to subsequent line, so it gets parsed consistent with the lead line
607609 }
608610 else if (FORMAT == this.FMT_IPTC_BLM) {
609611 String byline = " by ";
610 if (tmp_line.toLowerCase().contains(byline)) {
612 if (tmp_line.toLowerCase(Locale.ROOT).contains(byline)) {
611613 longkey = "bdy_author";
612614
613615 int term = tmp_line.length();
616618 term = Math.min(term, (tmp_line.contains("\n") ? tmp_line.indexOf("\n") : term));
617619 term = (term > 0 ) ? term : tmp_line.length();
618620 // for bloomberg, the author line sits below their copyright statement
619 bdy_author += tmp_line.substring(tmp_line.toLowerCase().indexOf(byline) + byline.length(), term) + " ";
621 bdy_author += tmp_line.substring(tmp_line.toLowerCase(Locale.ROOT).indexOf(byline) + byline.length(), term) + " ";
620622 metastarted = true;
621623 longline = ((tmp_line.contains("=")) && (!longline.equals(longkey)) ? longkey : "");
622624 }
623 else if(tmp_line.toLowerCase().startsWith("c.")) {
625 else if(tmp_line.toLowerCase(Locale.ROOT).startsWith("c.")) {
624626 // the author line for bloomberg is a multiline starting with c.2011 Bloomberg News
625627 // then containing the author info on the next line
626628 if (val_next == TB) {
628630 continue;
629631 }
630632 }
631 else if(tmp_line.toLowerCase().trim().startsWith("(") && tmp_line.toLowerCase().trim().endsWith(")")) {
633 else if(tmp_line.toLowerCase(Locale.ROOT).trim().startsWith("(") && tmp_line.toLowerCase(Locale.ROOT).trim().endsWith(")")) {
632634 // the author line may have one or more comment lines between the copyright
633635 // statement, and the By AUTHORNAME line
634636 if (val_next == TB) {
638640 }
639641 }
640642
641 else if (tmp_line.toLowerCase().startsWith("eds") || longline.equals("bdy_source")) {
643 else if (tmp_line.toLowerCase(Locale.ROOT).startsWith("eds") || longline.equals("bdy_source")) {
642644 longkey = "bdy_source";
643645 // prepend a space to subsequent line, so it gets parsed consistent with the lead line
644646 tmp_line = (longline.equals(longkey) ? " " : "") + tmp_line;
735737 // standard reuters format
736738 format_in = "HH:mm MM-dd-yy";
737739 }
738 SimpleDateFormat dfi = new SimpleDateFormat(format_in);
740 SimpleDateFormat dfi = new SimpleDateFormat(format_in, Locale.ROOT);
739741 dfi.setTimeZone(TimeZone.getTimeZone("UTC"));
740742 dateunix = dfi.parse(ftr_datetime);
741743 }
742744 catch (ParseException ep) {
743745 // failed, but this will just fall through to setting the date to now
744746 }
745 SimpleDateFormat dfo = new SimpleDateFormat(format_out);
747 SimpleDateFormat dfo = new SimpleDateFormat(format_out, Locale.ROOT);
746748 dfo.setTimeZone(TimeZone.getTimeZone("UTC"));
747749 ftr_datetime = dfo.format(dateunix);
748750 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.isatab;
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.io.Reader;
21 import java.util.Arrays;
22 import java.util.HashMap;
23 import java.util.Iterator;
24 import java.util.Locale;
25 import java.util.Map;
26
27 import org.apache.commons.csv.CSVFormat;
28 import org.apache.commons.csv.CSVParser;
29 import org.apache.commons.csv.CSVRecord;
30 import org.apache.tika.config.ServiceLoader;
31 import org.apache.tika.detect.AutoDetectReader;
32 import org.apache.tika.exception.TikaException;
33 import org.apache.tika.io.CloseShieldInputStream;
34 import org.apache.tika.io.TikaInputStream;
35 import org.apache.tika.metadata.Metadata;
36 import org.apache.tika.parser.ParseContext;
37 import org.apache.tika.sax.XHTMLContentHandler;
38 import org.xml.sax.SAXException;
39
40 public class ISATabUtils {
41
42 private static final ServiceLoader LOADER = new ServiceLoader(ISATabUtils.class.getClassLoader());
43
44 /**
45 * INVESTIGATION
46 */
47
48 // Investigation section.
49 private static final String[] sections = {
50 "ONTOLOGY SOURCE REFERENCE",
51 "INVESTIGATION",
52 "INVESTIGATION PUBLICATIONS",
53 "INVESTIGATION CONTACTS"
54 };
55
56 // STUDY section (inside the Study section)
57 private static final String studySectionField = "STUDY";
58
59 // Study File Name (inside the STUDY section)
60 private static final String studyFileNameField = "Study File Name";
61
62 public static void parseInvestigation(InputStream stream, XHTMLContentHandler handler, Metadata metadata, ParseContext context, String studyFileName) throws IOException, TikaException, SAXException {
63 // Automatically detect the character encoding
64 AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), metadata, context.get(ServiceLoader.class, LOADER));
65
66 try {
67 extractMetadata(reader, metadata, studyFileName);
68 } finally {
69 reader.close();
70 }
71 }
72
73 public static void parseInvestigation(InputStream stream, XHTMLContentHandler handler, Metadata metadata, ParseContext context) throws IOException, TikaException, SAXException {
74 parseInvestigation(stream, handler, metadata, context, null);
75 }
76
77 public static void parseStudy(InputStream stream, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, TikaException, SAXException {
78 TikaInputStream tis = TikaInputStream.get(stream);
79 // Automatically detect the character encoding
80 AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(tis), metadata, context.get(ServiceLoader.class, LOADER));
81 CSVParser csvParser = null;
82
83 try {
84 csvParser = new CSVParser(reader, CSVFormat.TDF);
85 Iterator<CSVRecord> iterator = csvParser.iterator();
86
87 xhtml.startElement("table");
88
89 xhtml.startElement("thead");
90 if (iterator.hasNext()) {
91 CSVRecord record = iterator.next();
92 for (int i = 0; i < record.size(); i++) {
93 xhtml.startElement("th");
94 xhtml.characters(record.get(i));
95 xhtml.endElement("th");
96 }
97 }
98 xhtml.endElement("thead");
99
100 xhtml.startElement("tbody");
101 while (iterator.hasNext()) {
102 CSVRecord record = iterator.next();
103 xhtml.startElement("tr");
104 for (int j = 0; j < record.size(); j++) {
105 xhtml.startElement("td");
106 xhtml.characters(record.get(j));
107 xhtml.endElement("td");
108 }
109 xhtml.endElement("tr");
110 }
111 xhtml.endElement("tbody");
112
113 xhtml.endElement("table");
114
115 } finally {
116 reader.close();
117 csvParser.close();
118 }
119 }
120
121 public static void parseAssay(InputStream stream, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, TikaException, SAXException {
122 TikaInputStream tis = TikaInputStream.get(stream);
123
124 // Automatically detect the character encoding
125 AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(tis), metadata, context.get(ServiceLoader.class, LOADER));
126 CSVParser csvParser = null;
127
128 try {
129 csvParser = new CSVParser(reader, CSVFormat.TDF);
130
131 xhtml.startElement("table");
132
133 Iterator<CSVRecord> iterator = csvParser.iterator();
134
135 xhtml.startElement("thead");
136 if (iterator.hasNext()) {
137 CSVRecord record = iterator.next();
138 for (int i = 0; i < record.size(); i++) {
139 xhtml.startElement("th");
140 xhtml.characters(record.get(i));
141 xhtml.endElement("th");
142 }
143 }
144 xhtml.endElement("thead");
145
146 xhtml.startElement("tbody");
147 while (iterator.hasNext()) {
148 CSVRecord record = iterator.next();
149 xhtml.startElement("tr");
150 for (int j = 0; j < record.size(); j++) {
151 xhtml.startElement("td");
152 xhtml.characters(record.get(j));
153 xhtml.endElement("td");
154 }
155 xhtml.endElement("tr");
156 }
157 xhtml.endElement("tbody");
158
159 xhtml.endElement("table");
160
161 } finally {
162 reader.close();
163 csvParser.close();
164 }
165 }
166
167 private static void extractMetadata(Reader reader, Metadata metadata, String studyFileName) throws IOException {
168 boolean investigationSection = false;
169 boolean studySection = false;
170 boolean studyTarget = false;
171
172 Map<String, String> map = new HashMap<String, String>();
173
174 CSVParser csvParser = null;
175 try {
176 csvParser = new CSVParser(reader, CSVFormat.TDF);
177
178 Iterator<CSVRecord> iterator = csvParser.iterator();
179
180 while (iterator.hasNext()) {
181 CSVRecord record = iterator.next();
182 String field = record.get(0);
183 if ((field.toUpperCase(Locale.ENGLISH).equals(field)) && (record.size() == 1)) {
184 investigationSection = Arrays.asList(sections).contains(field);
185 studySection = (studyFileName != null) && (field.equals(studySectionField));
186 }
187 else {
188 if (investigationSection) {
189 addMetadata(field, record, metadata);
190 }
191 else if (studySection) {
192 if (studyTarget) {
193 break;
194 }
195 String value = record.get(1);
196 map.put(field, value);
197 studyTarget = (field.equals(studyFileNameField)) && (value.equals(studyFileName));
198 if (studyTarget) {
199 mapStudyToMetadata(map, metadata);
200 studySection = false;
201 }
202 }
203 else if (studyTarget) {
204 addMetadata(field, record, metadata);
205 }
206 }
207 }
208 } catch (IOException ioe) {
209 throw ioe;
210 } finally {
211 csvParser.close();
212 }
213 }
214
215 private static void addMetadata(String field, CSVRecord record, Metadata metadata) {
216 if ((record ==null) || (record.size() <= 1)) {
217 return;
218 }
219
220 for (int i = 1; i < record.size(); i++) {
221 metadata.add(field, record.get(i));
222 }
223 }
224
225 private static void mapStudyToMetadata(Map<String, String> map, Metadata metadata) {
226 for (Map.Entry<String, String> entry : map.entrySet()) {
227 metadata.add(entry.getKey(), entry.getValue());
228 }
229 }
230 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.isatab;
17
18 import java.io.File;
19 import java.io.FilenameFilter;
20 import java.io.IOException;
21 import java.io.InputStream;
22 import java.util.Collections;
23 import java.util.Set;
24
25 import org.apache.tika.exception.TikaException;
26 import org.apache.tika.io.TikaInputStream;
27 import org.apache.tika.metadata.Metadata;
28 import org.apache.tika.mime.MediaType;
29 import org.apache.tika.parser.ParseContext;
30 import org.apache.tika.parser.Parser;
31 import org.apache.tika.sax.XHTMLContentHandler;
32 import org.xml.sax.ContentHandler;
33 import org.xml.sax.SAXException;
34
35 public class ISArchiveParser implements Parser {
36
37 /**
38 * Serial version UID
39 */
40 private static final long serialVersionUID = 3640809327541300229L;
41
42 private final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("x-isatab"));
43
44 private static String studyAssayFileNameField = "Study Assay File Name";
45
46 private String location = null;
47
48 private String studyFileName = null;
49
50 /**
51 * Default constructor.
52 */
53 public ISArchiveParser() {
54 this(null);
55 }
56
57 /**
58 * Constructor that accepts the pathname of ISArchive folder.
59 * @param location pathname of ISArchive folder including ISA-Tab files
60 */
61 public ISArchiveParser(String location) {
62 if (location != null && !location.endsWith(File.separator)) {
63 location += File.separator;
64 }
65 this.location = location;
66 }
67
68 @Override
69 public Set<MediaType> getSupportedTypes(ParseContext context) {
70 return SUPPORTED_TYPES;
71 }
72
73 @Override
74 public void parse(InputStream stream, ContentHandler handler, Metadata metadata,
75 ParseContext context) throws IOException, SAXException, TikaException {
76
77 TikaInputStream tis = TikaInputStream.get(stream);
78 if (this.location == null) {
79 this.location = tis.getFile().getParent() + File.separator;
80 }
81 this.studyFileName = tis.getFile().getName();
82
83 File locationFile = new File(location);
84 String[] investigationList = locationFile.list(new FilenameFilter() {
85
86 @Override
87 public boolean accept(File dir, String name) {
88 return name.matches("i_.+\\.txt");
89 }
90 });
91
92 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
93 xhtml.startDocument();
94
95 parseInvestigation(investigationList, xhtml, metadata, context);
96 parseStudy(stream, xhtml, metadata, context);
97 parseAssay(xhtml, metadata, context);
98
99 xhtml.endDocument();
100 }
101
102 private void parseInvestigation(String[] investigationList, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
103 if ((investigationList == null) || (investigationList.length == 0)) {
104 // TODO warning
105 return;
106 }
107 if (investigationList.length > 1) {
108 // TODO warning
109 return;
110 }
111
112 String investigation = investigationList[0]; // TODO add to metadata?
113 InputStream stream = TikaInputStream.get(new File(this.location + investigation));
114
115 ISATabUtils.parseInvestigation(stream, xhtml, metadata, context, this.studyFileName);
116
117 xhtml.element("h1", "INVESTIGATION " + metadata.get("Investigation Identifier"));
118 }
119
120 private void parseStudy(InputStream stream, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
121 xhtml.element("h2", "STUDY " + metadata.get("Study Identifier"));
122
123 ISATabUtils.parseStudy(stream, xhtml, metadata, context);
124 }
125
126 private void parseAssay(XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
127 for (String assayFileName : metadata.getValues(studyAssayFileNameField)) {
128 xhtml.startElement("div");
129 xhtml.element("h3", "ASSAY " + assayFileName);
130 InputStream stream = TikaInputStream.get(new File(this.location + assayFileName));
131 ISATabUtils.parseAssay(stream, xhtml, metadata, context);
132 xhtml.endElement("div");
133 }
134 }
135 }
1414 * limitations under the License.
1515 */
1616 package org.apache.tika.parser.iwork;
17
18 import java.util.Locale;
1719
1820 /**
1921 * Utility class to allow for conversion from an integer to Roman numerals
4345 }
4446
4547 public static String asAlphaNumericLower(int i) {
46 return asAlphaNumeric(i).toLowerCase();
48 return asAlphaNumeric(i).toLowerCase(Locale.ROOT);
4749 }
4850
4951 /*
7274 }
7375
7476 public static String asRomanNumeralsLower(int i) {
75 return asRomanNumerals(i).toLowerCase();
77 return asRomanNumerals(i).toLowerCase(Locale.ROOT);
7678 }
7779
7880 private static int i2r(StringBuffer sbuff, int i,
215215
216216 entry = zip.getNextZipEntry();
217217 }
218 zip.close();
218 // Don't close the zip InputStream (TIKA-1117).
219219 }
220220
221221 }
0 package org.apache.tika.parser.jdbc;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.sql.Connection;
21 import java.sql.DriverManager;
22 import java.sql.SQLException;
23 import java.util.List;
24 import java.util.Set;
25
26 import org.apache.tika.exception.TikaException;
27 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
28 import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
29 import org.apache.tika.io.IOExceptionWithCause;
30 import org.apache.tika.metadata.Database;
31 import org.apache.tika.metadata.Metadata;
32 import org.apache.tika.mime.MediaType;
33 import org.apache.tika.parser.AbstractParser;
34 import org.apache.tika.parser.ParseContext;
35 import org.apache.tika.sax.XHTMLContentHandler;
36 import org.xml.sax.ContentHandler;
37 import org.xml.sax.SAXException;
38
39 /**
40 * Abstract class that handles iterating through tables within a database.
41 */
42 abstract class AbstractDBParser extends AbstractParser {
43
44 private final static byte[] EMPTY_BYTE_ARR = new byte[0];
45
46 private Connection connection;
47
48 @Override
49 public Set<MediaType> getSupportedTypes(ParseContext context) {
50 return null;
51 }
52
53 @Override
54 public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
55 connection = getConnection(stream, metadata, context);
56 XHTMLContentHandler xHandler = null;
57 List<String> tableNames = null;
58 try {
59 tableNames = getTableNames(connection, metadata, context);
60 } catch (SQLException e) {
61 throw new IOExceptionWithCause(e);
62 }
63 for (String tableName : tableNames) {
64 //add table names to parent metadata
65 metadata.add(Database.TABLE_NAME, tableName);
66 }
67 xHandler = new XHTMLContentHandler(handler, metadata);
68 xHandler.startDocument();
69
70 try {
71 for (String tableName : tableNames) {
72 JDBCTableReader tableReader = getTableReader(connection, tableName, context);
73 xHandler.startElement("table", "name", tableReader.getTableName());
74 xHandler.startElement("thead");
75 xHandler.startElement("tr");
76 for (String header : tableReader.getHeaders()) {
77 xHandler.startElement("th");
78 xHandler.characters(header);
79 xHandler.endElement("th");
80 }
81 xHandler.endElement("tr");
82 xHandler.endElement("thead");
83 xHandler.startElement("tbody");
84 while (tableReader.nextRow(xHandler, context)) {
85 //no-op
86 }
87 xHandler.endElement("tbody");
88 xHandler.endElement("table");
89 }
90 } finally {
91 if (xHandler != null) {
92 xHandler.endDocument();
93 }
94 try {
95 close();
96 } catch (SQLException e) {
97 //swallow
98 }
99 }
100 }
101
102 protected static EmbeddedDocumentExtractor getEmbeddedDocumentExtractor(ParseContext context) {
103 return context.get(EmbeddedDocumentExtractor.class,
104 new ParsingEmbeddedDocumentExtractor(context));
105 }
106
107 /**
108 * Override this for any special handling of closing the connection.
109 *
110 * @throws java.sql.SQLException
111 * @throws java.io.IOException
112 */
113 protected void close() throws SQLException, IOException {
114 connection.close();
115 }
116
117 /**
118 * Override this for special configuration of the connection, such as limiting
119 * the number of rows to be held in memory.
120 *
121 * @param stream stream to use
122 * @param metadata metadata that could be used in parameterizing the connection
123 * @param context parsecontext that could be used in parameterizing the connection
124 * @return connection
125 * @throws java.io.IOException
126 * @throws org.apache.tika.exception.TikaException
127 */
128 protected Connection getConnection(InputStream stream, Metadata metadata, ParseContext context) throws IOException, TikaException {
129 String connectionString = getConnectionString(stream, metadata, context);
130
131 Connection connection = null;
132 try {
133 Class.forName(getJDBCClassName());
134 } catch (ClassNotFoundException e) {
135 throw new TikaException(e.getMessage());
136 }
137 try{
138 connection = DriverManager.getConnection(connectionString);
139 } catch (SQLException e) {
140 throw new IOExceptionWithCause(e);
141 }
142 return connection;
143 }
144
145 /**
146 * Implement for db specific connection information, e.g. "jdbc:sqlite:/docs/mydb.db"
147 * <p>
148 * Include any optimization settings, user name, password, etc.
149 * <p>
150 * @param stream stream for processing
151 * @param metadata metadata might be useful in determining connection info
152 * @param parseContext context to use to help create connectionString
153 * @return connection string to be used by {@link #getConnection}.
154 * @throws java.io.IOException
155 */
156 abstract protected String getConnectionString(InputStream stream,
157 Metadata metadata, ParseContext parseContext) throws IOException;
158
159 /**
160 * JDBC class name, e.g. org.sqlite.JDBC
161 * @return jdbc class name
162 */
163 abstract protected String getJDBCClassName();
164
165 /**
166 *
167 * Returns the names of the tables to process
168 *
169 * @param connection Connection to use to make the sql call(s) to get the names of the tables
170 * @param metadata Metadata to use (potentially) in decision about which tables to extract
171 * @param context ParseContext to use (potentially) in decision about which tables to extract
172 * @return
173 * @throws java.sql.SQLException
174 */
175 abstract protected List<String> getTableNames(Connection connection, Metadata metadata,
176 ParseContext context) throws SQLException;
177
178 /**
179 * Given a connection and a table name, return the JDBCTableReader for this db.
180 *
181 * @param connection
182 * @param tableName
183 * @return
184 */
185 abstract protected JDBCTableReader getTableReader(Connection connection, String tableName, ParseContext parseContext);
186
187 }
0 package org.apache.tika.parser.jdbc;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18
19 import java.io.ByteArrayInputStream;
20 import java.io.IOException;
21 import java.io.InputStream;
22 import java.sql.Blob;
23 import java.sql.Clob;
24 import java.sql.Connection;
25 import java.sql.ResultSet;
26 import java.sql.ResultSetMetaData;
27 import java.sql.SQLException;
28 import java.sql.Statement;
29 import java.sql.Types;
30 import java.util.LinkedList;
31 import java.util.List;
32
33 import org.apache.tika.config.TikaConfig;
34 import org.apache.tika.detect.Detector;
35 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
36 import org.apache.tika.io.FilenameUtils;
37 import org.apache.tika.io.IOExceptionWithCause;
38 import org.apache.tika.io.IOUtils;
39 import org.apache.tika.io.TikaInputStream;
40 import org.apache.tika.metadata.Database;
41 import org.apache.tika.metadata.Metadata;
42 import org.apache.tika.metadata.TikaMetadataKeys;
43 import org.apache.tika.mime.MediaType;
44 import org.apache.tika.mime.MimeType;
45 import org.apache.tika.mime.MimeTypeException;
46 import org.apache.tika.mime.MimeTypes;
47 import org.apache.tika.parser.ParseContext;
48 import org.apache.tika.sax.XHTMLContentHandler;
49 import org.xml.sax.Attributes;
50 import org.xml.sax.ContentHandler;
51 import org.xml.sax.SAXException;
52 import org.xml.sax.helpers.AttributesImpl;
53
54 /**
55 * General base class to iterate through rows of a JDBC table
56 */
57 class JDBCTableReader {
58
59 private final static Attributes EMPTY_ATTRIBUTES = new AttributesImpl();
60 private final Connection connection;
61 private final String tableName;
62 int maxClobLength = 1000000;
63 ResultSet results = null;
64 int rows = 0;
65 private TikaConfig tikaConfig = null;
66 private Detector detector = null;
67 private MimeTypes mimeTypes = null;
68
69 public JDBCTableReader(Connection connection, String tableName, ParseContext context) {
70 this.connection = connection;
71 this.tableName = tableName;
72 this.tikaConfig = context.get(TikaConfig.class);
73 }
74
75 public boolean nextRow(ContentHandler handler, ParseContext context) throws IOException, SAXException {
76 //lazy initialization
77 if (results == null) {
78 reset();
79 }
80 try {
81 if (!results.next()) {
82 return false;
83 }
84 } catch (SQLException e) {
85 throw new IOExceptionWithCause(e);
86 }
87 try {
88 ResultSetMetaData meta = results.getMetaData();
89 handler.startElement(XHTMLContentHandler.XHTML, "tr", "tr", EMPTY_ATTRIBUTES);
90 for (int i = 1; i <= meta.getColumnCount(); i++) {
91 handler.startElement(XHTMLContentHandler.XHTML, "td", "td", EMPTY_ATTRIBUTES);
92 handleCell(meta, i, handler, context);
93 handler.endElement(XHTMLContentHandler.XHTML, "td", "td");
94 }
95 handler.endElement(XHTMLContentHandler.XHTML, "tr", "tr");
96 } catch (SQLException e) {
97 throw new IOExceptionWithCause(e);
98 }
99 rows++;
100 return true;
101 }
102
103 private void handleCell(ResultSetMetaData rsmd, int i, ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException {
104 switch (rsmd.getColumnType(i)) {
105 case Types.BLOB:
106 handleBlob(tableName, rsmd.getColumnName(i), rows, results, i, handler, context);
107 break;
108 case Types.CLOB:
109 handleClob(tableName, rsmd.getColumnName(i), rows, results, i, handler, context);
110 break;
111 case Types.BOOLEAN:
112 handleBoolean(results.getBoolean(i), handler);
113 break;
114 case Types.DATE:
115 handleDate(results, i, handler);
116 break;
117 case Types.TIMESTAMP:
118 handleTimeStamp(results, i, handler);
119 break;
120 case Types.INTEGER:
121 handleInteger(rsmd.getColumnTypeName(i), results, i, handler);
122 break;
123 case Types.FLOAT:
124 //this is necessary to handle rounding issues in presentation
125 //Should we just use getString(i)?
126 addAllCharacters(Float.toString(results.getFloat(i)), handler);
127 break;
128 case Types.DOUBLE:
129 addAllCharacters(Double.toString(results.getDouble(i)), handler);
130 break;
131 default:
132 addAllCharacters(results.getString(i), handler);
133 break;
134 }
135 }
136
137 public List<String> getHeaders() throws IOException {
138 List<String> headers = new LinkedList<String>();
139 //lazy initialization
140 if (results == null) {
141 reset();
142 }
143 try {
144 ResultSetMetaData meta = results.getMetaData();
145 for (int i = 1; i <= meta.getColumnCount(); i++) {
146 headers.add(meta.getColumnName(i));
147 }
148 } catch (SQLException e) {
149 throw new IOExceptionWithCause(e);
150 }
151 return headers;
152 }
153
154 protected void handleInteger(String columnTypeName, ResultSet rs, int columnIndex, ContentHandler handler) throws SQLException, SAXException {
155 addAllCharacters(Integer.toString(rs.getInt(columnIndex)), handler);
156 }
157
158 private void handleBoolean(boolean aBoolean, ContentHandler handler) throws SAXException {
159 addAllCharacters(Boolean.toString(aBoolean), handler);
160 }
161
162
163 protected void handleClob(String tableName, String columnName, int rowNum,
164 ResultSet resultSet, int columnIndex,
165 ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException {
166 Clob clob = resultSet.getClob(columnIndex);
167 boolean truncated = clob.length() > Integer.MAX_VALUE || clob.length() > maxClobLength;
168
169 int readSize = (clob.length() < maxClobLength ? (int) clob.length() : maxClobLength);
170 Metadata m = new Metadata();
171 m.set(Database.TABLE_NAME, tableName);
172 m.set(Database.COLUMN_NAME, columnName);
173 m.set(Database.PREFIX + "ROW_NUM", Integer.toString(rowNum));
174 m.set(Database.PREFIX + "IS_CLOB", "true");
175 m.set(Database.PREFIX + "CLOB_LENGTH", Long.toString(clob.length()));
176 m.set(Database.PREFIX + "IS_CLOB_TRUNCATED", Boolean.toString(truncated));
177 m.set(Metadata.CONTENT_TYPE, "text/plain; charset=UTF-8");
178 m.set(Metadata.CONTENT_LENGTH, Integer.toString(readSize));
179 m.set(TikaMetadataKeys.RESOURCE_NAME_KEY,
180 //just in case something screwy is going on with the column name
181 FilenameUtils.normalize(FilenameUtils.getName(columnName + "_" + rowNum + ".txt")));
182
183
184 //is there a more efficient way to go from a Reader to an InputStream?
185 String s = clob.getSubString(0, readSize);
186 EmbeddedDocumentExtractor ex = AbstractDBParser.getEmbeddedDocumentExtractor(context);
187 ex.parseEmbedded(new ByteArrayInputStream(s.getBytes("UTF-8")), handler, m, true);
188 }
189
190 protected void handleBlob(String tableName, String columnName, int rowNum, ResultSet resultSet, int columnIndex,
191 ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException {
192 Metadata m = new Metadata();
193 m.set(Database.TABLE_NAME, tableName);
194 m.set(Database.COLUMN_NAME, columnName);
195 m.set(Database.PREFIX + "ROW_NUM", Integer.toString(rowNum));
196 m.set(Database.PREFIX + "IS_BLOB", "true");
197 Blob blob = null;
198 InputStream is = null;
199 EmbeddedDocumentExtractor ex = AbstractDBParser.getEmbeddedDocumentExtractor(context);
200 try {
201 is = TikaInputStream.get(getInputStreamFromBlob(resultSet, columnIndex, blob, m));
202
203 Attributes attrs = new AttributesImpl();
204 ((AttributesImpl) attrs).addAttribute("", "type", "type", "CDATA", "blob");
205 ((AttributesImpl) attrs).addAttribute("", "column_name", "column_name", "CDATA", columnName);
206 ((AttributesImpl) attrs).addAttribute("", "row_number", "row_number", "CDATA", Integer.toString(rowNum));
207 handler.startElement("", "span", "span", attrs);
208 MediaType mediaType = getDetector().detect(is, new Metadata());
209 String extension = "";
210 try {
211 MimeType mimeType = getMimeTypes().forName(mediaType.toString());
212 m.set(Metadata.CONTENT_TYPE, mimeType.toString());
213 extension = mimeType.getExtension();
214 } catch (MimeTypeException e) {
215 //swallow
216 }
217 m.set(TikaMetadataKeys.RESOURCE_NAME_KEY,
218 //just in case something screwy is going on with the column name
219 FilenameUtils.normalize(FilenameUtils.getName(columnName + "_" + rowNum + extension)));
220
221 ex.parseEmbedded(is, handler, m, true);
222
223 } finally {
224 if (blob != null) {
225 try {
226 blob.free();
227 } catch (SQLException e) {
228 //swallow
229 }
230 }
231 IOUtils.closeQuietly(is);
232 }
233 handler.endElement("", "span", "span");
234 }
235
236 protected InputStream getInputStreamFromBlob(ResultSet resultSet, int columnIndex, Blob blob, Metadata metadata) throws SQLException {
237 return TikaInputStream.get(blob, metadata);
238 }
239
240 protected void handleDate(ResultSet resultSet, int columnIndex, ContentHandler handler) throws SAXException, SQLException {
241 addAllCharacters(resultSet.getString(columnIndex), handler);
242 }
243
244 protected void handleTimeStamp(ResultSet resultSet, int columnIndex, ContentHandler handler) throws SAXException, SQLException {
245 addAllCharacters(resultSet.getString(columnIndex), handler);
246 }
247
248 protected void addAllCharacters(String s, ContentHandler handler) throws SAXException {
249 char[] chars = s.toCharArray();
250 handler.characters(chars, 0, chars.length);
251 }
252
253 void reset() throws IOException {
254
255 if (results != null) {
256 try {
257 results.close();
258 } catch (SQLException e) {
259 //swallow
260 }
261 }
262
263 String sql = "SELECT * from " + tableName;
264 try {
265 Statement st = connection.createStatement();
266 results = st.executeQuery(sql);
267 } catch (SQLException e) {
268 throw new IOExceptionWithCause(e);
269 }
270 rows = 0;
271 }
272
273 public String getTableName() {
274 return tableName;
275 }
276
277
278 protected TikaConfig getTikaConfig() {
279 if (tikaConfig == null) {
280 tikaConfig = TikaConfig.getDefaultConfig();
281 }
282 return tikaConfig;
283 }
284
285 protected Detector getDetector() {
286 if (detector != null) return detector;
287
288 detector = getTikaConfig().getDetector();
289 return detector;
290 }
291
292 protected MimeTypes getMimeTypes() {
293 if (mimeTypes != null) return mimeTypes;
294
295 mimeTypes = getTikaConfig().getMimeRepository();
296 return mimeTypes;
297 }
298
299 }
0 package org.apache.tika.parser.jdbc;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.File;
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.sql.Connection;
22 import java.sql.ResultSet;
23 import java.sql.SQLException;
24 import java.sql.Statement;
25 import java.util.LinkedList;
26 import java.util.List;
27 import java.util.Set;
28
29 import org.apache.tika.io.IOExceptionWithCause;
30 import org.apache.tika.io.TikaInputStream;
31 import org.apache.tika.metadata.Metadata;
32 import org.apache.tika.mime.MediaType;
33 import org.apache.tika.parser.ParseContext;
34 import org.sqlite.SQLiteConfig;
35
36 /**
37 * This is the implementation of the db parser for SQLite.
38 * <p>
39 * This parser is internal only; it should not be registered in the services
40 * file or configured in the TikaConfig xml file.
41 */
42 class SQLite3DBParser extends AbstractDBParser {
43
44 protected static final String SQLITE_CLASS_NAME = "org.sqlite.JDBC";
45
46 /**
47 *
48 * @param context context
49 * @return null (always)
50 */
51 @Override
52 public Set<MediaType> getSupportedTypes(ParseContext context) {
53 return null;
54 }
55
56 @Override
57 protected Connection getConnection(InputStream stream, Metadata metadata, ParseContext context) throws IOException {
58 String connectionString = getConnectionString(stream, metadata, context);
59
60 Connection connection = null;
61 try {
62 Class.forName(getJDBCClassName());
63 } catch (ClassNotFoundException e) {
64 throw new IOExceptionWithCause(e);
65 }
66 try{
67 SQLiteConfig config = new SQLiteConfig();
68
69 //good habit, but effectively meaningless here
70 config.setReadOnly(true);
71 connection = config.createConnection(connectionString);
72
73 } catch (SQLException e) {
74 throw new IOException(e.getMessage());
75 }
76 return connection;
77 }
78
79 @Override
80 protected String getConnectionString(InputStream is, Metadata metadata, ParseContext context) throws IOException {
81 File dbFile = TikaInputStream.get(is).getFile();
82 return "jdbc:sqlite:"+dbFile.getAbsolutePath();
83 }
84
85 @Override
86 protected String getJDBCClassName() {
87 return SQLITE_CLASS_NAME;
88 }
89
90 @Override
91 protected List<String> getTableNames(Connection connection, Metadata metadata,
92 ParseContext context) throws SQLException {
93 List<String> tableNames = new LinkedList<String>();
94
95 Statement st = null;
96 try {
97 st = connection.createStatement();
98 String sql = "SELECT name FROM sqlite_master WHERE type='table'";
99 ResultSet rs = st.executeQuery(sql);
100
101 while (rs.next()) {
102 tableNames.add(rs.getString(1));
103 }
104 } finally {
105 if (st != null)
106 st.close();
107 }
108 return tableNames;
109 }
110
111 @Override
112 public JDBCTableReader getTableReader(Connection connection, String tableName, ParseContext context) {
113 return new SQLite3TableReader(connection, tableName, context);
114 }
115 }
0 package org.apache.tika.parser.jdbc;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17 import java.io.IOException;
18 import java.io.InputStream;
19 import java.util.Collections;
20 import java.util.Set;
21
22 import org.apache.tika.exception.TikaException;
23 import org.apache.tika.metadata.Metadata;
24 import org.apache.tika.mime.MediaType;
25 import org.apache.tika.parser.AbstractParser;
26 import org.apache.tika.parser.ParseContext;
27 import org.xml.sax.ContentHandler;
28 import org.xml.sax.SAXException;
29
30 /**
31 * This is the main class for parsing SQLite3 files. When {@link #parse} is called,
32 * this creates a new {@link org.apache.tika.parser.jdbc.SQLite3DBParser}.
33 * <p>
34 * Given potential conflicts of native libraries in web servers, users will
35 * need to add org.xerial's sqlite-jdbc jar to the class path for this parser
36 * to work. For development and testing, this jar is specified in tika-parsers'
37 * pom.xml, but it is currently set to "provided."
38 * <p>
39 * Note that this family of jdbc parsers is designed to treat each CLOB and each BLOB
40 * as embedded documents.
41 *
42 */
43 public class SQLite3Parser extends AbstractParser {
44 /** Serial version UID */
45 private static final long serialVersionUID = -752276948656079347L;
46
47 private static final MediaType MEDIA_TYPE = MediaType.application("x-sqlite3");
48
49 private final Set<MediaType> SUPPORTED_TYPES;
50
51 /**
52 * Checks to see if class is available for org.sqlite.JDBC.
53 * <p>
54 * If not, this class will return an EMPTY_SET for getSupportedTypes()
55 */
56 public SQLite3Parser() {
57 Set<MediaType> tmp;
58 try {
59 Class.forName(SQLite3DBParser.SQLITE_CLASS_NAME);
60 tmp = Collections.singleton(MEDIA_TYPE);
61 } catch (ClassNotFoundException e) {
62 tmp = Collections.EMPTY_SET;
63 }
64 SUPPORTED_TYPES = tmp;
65 }
66
67 @Override
68 public Set<MediaType> getSupportedTypes(ParseContext context) {
69 return SUPPORTED_TYPES;
70 }
71
72 @Override
73 public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException {
74 SQLite3DBParser p = new SQLite3DBParser();
75 p.parse(stream, handler, metadata, context);
76 }
77 }
0 package org.apache.tika.parser.jdbc;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.sql.Blob;
21 import java.sql.Connection;
22 import java.sql.ResultSet;
23 import java.sql.SQLException;
24 import java.text.DateFormat;
25 import java.text.SimpleDateFormat;
26 import java.util.Locale;
27
28 import org.apache.tika.io.TikaInputStream;
29 import org.apache.tika.metadata.Metadata;
30 import org.apache.tika.parser.ParseContext;
31 import org.xml.sax.ContentHandler;
32 import org.xml.sax.SAXException;
33
34
35
36 /**
37 * Concrete class for SQLLite table parsing. This overrides
38 * column type handling from JDBCRowHandler.
39 * <p>
40 * This class is not designed to be thread safe (because of DateFormat)!
41 * Need to call a new instance for each parse, as AbstractDBParser does.
42 * <p>
43 * For now, this silently skips cells of type CLOB, because xerial's jdbc connector
44 * does not currently support them.
45 */
46 class SQLite3TableReader extends JDBCTableReader {
47
48
49 DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd", Locale.ROOT);
50
51 public SQLite3TableReader(Connection connection, String tableName, ParseContext context) {
52 super(connection, tableName, context);
53 }
54
55
56 /**
57 * No-op for now in {@link SQLite3TableReader}.
58 *
59 * @param tableName
60 * @param fieldName
61 * @param rowNum
62 * @param resultSet
63 * @param columnIndex
64 * @param handler
65 * @param context
66 * @throws java.sql.SQLException
67 * @throws java.io.IOException
68 * @throws org.xml.sax.SAXException
69 */
70 @Override
71 protected void handleClob(String tableName, String fieldName, int rowNum,
72 ResultSet resultSet, int columnIndex,
73 ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException {
74 //no-op for now.
75 }
76
77 /**
78 * The jdbc connection to Sqlite does not yet implement blob, have to getBytes().
79 *
80 * @param resultSet resultSet
81 * @param columnIndex columnIndex for blob
82 * @return
83 * @throws java.sql.SQLException
84 */
85 @Override
86 protected InputStream getInputStreamFromBlob(ResultSet resultSet, int columnIndex, Blob blob, Metadata m) throws SQLException {
87 return TikaInputStream.get(resultSet.getBytes(columnIndex), m);
88 }
89
90 @Override
91 protected void handleInteger(String columnTypeName, ResultSet rs, int columnIndex,
92 ContentHandler handler) throws SQLException, SAXException {
93 //As of this writing, with xerial's sqlite jdbc connector, a timestamp is
94 //stored as a column of type Integer, but the columnTypeName is TIMESTAMP, and the
95 //value is a string representing a Long.
96 if (columnTypeName.equals("TIMESTAMP")) {
97 addAllCharacters(parseDateFromLongString(rs.getString(columnIndex)), handler);
98 } else {
99 addAllCharacters(Integer.toString(rs.getInt(columnIndex)), handler);
100 }
101
102 }
103
104 private String parseDateFromLongString(String longString) throws SAXException {
105 java.sql.Date d = new java.sql.Date(Long.parseLong(longString));
106 return dateFormat.format(d);
107
108 }
109 }
2525 import org.apache.james.mime4j.dom.address.AddressList;
2626 import org.apache.james.mime4j.dom.address.Mailbox;
2727 import org.apache.james.mime4j.dom.address.MailboxList;
28 import org.apache.james.mime4j.dom.field.*;
28 import org.apache.james.mime4j.dom.field.AddressListField;
29 import org.apache.james.mime4j.dom.field.DateTimeField;
30 import org.apache.james.mime4j.dom.field.MailboxListField;
31 import org.apache.james.mime4j.dom.field.ParsedField;
32 import org.apache.james.mime4j.dom.field.UnstructuredField;
2933 import org.apache.james.mime4j.field.LenientFieldParser;
3034 import org.apache.james.mime4j.parser.ContentHandler;
3135 import org.apache.james.mime4j.stream.BodyDescriptor;
3236 import org.apache.james.mime4j.stream.Field;
3337 import org.apache.tika.config.TikaConfig;
34 import org.apache.tika.exception.TikaException;
38 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
39 import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
3540 import org.apache.tika.metadata.Metadata;
3641 import org.apache.tika.metadata.TikaCoreProperties;
3742 import org.apache.tika.parser.AutoDetectParser;
3843 import org.apache.tika.parser.ParseContext;
3944 import org.apache.tika.parser.Parser;
40 import org.apache.tika.sax.BodyContentHandler;
41 import org.apache.tika.sax.EmbeddedContentHandler;
4245 import org.apache.tika.sax.XHTMLContentHandler;
4346 import org.xml.sax.SAXException;
4447
5255 private boolean strictParsing = false;
5356
5457 private XHTMLContentHandler handler;
55 private ParseContext context;
5658 private Metadata metadata;
57 private TikaConfig tikaConfig = null;
59 private EmbeddedDocumentExtractor extractor;
5860
5961 private boolean inPart = false;
6062
6163 MailContentHandler(XHTMLContentHandler xhtml, Metadata metadata, ParseContext context, boolean strictParsing) {
6264 this.handler = xhtml;
63 this.context = context;
6465 this.metadata = metadata;
6566 this.strictParsing = strictParsing;
67
68 // Fetch / Build an EmbeddedDocumentExtractor with which
69 // to handle/process the parts/attachments
70
71 // Was an EmbeddedDocumentExtractor explicitly supplied?
72 this.extractor = context.get(EmbeddedDocumentExtractor.class);
73
74 // If there's no EmbeddedDocumentExtractor, then try using a normal parser
75 // This will ensure that the contents are made available to the user, so
76 // the see the text, but without fine-grained control/extraction
77 // (This also maintains backward compatibility with older versions!)
78 if (this.extractor == null) {
79 // If the user gave a parser, use that, if not the default
80 Parser parser = context.get(AutoDetectParser.class);
81 if (parser == null) {
82 parser = context.get(Parser.class);
83 }
84 if (parser == null) {
85 TikaConfig tikaConfig = context.get(TikaConfig.class);
86 if (tikaConfig == null) {
87 tikaConfig = TikaConfig.getDefaultConfig();
88 }
89 parser = new AutoDetectParser(tikaConfig.getParser());
90 }
91 ParseContext ctx = new ParseContext();
92 ctx.set(Parser.class, parser);
93 extractor = new ParsingEmbeddedDocumentExtractor(ctx);
94 }
6695 }
6796
6897 public void body(BodyDescriptor body, InputStream is) throws MimeException,
6998 IOException {
70 // Work out the best underlying parser for the part
71 // Check first for a specified AutoDetectParser (which may have a
72 // specific Config), then a recursing parser, and finally the default
73 Parser parser = context.get(AutoDetectParser.class);
74 if (parser == null) {
75 parser = context.get(Parser.class);
76 }
77 if (parser == null) {
78 if (tikaConfig == null) {
79 tikaConfig = context.get(TikaConfig.class);
80 if (tikaConfig == null) {
81 tikaConfig = TikaConfig.getDefaultConfig();
82 }
83 }
84 parser = tikaConfig.getParser();
85 }
86
8799 // use a different metadata object
88100 // in order to specify the mime type of the
89101 // sub part without damaging the main metadata
93105 submd.set(Metadata.CONTENT_ENCODING, body.getCharset());
94106
95107 try {
96 BodyContentHandler bch = new BodyContentHandler(handler);
97 parser.parse(is, new EmbeddedContentHandler(bch), submd, context);
98 } catch (SAXException e) {
99 throw new MimeException(e);
100 } catch (TikaException e) {
108 if (extractor.shouldParseEmbedded(submd)) {
109 extractor.parseEmbedded(is, handler, submd, false);
110 }
111 } catch (SAXException e) {
101112 throw new MimeException(e);
102113 }
103114 }
140151 /**
141152 * Header for the whole message or its parts
142153 *
143 * @see http
144 * ://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/parser/
154 * @see http://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/parser/
145155 * Field.html
146156 **/
147157 public void field(Field field) throws MimeException {
2323 import java.util.Map;
2424
2525 import org.apache.tika.exception.TikaException;
26 import org.apache.tika.io.IOUtils;
2627 import org.apache.tika.io.TikaInputStream;
2728 import org.apache.tika.metadata.Metadata;
2829 import org.apache.tika.parser.AbstractParser;
8586 }
8687
8788 // Get endian indicator from header file
88 String endianBytes = new String(hdr.getEndianIndicator()); // Retrieve endian bytes and convert to string
89 String endianBytes = new String(hdr.getEndianIndicator(), IOUtils.UTF_8); // Retrieve endian bytes and convert to string
8990 String endianCode = String.valueOf(endianBytes.toCharArray()); // Convert bytes to characters to string
9091 metadata.set("endian", endianCode);
9192
166166 return; // ignore malformed header lines
167167 }
168168
169 String headerTag = headerMatcher.group(1).toLowerCase();
169 String headerTag = headerMatcher.group(1).toLowerCase(Locale.ROOT);
170170 String headerContent = headerMatcher.group(2);
171171
172172 if (headerTag.equalsIgnoreCase("From")) {
2424 import java.io.InputStream;
2525 import java.util.Set;
2626
27 import com.pff.PSTAttachment;
28 import com.pff.PSTFile;
29 import com.pff.PSTFolder;
30 import com.pff.PSTMessage;
2731 import org.apache.tika.exception.TikaException;
2832 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
2933 import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
34 import org.apache.tika.io.IOUtils;
3035 import org.apache.tika.io.TemporaryResources;
3136 import org.apache.tika.io.TikaInputStream;
3237 import org.apache.tika.metadata.Metadata;
3944 import org.xml.sax.SAXException;
4045 import org.xml.sax.helpers.AttributesImpl;
4146
42 import com.pff.PSTAttachment;
43 import com.pff.PSTFile;
44 import com.pff.PSTFolder;
45 import com.pff.PSTMessage;
46
4747 /**
4848 * @author Tran Nam Quang
4949 * @author hong-thai.nguyen
7373 xhtml.startDocument();
7474
7575 TikaInputStream in = TikaInputStream.get(stream);
76 PSTFile pstFile = null;
7677 try {
77 PSTFile pstFile = new PSTFile(in.getFile().getPath());
78 pstFile = new PSTFile(in.getFile().getPath());
7879 metadata.set(Metadata.CONTENT_LENGTH, valueOf(pstFile.getFileHandle().length()));
7980 boolean isValid = pstFile.getFileHandle().getFD().valid();
8081 metadata.set("isValid", valueOf(isValid));
8384 }
8485 } catch (Exception e) {
8586 throw new TikaException(e.getMessage(), e);
87 } finally {
88 if (pstFile != null && pstFile.getFileHandle() != null) {
89 try{
90 pstFile.getFileHandle().close();
91 } catch (IOException e) {
92 //swallow closing exception
93 }
94 }
8695 }
8796
8897 xhtml.endDocument();
139148 mailMetadata.set("priority", valueOf(pstMail.getPriority()));
140149 mailMetadata.set("flagged", valueOf(pstMail.isFlagged()));
141150
142 byte[] mailContent = pstMail.getBody().getBytes();
151 byte[] mailContent = pstMail.getBody().getBytes(IOUtils.UTF_8);
143152 embeddedExtractor.parseEmbedded(new ByteArrayInputStream(mailContent), handler, mailMetadata, true);
144153 }
145154
3939 import org.apache.tika.mime.MimeTypeException;
4040 import org.apache.tika.mime.MimeTypes;
4141 import org.apache.tika.parser.ParseContext;
42 import org.apache.tika.parser.PasswordProvider;
4243 import org.apache.tika.parser.microsoft.OfficeParser.POIFSDocumentType;
4344 import org.apache.tika.parser.pkg.ZipContainerDetector;
4445 import org.apache.tika.sax.XHTMLContentHandler;
4647
4748 abstract class AbstractPOIFSExtractor {
4849 private final EmbeddedDocumentExtractor extractor;
50 private PasswordProvider passwordProvider;
4951 private TikaConfig tikaConfig;
5052 private MimeTypes mimeTypes;
5153 private Detector detector;
54 private Metadata metadata;
5255 private static final Log logger = LogFactory.getLog(AbstractPOIFSExtractor.class);
5356
5457 protected AbstractPOIFSExtractor(ParseContext context) {
58 this(context, null);
59 }
60 protected AbstractPOIFSExtractor(ParseContext context, Metadata metadata) {
5561 EmbeddedDocumentExtractor ex = context.get(EmbeddedDocumentExtractor.class);
5662
5763 if (ex==null) {
6066 this.extractor = ex;
6167 }
6268
63 tikaConfig = context.get(TikaConfig.class);
64 mimeTypes = context.get(MimeTypes.class);
65 detector = context.get(Detector.class);
69 this.passwordProvider = context.get(PasswordProvider.class);
70 this.tikaConfig = context.get(TikaConfig.class);
71 this.mimeTypes = context.get(MimeTypes.class);
72 this.detector = context.get(Detector.class);
73 this.metadata = metadata;
6674 }
6775
6876 // Note - these cache, but avoid creating the default TikaConfig if not needed
8391
8492 mimeTypes = getTikaConfig().getMimeRepository();
8593 return mimeTypes;
94 }
95
96 /**
97 * Returns the password to be used for this file, or null
98 * if no / default password should be used
99 */
100 protected String getPassword() {
101 if (passwordProvider != null) {
102 return passwordProvider.getPassword(metadata);
103 }
104 return null;
86105 }
87106
88107 protected void handleEmbeddedResource(TikaInputStream resource, String filename,
148167 try {
149168 // Try to un-wrap the OLE10Native record:
150169 Ole10Native ole = Ole10Native.createFromEmbeddedOleObject((DirectoryNode)dir);
151 metadata.set(Metadata.RESOURCE_NAME_KEY, dir.getName() + '/' + ole.getLabel());
152
170 if (ole.getLabel() != null) {
171 metadata.set(Metadata.RESOURCE_NAME_KEY, dir.getName() + '/' + ole.getLabel());
172 }
153173 byte[] data = ole.getDataBuffer();
154174 embedded = TikaInputStream.get(data);
155175 } catch (Ole10NativeException ex) {
3333 import org.apache.poi.hssf.eventusermodel.HSSFEventFactory;
3434 import org.apache.poi.hssf.eventusermodel.HSSFListener;
3535 import org.apache.poi.hssf.eventusermodel.HSSFRequest;
36 import org.apache.poi.hssf.extractor.OldExcelExtractor;
3637 import org.apache.poi.hssf.record.BOFRecord;
3738 import org.apache.poi.hssf.record.BoundSheetRecord;
3839 import org.apache.poi.hssf.record.CellValueRecordInterface;
5455 import org.apache.poi.hssf.record.TextObjectRecord;
5556 import org.apache.poi.hssf.record.chart.SeriesTextRecord;
5657 import org.apache.poi.hssf.record.common.UnicodeString;
58 import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey;
5759 import org.apache.poi.hssf.usermodel.HSSFPictureData;
5860 import org.apache.poi.poifs.filesystem.DirectoryEntry;
5961 import org.apache.poi.poifs.filesystem.DirectoryNode;
6365 import org.apache.tika.exception.EncryptedDocumentException;
6466 import org.apache.tika.exception.TikaException;
6567 import org.apache.tika.io.TikaInputStream;
68 import org.apache.tika.metadata.Metadata;
6669 import org.apache.tika.parser.ParseContext;
6770 import org.apache.tika.sax.XHTMLContentHandler;
6871 import org.xml.sax.SAXException;
9598 private boolean listenForAllRecords = false;
9699
97100 private static final String WORKBOOK_ENTRY = "Workbook";
98
99 public ExcelExtractor(ParseContext context) {
100 super(context);
101 private static final String BOOK_ENTRY = "Book";
102
103 public ExcelExtractor(ParseContext context, Metadata metadata) {
104 super(context, metadata);
101105 }
102106
103107 /**
142146 DirectoryNode root, XHTMLContentHandler xhtml,
143147 Locale locale) throws IOException, SAXException, TikaException {
144148 if (! root.hasEntry(WORKBOOK_ENTRY)) {
145 // Corrupt file / very old file, just skip
146 return;
147 }
149 if (root.hasEntry(BOOK_ENTRY)) {
150 // Excel 5 / Excel 95 file
151 // Records are in a different structure so needs a
152 // different parser to process them
153 OldExcelExtractor extractor = new OldExcelExtractor(root);
154 OldExcelParser.parse(extractor, xhtml);
155 return;
156 } else {
157 // Corrupt file / very old file, just skip text extraction
158 return;
159 }
160 }
161
162 // If a password was supplied, use it, otherwise the default
163 Biff8EncryptionKey.setCurrentUserPassword(getPassword());
148164
165 // Have the file processed in event mode
149166 TikaHSSFListener listener = new TikaHSSFListener(xhtml, locale, this);
150167 listener.processFile(root, isListenForAllRecords());
151168 listener.throwStoredException();
609626 }
610627
611628 }
612
613629 }
1919 import java.util.HashSet;
2020
2121 import org.apache.poi.hslf.HSLFSlideShow;
22 import org.apache.poi.hslf.model.*;
22 import org.apache.poi.hslf.model.Comment;
23 import org.apache.poi.hslf.model.HeadersFooters;
24 import org.apache.poi.hslf.model.MasterSheet;
25 import org.apache.poi.hslf.model.Notes;
26 import org.apache.poi.hslf.model.OLEShape;
27 import org.apache.poi.hslf.model.Picture;
28 import org.apache.poi.hslf.model.Shape;
29 import org.apache.poi.hslf.model.Slide;
30 import org.apache.poi.hslf.model.Table;
31 import org.apache.poi.hslf.model.TableCell;
32 import org.apache.poi.hslf.model.TextRun;
33 import org.apache.poi.hslf.model.TextShape;
2334 import org.apache.poi.hslf.usermodel.ObjectData;
2435 import org.apache.poi.hslf.usermodel.PictureData;
2536 import org.apache.poi.hslf.usermodel.SlideShow;
157157 root = ((NPOIFSFileSystem) container).getRoot();
158158 } else if (container instanceof DirectoryNode) {
159159 root = (DirectoryNode) container;
160 } else if (tstream.hasFile()) {
161 root = new NPOIFSFileSystem(tstream.getFileChannel()).getRoot();
162160 } else {
163 root = new NPOIFSFileSystem(new CloseShieldInputStream(tstream)).getRoot();
161 NPOIFSFileSystem fs;
162 if (tstream.hasFile()) {
163 fs = new NPOIFSFileSystem(tstream.getFile(), true);
164 } else {
165 fs = new NPOIFSFileSystem(new CloseShieldInputStream(tstream));
166 }
167 tstream.setOpenContainer(fs);
168 root = fs.getRoot();
164169 }
165170 }
166171 parse(root, context, metadata, xhtml);
183188
184189 switch (type) {
185190 case SOLIDWORKS_PART:
186 // new SolidworksExtractor(context).parse(root, xhtml);
187 break;
188191 case SOLIDWORKS_ASSEMBLY:
189 break;
190192 case SOLIDWORKS_DRAWING:
191193 break;
192194 case PUBLISHER:
203205 case WORKBOOK:
204206 case XLR:
205207 Locale locale = context.get(Locale.class, Locale.getDefault());
206 new ExcelExtractor(context).parse(root, xhtml, locale);
208 new ExcelExtractor(context, metadata).parse(root, xhtml, locale);
207209 break;
208210 case PROJECT:
209211 // We currently can't do anything beyond the metadata
254256 } catch (GeneralSecurityException ex) {
255257 throw new EncryptedDocumentException(ex);
256258 }
259 default:
260 // For unsupported / unhandled types, just the metadata
261 // is extracted, which happened above
262 break;
257263 }
258264 }
259265
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.microsoft;
17
18 import java.io.BufferedReader;
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.io.StringReader;
22 import java.util.Arrays;
23 import java.util.Collections;
24 import java.util.HashSet;
25 import java.util.Set;
26
27 import org.apache.poi.hssf.extractor.OldExcelExtractor;
28 import org.apache.tika.exception.TikaException;
29 import org.apache.tika.metadata.Metadata;
30 import org.apache.tika.mime.MediaType;
31 import org.apache.tika.parser.AbstractParser;
32 import org.apache.tika.parser.ParseContext;
33 import org.apache.tika.sax.XHTMLContentHandler;
34 import org.xml.sax.ContentHandler;
35 import org.xml.sax.SAXException;
36
37 /**
38 * A POI-powered Tika Parser for very old versions of Excel, from
39 * pre-OLE2 days, such as Excel 4.
40 */
41 public class OldExcelParser extends AbstractParser {
42 private static final long serialVersionUID = 4611820730372823452L;
43
44 private static final Set<MediaType> SUPPORTED_TYPES =
45 Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
46 MediaType.application("vnd.ms-excel.sheet.4"),
47 MediaType.application("vnd.ms-excel.workspace.4"),
48 MediaType.application("vnd.ms-excel.sheet.3"),
49 MediaType.application("vnd.ms-excel.workspace.3"),
50 MediaType.application("vnd.ms-excel.sheet.2")
51 )));
52
53 public Set<MediaType> getSupportedTypes(ParseContext context) {
54 return SUPPORTED_TYPES;
55 }
56
57 /**
58 * Extracts properties and text from an MS Document input stream
59 */
60 public void parse(
61 InputStream stream, ContentHandler handler,
62 Metadata metadata, ParseContext context)
63 throws IOException, SAXException, TikaException {
64 // Open the POI provided extractor
65 OldExcelExtractor extractor = new OldExcelExtractor(stream);
66
67 // We can't do anything about metadata, as these old formats
68 // didn't have any stored with them
69
70 // Set the content type
71 // TODO Get the version and type, to set as the Content Type
72
73 // Have the text extracted and given to our Content Handler
74 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
75 parse(extractor, xhtml);
76 }
77
78 protected static void parse(OldExcelExtractor extractor,
79 XHTMLContentHandler xhtml) throws TikaException, IOException, SAXException {
80 // Get the whole text, as a single string
81 String text = extractor.getText();
82
83 // Split and output
84 xhtml.startDocument();
85
86 String line;
87 BufferedReader reader = new BufferedReader(new StringReader(text));
88 while ((line = reader.readLine()) != null) {
89 xhtml.startElement("p");
90 xhtml.characters(line);
91 xhtml.endElement("p");
92 }
93
94 xhtml.endDocument();
95 }
96 }
1919 import java.io.IOException;
2020 import java.text.ParseException;
2121 import java.util.Date;
22 import java.util.Locale;
2223
2324 import org.apache.poi.hmef.attribute.MAPIRtfAttribute;
2425 import org.apache.poi.hsmf.MAPIMessage;
125126 String[] headers = msg.getHeaders();
126127 if(headers != null && headers.length > 0) {
127128 for(String header: headers) {
128 if(header.toLowerCase().startsWith("date:")) {
129 if(header.toLowerCase(Locale.ROOT).startsWith("date:")) {
129130 String date = header.substring(header.indexOf(':')+1).trim();
130131
131132 // See if we can parse it as a normal mail date
1717
1818 import static org.apache.tika.mime.MediaType.application;
1919
20 import java.io.File;
2021 import java.io.IOException;
2122 import java.io.InputStream;
22 import java.nio.channels.FileChannel;
2323 import java.util.Collections;
2424 import java.util.HashSet;
2525 import java.util.Set;
374374 throws IOException {
375375 // Force the document stream to a (possibly temporary) file
376376 // so we don't modify the current position of the stream
377 FileChannel channel = stream.getFileChannel();
377 File file = stream.getFile();
378378
379379 try {
380 NPOIFSFileSystem fs = new NPOIFSFileSystem(channel);
380 NPOIFSFileSystem fs = new NPOIFSFileSystem(file, true);
381381
382382 // Optimize a possible later parsing process by keeping
383383 // a reference to the already opened POI file system
2121 import java.util.HashMap;
2222 import java.util.HashSet;
2323 import java.util.List;
24 import java.util.Locale;
2425 import java.util.Map;
2526 import java.util.Set;
2627
4546 import org.apache.poi.poifs.filesystem.Entry;
4647 import org.apache.poi.poifs.filesystem.NPOIFSFileSystem;
4748 import org.apache.tika.exception.TikaException;
49 import org.apache.tika.io.IOUtils;
4850 import org.apache.tika.io.TikaInputStream;
4951 import org.apache.tika.parser.ParseContext;
5052 import org.apache.tika.sax.XHTMLContentHandler;
232234 CharacterRun cr = p.getCharacterRun(j);
233235
234236 // FIELD_BEGIN_MARK:
235 if (cr.text().getBytes()[0] == 0x13) {
237 if (cr.text().getBytes(IOUtils.UTF_8)[0] == 0x13) {
236238 Field field = document.getFields().getFieldByStartOffset(docPart, cr.getStartOffset());
237239 // 58 is an embedded document
238240 // 56 is a document link
402404
403405 if((text.startsWith("HYPERLINK") || text.startsWith(" HYPERLINK"))
404406 && text.indexOf('"') > -1) {
405 String url = text.substring(
406 text.indexOf('"') + 1,
407 text.lastIndexOf('"')
408 );
407 int start = text.indexOf('"') + 1;
408 int end = findHyperlinkEnd(text, start);
409 String url = "";
410 if (start >= 0 && start < end && end <= text.length()) {
411 url = text.substring(start, end);
412 }
413
409414 xhtml.startElement("a", "href", url);
410415 for(CharacterRun cr : texts) {
411416 handleCharacterRun(cr, skipStyling, xhtml);
432437
433438 // Tell them how many to skip over
434439 return i-index;
440 }
441
442 //temporary work around for TIKA-1512
443 private int findHyperlinkEnd(String text, int start) {
444 int end = text.lastIndexOf('"');
445 if (end > start) {
446 return end;
447 }
448 end = text.lastIndexOf('\u201D');//smart right double quote
449 if (end > start) {
450 return end;
451 }
452 end = text.lastIndexOf('\r');
453 if (end > start) {
454 return end;
455 }
456 //if nothing so far, take the full length of the string
457 //If the full string is > 256 characters, it appears
458 //that the url is truncated in the .doc file. This
459 //will return the value as it is in the file, which
460 //may be incorrect; but it is the same behavior as opening
461 //the link in MSWord.
462 //This code does not currently check that length is actually >= 256.
463 //we might want to add that?
464 return text.length();
435465 }
436466
437467 private void handlePictureCharacterRun(CharacterRun cr, Picture picture, PicturesSource pictures, XHTMLContentHandler xhtml)
547577 tag = "h" + Math.min(num, 6);
548578 } else {
549579 styleClass = styleName.replace(' ', '_');
550 styleClass = styleClass.substring(0,1).toLowerCase() +
580 styleClass = styleClass.substring(0,1).toLowerCase(Locale.ROOT) +
551581 styleClass.substring(1);
552582 }
553583
135135 private void handleThumbnail( ContentHandler handler ) {
136136 try {
137137 OPCPackage opcPackage = extractor.getPackage();
138 int thumbIndex = 0;
139138 for (PackageRelationship rel : opcPackage.getRelationshipsByType( PackageRelationshipTypes.THUMBNAIL )) {
140139 PackagePart tPart = opcPackage.getPart(rel);
141140 InputStream tStream = tPart.getInputStream();
142141 Metadata thumbnailMetadata = new Metadata();
143 String thumbName = "thumbnail_" + thumbIndex + "." + tPart.getPartName().getExtension();
142 String thumbName = tPart.getPartName().getName();
144143 thumbnailMetadata.set(Metadata.RESOURCE_NAME_KEY, thumbName);
145144
146145 AttributesImpl attributes = new AttributesImpl();
154153 thumbnailMetadata.set(TikaCoreProperties.TITLE, tPart.getPartName().getName());
155154
156155 if (embeddedExtractor.shouldParseEmbedded(thumbnailMetadata)) {
157 embeddedExtractor.parseEmbedded(TikaInputStream.get(tStream), new EmbeddedContentHandler(handler), thumbnailMetadata, true);
156 embeddedExtractor.parseEmbedded(TikaInputStream.get(tStream), new EmbeddedContentHandler(handler), thumbnailMetadata, false);
158157 }
159158
160159 tStream.close();
161 thumbIndex ++;
162160 }
163161 } catch (Exception ex) {
164162
216214 private void handleEmbeddedOLE(PackagePart part, ContentHandler handler, String rel)
217215 throws IOException, SAXException {
218216 // A POIFSFileSystem needs to be at least 3 blocks big to be valid
219 // TODO: TIKA-1118 Upgrade to POI 4.0 then enable this block of code
220 // if (part.getSize() >= 0 && part.getSize() < 512*3) {
221 // // Too small, skip
222 // return;
223 // }
217 if (part.getSize() >= 0 && part.getSize() < 512*3) {
218 // Too small, skip
219 return;
220 }
224221
225222 // Open the POIFS (OLE2) structure and process
226223 POIFSFileSystem fs = new POIFSFileSystem(part.getInputStream());
248245 // TIKA-704: OLE 1.0 embedded document
249246 Ole10Native ole =
250247 Ole10Native.createFromEmbeddedOleObject(fs);
251 metadata.set(Metadata.RESOURCE_NAME_KEY, ole.getLabel());
248 if (ole.getLabel() != null) {
249 metadata.set(Metadata.RESOURCE_NAME_KEY, ole.getLabel());
250 }
252251 byte[] data = ole.getDataBuffer();
253252 if (data != null) {
254253 stream = TikaInputStream.get(data);
158158 Metadata metadata) {
159159 org.openxmlformats.schemas.officeDocument.x2006.customProperties.CTProperties
160160 props = properties.getUnderlyingProperties();
161
162 for(CTProperty property : props.getPropertyList()) {
161 for (int i = 0; i < props.sizeOfPropertyArray(); i++) {
162 CTProperty property = props.getPropertyArray(i);
163163 String val = null;
164164 Date date = null;
165165
1515 */
1616 package org.apache.tika.parser.microsoft.ooxml;
1717
18 import javax.xml.namespace.QName;
1819 import java.io.IOException;
1920 import java.util.ArrayList;
2021 import java.util.List;
21 import javax.xml.namespace.QName;
2222
2323 import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
2424 import org.apache.poi.openxml4j.opc.PackagePart;
4949 import org.apache.xmlbeans.XmlObject;
5050 import org.openxmlformats.schemas.presentationml.x2006.main.CTComment;
5151 import org.openxmlformats.schemas.presentationml.x2006.main.CTPicture;
52 import org.openxmlformats.schemas.presentationml.x2006.main.CTSlideIdList;
5253 import org.openxmlformats.schemas.presentationml.x2006.main.CTSlideIdListEntry;
5354 import org.xml.sax.SAXException;
5455 import org.xml.sax.helpers.AttributesImpl;
9899 // comments (if present)
99100 XSLFComments comments = slide.getComments();
100101 if (comments != null) {
101 for (CTComment comment : comments.getCTCommentsList().getCmList()) {
102 for (int i = 0; i < comments.getNumberOfComments(); i++) {
103 CTComment comment = comments.getCommentAt(i);
102104 xhtml.element("p", comment.getText());
103105 }
104106 }
180182 } catch(Exception e) {
181183 throw new TikaException(e.getMessage()); // Shouldn't happen
182184 }
183
184 for (CTSlideIdListEntry ctSlide : document.getSlideReferences().getSldIdList()) {
185 // Add the slide
186 PackagePart slidePart;
187 try {
188 slidePart = document.getSlidePart(ctSlide);
189 } catch(IOException e) {
190 throw new TikaException("Broken OOXML file", e);
191 } catch(XmlException xe) {
192 throw new TikaException("Broken OOXML file", xe);
193 }
194 parts.add(slidePart);
195
196 // If it has drawings, return those too
197 try {
198 for(PackageRelationship rel : slidePart.getRelationshipsByType(XSLFRelation.VML_DRAWING.getRelation())) {
199 if(rel.getTargetMode() == TargetMode.INTERNAL) {
200 PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI());
201 parts.add( rel.getPackage().getPart(relName) );
202 }
203 }
204 } catch(InvalidFormatException e) {
205 throw new TikaException("Broken OOXML file", e);
206 }
185
186 CTSlideIdList ctSlideIdList = document.getSlideReferences();
187 if (ctSlideIdList != null) {
188 for (int i = 0; i < ctSlideIdList.sizeOfSldIdArray(); i++) {
189 CTSlideIdListEntry ctSlide = ctSlideIdList.getSldIdArray(i);
190 // Add the slide
191 PackagePart slidePart;
192 try {
193 slidePart = document.getSlidePart(ctSlide);
194 } catch (IOException e) {
195 throw new TikaException("Broken OOXML file", e);
196 } catch (XmlException xe) {
197 throw new TikaException("Broken OOXML file", xe);
198 }
199 parts.add(slidePart);
200
201 // If it has drawings, return those too
202 try {
203 for (PackageRelationship rel : slidePart.getRelationshipsByType(XSLFRelation.VML_DRAWING.getRelation())) {
204 if (rel.getTargetMode() == TargetMode.INTERNAL) {
205 PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI());
206 parts.add(rel.getPackage().getPart(relName));
207 }
208 }
209 } catch (InvalidFormatException e) {
210 throw new TikaException("Broken OOXML file", e);
211 }
212 }
207213 }
208
209214 return parts;
210215 }
211216 }
120120 InputStream stream = iter.next();
121121 sheetParts.add(iter.getSheetPart());
122122
123 SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(xhtml, iter.getSheetComments());
123 SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(xhtml);
124 CommentsTable comments = iter.getSheetComments();
124125
125126 // Start, and output the sheet name
126127 xhtml.startElement("div");
130131 xhtml.startElement("table");
131132 xhtml.startElement("tbody");
132133
133 processSheet(sheetExtractor, styles, strings, stream);
134 processSheet(sheetExtractor, comments, styles, strings, stream);
134135
135136 xhtml.endElement("tbody");
136137 xhtml.endElement("table");
175176
176177 public void processSheet(
177178 SheetContentsHandler sheetContentsExtractor,
179 CommentsTable comments,
178180 StylesTable styles,
179181 ReadOnlySharedStringsTable strings,
180182 InputStream sheetInputStream)
186188 XMLReader sheetParser = saxParser.getXMLReader();
187189 XSSFSheetInterestingPartsCapturer handler =
188190 new XSSFSheetInterestingPartsCapturer(new XSSFSheetXMLHandler(
189 styles, strings, sheetContentsExtractor, formatter, false));
191 styles, comments, strings, sheetContentsExtractor, formatter, false));
190192 sheetParser.setContentHandler(handler);
191193 sheetParser.parse(sheetSource);
192194 sheetInputStream.close();
204206 */
205207 protected static class SheetTextAsHTML implements SheetContentsHandler {
206208 private XHTMLContentHandler xhtml;
207 private CommentsTable comments;
208209 private List<String> headers;
209210 private List<String> footers;
210211
211 protected SheetTextAsHTML(XHTMLContentHandler xhtml, CommentsTable comments) {
212 protected SheetTextAsHTML(XHTMLContentHandler xhtml) {
212213 this.xhtml = xhtml;
213 this.comments = comments;
214214 headers = new ArrayList<String>();
215215 footers = new ArrayList<String>();
216216 }
221221 } catch(SAXException e) {}
222222 }
223223
224 public void endRow() {
224 public void endRow(int rowNum) {
225225 try {
226226 xhtml.endElement("tr");
227227 } catch(SAXException e) {}
228228 }
229229
230 public void cell(String cellRef, String formattedValue) {
230 public void cell(String cellRef, String formattedValue, XSSFComment comment) {
231231 try {
232232 xhtml.startElement("td");
233233
234234 // Main cell contents
235 xhtml.characters(formattedValue);
235 if (formattedValue != null) {
236 xhtml.characters(formattedValue);
237 }
236238
237239 // Comments
238 if(comments != null) {
239 XSSFComment comment = comments.findCellComment(cellRef);
240 if(comment != null) {
241 xhtml.startElement("br");
242 xhtml.endElement("br");
243 xhtml.characters(comment.getAuthor());
244 xhtml.characters(": ");
245 xhtml.characters(comment.getString().getString());
246 }
240 if(comment != null) {
241 xhtml.startElement("br");
242 xhtml.endElement("br");
243 xhtml.characters(comment.getAuthor());
244 xhtml.characters(": ");
245 xhtml.characters(comment.getString().getString());
247246 }
248247
249248 xhtml.endElement("td");
1515 */
1616 package org.apache.tika.parser.microsoft.ooxml;
1717
18 import javax.xml.namespace.QName;
1819 import java.io.IOException;
1920 import java.util.ArrayList;
2021 import java.util.List;
21 import javax.xml.namespace.QName;
2222
2323 import org.apache.poi.openxml4j.opc.PackagePart;
2424 import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
2727 import org.apache.poi.xwpf.usermodel.BodyType;
2828 import org.apache.poi.xwpf.usermodel.IBody;
2929 import org.apache.poi.xwpf.usermodel.IBodyElement;
30 import org.apache.poi.xwpf.usermodel.ICell;
3031 import org.apache.poi.xwpf.usermodel.IRunElement;
32 import org.apache.poi.xwpf.usermodel.ISDTContent;
3133 import org.apache.poi.xwpf.usermodel.XWPFDocument;
3234 import org.apache.poi.xwpf.usermodel.XWPFHeaderFooter;
3335 import org.apache.poi.xwpf.usermodel.XWPFHyperlink;
3739 import org.apache.poi.xwpf.usermodel.XWPFPictureData;
3840 import org.apache.poi.xwpf.usermodel.XWPFRun;
3941 import org.apache.poi.xwpf.usermodel.XWPFSDT;
40 import org.apache.poi.xwpf.usermodel.XWPFSDTContent;
42 import org.apache.poi.xwpf.usermodel.XWPFSDTCell;
4143 import org.apache.poi.xwpf.usermodel.XWPFStyle;
4244 import org.apache.poi.xwpf.usermodel.XWPFStyles;
4345 import org.apache.poi.xwpf.usermodel.XWPFTable;
4446 import org.apache.poi.xwpf.usermodel.XWPFTableCell;
4547 import org.apache.poi.xwpf.usermodel.XWPFTableRow;
4648 import org.apache.tika.parser.ParseContext;
49 import org.apache.tika.parser.microsoft.WordExtractor;
4750 import org.apache.tika.parser.microsoft.WordExtractor.TagAndStyle;
48 import org.apache.tika.parser.microsoft.WordExtractor;
4951 import org.apache.tika.sax.XHTMLContentHandler;
5052 import org.apache.xmlbeans.XmlCursor;
5153 import org.apache.xmlbeans.XmlException;
5254 import org.apache.xmlbeans.XmlObject;
5355 import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBookmark;
5456 import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTObject;
57 import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
5558 import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr;
56 import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP;
5759 import org.xml.sax.SAXException;
5860 import org.xml.sax.helpers.AttributesImpl;
5961
110112
111113 private void extractSDT(XWPFSDT element, XHTMLContentHandler xhtml) throws SAXException,
112114 XmlException, IOException {
113 XWPFSDTContent content = element.getContent();
115 ISDTContent content = element.getContent();
114116 String tag = "p";
115117 xhtml.startElement(tag);
116118 xhtml.characters(content.getText());
193195 // Attach bookmarks for the paragraph
194196 // (In future, we might put them in the right place, for now
195197 // we just put them in the correct paragraph)
196 for (CTBookmark bookmark : paragraph.getCTP().getBookmarkStartList()) {
198 for (int i = 0; i < paragraph.getCTP().sizeOfBookmarkStartArray(); i++) {
199 CTBookmark bookmark = paragraph.getCTP().getBookmarkStartArray(i);
197200 xhtml.startElement("a", "name", bookmark.getName());
198201 xhtml.endElement("a");
199202 }
330333 xhtml.startElement("tbody");
331334 for(XWPFTableRow row : table.getRows()) {
332335 xhtml.startElement("tr");
333 for(XWPFTableCell cell : row.getTableCells()) {
334 xhtml.startElement("td");
335 extractIBodyText(cell, xhtml);
336 xhtml.endElement("td");
336 for(ICell cell : row.getTableICells()){
337 xhtml.startElement("td");
338 if (cell instanceof XWPFTableCell) {
339 extractIBodyText((XWPFTableCell)cell, xhtml);
340 } else if (cell instanceof XWPFSDTCell) {
341 xhtml.characters(((XWPFSDTCell)cell).getContent().getText());
342 }
343 xhtml.endElement("td");
337344 }
338345 xhtml.endElement("tr");
339346 }
112112 return null;
113113 }
114114
115 public String getAlbumArtist() {
116 for (ID3Tags tag : tags) {
117 if (tag.getAlbumArtist() != null) {
118 return tag.getAlbumArtist();
119 }
120 }
121 return null;
122 }
123
124 public String getDisc() {
125 for (ID3Tags tag : tags) {
126 if (tag.getDisc() != null) {
127 return tag.getDisc();
128 }
129 }
130 return null;
131 }
132
133 public String getCompilation() {
134 for (ID3Tags tag : tags) {
135 if (tag.getCompilation() != null) {
136 return tag.getCompilation();
137 }
138 }
139 return null;
140 }
115141 }
1616 package org.apache.tika.parser.mp3;
1717
1818 import java.util.List;
19
2019
2120 /**
2221 * Interface that defines the common interface for ID3 tag parsers,
171170
172171 String getTitle();
173172
173 /**
174 * The Artist for the track
175 */
174176 String getArtist();
177
178 /**
179 * The Artist for the overall album / compilation of albums
180 */
181 String getAlbumArtist();
175182
176183 String getAlbum();
177184
178185 String getComposer();
179186
187 String getCompilation();
188
180189 /**
181190 * Retrieves the comments, if any.
182191 * Files may have more than one comment, but normally only
188197
189198 String getYear();
190199
200 /**
201 * The number of the track within the album / recording
202 */
191203 String getTrackNumber();
204
205 /**
206 * The number of the disc this belongs to, within the set
207 */
208 String getDisc();
192209
193210 /**
194211 * Represents a comments in ID3 (especially ID3 v2), where are
120120 }
121121
122122 /**
123 * ID3v1 doesn't have album-wide artists,
124 * so returns null;
125 */
126 public String getAlbumArtist() {
127 return null;
128 }
129
130 /**
131 * ID3v1 doesn't have disc numbers,
132 * so returns null;
133 */
134 public String getDisc() {
135 return null;
136 }
137
138 /**
139 * ID3v1 doesn't have compilations,
140 * so returns null;
141 */
142 public String getCompilation() {
143 return null;
144 }
145
146 /**
123147 * Returns the identified ISO-8859-1 substring from the given byte buffer.
124148 * The return value is the zero-terminated substring retrieved from
125149 * between the given start and end positions in the given byte buffer.
3838 private String composer;
3939 private String genre;
4040 private String trackNumber;
41 private String albumArtist;
42 private String disc;
4143 private List<ID3Comment> comments = new ArrayList<ID3Comment>();
4244
4345 public ID3v22Handler(ID3v2Frame frame)
4951 title = getTagString(tag.data, 0, tag.data.length);
5052 } else if (tag.name.equals("TP1")) {
5153 artist = getTagString(tag.data, 0, tag.data.length);
54 } else if (tag.name.equals("TP2")) {
55 albumArtist = getTagString(tag.data, 0, tag.data.length);
5256 } else if (tag.name.equals("TAL")) {
5357 album = getTagString(tag.data, 0, tag.data.length);
5458 } else if (tag.name.equals("TYE")) {
5963 comments.add( getComment(tag.data, 0, tag.data.length) );
6064 } else if (tag.name.equals("TRK")) {
6165 trackNumber = getTagString(tag.data, 0, tag.data.length);
66 } else if (tag.name.equals("TPA")) {
67 disc = getTagString(tag.data, 0, tag.data.length);
6268 } else if (tag.name.equals("TCO")) {
6369 genre = extractGenre( getTagString(tag.data, 0, tag.data.length) );
6470 }
128134 return trackNumber;
129135 }
130136
137 public String getAlbumArtist() {
138 return albumArtist;
139 }
140
141 public String getDisc() {
142 return disc;
143 }
144
145 /**
146 * ID3v22 doesn't have compilations,
147 * so returns null;
148 */
149 public String getCompilation() {
150 return null;
151 }
152
131153 private class RawV22TagIterator extends RawTagIterator {
132154 private RawV22TagIterator(ID3v2Frame frame) {
133155 frame.super(3, 3, 1, 0);
134156 }
135157 }
136
137158 }
3838 private String composer;
3939 private String genre;
4040 private String trackNumber;
41 private String albumArtist;
42 private String disc;
43 private String compilation;
4144 private List<ID3Comment> comments = new ArrayList<ID3Comment>();
4245
4346 public ID3v23Handler(ID3v2Frame frame)
4952 title = getTagString(tag.data, 0, tag.data.length);
5053 } else if (tag.name.equals("TPE1")) {
5154 artist = getTagString(tag.data, 0, tag.data.length);
55 } else if (tag.name.equals("TPE2")) {
56 albumArtist = getTagString(tag.data, 0, tag.data.length);
5257 } else if (tag.name.equals("TALB")) {
5358 album = getTagString(tag.data, 0, tag.data.length);
5459 } else if (tag.name.equals("TYER")) {
5964 comments.add( getComment(tag.data, 0, tag.data.length) );
6065 } else if (tag.name.equals("TRCK")) {
6166 trackNumber = getTagString(tag.data, 0, tag.data.length);
67 } else if (tag.name.equals("TPOS")) {
68 disc = getTagString(tag.data, 0, tag.data.length);
69 } else if (tag.name.equals("TCMP")) {
70 compilation = getTagString(tag.data, 0, tag.data.length);
6271 } else if (tag.name.equals("TCON")) {
6372 genre = ID3v22Handler.extractGenre( getTagString(tag.data, 0, tag.data.length) );
6473 }
108117 return trackNumber;
109118 }
110119
120 public String getAlbumArtist() {
121 return albumArtist;
122 }
123
124 public String getDisc() {
125 return disc;
126 }
127
128 public String getCompilation() {
129 return compilation;
130 }
131
111132 private class RawV23TagIterator extends RawTagIterator {
112133 private RawV23TagIterator(ID3v2Frame frame) {
113134 frame.super(4, 4, 1, 2);
114135 }
115136 }
116
117137 }
3939 private String composer;
4040 private String genre;
4141 private String trackNumber;
42 private String albumArtist;
43 private String disc;
44 private String compilation;
4245 private List<ID3Comment> comments = new ArrayList<ID3Comment>();
4346
4447 public ID3v24Handler(ID3v2Frame frame)
5053 title = getTagString(tag.data, 0, tag.data.length);
5154 } else if (tag.name.equals("TPE1")) {
5255 artist = getTagString(tag.data, 0, tag.data.length);
56 } else if (tag.name.equals("TPE2")) {
57 albumArtist = getTagString(tag.data, 0, tag.data.length);
5358 } else if (tag.name.equals("TALB")) {
5459 album = getTagString(tag.data, 0, tag.data.length);
5560 } else if (tag.name.equals("TYER")) {
6469 comments.add( getComment(tag.data, 0, tag.data.length) );
6570 } else if (tag.name.equals("TRCK")) {
6671 trackNumber = getTagString(tag.data, 0, tag.data.length);
72 } else if (tag.name.equals("TPOS")) {
73 disc = getTagString(tag.data, 0, tag.data.length);
74 } else if (tag.name.equals("TCMP")) {
75 compilation = getTagString(tag.data, 0, tag.data.length);
6776 } else if (tag.name.equals("TCON")) {
6877 genre = ID3v22Handler.extractGenre( getTagString(tag.data, 0, tag.data.length) );
6978 }
113122 return trackNumber;
114123 }
115124
125 public String getAlbumArtist() {
126 return albumArtist;
127 }
128
129 public String getDisc() {
130 return disc;
131 }
132
133 public String getCompilation() {
134 return compilation;
135 }
136
116137 private class RawV24TagIterator extends RawTagIterator {
117138 private RawV24TagIterator(ID3v2Frame frame) {
118139 frame.super(4, 4, 1, 2);
119140 }
120141 }
121
122142 }
412412
413413 // Now data
414414 int copyFrom = offset+nameLength+sizeLength+flagLength;
415 size = Math.min(size, frameData.length-copyFrom);
415 size = Math.max(0, Math.min(size, frameData.length-copyFrom)); // TIKA-1218, prevent negative size for malformed files.
416416 data = new byte[size];
417417 System.arraycopy(frameData, copyFrom, data, 0, size);
418418 }
1919 import java.io.InputStream;
2020
2121 import org.apache.tika.exception.TikaException;
22 import org.apache.tika.io.IOUtils;
2223 import org.xml.sax.ContentHandler;
2324 import org.xml.sax.SAXException;
2425
8182 // size including the LYRICSBEGIN but excluding the
8283 // length+LYRICS200 at the end.
8384 int length = Integer.parseInt(
84 new String(tagData, lookat-6, 6)
85 new String(tagData, lookat-6, 6, IOUtils.UTF_8)
8586 );
8687
8788 String lyrics = new String(
6969 // Create handlers for the various kinds of ID3 tags
7070 ID3TagsAndAudio audioAndTags = getAllTagHandlers(stream, handler);
7171
72 // Process tags metadata if the file has supported tags
7273 if (audioAndTags.tags.length > 0) {
7374 CompositeTagHandler tag = new CompositeTagHandler(audioAndTags.tags);
7475
7576 metadata.set(TikaCoreProperties.TITLE, tag.getTitle());
7677 metadata.set(TikaCoreProperties.CREATOR, tag.getArtist());
7778 metadata.set(XMPDM.ARTIST, tag.getArtist());
79 metadata.set(XMPDM.ALBUM_ARTIST, tag.getAlbumArtist());
7880 metadata.set(XMPDM.COMPOSER, tag.getComposer());
7981 metadata.set(XMPDM.ALBUM, tag.getAlbum());
82 metadata.set(XMPDM.COMPILATION, tag.getCompilation());
8083 metadata.set(XMPDM.RELEASE_DATE, tag.getYear());
8184 metadata.set(XMPDM.GENRE, tag.getGenre());
82 metadata.set(XMPDM.DURATION, audioAndTags.duration);
8385
8486 List<String> comments = new ArrayList<String>();
8587 for (ID3Comment comment : tag.getComments()) {
106108 xhtml.element("p", tag.getArtist());
107109
108110 // ID3v1.1 Track addition
111 StringBuilder sb = new StringBuilder();
112 sb.append(tag.getAlbum());
109113 if (tag.getTrackNumber() != null) {
110 xhtml.element("p", tag.getAlbum() + ", track " + tag.getTrackNumber());
114 sb.append(", track ").append(tag.getTrackNumber());
111115 metadata.set(XMPDM.TRACK_NUMBER, tag.getTrackNumber());
112 } else {
113 xhtml.element("p", tag.getAlbum());
114 }
116 }
117 if (tag.getDisc() != null) {
118 sb.append(", disc ").append(tag.getDisc());
119 metadata.set(XMPDM.DISC_NUMBER, tag.getDisc());
120 }
121 xhtml.element("p", sb.toString());
122
115123 xhtml.element("p", tag.getYear());
116124 xhtml.element("p", tag.getGenre());
117125 xhtml.element("p", String.valueOf(audioAndTags.duration));
118126 for (String comment : comments) {
119127 xhtml.element("p", comment);
120128 }
129 }
130 if (audioAndTags.duration > 0) {
131 metadata.set(XMPDM.DURATION, audioAndTags.duration);
121132 }
122133 if (audioAndTags.audio != null) {
123134 metadata.set("samplerate", String.valueOf(audioAndTags.audio.getSampleRate()));
1717
1818 import java.io.IOException;
1919 import java.io.InputStream;
20 import java.text.DecimalFormat;
21 import java.text.NumberFormat;
2022 import java.util.Arrays;
2123 import java.util.Collections;
2224 import java.util.HashMap;
2325 import java.util.List;
26 import java.util.Locale;
2427 import java.util.Map;
2528 import java.util.Set;
2629
3033 import org.apache.tika.metadata.Metadata;
3134 import org.apache.tika.metadata.Property;
3235 import org.apache.tika.metadata.TikaCoreProperties;
36 import org.apache.tika.metadata.XMP;
3337 import org.apache.tika.metadata.XMPDM;
3438 import org.apache.tika.mime.MediaType;
3539 import org.apache.tika.parser.AbstractParser;
5458 import com.coremedia.iso.boxes.sampleentry.AudioSampleEntry;
5559 import com.googlecode.mp4parser.boxes.apple.AppleAlbumBox;
5660 import com.googlecode.mp4parser.boxes.apple.AppleArtistBox;
61 import com.googlecode.mp4parser.boxes.apple.AppleArtist2Box;
5762 import com.googlecode.mp4parser.boxes.apple.AppleCommentBox;
63 import com.googlecode.mp4parser.boxes.apple.AppleCompilationBox;
64 import com.googlecode.mp4parser.boxes.apple.AppleDiskNumberBox;
5865 import com.googlecode.mp4parser.boxes.apple.AppleEncoderBox;
5966 import com.googlecode.mp4parser.boxes.apple.AppleGenreBox;
6067 import com.googlecode.mp4parser.boxes.apple.AppleNameBox;
7380 public class MP4Parser extends AbstractParser {
7481 /** Serial version UID */
7582 private static final long serialVersionUID = 84011216792285L;
83 /** TODO Replace this with a 2dp Duration Property Converter */
84 private static final DecimalFormat DURATION_FORMAT =
85 (DecimalFormat)NumberFormat.getNumberInstance(Locale.ROOT);
86 static {
87 DURATION_FORMAT.applyPattern("0.0#");
88 }
7689
7790 // Ensure this stays in Sync with the entries in tika-mimetypes.xml
7891 private static final Map<MediaType,List<String>> typesMap = new HashMap<MediaType, List<String>>();
159172
160173 // Get the duration
161174 double durationSeconds = ((double)mHeader.getDuration()) / mHeader.getTimescale();
162 // TODO Use this
175 metadata.set(XMPDM.DURATION, DURATION_FORMAT.format(durationSeconds));
163176
164177 // The timescale is normally the sampling rate
165178 metadata.set(XMPDM.AUDIO_SAMPLE_RATE, (int)mHeader.getTimescale());
216229 addMetadata(TikaCoreProperties.CREATOR, metadata, artist);
217230 addMetadata(XMPDM.ARTIST, metadata, artist);
218231
232 // Album Artist
233 AppleArtist2Box artist2 = getOrNull(apple, AppleArtist2Box.class);
234 addMetadata(XMPDM.ALBUM_ARTIST, metadata, artist2);
235
219236 // Album
220237 AppleAlbumBox album = getOrNull(apple, AppleAlbumBox.class);
221238 addMetadata(XMPDM.ALBUM, metadata, album);
241258 //metadata.set(XMPDM.NUMBER_OF_TRACKS, trackNum.getB()); // TODO
242259 }
243260
261 // Disc number
262 AppleDiskNumberBox discNum = getOrNull(apple, AppleDiskNumberBox.class);
263 if (discNum != null) {
264 metadata.set(XMPDM.DISC_NUMBER, discNum.getA());
265 }
266
267 // Compilation
268 AppleCompilationBox compilation = getOrNull(apple, AppleCompilationBox.class);
269 if (compilation != null) {
270 metadata.set(XMPDM.COMPILATION, (int)compilation.getValue());
271 }
272
244273 // Comment
245274 AppleCommentBox comment = getOrNull(apple, AppleCommentBox.class);
246275 addMetadata(XMPDM.LOG_COMMENT, metadata, comment);
247276
248277 // Encoder
249278 AppleEncoderBox encoder = getOrNull(apple, AppleEncoderBox.class);
250 // addMetadata(XMPDM.???, metadata, encoder); // TODO
279 if (encoder != null) {
280 metadata.set(XMP.CREATOR_TOOL, encoder.getValue());
281 }
251282
252283
253284 // As text
1616 package org.apache.tika.parser.netcdf;
1717
1818 //JDK imports
19 import java.io.ByteArrayOutputStream;
19
2020 import java.io.IOException;
2121 import java.io.InputStream;
2222 import java.util.Collections;
2424 import java.util.List;
2525
2626 import org.apache.tika.exception.TikaException;
27 import org.apache.tika.io.IOUtils;
27 import org.apache.tika.io.TemporaryResources;
28 import org.apache.tika.io.TikaInputStream;
2829 import org.apache.tika.metadata.Metadata;
2930 import org.apache.tika.metadata.Property;
3031 import org.apache.tika.metadata.TikaCoreProperties;
5051 */
5152 public class NetCDFParser extends AbstractParser {
5253
53 /** Serial version UID */
54 /**
55 * Serial version UID
56 */
5457 private static final long serialVersionUID = -5940938274907708665L;
5558
5659 private final Set<MediaType> SUPPORTED_TYPES =
57 Collections.singleton(MediaType.application("x-netcdf"));
60 Collections.singleton(MediaType.application("x-netcdf"));
5861
5962 /*
6063 * (non-Javadoc)
7578 * org.apache.tika.parser.ParseContext)
7679 */
7780 public void parse(InputStream stream, ContentHandler handler,
78 Metadata metadata, ParseContext context) throws IOException,
81 Metadata metadata, ParseContext context) throws IOException,
7982 SAXException, TikaException {
80 ByteArrayOutputStream os = new ByteArrayOutputStream();
81 IOUtils.copy(stream, os);
8283
83 String name = metadata.get(Metadata.RESOURCE_NAME_KEY);
84 if (name == null) {
85 name = "";
86 }
87
84 TikaInputStream tis = TikaInputStream.get(stream, new TemporaryResources());
8885 try {
89 NetcdfFile ncFile = NetcdfFile.openInMemory(name, os.toByteArray());
90
86 NetcdfFile ncFile = NetcdfFile.open(tis.getFile().getAbsolutePath());
87 metadata.set("File-Type-Description", ncFile.getFileTypeDescription());
9188 // first parse out the set of global attributes
9289 for (Attribute attr : ncFile.getGlobalAttributes()) {
93 Property property = resolveMetadataKey(attr.getName());
90 Property property = resolveMetadataKey(attr.getFullName());
9491 if (attr.getDataType().isString()) {
9592 metadata.add(property, attr.getStringValue());
9693 } else if (attr.getDataType().isNumeric()) {
9895 metadata.add(property, String.valueOf(value));
9996 }
10097 }
101
102
103 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
104 xhtml.startDocument();
10598
106 xhtml.characters("dimensions:");
107 xhtml.newline();
108
109 for (Dimension dim : ncFile.getDimensions()){
110 xhtml.characters(dim.getName());
111 xhtml.characters(" = ");
112 xhtml.characters(String.valueOf(dim.getLength()));
113 xhtml.characters(";");
114 xhtml.newline();
115 }
116
117 xhtml.newline();
118 xhtml.characters("variables:");
119
120 for (Variable var : ncFile.getVariables()){
121 xhtml.newline();
122 xhtml.characters(String.valueOf(var.getDataType())); // data type
123 xhtml.characters(" ");
124 xhtml.characters(var.getNameAndDimensions()); //variable name and dimensions
125 xhtml.characters(";");
12699
127 xhtml.newline();
100 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
101 xhtml.startDocument();
102 xhtml.newline();
103 xhtml.element("h1", "dimensions");
104 xhtml.startElement("ul");
105 xhtml.newline();
106 for (Dimension dim : ncFile.getDimensions()) {
107 xhtml.element("li", dim.getFullName() + " = " + dim.getLength());
108 }
109 xhtml.endElement("ul");
128110
129 for(Attribute element : var.getAttributes()){
130 String text = element.toString();
111 xhtml.element("h1", "variables");
112 xhtml.startElement("ul");
113 xhtml.newline();
114 for (Variable var : ncFile.getVariables()) {
115 xhtml.startElement("li");
116 xhtml.characters(var.getDataType() + " " + var.getNameAndDimensions());
117 xhtml.newline();
118 List<Attribute> attributes = var.getAttributes();
119 if (!attributes.isEmpty()) {
120 xhtml.startElement("ul");
121 for (Attribute element : attributes) {
122 xhtml.element("li", element.toString());
123 }
124 xhtml.endElement("ul");
125 }
126 xhtml.endElement("li");
127 }
128 xhtml.endElement("ul");
131129
132 xhtml.characters(" :");
133 xhtml.characters(text);
134 xhtml.characters(";");
135 xhtml.newline();
136 }
137 }
138
139 xhtml.endDocument();
140
130 xhtml.endDocument();
131
141132 } catch (IOException e) {
142133 throw new TikaException("NetCDF parse error", e);
143 }
134 }
144135 }
145
136
146137 private Property resolveMetadataKey(String localName) {
147138 if ("title".equals(localName)) {
148139 return TikaCoreProperties.TITLE;
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.ocr;
17
18 import java.io.File;
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.io.Serializable;
22 import java.util.Locale;
23 import java.util.Properties;
24
25 /**
26 * Configuration for TesseractOCRParser.
27 *
28 * This allows to enable TesseractOCRParser and set its parameters:
29 * <p>
30 * TesseractOCRConfig config = new TesseractOCRConfig();<br>
31 * config.setTesseractPath(tesseractFolder);<br>
32 * parseContext.set(TesseractOCRConfig.class, config);<br>
33 * </p>
34 *
35 * Parameters can also be set by either editing the existing TesseractOCRConfig.properties file in,
36 * tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own
37 * and placing it in the package org/apache/tika/parser/ocr on the classpath.
38 *
39 */
40 public class TesseractOCRConfig implements Serializable{
41
42 private static final long serialVersionUID = -4861942486845757891L;
43
44 // Path to tesseract installation folder, if not on system path.
45 private String tesseractPath = "";
46
47 // Language dictionary to be used.
48 private String language = "eng";
49
50 // Tesseract page segmentation mode.
51 private String pageSegMode = "1";
52
53 // Minimum file size to submit file to ocr.
54 private int minFileSizeToOcr = 0;
55
56 // Maximum file size to submit file to ocr.
57 private int maxFileSizeToOcr = Integer.MAX_VALUE;
58
59 // Maximum time (seconds) to wait for the ocring process termination
60 private int timeout = 120;
61
62 /**
63 * Default contructor.
64 */
65 public TesseractOCRConfig() {
66 init(this.getClass().getResourceAsStream("TesseractOCRConfig.properties"));
67 }
68
69 /**
70 * Loads properties from InputStream and then tries to close InputStream.
71 * If there is an IOException, this silently swallows the exception
72 * and goes back to the default.
73 *
74 * @param is
75 */
76 public TesseractOCRConfig(InputStream is) {
77 init(is);
78 }
79
80 private void init(InputStream is) {
81 if (is == null) {
82 return;
83 }
84 Properties props = new Properties();
85 try {
86 props.load(is);
87 } catch (IOException e) {
88 } finally {
89 if (is != null) {
90 try {
91 is.close();
92 } catch (IOException e) {
93 //swallow
94 }
95 }
96 }
97
98 setTesseractPath(
99 getProp(props, "tesseractPath", getTesseractPath()));
100 setLanguage(
101 getProp(props, "language", getLanguage()));
102 setPageSegMode(
103 getProp(props, "pageSegMode", getPageSegMode()));
104 setMinFileSizeToOcr(
105 getProp(props, "minFileSizeToOcr", getMinFileSizeToOcr()));
106 setMaxFileSizeToOcr(
107 getProp(props, "maxFileSizeToOcr", getMaxFileSizeToOcr()));
108 setTimeout(
109 getProp(props, "timeout", getTimeout()));
110
111 }
112
113 /** @see #setTesseractPath(String tesseractPath)*/
114 public String getTesseractPath() {
115 return tesseractPath;
116 }
117
118 /**
119 * Set tesseract installation folder, needed if it is not on system path.
120 */
121 public void setTesseractPath(String tesseractPath) {
122 if(!tesseractPath.isEmpty() && !tesseractPath.endsWith(File.separator))
123 tesseractPath += File.separator;
124
125 this.tesseractPath = tesseractPath;
126 }
127
128 /** @see #setLanguage(String language)*/
129 public String getLanguage() {
130 return language;
131 }
132
133 /**
134 * Set tesseract language dictionary to be used. Default is "eng".
135 * Multiple languages may be specified, separated by plus characters.
136 */
137 public void setLanguage(String language) {
138 if (!language.matches("([A-Za-z](\\+?))*")) {
139 throw new IllegalArgumentException("Invalid language code");
140 }
141 this.language = language;
142 }
143
144 /** @see #setPageSegMode(String pageSegMode)*/
145 public String getPageSegMode() {
146 return pageSegMode;
147 }
148
149 /**
150 * Set tesseract page segmentation mode.
151 * Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection)
152 */
153 public void setPageSegMode(String pageSegMode) {
154 if (!pageSegMode.matches("[1-9]|10")) {
155 throw new IllegalArgumentException("Invalid language code");
156 }
157 this.pageSegMode = pageSegMode;
158 }
159
160 /** @see #setMinFileSizeToOcr(int minFileSizeToOcr)*/
161 public int getMinFileSizeToOcr() {
162 return minFileSizeToOcr;
163 }
164
165 /**
166 * Set minimum file size to submit file to ocr.
167 * Default is 0.
168 */
169 public void setMinFileSizeToOcr(int minFileSizeToOcr) {
170 this.minFileSizeToOcr = minFileSizeToOcr;
171 }
172
173 /** @see #setMaxFileSizeToOcr(int maxFileSizeToOcr)*/
174 public int getMaxFileSizeToOcr() {
175 return maxFileSizeToOcr;
176 }
177
178 /**
179 * Set maximum file size to submit file to ocr.
180 * Default is Integer.MAX_VALUE.
181 */
182 public void setMaxFileSizeToOcr(int maxFileSizeToOcr) {
183 this.maxFileSizeToOcr = maxFileSizeToOcr;
184 }
185
186 /**
187 * Set maximum time (seconds) to wait for the ocring process to terminate.
188 * Default value is 120s.
189 */
190 public void setTimeout(int timeout) {
191 this.timeout = timeout;
192 }
193
194 /** @see #setTimeout(int timeout)*/
195 public int getTimeout() {
196 return timeout;
197 }
198
199 /**
200 * Get property from the properties file passed in.
201 * @param properties properties file to read from.
202 * @param property the property to fetch.
203 * @param defaultMissing default parameter to use.
204 * @return the value.
205 */
206 private int getProp(Properties properties, String property, int defaultMissing) {
207 String p = properties.getProperty(property);
208 if (p == null || p.isEmpty()){
209 return defaultMissing;
210 }
211 try {
212 return Integer.parseInt(p);
213 } catch (Throwable ex) {
214 throw new RuntimeException(String.format(Locale.ROOT, "Cannot parse TesseractOCRConfig variable %s, invalid integer value",
215 property), ex);
216 }
217 }
218
219 /**
220 * Get property from the properties file passed in.
221 * @param properties properties file to read from.
222 * @param property the property to fetch.
223 * @param defaultMissing default parameter to use.
224 * @return the value.
225 */
226 private String getProp(Properties properties, String property, String defaultMissing) {
227 return properties.getProperty(property, defaultMissing);
228 }
229 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.ocr;
17
18 import javax.imageio.ImageIO;
19
20 import java.awt.Image;
21 import java.awt.image.BufferedImage;
22 import java.io.File;
23 import java.io.FileInputStream;
24 import java.io.FileOutputStream;
25 import java.io.IOException;
26 import java.io.InputStream;
27 import java.io.InputStreamReader;
28 import java.io.Reader;
29 import java.util.Arrays;
30 import java.util.Collections;
31 import java.util.HashMap;
32 import java.util.HashSet;
33 import java.util.List;
34 import java.util.Map;
35 import java.util.Set;
36 import java.util.concurrent.Callable;
37 import java.util.concurrent.ExecutionException;
38 import java.util.concurrent.FutureTask;
39 import java.util.concurrent.TimeUnit;
40 import java.util.concurrent.TimeoutException;
41
42 import org.apache.commons.logging.LogFactory;
43 import org.apache.tika.exception.TikaException;
44 import org.apache.tika.io.IOUtils;
45 import org.apache.tika.io.TemporaryResources;
46 import org.apache.tika.io.TikaInputStream;
47 import org.apache.tika.metadata.Metadata;
48 import org.apache.tika.mime.MediaType;
49 import org.apache.tika.mime.MediaTypeRegistry;
50 import org.apache.tika.parser.AbstractParser;
51 import org.apache.tika.parser.CompositeParser;
52 import org.apache.tika.parser.ParseContext;
53 import org.apache.tika.parser.Parser;
54 import org.apache.tika.parser.external.ExternalParser;
55 import org.apache.tika.parser.image.ImageParser;
56 import org.apache.tika.parser.image.TiffParser;
57 import org.apache.tika.parser.jpeg.JpegParser;
58 import org.apache.tika.sax.XHTMLContentHandler;
59 import org.xml.sax.ContentHandler;
60 import org.xml.sax.SAXException;
61
62 /**
63 * TesseractOCRParser powered by tesseract-ocr engine. To enable this parser,
64 * create a {@link TesseractOCRConfig} object and pass it through a
65 * ParseContext. Tesseract-ocr must be installed and on system path or the path
66 * to its root folder must be provided:
67 * <p>
68 * TesseractOCRConfig config = new TesseractOCRConfig();<br>
69 * //Needed if tesseract is not on system path<br>
70 * config.setTesseractPath(tesseractFolder);<br>
71 * parseContext.set(TesseractOCRConfig.class, config);<br>
72 * </p>
73 *
74 *
75 */
76 public class TesseractOCRParser extends AbstractParser {
77 private static final long serialVersionUID = -8167538283213097265L;
78 private static final TesseractOCRConfig DEFAULT_CONFIG = new TesseractOCRConfig();
79 private static final Set<MediaType> SUPPORTED_TYPES = Collections.unmodifiableSet(
80 new HashSet<MediaType>(Arrays.asList(new MediaType[] {
81 MediaType.image("png"), MediaType.image("jpeg"), MediaType.image("tiff"),
82 MediaType.image("x-ms-bmp"), MediaType.image("gif")
83 })));
84 private static Map<String,Boolean> TESSERACT_PRESENT = new HashMap<String, Boolean>();
85
86 @Override
87 public Set<MediaType> getSupportedTypes(ParseContext context) {
88 // If Tesseract is installed, offer our supported image types
89 TesseractOCRConfig config = context.get(TesseractOCRConfig.class, DEFAULT_CONFIG);
90 if (hasTesseract(config))
91 return SUPPORTED_TYPES;
92
93 // Otherwise don't advertise anything, so the other image parsers
94 // can be selected instead
95 return Collections.emptySet();
96 }
97
98 private void setEnv(TesseractOCRConfig config, ProcessBuilder pb) {
99 if (!config.getTesseractPath().isEmpty()) {
100 Map<String, String> env = pb.environment();
101 env.put("TESSDATA_PREFIX", config.getTesseractPath());
102 }
103 }
104
105 private boolean hasTesseract(TesseractOCRConfig config) {
106 // Fetch where the config says to find Tesseract
107 String tesseract = config.getTesseractPath() + getTesseractProg();
108
109 // Have we already checked for a copy of Tesseract there?
110 if (TESSERACT_PRESENT.containsKey(tesseract)) {
111 return TESSERACT_PRESENT.get(tesseract);
112 }
113
114 // Try running Tesseract from there, and see if it exists + works
115 String[] checkCmd = { tesseract };
116 try {
117 boolean hasTesseract = ExternalParser.check(checkCmd);
118 TESSERACT_PRESENT.put(tesseract, hasTesseract);
119 return hasTesseract;
120 } catch (NoClassDefFoundError e) {
121 // This happens under OSGi + Fork Parser - see TIKA-1507
122 // As a workaround for now, just say we can't use OCR
123 // TODO Resolve it so we don't need this try/catch block
124 TESSERACT_PRESENT.put(tesseract, false);
125 return false;
126 }
127 }
128
129 public void parse(Image image, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException,
130 SAXException, TikaException {
131
132 TemporaryResources tmp = new TemporaryResources();
133 FileOutputStream fos = null;
134 TikaInputStream tis = null;
135 try {
136 int w = image.getWidth(null);
137 int h = image.getHeight(null);
138 BufferedImage bImage = new BufferedImage(w, h, BufferedImage.TYPE_INT_RGB);
139 File file = tmp.createTemporaryFile();
140 fos = new FileOutputStream(file);
141 ImageIO.write(bImage, "png", fos);
142 tis = TikaInputStream.get(file);
143 parse(tis, handler, metadata, context);
144
145 } finally {
146 tmp.dispose();
147 if (tis != null)
148 tis.close();
149 if (fos != null)
150 fos.close();
151 }
152
153 }
154
155 @Override
156 public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context)
157 throws IOException, SAXException, TikaException {
158 TesseractOCRConfig config = context.get(TesseractOCRConfig.class, DEFAULT_CONFIG);
159
160 // If Tesseract is not on the path with the current config, do not try to run OCR
161 // getSupportedTypes shouldn't have listed us as handling it, so this should only
162 // occur if someone directly calls this parser, not via DefaultParser or similar
163 if (! hasTesseract(config))
164 return;
165
166 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
167
168 TemporaryResources tmp = new TemporaryResources();
169 File output = null;
170 try {
171 TikaInputStream tikaStream = TikaInputStream.get(stream, tmp);
172 File input = tikaStream.getFile();
173 long size = tikaStream.getLength();
174
175 if (size >= config.getMinFileSizeToOcr() && size <= config.getMaxFileSizeToOcr()) {
176
177 output = tmp.createTemporaryFile();
178 doOCR(input, output, config);
179
180 // Tesseract appends .txt to output file name
181 output = new File(output.getAbsolutePath() + ".txt");
182
183 if (output.exists())
184 extractOutput(new FileInputStream(output), xhtml);
185
186 }
187
188 // Temporary workaround for TIKA-1445 - until we can specify
189 // composite parsers with strategies (eg Composite, Try In Turn),
190 // always send the image onwards to the regular parser to have
191 // the metadata for them extracted as well
192 _TMP_IMAGE_METADATA_PARSER.parse(tikaStream, handler, metadata, context);
193 } finally {
194 tmp.dispose();
195 if (output != null) {
196 output.delete();
197 }
198 }
199 }
200 // TIKA-1445 workaround parser
201 private static Parser _TMP_IMAGE_METADATA_PARSER = new CompositeImageParser();
202 private static class CompositeImageParser extends CompositeParser {
203 private static final long serialVersionUID = -2398203346206381382L;
204 private static List<Parser> imageParsers = Arrays.asList(new Parser[]{
205 new ImageParser(), new JpegParser(), new TiffParser()
206 });
207 CompositeImageParser() {
208 super(new MediaTypeRegistry(), imageParsers);
209 }
210 }
211
212 /**
213 * Run external tesseract-ocr process.
214 *
215 * @param input
216 * File to be ocred
217 * @param output
218 * File to collect ocr result
219 * @param config
220 * Configuration of tesseract-ocr engine
221 * @throws TikaException
222 * if the extraction timed out
223 * @throws IOException
224 * if an input error occurred
225 */
226 private void doOCR(File input, File output, TesseractOCRConfig config) throws IOException, TikaException {
227 String[] cmd = { config.getTesseractPath() + getTesseractProg(), input.getPath(), output.getPath(), "-l",
228 config.getLanguage(), "-psm", config.getPageSegMode() };
229
230 ProcessBuilder pb = new ProcessBuilder(cmd);
231 setEnv(config, pb);
232 final Process process = pb.start();
233
234 process.getOutputStream().close();
235 InputStream out = process.getInputStream();
236 InputStream err = process.getErrorStream();
237
238 logStream("OCR MSG", out, input);
239 logStream("OCR ERROR", err, input);
240
241 FutureTask<Integer> waitTask = new FutureTask<Integer>(new Callable<Integer>() {
242 public Integer call() throws Exception {
243 return process.waitFor();
244 }
245 });
246
247 Thread waitThread = new Thread(waitTask);
248 waitThread.start();
249
250 try {
251 waitTask.get(config.getTimeout(), TimeUnit.SECONDS);
252
253 } catch (InterruptedException e) {
254 waitThread.interrupt();
255 process.destroy();
256 Thread.currentThread().interrupt();
257 throw new TikaException("TesseractOCRParser interrupted", e);
258
259 } catch (ExecutionException e) {
260 // should not be thrown
261
262 } catch (TimeoutException e) {
263 waitThread.interrupt();
264 process.destroy();
265 throw new TikaException("TesseractOCRParser timeout", e);
266 }
267
268 }
269
270 /**
271 * Reads the contents of the given stream and write it to the given XHTML
272 * content handler. The stream is closed once fully processed.
273 *
274 * @param stream
275 * Stream where is the result of ocr
276 * @param xhtml
277 * XHTML content handler
278 * @throws SAXException
279 * if the XHTML SAX events could not be handled
280 * @throws IOException
281 * if an input error occurred
282 */
283 private void extractOutput(InputStream stream, XHTMLContentHandler xhtml) throws SAXException, IOException {
284
285 Reader reader = new InputStreamReader(stream, IOUtils.UTF_8);
286 xhtml.startDocument();
287 xhtml.startElement("div");
288 try {
289 char[] buffer = new char[1024];
290 for (int n = reader.read(buffer); n != -1; n = reader.read(buffer)) {
291 if (n > 0)
292 xhtml.characters(buffer, 0, n);
293 }
294 } finally {
295 reader.close();
296 }
297 xhtml.endElement("div");
298 xhtml.endDocument();
299 }
300
301 /**
302 * Starts a thread that reads the contents of the standard output or error
303 * stream of the given process to not block the process. The stream is closed
304 * once fully processed.
305 */
306 private void logStream(final String logType, final InputStream stream, final File file) {
307 new Thread() {
308 public void run() {
309 Reader reader = new InputStreamReader(stream, IOUtils.UTF_8);
310 StringBuilder out = new StringBuilder();
311 char[] buffer = new char[1024];
312 try {
313 for (int n = reader.read(buffer); n != -1; n = reader.read(buffer))
314 out.append(buffer, 0, n);
315 } catch (IOException e) {
316
317 } finally {
318 IOUtils.closeQuietly(stream);
319 }
320
321 String msg = out.toString();
322 LogFactory.getLog(TesseractOCRParser.class).debug(msg);
323 }
324 }.start();
325 }
326
327 static String getTesseractProg() {
328 return System.getProperty("os.name").startsWith("Windows") ? "tesseract.exe" : "tesseract";
329 }
330
331 }
1717
1818 import java.io.IOException;
1919 import java.io.StringReader;
20 import java.util.Locale;
2021
2122 import org.apache.tika.sax.ContentHandlerDecorator;
2223 import org.xml.sax.Attributes;
3435 public class NSNormalizerContentHandler extends ContentHandlerDecorator {
3536
3637 private static final String OLD_NS =
37 "http://openoffice.org/2000/";
38 "http://openoffice.org/2000/";
3839
3940 private static final String NEW_NS =
40 "urn:oasis:names:tc:opendocument:xmlns:";
41 "urn:oasis:names:tc:opendocument:xmlns:";
4142
4243 private static final String DTD_PUBLIC_ID =
43 "-//OpenOffice.org//DTD OfficeDocument 1.0//EN";
44 "-//OpenOffice.org//DTD OfficeDocument 1.0//EN";
4445
4546 public NSNormalizerContentHandler(ContentHandler handler) {
4647 super(handler);
8687 @Override
8788 public InputSource resolveEntity(String publicId, String systemId)
8889 throws IOException, SAXException {
89 if ((systemId != null && systemId.toLowerCase().endsWith(".dtd"))
90 if ((systemId != null && systemId.toLowerCase(Locale.ROOT).endsWith(".dtd"))
9091 || DTD_PUBLIC_ID.equals(publicId)) {
9192 return new InputSource(new StringReader(""));
9293 } else {
1515 */
1616 package org.apache.tika.parser.odf;
1717
18 import static org.apache.tika.sax.XHTMLContentHandler.XHTML;
18 import javax.xml.XMLConstants;
19 import javax.xml.namespace.QName;
20 import javax.xml.parsers.ParserConfigurationException;
21 import javax.xml.parsers.SAXParser;
22 import javax.xml.parsers.SAXParserFactory;
1923
2024 import java.io.IOException;
2125 import java.io.InputStream;
2529 import java.util.Map;
2630 import java.util.Set;
2731 import java.util.Stack;
28
29 import javax.xml.XMLConstants;
30 import javax.xml.namespace.QName;
31 import javax.xml.parsers.ParserConfigurationException;
32 import javax.xml.parsers.SAXParser;
33 import javax.xml.parsers.SAXParserFactory;
3432
3533 import org.apache.tika.exception.TikaException;
3634 import org.apache.tika.io.CloseShieldInputStream;
4947 import org.xml.sax.helpers.AttributesImpl;
5048 import org.xml.sax.helpers.DefaultHandler;
5149
50 import static org.apache.tika.sax.XHTMLContentHandler.XHTML;
51
5252 /**
5353 * Parser for ODF <code>content.xml</code> files.
5454 */
5555 public class OpenDocumentContentParser extends AbstractParser {
56 private interface Style {
57 }
58
59 private static class TextStyle implements Style {
60 public boolean italic;
61 public boolean bold;
62 public boolean underlined;
63 }
64
65 private static class ListStyle implements Style {
66 public boolean ordered;
67
68 public String getTag() {
69 return ordered ? "ol" : "ul";
70 }
71 }
5672
5773 private static final class OpenDocumentElementMappingContentHandler extends
58 ElementMappingContentHandler {
59 private final ContentHandler handler;
60 private final BitSet textNodeStack = new BitSet();
61 private int nodeDepth = 0;
62 private int completelyFiltered = 0;
63 private Stack<String> headingStack = new Stack<String>();
64
65 private OpenDocumentElementMappingContentHandler(ContentHandler handler,
66 Map<QName, TargetElement> mappings) {
67 super(handler, mappings);
68 this.handler = handler;
69 }
70
71 @Override
72 public void characters(char[] ch, int start, int length)
73 throws SAXException {
74 // only forward content of tags from text:-namespace
75 if (completelyFiltered == 0 && nodeDepth > 0
76 && textNodeStack.get(nodeDepth - 1)) {
77 super.characters(ch,start,length);
78 }
79 }
80
81 // helper for checking tags which need complete filtering
82 // (with sub-tags)
83 private boolean needsCompleteFiltering(
84 String namespaceURI, String localName) {
85 if (TEXT_NS.equals(namespaceURI)) {
86 return localName.endsWith("-template")
87 || localName.endsWith("-style");
88 } else if (TABLE_NS.equals(namespaceURI)) {
89 return "covered-table-cell".equals(localName);
90 } else {
91 return false;
92 }
93 }
94
95 // map the heading level to <hX> HTML tags
96 private String getXHTMLHeaderTagName(Attributes atts) {
97 String depthStr = atts.getValue(TEXT_NS, "outline-level");
98 if (depthStr == null) {
99 return "h1";
100 }
101
102 int depth = Integer.parseInt(depthStr);
103 if (depth >= 6) {
104 return "h6";
105 } else if (depth <= 1) {
106 return "h1";
107 } else {
108 return "h" + depth;
109 }
110 }
111
112 /**
113 * Check if a node is a text node
114 */
115 private boolean isTextNode(String namespaceURI, String localName) {
116 if (TEXT_NS.equals(namespaceURI) && !localName.equals("page-number") && !localName.equals("page-count")) {
117 return true;
118 }
119 if (SVG_NS.equals(namespaceURI)) {
120 return "title".equals(localName) ||
121 "desc".equals(localName);
122 }
123 return false;
124 }
125
126 @Override
127 public void startElement(
128 String namespaceURI, String localName, String qName,
129 Attributes atts) throws SAXException {
130 // keep track of current node type. If it is a text node,
131 // a bit at the current depth ist set in textNodeStack.
132 // characters() checks the top bit to determine, if the
133 // actual node is a text node to print out nodeDepth contains
134 // the depth of the current node and also marks top of stack.
135 assert nodeDepth >= 0;
136
137 textNodeStack.set(nodeDepth++,
138 isTextNode(namespaceURI, localName));
139 // filter *all* content of some tags
140 assert completelyFiltered >= 0;
141
142 if (needsCompleteFiltering(namespaceURI, localName)) {
143 completelyFiltered++;
144 }
145 // call next handler if no filtering
146 if (completelyFiltered == 0) {
147 // special handling of text:h, that are directly passed
148 // to incoming handler
149 if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) {
150 final String el = headingStack.push(getXHTMLHeaderTagName(atts));
151 handler.startElement(XHTMLContentHandler.XHTML, el, el, EMPTY_ATTRIBUTES);
152 } else {
153 super.startElement(
154 namespaceURI, localName, qName, atts);
155 }
156 }
157 }
158
159 @Override
160 public void endElement(
161 String namespaceURI, String localName, String qName)
162 throws SAXException {
163 // call next handler if no filtering
164 if (completelyFiltered == 0) {
165 // special handling of text:h, that are directly passed
166 // to incoming handler
167 if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) {
168 final String el = headingStack.pop();
169 handler.endElement(XHTMLContentHandler.XHTML, el, el);
170 } else {
171 super.endElement(namespaceURI,localName,qName);
172 }
173
174 // special handling of tabulators
175 if (TEXT_NS.equals(namespaceURI)
176 && ("tab-stop".equals(localName)
177 || "tab".equals(localName))) {
178 this.characters(TAB, 0, TAB.length);
179 }
180 }
181
182 // revert filter for *all* content of some tags
183 if (needsCompleteFiltering(namespaceURI,localName)) {
184 completelyFiltered--;
185 }
186 assert completelyFiltered >= 0;
187
188 // reduce current node depth
189 nodeDepth--;
190 assert nodeDepth >= 0;
191 }
192
193 @Override
194 public void startPrefixMapping(String prefix, String uri) {
195 // remove prefix mappings as they should not occur in XHTML
196 }
197
198 @Override
199 public void endPrefixMapping(String prefix) {
200 // remove prefix mappings as they should not occur in XHTML
201 }
202 }
203
204 public static final String TEXT_NS =
205 "urn:oasis:names:tc:opendocument:xmlns:text:1.0";
74 ElementMappingContentHandler {
75 private final ContentHandler handler;
76 private final BitSet textNodeStack = new BitSet();
77 private int nodeDepth = 0;
78 private int completelyFiltered = 0;
79 private Stack<String> headingStack = new Stack<String>();
80 private Map<String, TextStyle> textStyleMap = new HashMap<String, TextStyle>();
81 private Map<String, ListStyle> listStyleMap = new HashMap<String, ListStyle>();
82 private TextStyle textStyle;
83 private TextStyle lastTextStyle;
84 private Stack<ListStyle> listStyleStack = new Stack<ListStyle>();
85 private ListStyle listStyle;
86
87 private OpenDocumentElementMappingContentHandler(ContentHandler handler,
88 Map<QName, TargetElement> mappings) {
89 super(handler, mappings);
90 this.handler = handler;
91 }
92
93 @Override
94 public void characters(char[] ch, int start, int length)
95 throws SAXException {
96 // only forward content of tags from text:-namespace
97 if (completelyFiltered == 0 && nodeDepth > 0
98 && textNodeStack.get(nodeDepth - 1)) {
99 lazyEndSpan();
100 super.characters(ch, start, length);
101 }
102 }
103
104 // helper for checking tags which need complete filtering
105 // (with sub-tags)
106 private boolean needsCompleteFiltering(
107 String namespaceURI, String localName) {
108 if (TEXT_NS.equals(namespaceURI)) {
109 return localName.endsWith("-template")
110 || localName.endsWith("-style");
111 }
112 return TABLE_NS.equals(namespaceURI) && "covered-table-cell".equals(localName);
113 }
114
115 // map the heading level to <hX> HTML tags
116 private String getXHTMLHeaderTagName(Attributes atts) {
117 String depthStr = atts.getValue(TEXT_NS, "outline-level");
118 if (depthStr == null) {
119 return "h1";
120 }
121
122 int depth = Integer.parseInt(depthStr);
123 if (depth >= 6) {
124 return "h6";
125 } else if (depth <= 1) {
126 return "h1";
127 } else {
128 return "h" + depth;
129 }
130 }
131
132 /**
133 * Check if a node is a text node
134 */
135 private boolean isTextNode(String namespaceURI, String localName) {
136 if (TEXT_NS.equals(namespaceURI) && !localName.equals("page-number") && !localName.equals("page-count")) {
137 return true;
138 }
139 if (SVG_NS.equals(namespaceURI)) {
140 return "title".equals(localName) ||
141 "desc".equals(localName);
142 }
143 return false;
144 }
145
146 private void startList(String name) throws SAXException {
147 String elementName = "ul";
148 if (name != null) {
149 ListStyle style = listStyleMap.get(name);
150 elementName = style != null ? style.getTag() : "ul";
151 listStyleStack.push(style);
152 }
153 handler.startElement(XHTML, elementName, elementName, EMPTY_ATTRIBUTES);
154 }
155
156 private void endList() throws SAXException {
157 String elementName = "ul";
158 if (!listStyleStack.isEmpty()) {
159 ListStyle style = listStyleStack.pop();
160 elementName = style != null ? style.getTag() : "ul";
161 }
162 handler.endElement(XHTML, elementName, elementName);
163 }
164
165 private void startSpan(String name) throws SAXException {
166 if (name == null) {
167 return;
168 }
169
170 TextStyle style = textStyleMap.get(name);
171 if (style == null) {
172 return;
173 }
174
175 // End tags that refer to no longer valid styles
176 if (!style.underlined && lastTextStyle != null && lastTextStyle.underlined) {
177 handler.endElement(XHTML, "u", "u");
178 }
179 if (!style.italic && lastTextStyle != null && lastTextStyle.italic) {
180 handler.endElement(XHTML, "i", "i");
181 }
182 if (!style.bold && lastTextStyle != null && lastTextStyle.bold) {
183 handler.endElement(XHTML, "b", "b");
184 }
185
186 // Start tags for new styles
187 if (style.bold && (lastTextStyle == null || !lastTextStyle.bold)) {
188 handler.startElement(XHTML, "b", "b", EMPTY_ATTRIBUTES);
189 }
190 if (style.italic && (lastTextStyle == null || !lastTextStyle.italic)) {
191 handler.startElement(XHTML, "i", "i", EMPTY_ATTRIBUTES);
192 }
193 if (style.underlined && (lastTextStyle == null || !lastTextStyle.underlined)) {
194 handler.startElement(XHTML, "u", "u", EMPTY_ATTRIBUTES);
195 }
196
197 textStyle = style;
198 lastTextStyle = null;
199 }
200
201 private void endSpan() throws SAXException {
202 lastTextStyle = textStyle;
203 textStyle = null;
204 }
205
206 private void lazyEndSpan() throws SAXException {
207 if (lastTextStyle == null) {
208 return;
209 }
210
211 if (lastTextStyle.underlined) {
212 handler.endElement(XHTML, "u", "u");
213 }
214 if (lastTextStyle.italic) {
215 handler.endElement(XHTML, "i", "i");
216 }
217 if (lastTextStyle.bold) {
218 handler.endElement(XHTML, "b", "b");
219 }
220
221 lastTextStyle = null;
222 }
223
224 @Override
225 public void startElement(
226 String namespaceURI, String localName, String qName,
227 Attributes attrs) throws SAXException {
228 // keep track of current node type. If it is a text node,
229 // a bit at the current depth its set in textNodeStack.
230 // characters() checks the top bit to determine, if the
231 // actual node is a text node to print out nodeDepth contains
232 // the depth of the current node and also marks top of stack.
233 assert nodeDepth >= 0;
234
235 // Set styles
236 if (STYLE_NS.equals(namespaceURI) && "style".equals(localName)) {
237 String family = attrs.getValue(STYLE_NS, "family");
238 if ("text".equals(family)) {
239 textStyle = new TextStyle();
240 String name = attrs.getValue(STYLE_NS, "name");
241 textStyleMap.put(name, textStyle);
242 }
243 } else if (TEXT_NS.equals(namespaceURI) && "list-style".equals(localName)) {
244 listStyle = new ListStyle();
245 String name = attrs.getValue(STYLE_NS, "name");
246 listStyleMap.put(name, listStyle);
247 } else if (textStyle != null && STYLE_NS.equals(namespaceURI)
248 && "text-properties".equals(localName)) {
249 String fontStyle = attrs.getValue(FORMATTING_OBJECTS_NS, "font-style");
250 if ("italic".equals(fontStyle) || "oblique".equals(fontStyle)) {
251 textStyle.italic = true;
252 }
253 String fontWeight = attrs.getValue(FORMATTING_OBJECTS_NS, "font-weight");
254 if ("bold".equals(fontWeight) || "bolder".equals(fontWeight)
255 || (fontWeight != null && Character.isDigit(fontWeight.charAt(0))
256 && Integer.valueOf(fontWeight) > 500)) {
257 textStyle.bold = true;
258 }
259 String underlineStyle = attrs.getValue(STYLE_NS, "text-underline-style");
260 if (underlineStyle != null) {
261 textStyle.underlined = true;
262 }
263 } else if (listStyle != null && TEXT_NS.equals(namespaceURI)) {
264 if ("list-level-style-bullet".equals(localName)) {
265 listStyle.ordered = false;
266 } else if ("list-level-style-number".equals(localName)) {
267 listStyle.ordered = true;
268 }
269 }
270
271 textNodeStack.set(nodeDepth++,
272 isTextNode(namespaceURI, localName));
273 // filter *all* content of some tags
274 assert completelyFiltered >= 0;
275
276 if (needsCompleteFiltering(namespaceURI, localName)) {
277 completelyFiltered++;
278 }
279 // call next handler if no filtering
280 if (completelyFiltered == 0) {
281 // special handling of text:h, that are directly passed
282 // to incoming handler
283 if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) {
284 final String el = headingStack.push(getXHTMLHeaderTagName(attrs));
285 handler.startElement(XHTMLContentHandler.XHTML, el, el, EMPTY_ATTRIBUTES);
286 } else if (TEXT_NS.equals(namespaceURI) && "list".equals(localName)) {
287 startList(attrs.getValue(TEXT_NS, "style-name"));
288 } else if (TEXT_NS.equals(namespaceURI) && "span".equals(localName)) {
289 startSpan(attrs.getValue(TEXT_NS, "style-name"));
290 } else {
291 super.startElement(namespaceURI, localName, qName, attrs);
292 }
293 }
294 }
295
296 @Override
297 public void endElement(
298 String namespaceURI, String localName, String qName)
299 throws SAXException {
300 if (STYLE_NS.equals(namespaceURI) && "style".equals(localName)) {
301 textStyle = null;
302 } else if (TEXT_NS.equals(namespaceURI) && "list-style".equals(localName)) {
303 listStyle = null;
304 }
305
306 // call next handler if no filtering
307 if (completelyFiltered == 0) {
308 // special handling of text:h, that are directly passed
309 // to incoming handler
310 if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) {
311 final String el = headingStack.pop();
312 handler.endElement(XHTMLContentHandler.XHTML, el, el);
313 } else if (TEXT_NS.equals(namespaceURI) && "list".equals(localName)) {
314 endList();
315 } else if (TEXT_NS.equals(namespaceURI) && "span".equals(localName)) {
316 endSpan();
317 } else {
318 if (TEXT_NS.equals(namespaceURI) && "p".equals(localName)) {
319 lazyEndSpan();
320 }
321 super.endElement(namespaceURI, localName, qName);
322 }
323
324 // special handling of tabulators
325 if (TEXT_NS.equals(namespaceURI)
326 && ("tab-stop".equals(localName)
327 || "tab".equals(localName))) {
328 this.characters(TAB, 0, TAB.length);
329 }
330 }
331
332 // revert filter for *all* content of some tags
333 if (needsCompleteFiltering(namespaceURI, localName)) {
334 completelyFiltered--;
335 }
336 assert completelyFiltered >= 0;
337
338 // reduce current node depth
339 nodeDepth--;
340 assert nodeDepth >= 0;
341 }
342
343 @Override
344 public void startPrefixMapping(String prefix, String uri) {
345 // remove prefix mappings as they should not occur in XHTML
346 }
347
348 @Override
349 public void endPrefixMapping(String prefix) {
350 // remove prefix mappings as they should not occur in XHTML
351 }
352 }
353
354 public static final String TEXT_NS =
355 "urn:oasis:names:tc:opendocument:xmlns:text:1.0";
206356
207357 public static final String TABLE_NS =
208 "urn:oasis:names:tc:opendocument:xmlns:table:1.0";
358 "urn:oasis:names:tc:opendocument:xmlns:table:1.0";
359
360 public static final String STYLE_NS =
361 "urn:oasis:names:tc:opendocument:xmlns:style:1.0";
362
363 public static final String FORMATTING_OBJECTS_NS =
364 "urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0";
209365
210366 public static final String OFFICE_NS =
211 "urn:oasis:names:tc:opendocument:xmlns:office:1.0";
367 "urn:oasis:names:tc:opendocument:xmlns:office:1.0";
212368
213369 public static final String SVG_NS =
214 "urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0";
370 "urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0";
215371
216372 public static final String PRESENTATION_NS =
217 "urn:oasis:names:tc:opendocument:xmlns:presentation:1.0";
373 "urn:oasis:names:tc:opendocument:xmlns:presentation:1.0";
218374
219375 public static final String DRAW_NS =
220 "urn:oasis:names:tc:opendocument:xmlns:drawing:1.0";
376 "urn:oasis:names:tc:opendocument:xmlns:drawing:1.0";
221377
222378 public static final String XLINK_NS = "http://www.w3.org/1999/xlink";
223379
224 protected static final char[] TAB = new char[] { '\t' };
380 protected static final char[] TAB = new char[]{'\t'};
225381
226382 private static final Attributes EMPTY_ATTRIBUTES = new AttributesImpl();
227383
228384 /**
229385 * Mappings between ODF tag names and XHTML tag names
230386 * (including attributes). All other tag names/attributes are ignored
231 * and left out from event stream.
387 * and left out from event stream.
232388 */
233389 private static final HashMap<QName, TargetElement> MAPPINGS =
234 new HashMap<QName, TargetElement>();
390 new HashMap<QName, TargetElement>();
235391
236392 static {
237393 // general mappings of text:-tags
243399 new QName(TEXT_NS, "line-break"),
244400 new TargetElement(XHTML, "br"));
245401 MAPPINGS.put(
246 new QName(TEXT_NS, "list"),
247 new TargetElement(XHTML, "ul"));
248 MAPPINGS.put(
249402 new QName(TEXT_NS, "list-item"),
250403 new TargetElement(XHTML, "li"));
251404 MAPPINGS.put(
272425 MAPPINGS.put(
273426 new QName(TEXT_NS, "span"),
274427 new TargetElement(XHTML, "span"));
275
276 final HashMap<QName,QName> aAttsMapping =
277 new HashMap<QName,QName>();
428
429 final HashMap<QName, QName> aAttsMapping =
430 new HashMap<QName, QName>();
278431 aAttsMapping.put(
279432 new QName(XLINK_NS, "href"),
280433 new QName("href"));
294447 new QName(TABLE_NS, "table-row"),
295448 new TargetElement(XHTML, "tr"));
296449 // special mapping for rowspan/colspan attributes
297 final HashMap<QName,QName> tableCellAttsMapping =
298 new HashMap<QName,QName>();
450 final HashMap<QName, QName> tableCellAttsMapping =
451 new HashMap<QName, QName>();
299452 tableCellAttsMapping.put(
300453 new QName(TABLE_NS, "number-columns-spanned"),
301454 new QName("colspan"));
325478 Metadata metadata, ParseContext context)
326479 throws IOException, SAXException, TikaException {
327480 parseInternal(stream,
328 new XHTMLContentHandler(handler,metadata),
329 metadata, context);
481 new XHTMLContentHandler(handler, metadata),
482 metadata, context);
330483 }
331484
332485 void parseInternal(
342495 factory.setNamespaceAware(true);
343496 try {
344497 factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
345 } catch (SAXNotRecognizedException e){
498 } catch (SAXNotRecognizedException e) {
346499 // TIKA-329: Some XML parsers do not support the secure-processing
347500 // feature, even though it's required by JAXP in Java 5. Ignoring
348501 // the exception is fine here, deployments without this feature
359512 }
360513
361514 }
362
4949 * Serial version UID
5050 */
5151 private static final long serialVersionUID = -8739250869531737584L;
52
53 private static final String META_NS = "urn:oasis:names:tc:opendocument:xmlns:meta:1.0";
52
53 private static final String META_NS = "urn:oasis:names:tc:opendocument:xmlns:meta:1.0";
5454 private static final XPathParser META_XPATH = new XPathParser("meta", META_NS);
55
56 /**
57 * @see OfficeOpenXMLCore#SUBJECT
55
56 /**
57 * @see OfficeOpenXMLCore#SUBJECT
5858 * @deprecated use OfficeOpenXMLCore#SUBJECT
5959 */
6060 @Deprecated
61 private static final Property TRANSITION_INITIAL_CREATOR_TO_INITIAL_AUTHOR =
62 Property.composite(Office.INITIAL_AUTHOR,
63 new Property[] { Property.externalText("initial-creator") });
64
61 private static final Property TRANSITION_INITIAL_CREATOR_TO_INITIAL_AUTHOR =
62 Property.composite(Office.INITIAL_AUTHOR,
63 new Property[]{Property.externalText("initial-creator")});
64
6565 private static ContentHandler getDublinCoreHandler(
6666 Metadata metadata, Property property, String element) {
6767 return new ElementMetadataHandler(
6868 DublinCore.NAMESPACE_URI_DC, element,
6969 metadata, property);
7070 }
71
71
7272 private static ContentHandler getMeta(
7373 ContentHandler ch, Metadata md, Property property, String element) {
7474 Matcher matcher = new CompositeMatcher(
7575 META_XPATH.parse("//meta:" + element),
7676 META_XPATH.parse("//meta:" + element + "//text()"));
7777 ContentHandler branch =
78 new MatchingContentHandler(new MetadataHandler(md, property), matcher);
78 new MatchingContentHandler(new MetadataHandler(md, property), matcher);
7979 return new TeeContentHandler(ch, branch);
8080 }
8181
8686 META_XPATH.parse("//meta:user-defined//text()"));
8787 // eg <meta:user-defined meta:name="Info1">Text1</meta:user-defined> becomes custom:Info1=Text1
8888 ContentHandler branch = new MatchingContentHandler(
89 new AttributeDependantMetadataHandler(md, "meta:name", Metadata.USER_DEFINED_METADATA_NAME_PREFIX),
90 matcher);
89 new AttributeDependantMetadataHandler(md, "meta:name", Metadata.USER_DEFINED_METADATA_NAME_PREFIX),
90 matcher);
9191 return new TeeContentHandler(ch, branch);
9292 }
9393
94 @Deprecated private static ContentHandler getStatistic(
94 @Deprecated
95 private static ContentHandler getStatistic(
9596 ContentHandler ch, Metadata md, String name, String attribute) {
9697 Matcher matcher =
97 META_XPATH.parse("//meta:document-statistic/@meta:"+attribute);
98 META_XPATH.parse("//meta:document-statistic/@meta:" + attribute);
9899 ContentHandler branch = new MatchingContentHandler(
99 new AttributeMetadataHandler(META_NS, attribute, md, name), matcher);
100 new AttributeMetadataHandler(META_NS, attribute, md, name), matcher);
100101 return new TeeContentHandler(ch, branch);
101102 }
103
102104 private static ContentHandler getStatistic(
103 ContentHandler ch, Metadata md, Property property, String attribute) {
104 Matcher matcher =
105 META_XPATH.parse("//meta:document-statistic/@meta:"+attribute);
106 ContentHandler branch = new MatchingContentHandler(
107 new AttributeMetadataHandler(META_NS, attribute, md, property), matcher);
108 return new TeeContentHandler(ch, branch);
109 }
105 ContentHandler ch, Metadata md, Property property, String attribute) {
106 Matcher matcher =
107 META_XPATH.parse("//meta:document-statistic/@meta:" + attribute);
108 ContentHandler branch = new MatchingContentHandler(
109 new AttributeMetadataHandler(META_NS, attribute, md, property), matcher);
110 return new TeeContentHandler(ch, branch);
111 }
110112
111113 protected ContentHandler getContentHandler(ContentHandler ch, Metadata md, ParseContext context) {
112114 // We can no longer extend DcXMLParser due to the handling of dc:subject and dc:date
122124 getDublinCoreHandler(md, TikaCoreProperties.IDENTIFIER, "identifier"),
123125 getDublinCoreHandler(md, TikaCoreProperties.LANGUAGE, "language"),
124126 getDublinCoreHandler(md, TikaCoreProperties.RIGHTS, "rights"));
125
127
126128 // Process the OO Meta Attributes
127129 ch = getMeta(ch, md, TikaCoreProperties.CREATED, "creation-date");
128130 // ODF uses dc:date for modified
129131 ch = new TeeContentHandler(ch, new ElementMetadataHandler(
130132 DublinCore.NAMESPACE_URI_DC, "date",
131133 md, TikaCoreProperties.MODIFIED));
132
134
133135 // ODF uses dc:subject for description
134136 ch = new TeeContentHandler(ch, new ElementMetadataHandler(
135137 DublinCore.NAMESPACE_URI_DC, "subject",
136138 md, TikaCoreProperties.TRANSITION_SUBJECT_TO_OO_SUBJECT));
137139 ch = getMeta(ch, md, TikaCoreProperties.TRANSITION_KEYWORDS_TO_DC_SUBJECT, "keyword");
138
139 ch = getMeta(ch, md, Property.externalText(MSOffice.EDIT_TIME), "editing-duration");
140
141 ch = getMeta(ch, md, Property.externalText(MSOffice.EDIT_TIME), "editing-duration");
140142 ch = getMeta(ch, md, Property.externalText("editing-cycles"), "editing-cycles");
141143 ch = getMeta(ch, md, TRANSITION_INITIAL_CREATOR_TO_INITIAL_AUTHOR, "initial-creator");
142144 ch = getMeta(ch, md, Property.externalText("generator"), "generator");
143
145
144146 // Process the user defined Meta Attributes
145147 ch = getUserDefined(ch, md);
146
148
147149 // Process the OO Statistics Attributes
148 ch = getStatistic(ch, md, Office.OBJECT_COUNT, "object-count");
149 ch = getStatistic(ch, md, Office.IMAGE_COUNT, "image-count");
150 ch = getStatistic(ch, md, Office.PAGE_COUNT, "page-count");
151 ch = getStatistic(ch, md, PagedText.N_PAGES, "page-count");
152 ch = getStatistic(ch, md, Office.TABLE_COUNT, "table-count");
150 ch = getStatistic(ch, md, Office.OBJECT_COUNT, "object-count");
151 ch = getStatistic(ch, md, Office.IMAGE_COUNT, "image-count");
152 ch = getStatistic(ch, md, Office.PAGE_COUNT, "page-count");
153 ch = getStatistic(ch, md, PagedText.N_PAGES, "page-count");
154 ch = getStatistic(ch, md, Office.TABLE_COUNT, "table-count");
153155 ch = getStatistic(ch, md, Office.PARAGRAPH_COUNT, "paragraph-count");
154 ch = getStatistic(ch, md, Office.WORD_COUNT, "word-count");
156 ch = getStatistic(ch, md, Office.WORD_COUNT, "word-count");
155157 ch = getStatistic(ch, md, Office.CHARACTER_COUNT, "character-count");
156
158
157159 // Legacy, Tika-1.0 style attributes
158160 // TODO Remove these in Tika 2.0
159 ch = getStatistic(ch, md, MSOffice.OBJECT_COUNT, "object-count");
160 ch = getStatistic(ch, md, MSOffice.IMAGE_COUNT, "image-count");
161 ch = getStatistic(ch, md, MSOffice.PAGE_COUNT, "page-count");
162 ch = getStatistic(ch, md, MSOffice.TABLE_COUNT, "table-count");
161 ch = getStatistic(ch, md, MSOffice.OBJECT_COUNT, "object-count");
162 ch = getStatistic(ch, md, MSOffice.IMAGE_COUNT, "image-count");
163 ch = getStatistic(ch, md, MSOffice.PAGE_COUNT, "page-count");
164 ch = getStatistic(ch, md, MSOffice.TABLE_COUNT, "table-count");
163165 ch = getStatistic(ch, md, MSOffice.PARAGRAPH_COUNT, "paragraph-count");
164 ch = getStatistic(ch, md, MSOffice.WORD_COUNT, "word-count");
166 ch = getStatistic(ch, md, MSOffice.WORD_COUNT, "word-count");
165167 ch = getStatistic(ch, md, MSOffice.CHARACTER_COUNT, "character-count");
166
168
167169 // Legacy Statistics Attributes, replaced with real keys above
168170 // TODO Remove these shortly, eg after Tika 1.1 (TIKA-770)
169171 ch = getStatistic(ch, md, "nbPage", "page-count");
173175 ch = getStatistic(ch, md, "nbTab", "table-count");
174176 ch = getStatistic(ch, md, "nbObject", "object-count");
175177 ch = getStatistic(ch, md, "nbImg", "image-count");
176
178
177179 // Normalise the rest
178180 ch = new NSNormalizerContentHandler(ch);
179181 return ch;
180182 }
181
183
182184 @Override
183185 public void parse(
184186 InputStream stream, ContentHandler handler,
187189 super.parse(stream, handler, metadata, context);
188190 // Copy subject to description for OO2
189191 String odfSubject = metadata.get(OfficeOpenXMLCore.SUBJECT);
190 if (odfSubject != null && !odfSubject.equals("") &&
192 if (odfSubject != null && !odfSubject.equals("") &&
191193 (metadata.get(TikaCoreProperties.DESCRIPTION) == null || metadata.get(TikaCoreProperties.DESCRIPTION).equals(""))) {
192194 metadata.set(TikaCoreProperties.DESCRIPTION, odfSubject);
193195 }
194196 }
195
197
196198 }
4545 */
4646 public class OpenDocumentParser extends AbstractParser {
4747
48 /** Serial version UID */
48 /**
49 * Serial version UID
50 */
4951 private static final long serialVersionUID = -6410276875438618287L;
5052
5153 private static final Set<MediaType> SUPPORTED_TYPES =
52 Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
53 MediaType.application("vnd.sun.xml.writer"),
54 MediaType.application("vnd.oasis.opendocument.text"),
55 MediaType.application("vnd.oasis.opendocument.graphics"),
56 MediaType.application("vnd.oasis.opendocument.presentation"),
57 MediaType.application("vnd.oasis.opendocument.spreadsheet"),
58 MediaType.application("vnd.oasis.opendocument.chart"),
59 MediaType.application("vnd.oasis.opendocument.image"),
60 MediaType.application("vnd.oasis.opendocument.formula"),
61 MediaType.application("vnd.oasis.opendocument.text-master"),
62 MediaType.application("vnd.oasis.opendocument.text-web"),
63 MediaType.application("vnd.oasis.opendocument.text-template"),
64 MediaType.application("vnd.oasis.opendocument.graphics-template"),
65 MediaType.application("vnd.oasis.opendocument.presentation-template"),
66 MediaType.application("vnd.oasis.opendocument.spreadsheet-template"),
67 MediaType.application("vnd.oasis.opendocument.chart-template"),
68 MediaType.application("vnd.oasis.opendocument.image-template"),
69 MediaType.application("vnd.oasis.opendocument.formula-template"),
70 MediaType.application("x-vnd.oasis.opendocument.text"),
71 MediaType.application("x-vnd.oasis.opendocument.graphics"),
72 MediaType.application("x-vnd.oasis.opendocument.presentation"),
73 MediaType.application("x-vnd.oasis.opendocument.spreadsheet"),
74 MediaType.application("x-vnd.oasis.opendocument.chart"),
75 MediaType.application("x-vnd.oasis.opendocument.image"),
76 MediaType.application("x-vnd.oasis.opendocument.formula"),
77 MediaType.application("x-vnd.oasis.opendocument.text-master"),
78 MediaType.application("x-vnd.oasis.opendocument.text-web"),
79 MediaType.application("x-vnd.oasis.opendocument.text-template"),
80 MediaType.application("x-vnd.oasis.opendocument.graphics-template"),
81 MediaType.application("x-vnd.oasis.opendocument.presentation-template"),
82 MediaType.application("x-vnd.oasis.opendocument.spreadsheet-template"),
83 MediaType.application("x-vnd.oasis.opendocument.chart-template"),
84 MediaType.application("x-vnd.oasis.opendocument.image-template"),
85 MediaType.application("x-vnd.oasis.opendocument.formula-template"))));
54 Collections.unmodifiableSet(new HashSet<MediaType>(Arrays.asList(
55 MediaType.application("vnd.sun.xml.writer"),
56 MediaType.application("vnd.oasis.opendocument.text"),
57 MediaType.application("vnd.oasis.opendocument.graphics"),
58 MediaType.application("vnd.oasis.opendocument.presentation"),
59 MediaType.application("vnd.oasis.opendocument.spreadsheet"),
60 MediaType.application("vnd.oasis.opendocument.chart"),
61 MediaType.application("vnd.oasis.opendocument.image"),
62 MediaType.application("vnd.oasis.opendocument.formula"),
63 MediaType.application("vnd.oasis.opendocument.text-master"),
64 MediaType.application("vnd.oasis.opendocument.text-web"),
65 MediaType.application("vnd.oasis.opendocument.text-template"),
66 MediaType.application("vnd.oasis.opendocument.graphics-template"),
67 MediaType.application("vnd.oasis.opendocument.presentation-template"),
68 MediaType.application("vnd.oasis.opendocument.spreadsheet-template"),
69 MediaType.application("vnd.oasis.opendocument.chart-template"),
70 MediaType.application("vnd.oasis.opendocument.image-template"),
71 MediaType.application("vnd.oasis.opendocument.formula-template"),
72 MediaType.application("x-vnd.oasis.opendocument.text"),
73 MediaType.application("x-vnd.oasis.opendocument.graphics"),
74 MediaType.application("x-vnd.oasis.opendocument.presentation"),
75 MediaType.application("x-vnd.oasis.opendocument.spreadsheet"),
76 MediaType.application("x-vnd.oasis.opendocument.chart"),
77 MediaType.application("x-vnd.oasis.opendocument.image"),
78 MediaType.application("x-vnd.oasis.opendocument.formula"),
79 MediaType.application("x-vnd.oasis.opendocument.text-master"),
80 MediaType.application("x-vnd.oasis.opendocument.text-web"),
81 MediaType.application("x-vnd.oasis.opendocument.text-template"),
82 MediaType.application("x-vnd.oasis.opendocument.graphics-template"),
83 MediaType.application("x-vnd.oasis.opendocument.presentation-template"),
84 MediaType.application("x-vnd.oasis.opendocument.spreadsheet-template"),
85 MediaType.application("x-vnd.oasis.opendocument.chart-template"),
86 MediaType.application("x-vnd.oasis.opendocument.image-template"),
87 MediaType.application("x-vnd.oasis.opendocument.formula-template"))));
8688
8789 private static final String META_NAME = "meta.xml";
88
90
8991 private Parser meta = new OpenDocumentMetaParser();
9092
9193 private Parser content = new OpenDocumentContentParser();
125127 if (container instanceof ZipFile) {
126128 zipFile = (ZipFile) container;
127129 } else if (tis.hasFile()) {
128 zipFile = new ZipFile(tis.getFile());
130 zipFile = new ZipFile(tis.getFile());
131 } else {
132 zipStream = new ZipInputStream(stream);
129133 }
130134 } else {
131135 zipStream = new ZipInputStream(stream);
136140
137141 // As we don't know which of the metadata or the content
138142 // we'll hit first, catch the endDocument call initially
139 EndDocumentShieldingContentHandler handler =
140 new EndDocumentShieldingContentHandler(xhtml);
141
143 EndDocumentShieldingContentHandler handler =
144 new EndDocumentShieldingContentHandler(xhtml);
145
142146 // If we can, process the metadata first, then the
143147 // rest of the file afterwards
144148 // Only possible to guarantee that when opened from a file not a stream
150154 Enumeration<? extends ZipEntry> entries = zipFile.entries();
151155 while (entries.hasMoreElements()) {
152156 entry = entries.nextElement();
153 if (! META_NAME.equals(entry.getName())) {
157 if (!META_NAME.equals(entry.getName())) {
154158 handleZipEntry(entry, zipFile.getInputStream(entry), metadata, context, handler);
155159 }
156160 }
162166 } while (entry != null);
163167 zipStream.close();
164168 }
165
169
166170 // Only now call the end document
167 if(handler.getEndDocumentWasCalled()) {
168 handler.reallyEndDocument();
169 }
170 }
171
172 private void handleZipEntry(ZipEntry entry, InputStream zip, Metadata metadata,
173 ParseContext context, EndDocumentShieldingContentHandler handler)
171 if (handler.getEndDocumentWasCalled()) {
172 handler.reallyEndDocument();
173 }
174 }
175
176 private void handleZipEntry(ZipEntry entry, InputStream zip, Metadata metadata,
177 ParseContext context, EndDocumentShieldingContentHandler handler)
174178 throws IOException, SAXException, TikaException {
175179 if (entry == null) return;
176
180
177181 if (entry.getName().equals("mimetype")) {
178 String type = IOUtils.toString(zip, "UTF-8");
182 String type = IOUtils.toString(zip, IOUtils.UTF_8.name());
179183 metadata.set(Metadata.CONTENT_TYPE, type);
180184 } else if (entry.getName().equals(META_NAME)) {
181185 meta.parse(zip, new DefaultHandler(), metadata, context);
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.parser.pdf;
18
19 import java.io.Serializable;
20
21 import org.apache.tika.exception.AccessPermissionException;
22 import org.apache.tika.metadata.AccessPermissions;
23 import org.apache.tika.metadata.Metadata;
24
25 /**
26 * Checks whether or not a document allows extraction generally
27 * or extraction for accessibility only.
28 */
29 public class AccessChecker implements Serializable {
30
31 private static final long serialVersionUID = 6492570218190936986L;
32
33 private final boolean needToCheck;
34 private final boolean allowAccessibility;
35
36 /**
37 * This constructs an {@link AccessChecker} that
38 * will not perform any checking and will always return without
39 * throwing an exception.
40 * <p>
41 * This constructor is available to allow for Tika's legacy ( <= v1.7) behavior.
42 */
43 public AccessChecker() {
44 needToCheck = false;
45 allowAccessibility = true;
46 }
47 /**
48 * This constructs an {@link AccessChecker} that will check
49 * for whether or not content should be extracted from a document.
50 *
51 * @param allowExtractionForAccessibility if general extraction is not allowed, is extraction for accessibility allowed
52 */
53 public AccessChecker(boolean allowExtractionForAccessibility) {
54 needToCheck = true;
55 this.allowAccessibility = allowExtractionForAccessibility;
56 }
57
58 /**
59 * Checks to see if a document's content should be extracted based
60 * on metadata values and the value of {@link #allowAccessibility} in the constructor.
61 *
62 * @param metadata
63 * @throws AccessPermissionException if access is not permitted
64 */
65 public void check(Metadata metadata) throws AccessPermissionException {
66 if (!needToCheck) {
67 return;
68 }
69 if ("false".equals(metadata.get(AccessPermissions.EXTRACT_CONTENT))) {
70 if (allowAccessibility) {
71 if("true".equals(metadata.get(AccessPermissions.EXTRACT_FOR_ACCESSIBILITY))) {
72 return;
73 }
74 throw new AccessPermissionException("Content extraction for accessibility is not allowed.");
75 }
76 throw new AccessPermissionException("Content extraction is not allowed.");
77 }
78 }
79 }
2121 import java.io.Writer;
2222 import java.text.SimpleDateFormat;
2323 import java.util.Calendar;
24 import java.util.HashSet;
24 import java.util.HashMap;
2525 import java.util.List;
2626 import java.util.ListIterator;
27 import java.util.Locale;
2728 import java.util.Map;
28 import java.util.Set;
2929 import java.util.TreeMap;
3030
3131 import org.apache.pdfbox.pdmodel.PDDocument;
4646 import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage;
4747 import org.apache.pdfbox.pdmodel.interactive.action.type.PDAction;
4848 import org.apache.pdfbox.pdmodel.interactive.action.type.PDActionURI;
49 import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
50 import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationFileAttachment;
4951 import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink;
5052 import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationMarkup;
5153 import org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature;
8082 class PDF2XHTML extends PDFTextStripper {
8183
8284 /**
83 * format used for signature dates
85 * Format used for signature dates
86 * TODO Make this thread-safe
8487 */
85 private final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
88 private final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ", Locale.ROOT);
8689
8790 /**
8891 * Maximum recursive depth during AcroForm processing.
9093 */
9194 private final static int MAX_ACROFORM_RECURSIONS = 10;
9295
93
94 // TODO: remove once PDFBOX-2160 is fixed:
95 private boolean inParagraph = false;
96
9796 /**
9897 * This keeps track of the pdf object ids for inline
9998 * images that have been processed. If {@link PDFParserConfig#getExtractUniqueInlineImagesOnly()
10099 * is true, this will be checked before extracting an embedded image.
100 * The integer keeps track of the inlineImageCounter for that image.
101 * This integer is used to identify images in the markup.
101102 */
102 private Set<String> processedInlineImages = new HashSet<String>();
103
103 private Map<String, Integer> processedInlineImages = new HashMap<String, Integer>();
104
105 private int inlineImageCounter = 0;
104106
105107 /**
106108 * Converts the given PDF document (and related metadata) to a stream
229231
230232 extractImages(page.getResources());
231233
232 // TODO: remove once PDFBOX-1143 is fixed:
233 if (config.getExtractAnnotationText()) {
234 for(Object o : page.getAnnotations()) {
235 if( o instanceof PDAnnotationLink ) {
236 PDAnnotationLink annotationlink = (PDAnnotationLink) o;
237 if (annotationlink.getAction() != null) {
234 EmbeddedDocumentExtractor extractor = getEmbeddedDocumentExtractor();
235 for (PDAnnotation annotation : page.getAnnotations()) {
236
237 if (annotation instanceof PDAnnotationFileAttachment){
238 PDAnnotationFileAttachment fann = (PDAnnotationFileAttachment) annotation;
239 PDComplexFileSpecification fileSpec = (PDComplexFileSpecification) fann.getFile();
240 try {
241 extractMultiOSPDEmbeddedFiles("", fileSpec, extractor);
242 } catch (SAXException e) {
243 throw new IOExceptionWithCause("file embedded in annotation sax exception", e);
244 } catch (TikaException e) {
245 throw new IOExceptionWithCause("file embedded in annotation tika exception", e);
246 }
247 }
248 // TODO: remove once PDFBOX-1143 is fixed:
249 if (config.getExtractAnnotationText()) {
250 if (annotation instanceof PDAnnotationLink) {
251 PDAnnotationLink annotationlink = (PDAnnotationLink) annotation;
252 if (annotationlink.getAction() != null) {
238253 PDAction action = annotationlink.getAction();
239 if( action instanceof PDActionURI ) {
254 if (action instanceof PDActionURI) {
240255 PDActionURI uri = (PDActionURI) action;
241256 String link = uri.getURI();
242257 if (link != null) {
245260 handler.endElement("a");
246261 handler.endElement("div");
247262 }
248 }
263 }
249264 }
250265 }
251266
252 if (o instanceof PDAnnotationMarkup) {
253 PDAnnotationMarkup annot = (PDAnnotationMarkup) o;
254 String title = annot.getTitlePopup();
255 String subject = annot.getSubject();
256 String contents = annot.getContents();
257 // TODO: maybe also annot.getRichContents()?
267 if (annotation instanceof PDAnnotationMarkup) {
268 PDAnnotationMarkup annotationMarkup = (PDAnnotationMarkup) annotation;
269 String title = annotationMarkup.getTitlePopup();
270 String subject = annotationMarkup.getSubject();
271 String contents = annotationMarkup.getContents();
272 // TODO: maybe also annotationMarkup.getRichContents()?
258273 if (title != null || subject != null || contents != null) {
259274 handler.startElement("div", "class", "annotation");
260275
305320 if (object instanceof PDXObjectForm) {
306321 extractImages(((PDXObjectForm) object).getResources());
307322 } else if (object instanceof PDXObjectImage) {
308
323
324 PDXObjectImage image = (PDXObjectImage) object;
325
326 Metadata metadata = new Metadata();
327 String extension = "";
328 if (image instanceof PDJpeg) {
329 metadata.set(Metadata.CONTENT_TYPE, "image/jpeg");
330 extension = ".jpg";
331 } else if (image instanceof PDCcitt) {
332 metadata.set(Metadata.CONTENT_TYPE, "image/tiff");
333 extension = ".tif";
334 } else if (image instanceof PDPixelMap) {
335 metadata.set(Metadata.CONTENT_TYPE, "image/png");
336 extension = ".png";
337 }
338
339 Integer imageNumber = processedInlineImages.get(entry.getKey());
340 if (imageNumber == null) {
341 imageNumber = inlineImageCounter++;
342 }
343 String fileName = "image"+imageNumber+extension;
344 metadata.set(Metadata.RESOURCE_NAME_KEY, fileName);
345
346 // Output the img tag
347 AttributesImpl attr = new AttributesImpl();
348 attr.addAttribute("", "src", "src", "CDATA", "embedded:" + fileName);
349 attr.addAttribute("", "alt", "alt", "CDATA", fileName);
350 handler.startElement("img", attr);
351 handler.endElement("img");
352
309353 //Do we only want to process unique COSObject ids?
310354 //If so, have we already processed this one?
311355 if (config.getExtractUniqueInlineImagesOnly() == true) {
312356 String cosObjectId = entry.getKey();
313 if (processedInlineImages.contains(cosObjectId)){
357 if (processedInlineImages.containsKey(cosObjectId)){
314358 continue;
315359 }
316 processedInlineImages.add(cosObjectId);
317 }
318
319 PDXObjectImage image = (PDXObjectImage) object;
320
321 Metadata metadata = new Metadata();
322 if (image instanceof PDJpeg) {
323 metadata.set(Metadata.CONTENT_TYPE, "image/jpeg");
324 } else if (image instanceof PDCcitt) {
325 metadata.set(Metadata.CONTENT_TYPE, "image/tiff");
326 } else if (image instanceof PDPixelMap) {
327 metadata.set(Metadata.CONTENT_TYPE, "image/png");
328 }
329 metadata.set(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE,
360 processedInlineImages.put(cosObjectId, imageNumber);
361 }
362
363 metadata.set(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE,
330364 TikaCoreProperties.EmbeddedResourceType.INLINE.toString());
331365
332366 EmbeddedDocumentExtractor extractor =
360394
361395 @Override
362396 protected void writeParagraphStart() throws IOException {
363 // TODO: remove once PDFBOX-2160 is fixed
364 if (inParagraph) {
365 // Close last paragraph
366 writeParagraphEnd();
367 }
368 assert !inParagraph;
369 inParagraph = true;
397 super.writeParagraphStart();
370398 try {
371399 handler.startElement("p");
372400 } catch (SAXException e) {
376404
377405 @Override
378406 protected void writeParagraphEnd() throws IOException {
379 // TODO: remove once PDFBOX-2160 is fixed
380 if (!inParagraph) {
381 writeParagraphStart();
382 }
383 assert inParagraph;
384 inParagraph = false;
407 super.writeParagraphEnd();
385408 try {
386409 handler.endElement("p");
387410 } catch (SAXException e) {
473496 EmbeddedDocumentExtractor extractor = getEmbeddedDocumentExtractor();
474497 for (Map.Entry<String,COSObjectable> ent : embeddedFileNames.entrySet()) {
475498 PDComplexFileSpecification spec = (PDComplexFileSpecification) ent.getValue();
476 if (spec == null) {
477 //skip silently
478 continue;
479 }
480 PDEmbeddedFile file = spec.getEmbeddedFile();
481 if (file == null) {
482 //skip silently
483 continue;
484 }
485
486 //current strategy is to pull all, not just first non-null
487 extractPDEmbeddedFile(ent.getKey(), spec.getFile(), spec.getEmbeddedFile(), extractor);
488 extractPDEmbeddedFile(ent.getKey(), spec.getFileMac(), spec.getEmbeddedFileMac(), extractor);
489 extractPDEmbeddedFile(ent.getKey(), spec.getFileDos(), spec.getEmbeddedFileDos(), extractor);
490 extractPDEmbeddedFile(ent.getKey(), spec.getFileUnix(), spec.getEmbeddedFileUnix(), extractor);
491
492 }
499 extractMultiOSPDEmbeddedFiles(ent.getKey(), spec, extractor);
500 }
501 }
502
503 private void extractMultiOSPDEmbeddedFiles(String defaultName,
504 PDComplexFileSpecification spec, EmbeddedDocumentExtractor extractor) throws IOException,
505 SAXException, TikaException {
506
507 if (spec == null) {
508 return;
509 }
510 //current strategy is to pull all, not just first non-null
511 extractPDEmbeddedFile(defaultName, spec.getFile(), spec.getEmbeddedFile(), extractor);
512 extractPDEmbeddedFile(defaultName, spec.getFileMac(), spec.getEmbeddedFileMac(), extractor);
513 extractPDEmbeddedFile(defaultName, spec.getFileDos(), spec.getEmbeddedFileDos(), extractor);
514 extractPDEmbeddedFile(defaultName, spec.getFileUnix(), spec.getEmbeddedFileUnix(), extractor);
493515 }
494516
495517 private void extractPDEmbeddedFile(String defaultName, String fileName, PDEmbeddedFile file,
519541 stream,
520542 new EmbeddedContentHandler(handler),
521543 metadata, false);
544
545 AttributesImpl attributes = new AttributesImpl();
546 attributes.addAttribute("", "class", "class", "CDATA", "embedded");
547 attributes.addAttribute("", "id", "id", "CDATA", fileName);
548 handler.startElement("div", attributes);
549 handler.endElement("div");
522550 } finally {
523551 IOUtils.closeQuietly(stream);
524552 }
617645 }
618646 } catch (IOException e) {
619647 //swallow
620 } catch (NullPointerException e) {
621 //TODO: remove once PDFBOX-2161 is fixed
622648 }
623649
624650 if (attrs.getLength() > 0 || sb.length() > 0) {
2121 import java.util.Calendar;
2222 import java.util.Collections;
2323 import java.util.List;
24 import java.util.Locale;
2425 import java.util.Set;
2526
2627 import org.apache.jempbox.xmp.XMPSchema;
3132 import org.apache.pdfbox.cos.COSDictionary;
3233 import org.apache.pdfbox.cos.COSName;
3334 import org.apache.pdfbox.cos.COSString;
35 import org.apache.pdfbox.exceptions.CryptographyException;
3436 import org.apache.pdfbox.io.RandomAccess;
3537 import org.apache.pdfbox.io.RandomAccessBuffer;
3638 import org.apache.pdfbox.io.RandomAccessFile;
3739 import org.apache.pdfbox.pdmodel.PDDocument;
3840 import org.apache.pdfbox.pdmodel.PDDocumentInformation;
41 import org.apache.pdfbox.pdmodel.encryption.AccessPermission;
42 import org.apache.pdfbox.pdmodel.font.PDFont;
43 import org.apache.tika.exception.EncryptedDocumentException;
3944 import org.apache.tika.exception.TikaException;
4045 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
4146 import org.apache.tika.io.CloseShieldInputStream;
4247 import org.apache.tika.io.TemporaryResources;
4348 import org.apache.tika.io.TikaInputStream;
49 import org.apache.tika.metadata.AccessPermissions;
4450 import org.apache.tika.metadata.Metadata;
4551 import org.apache.tika.metadata.PagedText;
4652 import org.apache.tika.metadata.Property;
103109 TemporaryResources tmp = new TemporaryResources();
104110 //config from context, or default if not set via context
105111 PDFParserConfig localConfig = context.get(PDFParserConfig.class, defaultConfig);
112 String password = "";
106113 try {
107114 // PDFBox can process entirely in memory, or can use a temp file
108115 // for unpacked / processed resources
109116 // Decide which to do based on if we're reading from a file or not already
110117 TikaInputStream tstream = TikaInputStream.cast(stream);
118 password = getPassword(metadata, context);
111119 if (tstream != null && tstream.hasFile()) {
112120 // File based, take that as a cue to use a temporary file
113121 RandomAccess scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw");
114122 if (localConfig.getUseNonSequentialParser() == true) {
115 pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), scratchFile);
123 pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), scratchFile, password);
116124 } else {
117125 pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true);
118126 }
119127 } else {
120128 // Go for the normal, stream based in-memory parsing
121129 if (localConfig.getUseNonSequentialParser() == true) {
122 pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), new RandomAccessBuffer());
130 pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), new RandomAccessBuffer(), password);
123131 } else {
124132 pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true);
125133 }
126134 }
127
128
129 if (pdfDocument.isEncrypted()) {
130 String password = null;
131
132 // Did they supply a new style Password Provider?
133 PasswordProvider passwordProvider = context.get(PasswordProvider.class);
134 if (passwordProvider != null) {
135 password = passwordProvider.getPassword(metadata);
136 }
137
138 // Fall back on the old style metadata if set
139 if (password == null && metadata.get(PASSWORD) != null) {
140 password = metadata.get(PASSWORD);
141 }
142
143 // If no password is given, use an empty string as the default
144 if (password == null) {
145 password = "";
146 }
147
148 try {
149 pdfDocument.decrypt(password);
150 } catch (Exception e) {
151 // Ignore
152 }
153 }
135 metadata.set("pdf:encrypted", Boolean.toString(pdfDocument.isEncrypted()));
136
137 //if using the classic parser and the doc is encrypted, we must manually decrypt
138 if (! localConfig.getUseNonSequentialParser() && pdfDocument.isEncrypted()) {
139 pdfDocument.decrypt(password);
140 }
141
154142 metadata.set(Metadata.CONTENT_TYPE, "application/pdf");
155143 extractMetadata(pdfDocument, metadata);
144
145 AccessChecker checker = localConfig.getAccessChecker();
146 checker.check(metadata);
156147 if (handler != null) {
157148 PDF2XHTML.process(pdfDocument, handler, context, metadata, localConfig);
158149 }
159150
151 } catch (CryptographyException e) {
152 //seq parser throws CryptographyException for bad password
153 throw new EncryptedDocumentException(e);
154 } catch (IOException e) {
155 //nonseq parser throws IOException for bad password
156 //At the Tika level, we want the same exception to be thrown
157 if (e.getMessage().contains("Error (CryptographyException)")) {
158 metadata.set("pdf:encrypted", Boolean.toString(true));
159 throw new EncryptedDocumentException(e);
160 }
161 //rethrow any other IOExceptions
162 throw e;
160163 } finally {
161164 if (pdfDocument != null) {
162165 pdfDocument.close();
163166 }
164167 tmp.dispose();
165 }
166 }
167
168
168 //TODO: once we migrate to PDFBox 2.0, remove this (PDFBOX-2200)
169 PDFont.clearResources();
170 }
171 }
172
173 private String getPassword(Metadata metadata, ParseContext context) {
174 String password = null;
175
176 // Did they supply a new style Password Provider?
177 PasswordProvider passwordProvider = context.get(PasswordProvider.class);
178 if (passwordProvider != null) {
179 password = passwordProvider.getPassword(metadata);
180 }
181
182 // Fall back on the old style metadata if set
183 if (password == null && metadata.get(PASSWORD) != null) {
184 password = metadata.get(PASSWORD);
185 }
186
187 // If no password is given, use an empty string as the default
188 if (password == null) {
189 password = "";
190 }
191 return password;
192 }
193
169194
170195 private void extractMetadata(PDDocument document, Metadata metadata)
171196 throws TikaException {
172197
198 //first extract AccessPermissions
199 AccessPermission ap = document.getCurrentAccessPermission();
200 metadata.set(AccessPermissions.EXTRACT_FOR_ACCESSIBILITY,
201 Boolean.toString(ap.canExtractForAccessibility()));
202 metadata.set(AccessPermissions.EXTRACT_CONTENT,
203 Boolean.toString(ap.canExtractContent()));
204 metadata.set(AccessPermissions.ASSEMBLE_DOCUMENT,
205 Boolean.toString(ap.canAssembleDocument()));
206 metadata.set(AccessPermissions.FILL_IN_FORM,
207 Boolean.toString(ap.canFillInForm()));
208 metadata.set(AccessPermissions.CAN_MODIFY,
209 Boolean.toString(ap.canModify()));
210 metadata.set(AccessPermissions.CAN_MODIFY_ANNOTATIONS,
211 Boolean.toString(ap.canModifyAnnotations()));
212 metadata.set(AccessPermissions.CAN_PRINT,
213 Boolean.toString(ap.canPrint()));
214 metadata.set(AccessPermissions.CAN_PRINT_DEGRADED,
215 Boolean.toString(ap.canPrintDegraded()));
216
217
218
219 //now go for the XMP stuff
173220 org.apache.jempbox.xmp.XMPMetadata xmp = null;
174221 XMPSchemaDublinCore dcSchema = null;
175222 try{
203250 // Invalid date format, just ignore
204251 }
205252 try {
206 Calendar modified = info.getModificationDate();
253 Calendar modified = info.getModificationDate();
207254 addMetadata(metadata, Metadata.LAST_MODIFIED, modified);
208255 addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
209256 } catch (IOException e) {
213260 // All remaining metadata is custom
214261 // Copy this over as-is
215262 List<String> handledMetadata = Arrays.asList("Author", "Creator", "CreationDate", "ModDate",
216 "Keywords", "Producer", "Subject", "Title", "Trapped");
263 "Keywords", "Producer", "Subject", "Title", "Trapped");
217264 for(COSName key : info.getDictionary().keySet()) {
218265 String name = key.getName();
219266 if(! handledMetadata.contains(name)) {
220267 addMetadata(metadata, name, info.getDictionary().getDictionaryObject(key));
221268 }
222269 }
223 metadata.set("pdf:encrypted", Boolean.toString(document.isEncrypted()));
224270
225271 //try to get the various versions
226272 //Caveats:
240286 metadata.set("pdfaid:part", Integer.toString(pdfaxmp.getPart()));
241287 if (pdfaxmp.getConformance() != null) {
242288 metadata.set("pdfaid:conformance", pdfaxmp.getConformance());
243 String version = "A-"+pdfaxmp.getPart()+pdfaxmp.getConformance().toLowerCase();
289 String version = "A-"+pdfaxmp.getPart()+pdfaxmp.getConformance().toLowerCase(Locale.ROOT);
244290 metadata.set("pdfa:PDFVersion", version );
245291 metadata.add(TikaCoreProperties.FORMAT.getName(),
246292 MEDIA_TYPE.toString()+"; version=\""+version+"\"" );
249295 // TODO WARN if this XMP version is inconsistent with document header version?
250296 }
251297 } catch (IOException e) {
252 metadata.set("pdf:metadata-xmp-parse-failed", ""+e);
298 metadata.set(TikaCoreProperties.TIKA_META_PREFIX+"pdf:metadata-xmp-parse-failed", ""+e);
253299 }
254300 //TODO: Let's try to move this into PDFBox.
255301 //Attempt to determine Adobe extension level, if present:
426472 }
427473 } else if(value instanceof COSString) {
428474 addMetadata(metadata, name, ((COSString)value).getString());
429 } else if (value != null) {
475 }
476 // Avoid calling COSDictionary#toString, since it can lead to infinite
477 // recursion. See TIKA-1038 and PDFBOX-1835.
478 else if (value != null && !(value instanceof COSDictionary)) {
430479 addMetadata(metadata, name, value.toString());
431480 }
432481 }
1313 * distributed under the License is distributed on an "AS IS" BASIS,
1414 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1515 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.io.Serializable;
22 import java.util.Properties;
23
24 import org.apache.pdfbox.util.PDFTextStripper;
25
26 /**
27 * Config for PDFParser.
28 *
16 * limitations under the License.
17 */
18
19 import org.apache.pdfbox.util.PDFTextStripper;
20
21 import java.io.IOException;
22 import java.io.InputStream;
23 import java.io.Serializable;
24 import java.util.Locale;
25 import java.util.Properties;
26
27 /**
28 * Config for PDFParser.
29 *
2930 * This allows parameters to be set programmatically:
3031 * <ol>
3132 * <li>Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)</li>
7576 //The character width-based tolerance value used to estimate where spaces in text should be added
7677 private Float averageCharTolerance;
7778
78 //The space width-based tolerance value used to estimate where spaces in text should be added
79 private Float spacingTolerance;
80
81 public PDFParserConfig() {
82 init(this.getClass().getResourceAsStream("PDFParser.properties"));
83 }
79 //The space width-based tolerance value used to estimate where spaces in text should be added
80 private Float spacingTolerance;
81
82 private AccessChecker accessChecker;
83
84 public PDFParserConfig() {
85 init(this.getClass().getResourceAsStream("PDFParser.properties"));
86 }
8487
8588 /**
8689 * Loads properties from InputStream and then tries to close InputStream.
132135 setExtractInlineImages(
133136 getProp(props.getProperty("extractInlineImages"),
134137 getExtractInlineImages()));
135 setExtractUniqueInlineImagesOnly(
136 getProp(props.getProperty("extractUniqueInlineImagesOnly"),
137 getExtractUniqueInlineImagesOnly()));
138 }
139
140 /**
141 * Configures the given pdf2XHTML.
138 setExtractUniqueInlineImagesOnly(
139 getProp(props.getProperty("extractUniqueInlineImagesOnly"),
140 getExtractUniqueInlineImagesOnly()));
141
142 boolean checkExtractAccessPermission = getProp(props.getProperty("checkExtractAccessPermission"), false);
143 boolean allowExtractionForAccessibility = getProp(props.getProperty("allowExtractionForAccessibility"), true);
144
145 if (checkExtractAccessPermission == false) {
146 //silently ignore the crazy configuration of checkExtractAccessPermission = false,
147 //but allowExtractionForAccessibility=false
148 accessChecker = new AccessChecker();
149 } else {
150 accessChecker = new AccessChecker(allowExtractionForAccessibility);
151 }
152 }
153
154 /**
155 * Configures the given pdf2XHTML.
142156 *
143157 * @param pdf2XHTML
144158 */
327341
328342 /**
329343 * See {@link PDFTextStripper#setSpacingTolerance(float)}
330 */
331 public void setSpacingTolerance(Float spacingTolerance) {
332 this.spacingTolerance = spacingTolerance;
333 }
334
335 private boolean getProp(String p, boolean defaultMissing){
344 */
345 public void setSpacingTolerance(Float spacingTolerance) {
346 this.spacingTolerance = spacingTolerance;
347 }
348
349 public void setAccessChecker(AccessChecker accessChecker) {
350 this.accessChecker = accessChecker;
351 }
352
353 public AccessChecker getAccessChecker() {
354 return accessChecker;
355 }
356
357 private boolean getProp(String p, boolean defaultMissing){
336358 if (p == null){
337359 return defaultMissing;
338360 }
339 if (p.toLowerCase().equals("true")) {
361 if (p.toLowerCase(Locale.ROOT).equals("true")) {
340362 return true;
341 } else if (p.toLowerCase().equals("false")) {
363 } else if (p.toLowerCase(Locale.ROOT).equals("false")) {
342364 return false;
343365 } else {
344366 return defaultMissing;
1515 */
1616 package org.apache.tika.parser.pkg;
1717
18 import static org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE;
19
1820 import java.io.BufferedInputStream;
1921 import java.io.IOException;
2022 import java.io.InputStream;
23 import java.util.Date;
2124 import java.util.Set;
2225
2326 import org.apache.commons.compress.archivers.ArchiveEntry;
3134 import org.apache.commons.compress.archivers.jar.JarArchiveInputStream;
3235 import org.apache.commons.compress.archivers.sevenz.SevenZFile;
3336 import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
37 import org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException;
38 import org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException.Feature;
3439 import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream;
40 import org.apache.tika.exception.EncryptedDocumentException;
3541 import org.apache.tika.exception.TikaException;
3642 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
3743 import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
3945 import org.apache.tika.io.TemporaryResources;
4046 import org.apache.tika.io.TikaInputStream;
4147 import org.apache.tika.metadata.Metadata;
48 import org.apache.tika.metadata.TikaCoreProperties;
4249 import org.apache.tika.mime.MediaType;
4350 import org.apache.tika.parser.AbstractParser;
4451 import org.apache.tika.parser.ParseContext;
52 import org.apache.tika.parser.PasswordProvider;
4553 import org.apache.tika.sax.XHTMLContentHandler;
4654 import org.xml.sax.ContentHandler;
4755 import org.xml.sax.SAXException;
4856 import org.xml.sax.helpers.AttributesImpl;
49
50 import static org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE;
5157
5258 /**
5359 * Parser for various packaging formats. Package entries will be written to
5460 * the XHTML event stream as &lt;div class="package-entry"&gt; elements that
5561 * contain the (optional) entry name as a &lt;h1&gt; element and the full
5662 * structured body content of the parsed entry.
63 * <p>
64 * User must have JCE Unlimited Strength jars installed for encryption to
65 * work with 7Z files (see: COMPRESS-299 and TIKA-1521). If the jars
66 * are not installed, an IOException will be thrown, and potentially
67 * wrapped in a TikaException.
5768 */
5869 public class PackageParser extends AbstractParser {
5970
6677 private static final MediaType CPIO = MediaType.application("x-cpio");
6778 private static final MediaType DUMP = MediaType.application("x-tika-unix-dump");
6879 private static final MediaType TAR = MediaType.application("x-tar");
69 // Enable this when COMPRESS-267 is fixed, see TIKA-1243
7080 private static final MediaType SEVENZ = MediaType.application("x-7z-compressed");
7181
7282 private static final Set<MediaType> SUPPORTED_TYPES =
104114 InputStream stream, ContentHandler handler,
105115 Metadata metadata, ParseContext context)
106116 throws IOException, SAXException, TikaException {
107 // At the end we want to close the archive stream to release
108 // any associated resources, but the underlying document stream
109 // should not be closed
110 stream = new CloseShieldInputStream(stream);
111
117
112118 // Ensure that the stream supports the mark feature
113 if (! TikaInputStream.isTikaInputStream(stream)) {
119 if (! TikaInputStream.isTikaInputStream(stream))
114120 stream = new BufferedInputStream(stream);
115 }
116
117 ArchiveInputStream ais;
121
122
123 TemporaryResources tmp = new TemporaryResources();
124 ArchiveInputStream ais = null;
118125 try {
119 ArchiveStreamFactory factory = context.get(
120 ArchiveStreamFactory.class, new ArchiveStreamFactory());
121 ais = factory.createArchiveInputStream(stream);
126 ArchiveStreamFactory factory = context.get(ArchiveStreamFactory.class, new ArchiveStreamFactory());
127 // At the end we want to close the archive stream to release
128 // any associated resources, but the underlying document stream
129 // should not be closed
130 ais = factory.createArchiveInputStream(new CloseShieldInputStream(stream));
131
122132 } catch (StreamingNotSupportedException sne) {
123133 // Most archive formats work on streams, but a few need files
124134 if (sne.getFormat().equals(ArchiveStreamFactory.SEVEN_Z)) {
125135 // Rework as a file, and wrap
126136 stream.reset();
127 TikaInputStream tstream = TikaInputStream.get(stream);
137 TikaInputStream tstream = TikaInputStream.get(stream, tmp);
128138
129 // Pending a fix for COMPRESS_269, this bit is a little nasty
130 ais = new SevenZWrapper(new SevenZFile(tstream.getFile()));
139 // Seven Zip suports passwords, was one given?
140 String password = null;
141 PasswordProvider provider = context.get(PasswordProvider.class);
142 if (provider != null) {
143 password = provider.getPassword(metadata);
144 }
145
146 SevenZFile sevenz;
147 if (password == null) {
148 sevenz = new SevenZFile(tstream.getFile());
149 } else {
150 sevenz = new SevenZFile(tstream.getFile(), password.getBytes("UnicodeLittleUnmarked"));
151 }
152
153 // Pending a fix for COMPRESS-269 / TIKA-1525, this bit is a little nasty
154 ais = new SevenZWrapper(sevenz);
131155 } else {
156 tmp.close();
132157 throw new TikaException("Unknown non-streaming format " + sne.getFormat(), sne);
133158 }
134159 } catch (ArchiveException e) {
160 tmp.close();
135161 throw new TikaException("Unable to unpack document stream", e);
136162 }
137163
139165 if (!type.equals(MediaType.OCTET_STREAM)) {
140166 metadata.set(CONTENT_TYPE, type.toString());
141167 }
142
143168 // Use the delegate parser to parse the contained document
144169 EmbeddedDocumentExtractor extractor = context.get(
145170 EmbeddedDocumentExtractor.class,
156181 }
157182 entry = ais.getNextEntry();
158183 }
184 } catch (UnsupportedZipFeatureException zfe) {
185 // If it's an encrypted document of unknown password, report as such
186 if (zfe.getFeature() == Feature.ENCRYPTION) {
187 throw new EncryptedDocumentException(zfe);
188 }
189 // Otherwise fall through to raise the exception as normal
190 } catch (IOException ie) {
191 // Is this a password protection error?
192 // (COMPRESS-298 should give a nicer way when implemented, see TIKA-1525)
193 if ("Cannot read encrypted files without a password".equals(ie.getMessage())) {
194 throw new EncryptedDocumentException();
195 }
196 // Otherwise fall through to raise the exception as normal
197 throw ie;
159198 } finally {
160199 ais.close();
200 tmp.close();
161201 }
162202
163203 xhtml.endDocument();
169209 throws SAXException, IOException, TikaException {
170210 String name = entry.getName();
171211 if (archive.canReadEntryData(entry)) {
172 Metadata entrydata = new Metadata();
173 if (name != null && name.length() > 0) {
174 entrydata.set(Metadata.RESOURCE_NAME_KEY, name);
175 AttributesImpl attributes = new AttributesImpl();
176 attributes.addAttribute("", "class", "class", "CDATA", "embedded");
177 attributes.addAttribute("", "id", "id", "CDATA", name);
178 xhtml.startElement("div", attributes);
179 xhtml.endElement("div");
180
181 entrydata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, name);
182 }
212 // Fetch the metadata on the entry contained in the archive
213 Metadata entrydata = handleEntryMetadata(name, null,
214 entry.getLastModifiedDate(), entry.getSize(), xhtml);
215
216 // Recurse into the entry if desired
183217 if (extractor.shouldParseEmbedded(entrydata)) {
184218 // For detectors to work, we need a mark/reset supporting
185219 // InputStream, which ArchiveInputStream isn't, so wrap
195229 xhtml.element("p", name);
196230 }
197231 }
232
233 protected static Metadata handleEntryMetadata(
234 String name, Date createAt, Date modifiedAt,
235 Long size, XHTMLContentHandler xhtml)
236 throws SAXException, IOException, TikaException {
237 Metadata entrydata = new Metadata();
238 if (createAt != null) {
239 entrydata.set(TikaCoreProperties.CREATED, createAt);
240 }
241 if (modifiedAt != null) {
242 entrydata.set(TikaCoreProperties.MODIFIED, modifiedAt);
243 }
244 if (size != null) {
245 entrydata.set(Metadata.CONTENT_LENGTH, Long.toString(size));
246 }
247 if (name != null && name.length() > 0) {
248 name = name.replace("\\", "/");
249 entrydata.set(Metadata.RESOURCE_NAME_KEY, name);
250 AttributesImpl attributes = new AttributesImpl();
251 attributes.addAttribute("", "class", "class", "CDATA", "embedded");
252 attributes.addAttribute("", "id", "id", "CDATA", name);
253 xhtml.startElement("div", attributes);
254 xhtml.endElement("div");
255
256 entrydata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, name);
257 }
258 return entrydata;
259 }
198260
199261 // Pending a fix for COMPRESS-269, we have to wrap ourselves
200262 private static class SevenZWrapper extends ArchiveInputStream {
220282 public ArchiveEntry getNextEntry() throws IOException {
221283 return file.getNextEntry();
222284 }
285
286 @Override
287 public void close() throws IOException {
288 file.close();
289 }
223290 }
224291 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.pkg;
17
18 import java.io.IOException;
19 import java.io.InputStream;
20 import java.util.Collections;
21 import java.util.Set;
22
23 import org.apache.tika.exception.EncryptedDocumentException;
24 import org.apache.tika.exception.TikaException;
25 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
26 import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor;
27 import org.apache.tika.io.TemporaryResources;
28 import org.apache.tika.io.TikaInputStream;
29 import org.apache.tika.metadata.Metadata;
30 import org.apache.tika.mime.MediaType;
31 import org.apache.tika.parser.AbstractParser;
32 import org.apache.tika.parser.ParseContext;
33 import org.apache.tika.sax.XHTMLContentHandler;
34 import org.xml.sax.ContentHandler;
35 import org.xml.sax.SAXException;
36
37 import com.github.junrar.Archive;
38 import com.github.junrar.exception.RarException;
39 import com.github.junrar.rarfile.FileHeader;
40
41 /**
42 * Parser for Rar files.
43 */
44 public class RarParser extends AbstractParser {
45 private static final long serialVersionUID = 6157727985054451501L;
46
47 private static final Set<MediaType> SUPPORTED_TYPES = Collections
48 .singleton(MediaType.application("x-rar-compressed"));
49
50 @Override
51 public Set<MediaType> getSupportedTypes(ParseContext arg0) {
52 return SUPPORTED_TYPES;
53 }
54
55 @Override
56 public void parse(InputStream stream, ContentHandler handler,
57 Metadata metadata, ParseContext context) throws IOException,
58 SAXException, TikaException {
59
60 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
61 xhtml.startDocument();
62
63 EmbeddedDocumentExtractor extractor = context.get(
64 EmbeddedDocumentExtractor.class,
65 new ParsingEmbeddedDocumentExtractor(context));
66
67 TemporaryResources tmp = new TemporaryResources();
68 Archive rar = null;
69 try {
70 TikaInputStream tis = TikaInputStream.get(stream, tmp);
71 rar = new Archive(tis.getFile());
72
73 if (rar.isEncrypted()) {
74 throw new EncryptedDocumentException();
75 }
76
77 //Without this BodyContentHandler does not work
78 xhtml.element("div", " ");
79
80 FileHeader header = rar.nextFileHeader();
81 while (header != null && !Thread.currentThread().isInterrupted()) {
82 if (!header.isDirectory()) {
83 InputStream subFile = null;
84 try {
85 subFile = rar.getInputStream(header);
86
87 Metadata entrydata = PackageParser.handleEntryMetadata(
88 "".equals(header.getFileNameW())?header.getFileNameString():header.getFileNameW(),
89 header.getCTime(), header.getMTime(),
90 header.getFullUnpackSize(),
91 xhtml
92 );
93
94 if (extractor.shouldParseEmbedded(entrydata)) {
95 extractor.parseEmbedded(subFile, handler, entrydata, true);
96 }
97 } finally {
98 if (subFile != null)
99 subFile.close();
100 }
101 }
102
103 header = rar.nextFileHeader();
104 }
105
106 } catch (RarException e) {
107 throw new TikaException("RarParser Exception", e);
108 } finally {
109 if (rar != null)
110 rar.close();
111 tmp.close();
112 }
113
114 xhtml.endDocument();
115 }
116 }
2121 import java.util.Enumeration;
2222 import java.util.HashSet;
2323 import java.util.Iterator;
24 import java.util.Locale;
2425 import java.util.Set;
2526 import java.util.regex.Pattern;
2627
3334 import org.apache.commons.compress.compressors.CompressorException;
3435 import org.apache.commons.compress.compressors.CompressorInputStream;
3536 import org.apache.commons.compress.compressors.CompressorStreamFactory;
36 import org.apache.poi.extractor.ExtractorFactory;
3737 import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
3838 import org.apache.poi.openxml4j.opc.OPCPackage;
3939 import org.apache.poi.openxml4j.opc.PackageAccess;
4040 import org.apache.poi.openxml4j.opc.PackagePart;
4141 import org.apache.poi.openxml4j.opc.PackageRelationshipCollection;
42 import org.apache.poi.openxml4j.opc.PackageRelationshipTypes;
4243 import org.apache.tika.detect.Detector;
4344 import org.apache.tika.exception.TikaException;
4445 import org.apache.tika.io.IOUtils;
5657 public class ZipContainerDetector implements Detector {
5758 private static final Pattern MACRO_TEMPLATE_PATTERN = Pattern.compile("macroenabledtemplate$", Pattern.CASE_INSENSITIVE);
5859
60 // TODO Remove this constant once we upgrade to POI 3.12 beta 2, then use PackageRelationshipTypes
61 private static final String VISIO_DOCUMENT =
62 "http://schemas.microsoft.com/visio/2010/relationships/document";
63 // TODO Remove this constant once we upgrade to POI 3.12 beta 2, then use PackageRelationshipTypes
64 private static final String STRICT_CORE_DOCUMENT =
65 "http://purl.oclc.org/ooxml/officeDocument/relationships/officeDocument";
66
5967 /** Serial version UID */
6068 private static final long serialVersionUID = 2891763938430295453L;
6169
179187 if (mimetype != null) {
180188 InputStream stream = zip.getInputStream(mimetype);
181189 try {
182 return MediaType.parse(IOUtils.toString(stream, "UTF-8"));
190 return MediaType.parse(IOUtils.toString(stream, IOUtils.UTF_8.name()));
183191 } finally {
184192 stream.close();
185193 }
229237 * opened Package
230238 */
231239 public static MediaType detectOfficeOpenXML(OPCPackage pkg) {
240 // Check for the normal Office core document
232241 PackageRelationshipCollection core =
233 pkg.getRelationshipsByType(ExtractorFactory.CORE_DOCUMENT_REL);
242 pkg.getRelationshipsByType(PackageRelationshipTypes.CORE_DOCUMENT);
243 // Otherwise check for some other Office core document types
244 if (core.size() == 0) {
245 core = pkg.getRelationshipsByType(STRICT_CORE_DOCUMENT);
246 }
247 if (core.size() == 0) {
248 core = pkg.getRelationshipsByType(VISIO_DOCUMENT);
249 }
250
251 // If we didn't find a single core document of any type, skip detection
234252 if (core.size() != 1) {
235253 // Invalid OOXML Package received
236254 return null;
244262 String docType = coreType.substring(0, coreType.lastIndexOf('.'));
245263
246264 // The Macro Enabled formats are a little special
247 if(docType.toLowerCase().endsWith("macroenabled")) {
248 docType = docType.toLowerCase() + ".12";
249 }
250
251 if(docType.toLowerCase().endsWith("macroenabledtemplate")) {
265 if(docType.toLowerCase(Locale.ROOT).endsWith("macroenabled")) {
266 docType = docType.toLowerCase(Locale.ROOT) + ".12";
267 }
268
269 if(docType.toLowerCase(Locale.ROOT).endsWith("macroenabledtemplate")) {
252270 docType = MACRO_TEMPLATE_PATTERN.matcher(docType).replaceAll("macroenabled.12");
253271 }
254272
2323 import java.io.IOException;
2424 import java.io.InputStream;
2525 import java.io.UnsupportedEncodingException;
26 import java.util.Locale;
2627 import java.util.concurrent.atomic.AtomicInteger;
2728
2829 import org.apache.poi.poifs.filesystem.DirectoryNode;
101102 //readBytes tests for reading too many bytes
102103 byte[] embObjBytes = readBytes(is, dataSz);
103104
104 if (className.toLowerCase().equals("package")){
105 if (className.toLowerCase(Locale.ROOT).equals("package")){
105106 return handlePackage(embObjBytes, metadata);
106 } else if (className.toLowerCase().equals("pbrush")) {
107 } else if (className.toLowerCase(Locale.ROOT).equals("pbrush")) {
107108 //simple bitmap bytes
108109 return embObjBytes;
109110 } else {
3535 */
3636 public class RTFParser extends AbstractParser {
3737
38 /** Serial version UID */
38 /**
39 * Serial version UID
40 */
3941 private static final long serialVersionUID = -4165069489372320313L;
4042
4143 private static final Set<MediaType> SUPPORTED_TYPES =
4547 return SUPPORTED_TYPES;
4648 }
4749
48 /** maximum number of bytes per embedded object/pict (default: 20MB)*/
49 private static int EMB_OBJ_MAX_BYTES = 20*1024*1024; //20MB
50 /**
51 * maximum number of bytes per embedded object/pict (default: 20MB)
52 */
53 private static int EMB_OBJ_MAX_BYTES = 20 * 1024 * 1024; //20MB
5054
5155 /**
52 * Bytes for embedded objects are currently cached in memory.
53 * If something goes wrong during the parsing of an embedded object,
54 * it is possible that a read length may be crazily too long
56 * Bytes for embedded objects are currently cached in memory.
57 * If something goes wrong during the parsing of an embedded object,
58 * it is possible that a read length may be crazily too long
5559 * and cause a heap crash.
56 *
57 * @param max maximum number of bytes to allow for embedded objects. If
58 * the embedded object has more than this number of bytes, skip it.
60 *
61 * @param max maximum number of bytes to allow for embedded objects. If
62 * the embedded object has more than this number of bytes, skip it.
5963 */
6064 public static void setMaxBytesForEmbeddedObject(int max) {
6165 EMB_OBJ_MAX_BYTES = max;
6266 }
63
67
6468 /**
6569 * See {@link #setMaxBytesForEmbeddedObject(int)}.
66 *
70 *
6771 * @return maximum number of bytes allowed for an embedded object.
68 *
6972 */
7073 public static int getMaxBytesForEmbeddedObject() {
7174 return EMB_OBJ_MAX_BYTES;
7780 throws IOException, SAXException, TikaException {
7881 TaggedInputStream tagged = new TaggedInputStream(stream);
7982 try {
80 RTFEmbObjHandler embObjHandler = new RTFEmbObjHandler(handler,
81 metadata, context);
82 final TextExtractor ert =
83 new TextExtractor(new XHTMLContentHandler(handler,
84 metadata), metadata, embObjHandler);
83 XHTMLContentHandler xhtmlHandler = new XHTMLContentHandler(handler, metadata);
84 RTFEmbObjHandler embObjHandler = new RTFEmbObjHandler(xhtmlHandler, metadata, context);
85 final TextExtractor ert = new TextExtractor(xhtmlHandler, metadata, embObjHandler);
8586 ert.extract(stream);
8687 metadata.add(Metadata.CONTENT_TYPE, "application/rtf");
8788 } catch (IOException e) {
2828 import java.util.Calendar;
2929 import java.util.HashMap;
3030 import java.util.LinkedList;
31 import java.util.Locale;
3132 import java.util.Map;
33 import java.util.TimeZone;
3234
3335 import org.apache.tika.exception.TikaException;
3436 import org.apache.tika.metadata.Metadata;
619621
620622 private void endParagraph(boolean preserveStyles) throws IOException, SAXException, TikaException {
621623 pushText();
624 //maintain consecutive new lines
625 if (!inParagraph) {
626 lazyStartParagraph();
627 }
622628 if (inParagraph) {
623629 if (groupState.italic) {
624630 end("i");
13381344 if (inHeader) {
13391345 if (nextMetaData != null) {
13401346 if (nextMetaData == TikaCoreProperties.CREATED) {
1341 Calendar cal = Calendar.getInstance();
1347 Calendar cal = Calendar.getInstance(TimeZone.getDefault(), Locale.ROOT);
13421348 cal.set(year, month-1, day, hour, minute, 0);
13431349 metadata.set(nextMetaData, cal.getTime());
13441350 } else if (nextMetaData.isMultiValuePermitted()) {
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import java.io.Serializable;
16
17 /**
18 * Configuration for the "file" (or file-alternative) command.
19 *
20 */
21 public class FileConfig implements Serializable {
22 /**
23 * Serial version UID
24 */
25 private static final long serialVersionUID = 5712655467296441314L;
26
27 private String filePath = "";
28
29 private boolean mimetype = false;
30
31 /**
32 * Default constructor.
33 */
34 public FileConfig() {
35 // TODO Loads properties from InputStream.
36 }
37
38 /**
39 * Returns the "file" installation folder.
40 *
41 * @return the "file" installation folder.
42 */
43 public String getFilePath() {
44 return filePath;
45 }
46
47 /**
48 * Sets the "file" installation folder.
49 *
50 * @param path
51 * the "file" installation folder.
52 */
53 public void setFilePath(String filePath) {
54 this.filePath = filePath;
55 }
56
57 /**
58 * Returns {@code true} if the mime option is enabled.
59 *
60 * @return {@code true} if the mime option is enabled, {@code} otherwise.
61 */
62 public boolean isMimetype() {
63 return mimetype;
64 }
65
66 /**
67 * Sets the mime option. If {@code true}, it causes the file command to
68 * output mime type strings rather than the more traditional human readable
69 * ones.
70 *
71 * @param mimetype
72 */
73 public void setMimetype(boolean mimetype) {
74 this.mimetype = mimetype;
75 }
76 }
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import java.io.IOException;
16 import java.io.InputStream;
17 import java.io.UnsupportedEncodingException;
18 import java.util.HashSet;
19 import java.util.Set;
20
21 import org.apache.tika.metadata.Metadata;
22 import org.apache.tika.mime.MediaType;
23 import org.apache.tika.parser.AbstractParser;
24 import org.apache.tika.parser.ParseContext;
25 import org.apache.tika.sax.XHTMLContentHandler;
26 import org.xml.sax.ContentHandler;
27 import org.xml.sax.SAXException;
28
29 /**
30 * Parser to extract printable Latin1 strings from arbitrary files with pure
31 * java. Useful for binary or unknown files, for files without a specific parser
32 * and for corrupted ones causing a TikaException as a fallback parser.
33 *
34 * Currently the parser does a best effort to extract Latin1 strings, used by
35 * Western European languages, encoded with ISO-8859-1, UTF-8 or UTF-16 charsets
36 * within the same file.
37 *
38 * The implementation is optimized for fast parsing with only one pass.
39 */
40 public class Latin1StringsParser extends AbstractParser {
41
42 private static final long serialVersionUID = 1L;
43
44 /**
45 * The set of supported types
46 */
47 private static final Set<MediaType> SUPPORTED_TYPES = getTypes();
48
49 /**
50 * The valid ISO-8859-1 character map.
51 */
52 private static final boolean[] isChar = getCharMap();
53
54 /**
55 * The size of the internal buffers.
56 */
57 private static int BUF_SIZE = 64 * 1024;
58
59 /**
60 * The minimum size of a character sequence to be extracted.
61 */
62 private int minSize = 4;
63
64 /**
65 * The output buffer.
66 */
67 private byte[] output = new byte[BUF_SIZE];
68
69 /**
70 * The input buffer.
71 */
72 private byte[] input = new byte[BUF_SIZE];
73
74 /**
75 * The temporary position into the output buffer.
76 */
77 private int tmpPos = 0;
78
79 /**
80 * The current position into the output buffer.
81 */
82 private int outPos = 0;
83
84 /**
85 * The number of bytes into the input buffer.
86 */
87 private int inSize = 0;
88
89 /**
90 * The position into the input buffer.
91 */
92 private int inPos = 0;
93
94 /**
95 * The output content handler.
96 */
97 private XHTMLContentHandler xhtml;
98
99 /**
100 * Returns the minimum size of a character sequence to be extracted.
101 *
102 * @return the minimum size of a character sequence
103 */
104 public int getMinSize() {
105 return minSize;
106 }
107
108 /**
109 * Sets the minimum size of a character sequence to be extracted.
110 *
111 * @param minSize
112 * the minimum size of a character sequence
113 */
114 public void setMinSize(int minSize) {
115 this.minSize = minSize;
116 }
117
118 /**
119 * Populates the valid ISO-8859-1 character map.
120 *
121 * @return the valid ISO-8859-1 character map.
122 */
123 private static boolean[] getCharMap() {
124
125 boolean[] isChar = new boolean[256];
126 for (int c = Byte.MIN_VALUE; c <= Byte.MAX_VALUE; c++)
127 if ((c >= 0x20 && c <= 0x7E)
128 || (c >= (byte) 0xC0 && c <= (byte) 0xFE) || c == 0x0A
129 || c == 0x0D || c == 0x09) {
130 isChar[c & 0xFF] = true;
131 }
132 return isChar;
133
134 }
135
136 /**
137 * Returns the set of supported types.
138 *
139 * @return the set of supported types
140 */
141 private static Set<MediaType> getTypes() {
142 HashSet<MediaType> supportedTypes = new HashSet<MediaType>();
143 supportedTypes.add(MediaType.OCTET_STREAM);
144 return supportedTypes;
145 }
146
147 /**
148 * Tests if the byte is a ISO-8859-1 char.
149 *
150 * @param c
151 * the byte to test.
152 *
153 * @return if the byte is a char.
154 */
155 private static final boolean isChar(byte c) {
156 return isChar[c & 0xFF];
157 }
158
159 /**
160 * Flushes the internal output buffer to the content handler.
161 *
162 * @throws UnsupportedEncodingException
163 * @throws SAXException
164 */
165 private void flushBuffer() throws UnsupportedEncodingException,
166 SAXException {
167 if (tmpPos - outPos >= minSize)
168 outPos = tmpPos - minSize;
169
170 xhtml.characters(new String(output, 0, outPos, "windows-1252"));
171
172 for (int k = 0; k < tmpPos - outPos; k++)
173 output[k] = output[outPos + k];
174 tmpPos = tmpPos - outPos;
175 outPos = 0;
176 }
177
178 @Override
179 public Set<MediaType> getSupportedTypes(ParseContext arg0) {
180 return SUPPORTED_TYPES;
181 }
182
183 /**
184 * @see org.apache.tika.parser.Parser#parse(java.io.InputStream,
185 * org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata,
186 * org.apache.tika.parser.ParseContext)
187 */
188 @Override
189 public void parse(InputStream stream, ContentHandler handler,
190 Metadata metadata, ParseContext context) throws IOException,
191 SAXException {
192 /*
193 * Creates a new instance because the object is not immutable.
194 */
195 new Latin1StringsParser().doParse(stream, handler, metadata, context);
196 }
197
198 /**
199 * Does a best effort to extract Latin1 strings encoded with ISO-8859-1,
200 * UTF-8 or UTF-16. Valid chars are saved into the output buffer and the
201 * temporary buffer position is incremented. When an invalid char is read,
202 * the difference of the temporary and current buffer position is checked.
203 * If it is greater than the minimum string size, the current buffer
204 * position is updated to the temp position. If it is not, the temp position
205 * is reseted to the current position.
206 *
207 * @param stream
208 * the input stream.
209 * @param handler
210 * the output content handler
211 * @param metadata
212 * the metadata of the file
213 * @param context
214 * the parsing context
215 * @throws IOException
216 * if an io error occurs
217 * @throws SAXException
218 * if a sax error occurs
219 */
220 private void doParse(InputStream stream, ContentHandler handler,
221 Metadata metadata, ParseContext context) throws IOException,
222 SAXException {
223
224 tmpPos = 0;
225 outPos = 0;
226
227 xhtml = new XHTMLContentHandler(handler, metadata);
228 xhtml.startDocument();
229
230 int i = 0;
231 do {
232 inSize = 0;
233 while ((i = stream.read(input, inSize, BUF_SIZE - inSize)) > 0) {
234 inSize += i;
235 }
236 inPos = 0;
237 while (inPos < inSize) {
238 byte c = input[inPos++];
239 boolean utf8 = false;
240 /*
241 * Test for a possible UTF8 encoded char
242 */
243 if (c == (byte) 0xC3) {
244 byte c_ = inPos < inSize ? input[inPos++] : (byte) stream
245 .read();
246 /*
247 * Test if the next byte is in the valid UTF8 range
248 */
249 if (c_ >= (byte) 0x80 && c_ <= (byte) 0xBF) {
250 utf8 = true;
251 output[tmpPos++] = (byte) (c_ + 0x40);
252 } else {
253 output[tmpPos++] = c;
254 c = c_;
255 }
256 if (tmpPos == BUF_SIZE)
257 flushBuffer();
258
259 /*
260 * Test for a possible UTF8 encoded char
261 */
262 } else if (c == (byte) 0xC2) {
263 byte c_ = inPos < inSize ? input[inPos++] : (byte) stream
264 .read();
265 /*
266 * Test if the next byte is in the valid UTF8 range
267 */
268 if (c_ >= (byte) 0xA0 && c_ <= (byte) 0xBF) {
269 utf8 = true;
270 output[tmpPos++] = c_;
271 } else {
272 output[tmpPos++] = c;
273 c = c_;
274 }
275 if (tmpPos == BUF_SIZE)
276 flushBuffer();
277 }
278 if (!utf8)
279 /*
280 * Test if the byte is a valid char.
281 */
282 if (isChar(c)) {
283 output[tmpPos++] = c;
284 if (tmpPos == BUF_SIZE)
285 flushBuffer();
286 } else {
287 /*
288 * Test if the byte is an invalid char, marking a string
289 * end. If it is a zero, test 2 positions before or
290 * ahead for a valid char, meaning it marks the
291 * transition between ISO-8859-1 and UTF16 sequences.
292 */
293 if (c != 0
294 || (inPos >= 3 && isChar(input[inPos - 3]))
295 || (inPos + 1 < inSize && isChar(input[inPos + 1]))) {
296
297 if (tmpPos - outPos >= minSize) {
298 output[tmpPos++] = 0x0A;
299 outPos = tmpPos;
300
301 if (tmpPos == BUF_SIZE)
302 flushBuffer();
303 } else
304 tmpPos = outPos;
305
306 }
307 }
308 }
309 } while (i != -1 && !Thread.currentThread().isInterrupted());
310
311 if (tmpPos - outPos >= minSize) {
312 output[tmpPos++] = 0x0A;
313 outPos = tmpPos;
314 }
315 xhtml.characters(new String(output, 0, outPos, "windows-1252"));
316
317 xhtml.endDocument();
318
319 }
320
321 }
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import java.io.File;
16 import java.io.Serializable;
17 import java.util.Properties;
18 import java.io.InputStream;
19 import java.io.IOException;
20
21 /**
22 * Configuration for the "strings" (or strings-alternative) command.
23 *
24 */
25 public class StringsConfig implements Serializable {
26 /**
27 * Serial version UID
28 */
29 private static final long serialVersionUID = -1465227101645003594L;
30
31 private String stringsPath = "";
32
33 // Minimum sequence length (characters) to print
34 private int minLength = 4;
35
36 // Character encoding of the strings that are to be found
37 private StringsEncoding encoding = StringsEncoding.SINGLE_7_BIT;
38
39 // Maximum time (seconds) to wait for the strings process termination
40 private int timeout = 120;
41
42 /**
43 * Default contructor.
44 */
45 public StringsConfig() {
46 init(this.getClass().getResourceAsStream("Strings.properties"));
47 }
48
49 /**
50 * Loads properties from InputStream and then tries to close InputStream. If
51 * there is an IOException, this silently swallows the exception and goes
52 * back to the default.
53 *
54 * @param is
55 */
56 public StringsConfig(InputStream is) {
57 init(is);
58 }
59
60 /**
61 * Initializes attributes.
62 *
63 * @param is
64 */
65 private void init(InputStream is) {
66 if (is == null) {
67 return;
68 }
69 Properties props = new Properties();
70 try {
71 props.load(is);
72 } catch (IOException e) {
73 // swallow
74 } finally {
75 if (is != null) {
76 try {
77 is.close();
78 } catch (IOException e) {
79 // swallow
80 }
81 }
82 }
83
84 setStringsPath(props.getProperty("stringsPath", "" + getStringsPath()));
85
86 setMinLength(Integer.parseInt(props.getProperty("minLength", ""
87 + getMinLength())));
88
89 setEncoding(StringsEncoding.valueOf(props.getProperty("encoding", ""
90 + getEncoding().get())));
91
92 setTimeout(Integer.parseInt(props.getProperty("timeout", ""
93 + getTimeout())));
94 }
95
96 /**
97 * Returns the "strings" installation folder.
98 *
99 * @return the "strings" installation folder.
100 */
101 public String getStringsPath() {
102 return this.stringsPath;
103 }
104
105 /**
106 * Returns the minimum sequence length (characters) to print.
107 *
108 * @return the minimum sequence length (characters) to print.
109 */
110 public int getMinLength() {
111 return this.minLength;
112 }
113
114 /**
115 * Returns the character encoding of the strings that are to be found.
116 *
117 * @return {@see StringsEncoding} enum that represents the character
118 * encoding of the strings that are to be found.
119 */
120 public StringsEncoding getEncoding() {
121 return this.encoding;
122 }
123
124 /**
125 * Returns the maximum time (in seconds) to wait for the "strings" command
126 * to terminate.
127 *
128 * @return the maximum time (in seconds) to wait for the "strings" command
129 * to terminate.
130 */
131 public int getTimeout() {
132 return this.timeout;
133 }
134
135 /**
136 * Sets the "strings" installation folder.
137 *
138 * @param path
139 * the "strings" installation folder.
140 */
141 public void setStringsPath(String path) {
142 if (!path.isEmpty() && !path.endsWith(File.separator)) {
143 path += File.separatorChar;
144 }
145 this.stringsPath = path;
146 }
147
148 /**
149 * Sets the minimum sequence length (characters) to print.
150 *
151 * @param minLength
152 * the minimum sequence length (characters) to print.
153 */
154 public void setMinLength(int minLength) {
155 if (minLength < 1) {
156 throw new IllegalArgumentException("Invalid minimum length");
157 }
158 this.minLength = minLength;
159 }
160
161 /**
162 * Sets the character encoding of the strings that are to be found.
163 *
164 * @param encoding
165 * {@see StringsEncoding} enum that represents the character
166 * encoding of the strings that are to be found.
167 */
168 public void setEncoding(StringsEncoding encoding) {
169 this.encoding = encoding;
170 }
171
172 /**
173 * Sets the maximum time (in seconds) to wait for the "strings" command to
174 * terminate.
175 *
176 * @param timeout
177 * the maximum time (in seconds) to wait for the "strings"
178 * command to terminate.
179 */
180 public void setTimeout(int timeout) {
181 if (timeout < 1) {
182 throw new IllegalArgumentException("Invalid timeout");
183 }
184 this.timeout = timeout;
185 }
186 }
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 /**
16 * Character encoding of the strings that are to be found using the "strings" command.
17 *
18 */
19 public enum StringsEncoding {
20 SINGLE_7_BIT('s', "single-7-bit-byte"), // default
21 SINGLE_8_BIT('S', "single-8-bit-byte"),
22 BIGENDIAN_16_BIT('b', "16-bit bigendian"),
23 LITTLEENDIAN_16_BIT('l', "16-bit littleendian"),
24 BIGENDIAN_32_BIT('B', "32-bit bigendian"),
25 LITTLEENDIAN_32_BIT('L', "32-bit littleendian");
26
27 private char value;
28
29 private String encoding;
30
31 private StringsEncoding(char value, String encoding) {
32 this.value = value;
33 this.encoding = encoding;
34 }
35
36 public char get() {
37 return value;
38 }
39
40 @Override
41 public String toString() {
42 return encoding;
43 }
44 }
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import java.io.BufferedReader;
16 import java.io.File;
17 import java.io.IOException;
18 import java.io.InputStream;
19 import java.io.InputStreamReader;
20 import java.util.ArrayList;
21 import java.util.Collections;
22 import java.util.HashMap;
23 import java.util.Map;
24 import java.util.Set;
25 import java.util.concurrent.Callable;
26 import java.util.concurrent.ExecutionException;
27 import java.util.concurrent.FutureTask;
28 import java.util.concurrent.TimeUnit;
29 import java.util.concurrent.TimeoutException;
30
31 import org.apache.tika.exception.TikaException;
32 import org.apache.tika.io.IOUtils;
33 import org.apache.tika.io.TikaInputStream;
34 import org.apache.tika.metadata.Metadata;
35 import org.apache.tika.mime.MediaType;
36 import org.apache.tika.parser.AbstractParser;
37 import org.apache.tika.parser.ParseContext;
38 import org.apache.tika.parser.external.ExternalParser;
39 import org.apache.tika.sax.XHTMLContentHandler;
40 import org.xml.sax.ContentHandler;
41 import org.xml.sax.SAXException;
42
43 /**
44 * Parser that uses the "strings" (or strings-alternative) command to find the
45 * printable strings in a object, or other binary, file
46 * (application/octet-stream). Useful as "best-effort" parser for files detected
47 * as application/octet-stream.
48 *
49 * @author gtotaro
50 *
51 */
52 public class StringsParser extends AbstractParser {
53 /**
54 * Serial version UID
55 */
56 private static final long serialVersionUID = 802566634661575025L;
57
58 private static final Set<MediaType> SUPPORTED_TYPES = Collections
59 .singleton(MediaType.OCTET_STREAM);
60
61 private static final StringsConfig DEFAULT_STRINGS_CONFIG = new StringsConfig();
62
63 private static final FileConfig DEFAULT_FILE_CONFIG = new FileConfig();
64
65 /*
66 * This map is organized as follows:
67 * command's pathname (String) -> is it present? (Boolean), does it support -e option? (Boolean)
68 * It stores check results for command and, if present, -e (encoding) option.
69 */
70 private static Map<String,Boolean[]> STRINGS_PRESENT = new HashMap<String, Boolean[]>();
71
72 @Override
73 public Set<MediaType> getSupportedTypes(ParseContext context) {
74 return SUPPORTED_TYPES;
75 }
76
77 @Override
78 public void parse(InputStream stream, ContentHandler handler,
79 Metadata metadata, ParseContext context) throws IOException,
80 SAXException, TikaException {
81 StringsConfig stringsConfig = context.get(StringsConfig.class, DEFAULT_STRINGS_CONFIG);
82 FileConfig fileConfig = context.get(FileConfig.class, DEFAULT_FILE_CONFIG);
83
84 if (!hasStrings(stringsConfig)) {
85 return;
86 }
87
88 TikaInputStream tis = TikaInputStream.get(stream);
89 File input = tis.getFile();
90
91 // Metadata
92 metadata.set("strings:min-len", "" + stringsConfig.getMinLength());
93 metadata.set("strings:encoding", stringsConfig.toString());
94 metadata.set("strings:file_output", doFile(input, fileConfig));
95
96 int totalBytes = 0;
97
98 // Content
99 XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
100
101 xhtml.startDocument();
102
103 totalBytes = doStrings(input, stringsConfig, xhtml);
104
105 xhtml.endDocument();
106
107 // Metadata
108 metadata.set("strings:length", "" + totalBytes);
109 }
110
111 /**
112 * Checks if the "strings" command is supported.
113 *
114 * @param config
115 * {@see StringsConfig} object used for testing the strings
116 * command.
117 * @return Returns returns {@code true} if the strings command is supported.
118 */
119 private boolean hasStrings(StringsConfig config) {
120 String stringsProg = config.getStringsPath() + getStringsProg();
121
122 if (STRINGS_PRESENT.containsKey(stringsProg)) {
123 return STRINGS_PRESENT.get(stringsProg)[0];
124 }
125
126 String[] checkCmd = { stringsProg, "--version" };
127 try {
128 boolean hasStrings = ExternalParser.check(checkCmd);
129
130 boolean encodingOpt = false;
131
132 // Check if the -e option (encoding) is supported
133 if (!System.getProperty("os.name").startsWith("Windows")) {
134 String[] checkOpt = {stringsProg, "-e", "" + config.getEncoding().get(), "/dev/null"};
135 int[] errorValues = {1, 2}; // Exit status code: 1 = general error; 2 = incorrect usage.
136 encodingOpt = ExternalParser.check(checkOpt, errorValues);
137 }
138
139 Boolean[] values = {hasStrings, encodingOpt};
140 STRINGS_PRESENT.put(stringsProg, values);
141
142 return hasStrings;
143 } catch (NoClassDefFoundError ncdfe) {
144 // This happens under OSGi + Fork Parser - see TIKA-1507
145 // As a workaround for now, just say we can't use strings
146 // TODO Resolve it so we don't need this try/catch block
147 Boolean[] values = {false, false};
148 STRINGS_PRESENT.put(stringsProg, values);
149 return false;
150 }
151 }
152
153 /**
154 * Checks if the "file" command is supported.
155 *
156 * @param config
157 * @return
158 */
159 private boolean hasFile(FileConfig config) {
160 String fileProg = config.getFilePath() + getFileProg();
161
162 String[] checkCmd = { fileProg, "--version" };
163
164 boolean hasFile = ExternalParser.check(checkCmd);
165
166 return hasFile;
167 }
168
169 /**
170 * Runs the "strings" command on the given file.
171 *
172 * @param input
173 * {@see File} object that represents the file to parse.
174 * @param config
175 * {@see StringsConfig} object including the strings
176 * configuration.
177 * @param xhtml
178 * {@see XHTMLContentHandler} object.
179 * @return the total number of bytes read using the strings command.
180 * @throws IOException
181 * if any I/O error occurs.
182 * @throws TikaException
183 * if the parsing process has been interrupted.
184 * @throws SAXException
185 */
186 private int doStrings(File input, StringsConfig config,
187 XHTMLContentHandler xhtml) throws IOException, TikaException,
188 SAXException {
189
190 String stringsProg = config.getStringsPath() + getStringsProg();
191
192 // Builds the command array
193 ArrayList<String> cmdList = new ArrayList<String>(4);
194 cmdList.add(stringsProg);
195 cmdList.add("-n");
196 cmdList.add("" + config.getMinLength());;
197 // Currently, encoding option is not supported by Windows (and other) versions
198 if (STRINGS_PRESENT.get(stringsProg)[1]) {
199 cmdList.add("-e");
200 cmdList.add("" + config.getEncoding().get());
201 }
202 cmdList.add(input.getPath());
203
204 String[] cmd = cmdList.toArray(new String[cmdList.size()]);
205
206 ProcessBuilder pb = new ProcessBuilder(cmd);
207 final Process process = pb.start();
208
209 InputStream out = process.getInputStream();
210
211 FutureTask<Integer> waitTask = new FutureTask<Integer>(
212 new Callable<Integer>() {
213 public Integer call() throws Exception {
214 return process.waitFor();
215 }
216 });
217
218 Thread waitThread = new Thread(waitTask);
219 waitThread.start();
220
221 // Reads content printed out by "strings" command
222 int totalBytes = 0;
223 totalBytes = extractOutput(out, xhtml);
224
225 try {
226 waitTask.get(config.getTimeout(), TimeUnit.SECONDS);
227
228 } catch (InterruptedException ie) {
229 waitThread.interrupt();
230 process.destroy();
231 Thread.currentThread().interrupt();
232 throw new TikaException(StringsParser.class.getName()
233 + " interrupted", ie);
234
235 } catch (ExecutionException ee) {
236 // should not be thrown
237
238 } catch (TimeoutException te) {
239 waitThread.interrupt();
240 process.destroy();
241 throw new TikaException(StringsParser.class.getName() + " timeout",
242 te);
243 }
244
245 return totalBytes;
246 }
247
248 /**
249 * Extracts ASCII strings using the "strings" command.
250 *
251 * @param stream
252 * {@see InputStream} object used for reading the binary file.
253 * @param xhtml
254 * {@see XHTMLContentHandler} object.
255 * @return the total number of bytes read using the "strings" command.
256 * @throws SAXException
257 * if the content element could not be written.
258 * @throws IOException
259 * if any I/O error occurs.
260 */
261 private int extractOutput(InputStream stream, XHTMLContentHandler xhtml)
262 throws SAXException, IOException {
263
264 char[] buffer = new char[1024];
265 BufferedReader reader = null;
266 int totalBytes = 0;
267
268 try {
269 reader = new BufferedReader(new InputStreamReader(stream, IOUtils.UTF_8));
270
271 int n = 0;
272 while ((n = reader.read(buffer)) != -1) {
273 if (n > 0) {
274 xhtml.characters(buffer, 0, n);
275 }
276 totalBytes += n;
277 }
278
279 } finally {
280 reader.close();
281 }
282
283 return totalBytes;
284 }
285
286 /**
287 * Runs the "file" command on the given file that aims at providing an
288 * alternative way to determine the file type.
289 *
290 * @param input
291 * {@see File} object that represents the file to detect.
292 * @return the file type provided by the "file" command using the "-b"
293 * option (it stands for "brief mode").
294 * @throws IOException
295 * if any I/O error occurs.
296 */
297 private String doFile(File input, FileConfig config) throws IOException {
298 if (!hasFile(config)) {
299 return null;
300 }
301
302 // Builds the command array
303 ArrayList<String> cmdList = new ArrayList<String>(3);
304 cmdList.add(config.getFilePath() + getFileProg());
305 cmdList.add("-b");
306 if (config.isMimetype()) {
307 cmdList.add("-I");
308 }
309 cmdList.add(input.getPath());
310
311 String[] cmd = cmdList.toArray(new String[cmdList.size()]);
312
313 ProcessBuilder pb = new ProcessBuilder(cmd);
314 final Process process = pb.start();
315
316 InputStream out = process.getInputStream();
317
318 BufferedReader reader = null;
319 String fileOutput = null;
320
321 try {
322 reader = new BufferedReader(new InputStreamReader(out, IOUtils.UTF_8));
323 fileOutput = reader.readLine();
324
325 } catch (IOException ioe) {
326 // file output not available!
327 fileOutput = "";
328 } finally {
329 reader.close();
330 }
331
332 return fileOutput;
333 }
334
335
336 public static String getStringsProg() {
337 return System.getProperty("os.name").startsWith("Windows") ? "strings.exe" : "strings";
338 }
339
340 public static String getFileProg() {
341 return System.getProperty("os.name").startsWith("Windows") ? "file.exe" : "file";
342 }
343 }
2828 import java.util.Set;
2929
3030 import org.apache.tika.exception.TikaException;
31 import org.apache.tika.io.IOUtils;
3132 import org.apache.tika.metadata.Metadata;
3233 import org.apache.tika.mime.MediaType;
3334 import org.apache.tika.parser.AbstractParser;
129130 int size = input.readUnsignedShort();
130131 byte[] chars = new byte[size];
131132 input.readFully(chars);
132 return new String(chars);
133 return new String(chars, IOUtils.UTF_8);
133134 }
134135
135136 private Object readAMFObject(DataInputStream input) throws IOException {
2323 org.apache.tika.parser.font.AdobeFontMetricParser
2424 org.apache.tika.parser.font.TrueTypeParser
2525 org.apache.tika.parser.html.HtmlParser
26 org.apache.tika.parser.image.BPGParser
2627 org.apache.tika.parser.image.ImageParser
2728 org.apache.tika.parser.image.PSDParser
2829 org.apache.tika.parser.image.TiffParser
30 org.apache.tika.parser.image.WebPParser
2931 org.apache.tika.parser.iptc.IptcAnpaParser
3032 org.apache.tika.parser.iwork.IWorkPackageParser
3133 org.apache.tika.parser.jpeg.JpegParser
3335 org.apache.tika.parser.mbox.MboxParser
3436 org.apache.tika.parser.mbox.OutlookPSTParser
3537 org.apache.tika.parser.microsoft.OfficeParser
38 org.apache.tika.parser.microsoft.OldExcelParser
3639 org.apache.tika.parser.microsoft.TNEFParser
3740 org.apache.tika.parser.microsoft.ooxml.OOXMLParser
3841 org.apache.tika.parser.mp3.Mp3Parser
4346 org.apache.tika.parser.pdf.PDFParser
4447 org.apache.tika.parser.pkg.CompressorParser
4548 org.apache.tika.parser.pkg.PackageParser
49 org.apache.tika.parser.pkg.RarParser
4650 org.apache.tika.parser.rtf.RTFParser
4751 org.apache.tika.parser.txt.TXTParser
4852 org.apache.tika.parser.video.FLVParser
5155 org.apache.tika.parser.chm.ChmParser
5256 org.apache.tika.parser.code.SourceCodeParser
5357 org.apache.tika.parser.mat.MatParser
58 org.apache.tika.parser.ocr.TesseractOCRParser
59 org.apache.tika.parser.gdal.GDALParser
60 org.apache.tika.parser.grib.GribParser
61 org.apache.tika.parser.jdbc.SQLite3Parser
62 org.apache.tika.parser.isatab.ISArchiveParser
0 # Licensed to the Apache Software Foundation (ASF) under one or more
1 # contributor license agreements. See the NOTICE file distributed with
2 # this work for additional information regarding copyright ownership.
3 # The ASF licenses this file to You under the Apache License, Version 2.0
4 # (the "License"); you may not use this file except in compliance with
5 # the License. You may obtain a copy of the License at
6 #
7 # http://www.apache.org/licenses/LICENSE-2.0
8 #
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14
15 tesseractPath=
16 language=eng
17 pageSegMode=1
18 maxFileSizeToOcr=2147483647
19 minFileSizeToOcr=0
20 timeout=120
1717 sortByPosition false
1818 suppressDuplicateOverlappingText false
1919 useNonSequentialParser false
20 extractAcroFormContent true
21 extractInlineImages false
22 extractUniqueInlineImagesOnly true
20 extractAcroFormContent true
21 extractInlineImages false
22 extractUniqueInlineImagesOnly true
23 checkExtractAccessPermission false
24 allowExtractionForAccessibility true
107107
108108 @Test
109109 public void testComment() throws Exception {
110 final String[] extensions = new String[] {"ppt", "pptx", "doc", "docx", "pdf", "rtf"};
110 final String[] extensions = new String[] {"ppt", "pptx", "doc",
111 "docx", "xls", "xlsx", "pdf", "rtf"};
111112 for(String extension : extensions) {
112113 verifyComment(extension, "testComment");
113114 }
1515 */
1616 package org.apache.tika;
1717
18 import static org.junit.Assert.assertFalse;
1819 import static org.junit.Assert.assertTrue;
1920 import static org.junit.Assert.fail;
2021
2526 import java.net.URISyntaxException;
2627 import java.net.URL;
2728 import java.util.ArrayList;
29 import java.util.Collection;
2830 import java.util.HashSet;
2931 import java.util.List;
3032 import java.util.Set;
3133
32 import org.apache.tika.exception.TikaException;
3334 import org.apache.tika.extractor.EmbeddedResourceHandler;
3435 import org.apache.tika.io.IOUtils;
3536 import org.apache.tika.io.TikaInputStream;
3839 import org.apache.tika.parser.AutoDetectParser;
3940 import org.apache.tika.parser.ParseContext;
4041 import org.apache.tika.parser.Parser;
41 import org.apache.tika.parser.ParserDecorator;
4242 import org.apache.tika.sax.BodyContentHandler;
4343 import org.apache.tika.sax.ToXMLContentHandler;
4444 import org.xml.sax.ContentHandler;
45 import org.xml.sax.SAXException;
46 import org.xml.sax.helpers.DefaultHandler;
4745
4846 /**
4947 * Parent class of Tika tests
8684 public static void assertContains(String needle, String haystack) {
8785 assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle));
8886 }
87 public static <T> void assertContains(T needle, Collection<? extends T> haystack) {
88 assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle));
89 }
90
91 public static void assertNotContained(String needle, String haystack) {
92 assertFalse(needle + " unexpectedly found in:\n" + haystack, haystack.contains(needle));
93 }
94 public static <T> void assertNotContained(T needle, Collection<? extends T> haystack) {
95 assertFalse(needle + " unexpectedly found in:\n" + haystack, haystack.contains(needle));
96 }
8997
9098 protected static class XMLResult {
9199 public final String xml;
95103 this.xml = xml;
96104 this.metadata = metadata;
97105 }
106 }
107
108 protected XMLResult getXML(String filePath, Metadata metadata) throws Exception {
109 return getXML(getResourceAsStream("/test-documents/" + filePath), new AutoDetectParser(), metadata);
98110 }
99111
100112 protected XMLResult getXML(String filePath) throws Exception {
194206 }
195207 }
196208 }
197
198 /**
199 * Stores metadata and (optionally) content.
200 * Many thanks to Jukka's example:
201 * http://wiki.apache.org/tika/RecursiveMetadata
202 * This ignores the incoming handler and applies a
203 * new BodyContentHandler(-1) for each file.
204 */
205 public static class RecursiveMetadataParser extends ParserDecorator {
206 /** Key for content string if stored */
207 public static final String TIKA_CONTENT = "tika:content";
208
209 private static final long serialVersionUID = 1L;
210
211 private List<Metadata> metadatas = new ArrayList<Metadata>();
212 private final boolean storeContent;
213
214 public RecursiveMetadataParser(Parser parser,
215 boolean storeContent) {
216 super(parser);
217 this.storeContent = storeContent;
218 }
219
220 @Override
221 public void parse(
222 InputStream stream, ContentHandler ignoredHandler,
223 Metadata metadata, ParseContext context)
224 throws IOException, SAXException, TikaException {
225
226 ContentHandler contentHandler = null;
227 if (storeContent) {
228 contentHandler = new BodyContentHandler(-1);
229 } else {
230 contentHandler = new DefaultHandler();
231 }
232 super.parse(stream, contentHandler, metadata, context);
233
234 if (storeContent) {
235 metadata.add(TIKA_CONTENT, contentHandler.toString());
236 }
237 metadatas.add(metadata);
238 }
239
240 public List<Metadata> getAllMetadata() {
241 return metadatas;
242 }
243
244 public void clear() {
245 metadatas.clear();
246 }
247 }
248
249209 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.config;
17
18 import static org.apache.tika.TikaTest.assertContains;
19 import static org.apache.tika.TikaTest.assertNotContained;
20 import static org.junit.Assert.assertEquals;
21 import static org.junit.Assert.assertNotNull;
22 import static org.junit.Assert.assertTrue;
23 import static org.junit.Assert.fail;
24
25 import java.net.URL;
26 import java.util.List;
27
28 import org.apache.tika.mime.MediaType;
29 import org.apache.tika.parser.CompositeParser;
30 import org.apache.tika.parser.DefaultParser;
31 import org.apache.tika.parser.EmptyParser;
32 import org.apache.tika.parser.ParseContext;
33 import org.apache.tika.parser.Parser;
34 import org.apache.tika.parser.ParserDecorator;
35 import org.apache.tika.parser.executable.ExecutableParser;
36 import org.apache.tika.parser.xml.XMLParser;
37 import org.junit.After;
38 import org.junit.Test;
39
40 /**
41 * Junit test class for {@link TikaConfig}, which cover things
42 * that {@link TikaConfigTest} can't do due to a need for the
43 * full set of parsers
44 */
45 public class TikaParserConfigTest {
46 protected static ParseContext context = new ParseContext();
47 protected static TikaConfig getConfig(String config) throws Exception {
48 URL url = TikaConfig.class.getResource(config);
49 System.setProperty("tika.config", url.toExternalForm());
50 return new TikaConfig();
51 }
52 @After
53 public void resetConfig() {
54 System.clearProperty("tika.config");
55 }
56
57 @Test
58 public void testMimeExcludeInclude() throws Exception {
59 TikaConfig config = getConfig("TIKA-1558-blacklist.xml");
60 Parser parser = config.getParser();
61
62 MediaType PDF = MediaType.application("pdf");
63 MediaType JPEG = MediaType.image("jpeg");
64
65
66 // Has two parsers
67 assertEquals(CompositeParser.class, parser.getClass());
68 CompositeParser cParser = (CompositeParser)parser;
69 assertEquals(2, cParser.getAllComponentParsers().size());
70
71 // Both are decorated
72 assertTrue(cParser.getAllComponentParsers().get(0) instanceof ParserDecorator);
73 assertTrue(cParser.getAllComponentParsers().get(1) instanceof ParserDecorator);
74 ParserDecorator p0 = (ParserDecorator)cParser.getAllComponentParsers().get(0);
75 ParserDecorator p1 = (ParserDecorator)cParser.getAllComponentParsers().get(1);
76
77
78 // DefaultParser will be wrapped with excludes
79 assertEquals(DefaultParser.class, p0.getWrappedParser().getClass());
80
81 assertNotContained(PDF, p0.getSupportedTypes(context));
82 assertContains(PDF, p0.getWrappedParser().getSupportedTypes(context));
83 assertNotContained(JPEG, p0.getSupportedTypes(context));
84 assertContains(JPEG, p0.getWrappedParser().getSupportedTypes(context));
85
86
87 // Will have an empty parser for PDF
88 assertEquals(EmptyParser.class, p1.getWrappedParser().getClass());
89 assertEquals(1, p1.getSupportedTypes(context).size());
90 assertContains(PDF, p1.getSupportedTypes(context));
91 assertNotContained(PDF, p1.getWrappedParser().getSupportedTypes(context));
92 }
93
94 @Test
95 public void testParserExcludeFromDefault() throws Exception {
96 TikaConfig config = getConfig("TIKA-1558-blacklist.xml");
97 CompositeParser parser = (CompositeParser)config.getParser();
98
99 MediaType PE_EXE = MediaType.application("x-msdownload");
100 MediaType ELF = MediaType.application("x-elf");
101
102
103 // Get the DefaultParser from the config
104 ParserDecorator confWrappedParser = (ParserDecorator)parser.getParsers().get(MediaType.APPLICATION_XML);
105 assertNotNull(confWrappedParser);
106 DefaultParser confParser = (DefaultParser)confWrappedParser.getWrappedParser();
107
108 // Get a fresh "default" DefaultParser
109 DefaultParser normParser = new DefaultParser(config.getMediaTypeRegistry());
110
111
112 // The default one will offer the Executable Parser
113 assertContains(PE_EXE, normParser.getSupportedTypes(context));
114 assertContains(ELF, normParser.getSupportedTypes(context));
115
116 boolean hasExec = false;
117 for (Parser p : normParser.getParsers().values()) {
118 if (p instanceof ExecutableParser) {
119 hasExec = true;
120 break;
121 }
122 }
123 assertTrue(hasExec);
124
125
126 // The one from the config won't
127 assertNotContained(PE_EXE, confParser.getSupportedTypes(context));
128 assertNotContained(ELF, confParser.getSupportedTypes(context));
129
130 for (Parser p : confParser.getParsers().values()) {
131 if (p instanceof ExecutableParser)
132 fail("Shouldn't have the Executable Parser from config");
133 }
134 }
135 /**
136 * TIKA-1558 It should be possible to exclude Parsers from being picked up by
137 * DefaultParser.
138 */
139 @Test
140 public void defaultParserBlacklist() throws Exception {
141 TikaConfig config = new TikaConfig();
142 CompositeParser cp = (CompositeParser) config.getParser();
143 List<Parser> parsers = cp.getAllComponentParsers();
144
145 boolean hasXML = false;
146 for (Parser p : parsers) {
147 if (p instanceof XMLParser) {
148 hasXML = true;
149 break;
150 }
151 }
152 assertTrue("Default config should include an XMLParser.", hasXML);
153
154 // This custom TikaConfig should exclude XMLParser and all of its subclasses.
155 config = getConfig("TIKA-1558-blacklistsub.xml");
156 cp = (CompositeParser) config.getParser();
157 parsers = cp.getAllComponentParsers();
158
159 for (Parser p : parsers) {
160 if (p instanceof XMLParser)
161 fail("Custom config should not include an XMLParser (" + p.getClass() + ").");
162 }
163 }
164 }
9191 assertTypeByData("testPUBLISHER.pub", "application/x-mspublisher");
9292 assertTypeByData("testWORKS.wps", "application/vnd.ms-works");
9393 assertTypeByData("testWORKS2000.wps", "application/vnd.ms-works");
94
9495 // older Works Word Processor files can't be recognized
9596 // they were created with Works Word Processor 7.0 (hence the text inside)
9697 // and exported to the older formats with the "Save As" feature
99100 assertTypeByData("testWORKSSpreadsheet7.0.xlr", "application/x-tika-msworks-spreadsheet");
100101 assertTypeByData("testPROJECT2003.mpp", "application/vnd.ms-project");
101102 assertTypeByData("testPROJECT2007.mpp", "application/vnd.ms-project");
103
102104 // Excel95 can be detected by not parsed
103105 assertTypeByData("testEXCEL_95.xls", "application/vnd.ms-excel");
104106
210212 assertTypeByData("testPPT.ppsx", "application/vnd.openxmlformats-officedocument.presentationml.slideshow");
211213 assertTypeByData("testPPT.ppsm", "application/vnd.ms-powerpoint.slideshow.macroEnabled.12");
212214 assertTypeByData("testDOTM.dotm", "application/vnd.ms-word.template.macroEnabled.12");
215 assertTypeByData("testEXCEL.strict.xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
213216 assertTypeByData("testPPT.xps", "application/vnd.ms-xpsdocument");
214217
218 assertTypeByData("testVISIO.vsdm", "application/vnd.ms-visio.drawing.macroenabled.12");
219 assertTypeByData("testVISIO.vsdx", "application/vnd.ms-visio.drawing");
220 assertTypeByData("testVISIO.vssm", "application/vnd.ms-visio.stencil.macroenabled.12");
221 assertTypeByData("testVISIO.vssx", "application/vnd.ms-visio.stencil");
222 assertTypeByData("testVISIO.vstm", "application/vnd.ms-visio.template.macroenabled.12");
223 assertTypeByData("testVISIO.vstx", "application/vnd.ms-visio.template");
224
215225 // .xlsb is an OOXML file containing the binary parts, and not
216226 // an OLE2 file as you might initially expect!
217227 assertTypeByData("testEXCEL.xlsb", "application/vnd.ms-excel.sheet.binary.macroEnabled.12");
3333 import java.text.SimpleDateFormat;
3434 import java.util.Date;
3535 import java.util.HashMap;
36 import java.util.Locale;
3637 import java.util.Map;
3738
38 import org.apache.tika.embedder.Embedder;
39 import org.apache.tika.embedder.ExternalEmbedder;
4039 import org.apache.tika.exception.TikaException;
40 import org.apache.tika.io.IOUtils;
4141 import org.apache.tika.io.TemporaryResources;
4242 import org.apache.tika.io.TikaInputStream;
4343 import org.apache.tika.metadata.Metadata;
5757 public class ExternalEmbedderTest {
5858
5959 protected static final DateFormat EXPECTED_METADATA_DATE_FORMATTER =
60 new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
61 protected static final String DEFAULT_CHARSET = "UTF-8";
60 new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss", Locale.ROOT);
61 protected static final String DEFAULT_CHARSET = IOUtils.UTF_8.name();
6262 private static final String COMMAND_METADATA_ARGUMENT_DESCRIPTION = "dc:description";
6363 private static final String TEST_TXT_PATH = "/test-documents/testTXT.txt";
6464
231231 }
232232
233233 /**
234 * Files from Excel 2 through 4 are based on the BIFF record
235 * structure, but without a wrapping OLE2 structure.
236 * Excel 5 and Excel 95+ work on OLE2
237 */
238 @Test
239 public void testOldExcel() throws Exception {
240 // With just a name, we'll think everything's a new Excel file
241 assertTypeByName("application/vnd.ms-excel","testEXCEL_4.xls");
242 assertTypeByName("application/vnd.ms-excel","testEXCEL_5.xls");
243 assertTypeByName("application/vnd.ms-excel","testEXCEL_95.xls");
244
245 // With data, we can work out if it's old or new style
246 assertTypeByData("application/vnd.ms-excel.sheet.4","testEXCEL_4.xls");
247 assertTypeByData("application/x-tika-msoffice","testEXCEL_5.xls");
248 assertTypeByData("application/x-tika-msoffice","testEXCEL_95.xls");
249
250 assertTypeByNameAndData("application/vnd.ms-excel.sheet.4","testEXCEL_4.xls");
251 assertTypeByNameAndData("application/vnd.ms-excel","testEXCEL_5.xls");
252 assertTypeByNameAndData("application/vnd.ms-excel","testEXCEL_95.xls");
253 }
254
255 /**
234256 * Note - detecting container formats by mime magic is very very
235257 * iffy, as we can't be sure where things will end up.
236258 * People really ought to use the container aware detection...
258280 assertTypeByNameAndData("application/vnd.ms-powerpoint.presentation.macroenabled.12", "testPPT.pptm");
259281 assertTypeByNameAndData("application/vnd.ms-powerpoint.template.macroenabled.12", "testPPT.potm");
260282 assertTypeByNameAndData("application/vnd.ms-powerpoint.slideshow.macroenabled.12", "testPPT.ppsm");
283 }
284
285 /**
286 * Note - container based formats, needs container detection
287 * to be properly correct
288 */
289 @Test
290 public void testVisioDetection() throws Exception {
291 // By Name, should get it right
292 assertTypeByName("application/vnd.visio", "testVISIO.vsd");
293 assertTypeByName("application/vnd.ms-visio.drawing.macroenabled.12", "testVISIO.vsdm");
294 assertTypeByName("application/vnd.ms-visio.drawing", "testVISIO.vsdx");
295 assertTypeByName("application/vnd.ms-visio.stencil.macroenabled.12", "testVISIO.vssm");
296 assertTypeByName("application/vnd.ms-visio.stencil", "testVISIO.vssx");
297 assertTypeByName("application/vnd.ms-visio.template.macroenabled.12", "testVISIO.vstm");
298 assertTypeByName("application/vnd.ms-visio.template", "testVISIO.vstx");
299
300 // By Name and Data, should get it right
301 assertTypeByNameAndData("application/vnd.visio", "testVISIO.vsd");
302 assertTypeByNameAndData("application/vnd.ms-visio.drawing.macroenabled.12", "testVISIO.vsdm");
303 assertTypeByNameAndData("application/vnd.ms-visio.drawing", "testVISIO.vsdx");
304 assertTypeByNameAndData("application/vnd.ms-visio.stencil.macroenabled.12", "testVISIO.vssm");
305 assertTypeByNameAndData("application/vnd.ms-visio.stencil", "testVISIO.vssx");
306 assertTypeByNameAndData("application/vnd.ms-visio.template.macroenabled.12", "testVISIO.vstm");
307 assertTypeByNameAndData("application/vnd.ms-visio.template", "testVISIO.vstx");
308
309 // By Data only, will get the container parent
310 assertTypeByData("application/x-tika-msoffice", "testVISIO.vsd");
311 assertTypeByData("application/x-tika-ooxml", "testVISIO.vsdm");
312 assertTypeByData("application/x-tika-ooxml", "testVISIO.vsdx");
313 assertTypeByData("application/x-tika-ooxml", "testVISIO.vssm");
314 assertTypeByData("application/x-tika-ooxml", "testVISIO.vssx");
315 assertTypeByData("application/x-tika-ooxml", "testVISIO.vstm");
316 assertTypeByData("application/x-tika-ooxml", "testVISIO.vstx");
261317 }
262318
263319 /**
337393 }
338394
339395 @Test
396 public void testBpgDetection() throws Exception {
397 assertType("image/x-bpg", "testBPG.bpg");
398 assertTypeByData("image/x-bpg", "testBPG.bpg");
399 assertTypeByData("image/x-bpg", "testBPG_commented.bpg");
400 assertTypeByName("image/x-bpg", "x.bpg");
401 }
402
403 @Test
340404 public void testTiffDetection() throws Exception {
341405 assertType("image/tiff", "testTIFF.tif");
342406 assertTypeByData("image/tiff", "testTIFF.tif");
359423 assertTypeByData("image/png", "testPNG.png");
360424 assertTypeByName("image/png", "x.png");
361425 assertTypeByName("image/png", "x.PNG");
426 }
427
428 @Test
429 public void testWEBPDetection() throws Exception {
430 assertType("image/webp", "testWEBP.webp");
431 assertTypeByData("image/webp", "testWEBP.webp");
432 assertTypeByName("image/webp", "x.webp");
433 assertTypeByName("image/webp", "x.WEBP");
362434 }
363435
364436 @Test
488560 }
489561
490562 @Test
563 public void testXmlAndHtmlDetection() throws Exception {
564 assertTypeByData("application/xml", "<?xml version=\"1.0\" encoding=\"UTF-8\"?><records><record/></records>"
565 .getBytes("UTF-8"));
566 assertTypeByData("application/xml", "\uFEFF<?xml version=\"1.0\" encoding=\"UTF-16\"?><records><record/></records>"
567 .getBytes("UTF-16LE"));
568 assertTypeByData("application/xml", "\uFEFF<?xml version=\"1.0\" encoding=\"UTF-16\"?><records><record/></records>"
569 .getBytes("UTF-16BE"));
570 assertTypeByData("application/xml", "<!-- XML without processing instructions --><records><record/></records>"
571 .getBytes("UTF-8"));
572 assertTypeByData("text/html", "<html><body>HTML</body></html>"
573 .getBytes("UTF-8"));
574 assertTypeByData("text/html", "<!-- HTML comment --><html><body>HTML</body></html>"
575 .getBytes("UTF-8"));
576 }
577
578 @Test
491579 public void testWmfDetection() throws Exception {
492580 assertTypeByName("application/x-msmetafile", "x.wmf");
493581 assertTypeByData("application/x-msmetafile", "testWMF.wmf");
548636 repo.getMediaTypeRegistry().getSupertype(getTypeByNameAndData("testDITA.ditamap")).toString());
549637 assertEquals("application/dita+xml",
550638 repo.getMediaTypeRegistry().getSupertype(getTypeByNameAndData("testDITA.dita")).toString());
551 assertEquals("application/dita+xml",
639 // Concept inherits from topic
640 assertEquals("application/dita+xml; format=topic",
552641 repo.getMediaTypeRegistry().getSupertype(getTypeByNameAndData("testDITA2.dita")).toString());
553642 }
554643
664753 assertType("audio/x-wav", "testWAV.wav");
665754 assertType("audio/midi", "testMID.mid");
666755 assertType("application/x-msaccess", "testACCESS.mdb");
667 assertType("application/x-font-ttf", "testTrueType.ttf");
756 assertType("application/x-font-ttf", "testTrueType3.ttf");
668757 }
669758
670759 @Test
717806 }
718807
719808 @Test
720 public void testEmlx() throws IOException {
809 public void testEmail() throws IOException {
810 // EMLX
721811 assertTypeDetection("testEMLX.emlx", "message/x-emlx");
722 }
723
724 @Test
725 public void testGroupWiseEml() throws Exception {
812
813 // Groupwise
726814 assertTypeDetection("testGroupWiseEml.eml", "message/rfc822");
815
816 // Lotus
817 assertTypeDetection("testLotusEml.eml", "message/rfc822");
818
819 // Thunderbird - doesn't currently work by name
820 assertTypeByNameAndData("message/rfc822", "testThunderbirdEml.eml");
821 }
822
823 @Test
824 public void testAxCrypt() throws Exception {
825 // test-TXT.txt encrypted with a key of "tika"
826 assertTypeDetection("testTXT-tika.axx", "application/x-axcrypt");
827 }
828
829 @Test
830 public void testWindowsEXE() throws Exception {
831 assertTypeByName("application/x-msdownload", "x.dll");
832 assertTypeByName("application/x-ms-installer", "x.msi");
833 assertTypeByName("application/x-dosexec", "x.exe");
834
835 assertTypeByData("application/x-msdownload; format=pe", "testTinyPE.exe");
836 assertTypeByNameAndData("application/x-msdownload; format=pe", "testTinyPE.exe");
837
838 // A jar file with part of a PE header, but not a full one
839 // should still be detected as a zip or jar (without/with name)
840 assertTypeByData("application/zip", "testJAR_with_PEHDR.jar");
841 assertTypeByNameAndData("application/java-archive", "testJAR_with_PEHDR.jar");
727842 }
728843
729844 @Test
758873 assertText(new byte[] { 'a', 'b', 'c' });
759874 assertText(new byte[] { '\t', '\r', '\n', 0x0C, 0x1B });
760875 assertNotText(new byte[] { '\t', '\r', '\n', 0x0E, 0x1C });
876 }
877
878 @Test
879 public void testBerkeleyDB() throws IOException {
880 assertTypeByData(
881 "application/x-berkeley-db; format=btree; version=2",
882 "testBDB_btree_2.db");
883 assertTypeByData(
884 "application/x-berkeley-db; format=btree; version=3",
885 "testBDB_btree_3.db");
886 assertTypeByData(
887 "application/x-berkeley-db; format=btree; version=4",
888 "testBDB_btree_4.db");
889 // V4 and V5 share the same btree format
890 assertTypeByData(
891 "application/x-berkeley-db; format=btree; version=4",
892 "testBDB_btree_5.db");
893
894 assertTypeByData(
895 "application/x-berkeley-db; format=hash; version=2",
896 "testBDB_hash_2.db");
897 assertTypeByData(
898 "application/x-berkeley-db; format=hash; version=3",
899 "testBDB_hash_3.db");
900 assertTypeByData(
901 "application/x-berkeley-db; format=hash; version=4",
902 "testBDB_hash_4.db");
903 assertTypeByData(
904 "application/x-berkeley-db; format=hash; version=5",
905 "testBDB_hash_5.db");
761906 }
762907
763908 private void assertText(byte[] prefix) throws IOException {
3232 import org.apache.tika.config.TikaConfig;
3333 import org.apache.tika.detect.Detector;
3434 import org.apache.tika.exception.TikaException;
35 import org.apache.tika.io.IOUtils;
3536 import org.apache.tika.metadata.Metadata;
3637 import org.apache.tika.metadata.TikaCoreProperties;
3738 import org.apache.tika.metadata.XMPDM;
388389 public void testSpecificParserList() throws Exception {
389390 AutoDetectParser parser = new AutoDetectParser(new MyDetector(), new MyParser());
390391
391 InputStream is = new ByteArrayInputStream("test".getBytes());
392 InputStream is = new ByteArrayInputStream("test".getBytes(IOUtils.UTF_8));
392393 Metadata metadata = new Metadata();
393394 parser.parse(is, new BodyContentHandler(), metadata, new ParseContext());
394395
1515 */
1616 package org.apache.tika.parser;
1717
18 import static org.junit.Assert.assertEquals;
19
2018 import java.io.ByteArrayInputStream;
2119 import java.io.InputStream;
2220 import java.io.Reader;
23
21 import org.apache.tika.io.IOUtils;
2422 import org.apache.tika.metadata.Metadata;
2523 import org.apache.tika.metadata.TikaCoreProperties;
2624 import org.junit.Test;
25
26 import static org.junit.Assert.assertEquals;
2727
2828 public class ParsingReaderTest {
2929
3030 @Test
3131 public void testPlainText() throws Exception {
3232 String data = "test content";
33 InputStream stream = new ByteArrayInputStream(data.getBytes("UTF-8"));
33 InputStream stream = new ByteArrayInputStream(data.getBytes(IOUtils.UTF_8));
3434 Reader reader = new ParsingReader(stream, "test.txt");
3535 assertEquals('t', reader.read());
3636 assertEquals('e', reader.read());
5353 @Test
5454 public void testXML() throws Exception {
5555 String data = "<p>test <span>content</span></p>";
56 InputStream stream = new ByteArrayInputStream(data.getBytes("UTF-8"));
56 InputStream stream = new ByteArrayInputStream(data.getBytes(IOUtils.UTF_8));
5757 Reader reader = new ParsingReader(stream, "test.xml");
5858 assertEquals(' ', (char) reader.read());
5959 assertEquals('t', (char) reader.read());
0 package org.apache.tika.parser;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19
20 import org.apache.tika.metadata.Metadata;
21 import org.apache.tika.sax.BasicContentHandlerFactory;
22 import org.apache.tika.sax.ContentHandlerFactory;
23 import org.junit.Test;
24 import org.xml.sax.helpers.DefaultHandler;
25
26 import java.io.InputStream;
27 import java.util.HashSet;
28 import java.util.List;
29 import java.util.Set;
30
31 import static org.junit.Assert.assertEquals;
32 import static org.junit.Assert.assertNull;
33 import static org.junit.Assert.assertTrue;
34
35 public class RecursiveParserWrapperTest {
36
37 @Test
38 public void testBasicXML() throws Exception {
39 List<Metadata> list = getMetadata(new Metadata(),
40 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.XML, -1));
41 Metadata container = list.get(0);
42 String content = container.get(RecursiveParserWrapper.TIKA_CONTENT);
43 //not much differentiates html from xml in this test file
44 assertTrue(content.indexOf("<p class=\"header\" />") > -1);
45 }
46
47 @Test
48 public void testBasicHTML() throws Exception {
49 List<Metadata> list = getMetadata(new Metadata(),
50 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1));
51 Metadata container = list.get(0);
52 String content = container.get(RecursiveParserWrapper.TIKA_CONTENT);
53 //not much differentiates html from xml in this test file
54 assertTrue(content.indexOf("<p class=\"header\"></p>") > -1);
55 }
56
57 @Test
58 public void testBasicText() throws Exception {
59 List<Metadata> list = getMetadata(new Metadata(),
60 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1));
61 Metadata container = list.get(0);
62 String content = container.get(RecursiveParserWrapper.TIKA_CONTENT);
63 assertTrue(content.indexOf("<p ") < 0);
64 assertTrue(content.indexOf("embed_0") > -1);
65 }
66
67 @Test
68 public void testIgnoreContent() throws Exception {
69 List<Metadata> list = getMetadata(new Metadata(),
70 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
71 Metadata container = list.get(0);
72 String content = container.get(RecursiveParserWrapper.TIKA_CONTENT);
73 assertNull(content);
74 }
75
76
77 @Test
78 public void testCharLimit() throws Exception {
79 ParseContext context = new ParseContext();
80 Metadata metadata = new Metadata();
81
82 Parser wrapped = new AutoDetectParser();
83 RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped,
84 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, 60));
85 InputStream stream = RecursiveParserWrapperTest.class.getResourceAsStream(
86 "/test-documents/test_recursive_embedded.docx");
87 wrapper.parse(stream, new DefaultHandler(), metadata, context);
88 List<Metadata> list = wrapper.getMetadata();
89
90 assertEquals(5, list.size());
91
92 int wlr = 0;
93 for (Metadata m : list) {
94 String limitReached = m.get(RecursiveParserWrapper.WRITE_LIMIT_REACHED);
95 if (limitReached != null && limitReached.equals("true")){
96 wlr++;
97 }
98 }
99 assertEquals(1, wlr);
100
101 }
102 @Test
103 public void testMaxEmbedded() throws Exception {
104 int maxEmbedded = 4;
105 int totalNoLimit = 12;//including outer container file
106 ParseContext context = new ParseContext();
107 Metadata metadata = new Metadata();
108 String limitReached = null;
109
110 Parser wrapped = new AutoDetectParser();
111 RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped,
112 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1));
113
114 InputStream stream = RecursiveParserWrapperTest.class.getResourceAsStream(
115 "/test-documents/test_recursive_embedded.docx");
116 wrapper.parse(stream, new DefaultHandler(), metadata, context);
117 List<Metadata> list = wrapper.getMetadata();
118 //test default
119 assertEquals(totalNoLimit, list.size());
120
121 limitReached = list.get(0).get(RecursiveParserWrapper.EMBEDDED_RESOURCE_LIMIT_REACHED);
122 assertNull(limitReached);
123
124
125 wrapper.reset();
126 stream.close();
127
128 //test setting value
129 metadata = new Metadata();
130 stream = RecursiveParserWrapperTest.class.getResourceAsStream(
131 "/test-documents/test_recursive_embedded.docx");
132 wrapper.setMaxEmbeddedResources(maxEmbedded);
133 wrapper.parse(stream, new DefaultHandler(), metadata, context);
134 list = wrapper.getMetadata();
135
136 //add 1 for outer container file
137 assertEquals(maxEmbedded+1, list.size());
138
139 limitReached = list.get(0).get(RecursiveParserWrapper.EMBEDDED_RESOURCE_LIMIT_REACHED);
140 assertEquals("true", limitReached);
141
142 wrapper.reset();
143 stream.close();
144
145 //test setting value < 0
146 metadata = new Metadata();
147 stream = RecursiveParserWrapperTest.class.getResourceAsStream(
148 "/test-documents/test_recursive_embedded.docx");
149
150 wrapper.setMaxEmbeddedResources(-2);
151 wrapper.parse(stream, new DefaultHandler(), metadata, context);
152 assertEquals(totalNoLimit, list.size());
153 limitReached = list.get(0).get(RecursiveParserWrapper.EMBEDDED_RESOURCE_LIMIT_REACHED);
154 assertNull(limitReached);
155 }
156
157 @Test
158 public void testEmbeddedResourcePath() throws Exception {
159
160 Set<String> targets = new HashSet<String>();
161 targets.add("test_recursive_embedded.docx/embed1.zip");
162 targets.add("test_recursive_embedded.docx/embed1.zip/embed2.zip");
163 targets.add("test_recursive_embedded.docx/embed1.zip/embed2.zip/embed3.zip");
164 targets.add("test_recursive_embedded.docx/embed1.zip/embed2.zip/embed3.zip/embed4.zip");
165 targets.add("test_recursive_embedded.docx/embed1.zip/embed2.zip/embed3.zip/embed4.zip/embed4.txt");
166 targets.add("test_recursive_embedded.docx/embed1.zip/embed2.zip/embed3.zip/embed3.txt");
167 targets.add("test_recursive_embedded.docx/embed1.zip/embed2.zip/embed2a.txt");
168 targets.add("test_recursive_embedded.docx/embed1.zip/embed2.zip/embed2b.txt");
169 targets.add("test_recursive_embedded.docx/embed1.zip/embed1b.txt");
170 targets.add("test_recursive_embedded.docx/embed1.zip/embed1a.txt");
171 targets.add("test_recursive_embedded.docx/image1.emf");
172
173 Metadata metadata = new Metadata();
174 metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx");
175 List<Metadata> list = getMetadata(metadata,
176 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.XML, -1));
177 Metadata container = list.get(0);
178 String content = container.get(RecursiveParserWrapper.TIKA_CONTENT);
179 assertTrue(content.indexOf("<p class=\"header\" />") > -1);
180
181 Set<String> seen = new HashSet<String>();
182 for (Metadata m : list) {
183 String path = m.get(RecursiveParserWrapper.EMBEDDED_RESOURCE_PATH);
184 if (path != null) {
185 seen.add(path);
186 }
187 }
188 assertEquals(targets, seen);
189 }
190
191 private List<Metadata> getMetadata(Metadata metadata, ContentHandlerFactory contentHandlerFactory)
192 throws Exception{
193 ParseContext context = new ParseContext();
194 Parser wrapped = new AutoDetectParser();
195 RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped, contentHandlerFactory);
196 InputStream stream = RecursiveParserWrapperTest.class.getResourceAsStream(
197 "/test-documents/test_recursive_embedded.docx");
198 wrapper.parse(stream, new DefaultHandler(), metadata, context);
199 return wrapper.getMetadata();
200 }
201 }
1616 package org.apache.tika.parser.audio;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
19 import static org.apache.tika.TikaTest.assertContains;
2020
2121 import org.apache.tika.Tika;
2222 import org.apache.tika.metadata.Metadata;
3636 assertEquals("0", metadata.get("patches"));
3737 assertEquals("PPQ", metadata.get("divisionType"));
3838
39 assertTrue(content.contains("Untitled"));
39 assertContains("Untitled", content);
4040 }
4141 }
1717
1818 import static org.junit.Assert.assertTrue;
1919
20 import java.util.Iterator;
21
20 import org.apache.tika.io.IOUtils;
2221 import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet;
2322 import org.apache.tika.parser.chm.accessor.ChmItsfHeader;
2423 import org.apache.tika.parser.chm.accessor.ChmItspHeader;
6867 int indexOfControlData = chmDirListCont.getControlDataIndex();
6968
7069 int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data,
71 ChmConstants.LZXC.getBytes());
70 ChmConstants.LZXC.getBytes(IOUtils.UTF_8));
7271 byte[] dir_chunk = null;
7372 if (indexOfResetTable > 0) {
7473 // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable,
1818 import static org.junit.Assert.assertTrue;
1919
2020 import java.io.ByteArrayInputStream;
21 import java.io.File;
22 import java.io.FileInputStream;
2123 import java.io.IOException;
2224 import java.io.InputStream;
25 import java.net.URL;
2326 import java.util.Arrays;
27 import java.util.HashSet;
2428 import java.util.List;
29 import java.util.Locale;
30 import java.util.Set;
2531 import java.util.concurrent.ExecutorService;
2632 import java.util.concurrent.Executors;
27
33 import java.util.regex.Pattern;
34
35 import org.apache.tika.exception.TikaException;
2836 import org.apache.tika.metadata.Metadata;
2937 import org.apache.tika.parser.ParseContext;
3038 import org.apache.tika.parser.Parser;
39 import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet;
40 import org.apache.tika.parser.chm.accessor.DirectoryListingEntry;
41 import org.apache.tika.parser.chm.core.ChmExtractor;
3142 import org.apache.tika.sax.BodyContentHandler;
3243 import org.junit.Test;
44 import org.xml.sax.SAXException;
3345
3446 public class TestChmExtraction {
3547
3749
3850 private final List<String> files = Arrays.asList(
3951 "/test-documents/testChm.chm",
52 "/test-documents/testChm2.chm",
4053 "/test-documents/testChm3.chm");
4154
4255 @Test
5265 @Test
5366 public void testChmParser() throws Exception{
5467 for (String fileName : files) {
68 InputStream stream = TestChmExtraction.class.getResourceAsStream(fileName);
69 testingChm(stream);
70 }
71 }
72
73 private void testingChm(InputStream stream) throws IOException, SAXException, TikaException {
74 try {
75 BodyContentHandler handler = new BodyContentHandler(-1);
76 parser.parse(stream, handler, new Metadata(), new ParseContext());
77 assertTrue(!handler.toString().isEmpty());
78 } finally {
79 stream.close();
80 }
81 }
82
83 @Test
84 public void testExtractChmEntries() throws TikaException, IOException{
85 for (String fileName : files) {
5586 InputStream stream =
5687 TestChmExtraction.class.getResourceAsStream(fileName);
5788 try {
58 BodyContentHandler handler = new BodyContentHandler(-1);
59 parser.parse(stream, handler, new Metadata(), new ParseContext());
60 assertTrue(!handler.toString().isEmpty());
89 testExtractChmEntry(stream);
6190 } finally {
6291 stream.close();
6392 }
6493 }
6594 }
66
95
96 protected boolean findZero(byte[] textData) {
97 for (byte b : textData) {
98 if (b==0) {
99 return true;
100 }
101 }
102
103 return false;
104 }
105
106 protected boolean niceAscFileName(String name) {
107 for (char c : name.toCharArray()) {
108 if (c>=127 || c<32) {
109 //non-ascii char or control char
110 return false;
111 }
112 }
113
114 return true;
115 }
116
117 protected void testExtractChmEntry(InputStream stream) throws TikaException, IOException{
118 ChmExtractor chmExtractor = new ChmExtractor(stream);
119 ChmDirectoryListingSet entries = chmExtractor.getChmDirList();
120 final Pattern htmlPairP = Pattern.compile("\\Q<html\\E.+\\Q</html>\\E"
121 , Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
122
123 Set<String> names = new HashSet<String>();
124
125 for (DirectoryListingEntry directoryListingEntry : entries.getDirectoryListingEntryList()) {
126 byte[] data = chmExtractor.extractChmEntry(directoryListingEntry);
127
128 //Entry names should be nice. Disable this if the test chm do have bad looking but valid entry names.
129 if (! niceAscFileName(directoryListingEntry.getName())) {
130 throw new TikaException("Warning: File name contains a non ascii char : " + directoryListingEntry.getName());
131 }
132
133 final String lowName = directoryListingEntry.getName().toLowerCase(Locale.ROOT);
134
135 //check duplicate entry name which is seen before.
136 if (names.contains(lowName)) {
137 throw new TikaException("Duplicate File name detected : " + directoryListingEntry.getName());
138 }
139 names.add(lowName);
140
141 if (lowName.endsWith(".html")
142 || lowName.endsWith(".htm")
143 || lowName.endsWith(".hhk")
144 || lowName.endsWith(".hhc")
145 //|| name.endsWith(".bmp")
146 ) {
147 if (findZero(data)) {
148 throw new TikaException("Xhtml/text file contains '\\0' : " + directoryListingEntry.getName());
149 }
150
151 //validate html
152 String html = new String(data, "ISO-8859-1");
153 if (! htmlPairP.matcher(html).find()) {
154 System.err.println(lowName + " is invalid.");
155 System.err.println(html);
156 throw new TikaException("Invalid xhtml file : " + directoryListingEntry.getName());
157 }
158 // else {
159 // System.err.println(directoryListingEntry.getName() + " is valid.");
160 // }
161 }
162 }
163 }
164
67165
68166 @Test
69167 public void testMultiThreadedChmExtraction() throws InterruptedException {
97195 Thread.sleep(500);
98196 }
99197 }
198
199 @Test
200 public void test_TIKA_1446() throws Exception {
201 URL chmDir = TestChmExtraction.class.getResource("/test-documents/chm/");
202 File chmFolder = new File(chmDir.toURI());
203 for (String fileName : chmFolder.list()) {
204 File file = new File(chmFolder, fileName);
205 InputStream stream = new FileInputStream(file);
206 testingChm(stream);
207 }
208 }
100209 }
1515 */
1616 package org.apache.tika.parser.chm;
1717
18 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertNotNull;
20
2118 import java.io.ByteArrayInputStream;
22 import java.util.Iterator;
2319 import java.util.List;
24
2520 import org.apache.tika.exception.TikaException;
2621 import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet;
2722 import org.apache.tika.parser.chm.accessor.DirectoryListingEntry;
2823 import org.apache.tika.parser.chm.core.ChmExtractor;
24 import static org.junit.Assert.assertEquals;
25 import static org.junit.Assert.assertNotNull;
2926 import org.junit.Before;
3027 import org.junit.Test;
3128
5350 @Test
5451 public void testExtractChmEntry() throws TikaException{
5552 ChmDirectoryListingSet entries = chmExtractor.getChmDirList();
53
5654 int count = 0;
5755 for (DirectoryListingEntry directoryListingEntry : entries.getDirectoryListingEntryList()) {
5856 chmExtractor.extractChmEntry(directoryListingEntry);
1818 import static org.junit.Assert.assertEquals;
1919 import static org.junit.Assert.assertTrue;
2020
21 import org.apache.tika.io.IOUtils;
2122 import org.apache.tika.parser.chm.accessor.ChmItsfHeader;
2223 import org.apache.tika.parser.chm.accessor.ChmItspHeader;
2324 import org.apache.tika.parser.chm.core.ChmCommons;
135136 @Test
136137 public void testGetSignature() {
137138 assertEquals(TestParameters.VP_ISTP_SIGNATURE, new String(
138 chmItspHeader.getSignature()));
139 chmItspHeader.getSignature(), IOUtils.UTF_8));
139140 }
140141
141142 @Test
2020 import static org.junit.Assert.assertTrue;
2121
2222 import org.apache.tika.exception.TikaException;
23 import org.apache.tika.io.IOUtils;
2324 import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet;
2425 import org.apache.tika.parser.chm.accessor.ChmItsfHeader;
2526 import org.apache.tika.parser.chm.accessor.ChmItspHeader;
6364 ChmConstants.CONTROL_DATA);
6465
6566 int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data,
66 ChmConstants.LZXC.getBytes());
67 ChmConstants.LZXC.getBytes(IOUtils.UTF_8));
6768 byte[] dir_chunk = null;
6869 if (indexOfResetTable > 0) {
6970 // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable,
1919 import static org.junit.Assert.assertNotNull;
2020 import static org.junit.Assert.assertTrue;
2121
22 import org.apache.tika.io.IOUtils;
2223 import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet;
2324 import org.apache.tika.parser.chm.accessor.ChmItsfHeader;
2425 import org.apache.tika.parser.chm.accessor.ChmItspHeader;
5960 int indexOfControlData = chmDirListCont.getControlDataIndex();
6061
6162 int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data,
62 ChmConstants.LZXC.getBytes());
63 ChmConstants.LZXC.getBytes(IOUtils.UTF_8));
6364 byte[] dir_chunk = null;
6465 if (indexOfResetTable > 0) {
6566 // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable,
128129 @Test
129130 public void testGetSignature() {
130131 assertEquals(
131 TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes().length,
132 TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes(IOUtils.UTF_8).length,
132133 chmLzxcControlData.getSignature().length);
133134 }
134135
135136 @Test
136137 public void testGetSignaure() {
137138 assertEquals(
138 TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes().length,
139 TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes(IOUtils.UTF_8).length,
139140 chmLzxcControlData.getSignature().length);
140141 }
141142
1919 import static org.junit.Assert.assertEquals;
2020 import static org.junit.Assert.assertTrue;
2121
22 import org.apache.tika.io.IOUtils;
2223 import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet;
2324 import org.apache.tika.parser.chm.accessor.ChmItsfHeader;
2425 import org.apache.tika.parser.chm.accessor.ChmItspHeader;
5859 int indexOfControlData = chmDirListCont.getControlDataIndex();
5960
6061 int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data,
61 ChmConstants.LZXC.getBytes());
62 ChmConstants.LZXC.getBytes(IOUtils.UTF_8));
6263 byte[] dir_chunk = null;
6364 if (indexOfResetTable > 0) {
6465 // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable,
1717
1818 import java.io.IOException;
1919 import java.io.InputStream;
20
2120 import org.apache.tika.io.IOUtils;
2221 import org.apache.tika.parser.chm.core.ChmCommons.EntryType;
2322
9089 static final int VP_CONTROL_DATA_VERSION = 2;
9190 static final int VP_WINDOW_SIZE = 65536;
9291 static final int VP_WINDOWS_PER_RESET = 1;
93 static final int VP_CHM_ENTITIES_NUMBER = 101;
92 static final int VP_CHM_ENTITIES_NUMBER = 100; //updated by Hawking
9493 static final int VP_PMGI_FREE_SPACE = 3;
9594 static final int VP_PMGL_BLOCK_NEXT = -1;
9695 static final int VP_PMGL_BLOCK_PREV = -1;
1818 import static org.junit.Assert.assertEquals;
1919 import static org.junit.Assert.assertTrue;
2020
21 import org.apache.tika.io.IOUtils;
2122 import org.apache.tika.parser.chm.accessor.ChmPmglHeader;
2223 import org.apache.tika.parser.chm.core.ChmCommons;
2324 import org.apache.tika.parser.chm.core.ChmConstants;
4546 @Test
4647 public void testChmPmglHeaderGet() {
4748 assertEquals(TestParameters.VP_PMGL_SIGNATURE, new String(
48 chmPmglHeader.getSignature()));
49 chmPmglHeader.getSignature(), IOUtils.UTF_8));
4950 }
5051
5152 @Test
2323 import java.util.Set;
2424
2525 import org.apache.tika.TikaTest;
26 import org.apache.tika.io.IOUtils;
2627 import org.apache.tika.metadata.Metadata;
2728 import org.apache.tika.metadata.TikaCoreProperties;
2829 import org.apache.tika.mime.MediaType;
6162 assertTrue(textContent.length() > 0);
6263 assertTrue(textContent.indexOf("html") < 0);
6364
64 textContent = getText(new ByteArrayInputStream("public class HelloWorld {}".getBytes()), sourceCodeParser, createMetadata("text/x-java-source"));
65 textContent = getText(new ByteArrayInputStream("public class HelloWorld {}".getBytes(IOUtils.UTF_8)), sourceCodeParser, createMetadata("text/x-java-source"));
6566 assertTrue(textContent.length() > 0);
6667 assertTrue(textContent.indexOf("html") < 0);
6768 }
1717
1818 import static org.junit.Assert.assertEquals;
1919 import static org.junit.Assert.assertNull;
20 import static org.junit.Assert.assertTrue;
20 import static org.apache.tika.TikaTest.assertContains;
2121
2222 import java.io.InputStream;
2323
132132 metadata.get(Metadata.SUBJECT));
133133
134134 String content = handler.toString();
135 assertTrue(content.contains("The quick brown fox jumps over the lazy dog"));
136 assertTrue(content.contains("Gym class"));
137 assertTrue(content.contains("www.alfresco.com"));
135 assertContains("The quick brown fox jumps over the lazy dog", content);
136 assertContains("Gym class", content);
137 assertContains("www.alfresco.com", content);
138138 } finally {
139139 input.close();
140140 }
158158 assertNull(metadata.get(TikaCoreProperties.RELATION));
159159
160160 String content = handler.toString();
161 assertTrue(content.contains(""));
161 assertEquals("", content);
162162 } finally {
163163 input.close();
164164 }
195195 metadata.get("MyCustomProperty"));
196196
197197 String content = handler.toString();
198 assertTrue(content.contains("This is a comment"));
199 assertTrue(content.contains("mycompany"));
198 assertContains("This is a comment", content);
199 assertContains("mycompany", content);
200200 } finally {
201201 input.close();
202202 }
1616
1717 package org.apache.tika.parser.envi;
1818
19 //Junit imports
19 import static org.apache.tika.TikaTest.assertContains;
2020 import static org.junit.Assert.assertNotNull;
21 import static org.junit.Assert.assertTrue;
2221
23 import org.apache.tika.sax.ToXMLContentHandler;
24 import org.junit.Test;
22 import java.io.InputStream;
2523
2624 import org.apache.tika.metadata.Metadata;
2725 import org.apache.tika.parser.ParseContext;
2826 import org.apache.tika.parser.Parser;
27 import org.apache.tika.sax.ToXMLContentHandler;
28 import org.junit.Test;
2929
30 import java.io.InputStream;
31
32 /*
30 /**
3331 * Test cases to exercise the {@link EnviHeaderParser}.
34 *
3532 */
3633 public class EnviHeaderParserTest {
37 @Test
38 public void testParseGlobalMetadata() throws Exception {
39 if (System.getProperty("java.version").startsWith("1.5")) {
40 return;
41 }
34 @Test
35 public void testParseGlobalMetadata() throws Exception {
36 if (System.getProperty("java.version").startsWith("1.5")) {
37 return;
38 }
4239
43 Parser parser = new EnviHeaderParser();
44 ToXMLContentHandler handler = new ToXMLContentHandler();
45 Metadata metadata = new Metadata();
40 Parser parser = new EnviHeaderParser();
41 ToXMLContentHandler handler = new ToXMLContentHandler();
42 Metadata metadata = new Metadata();
4643
47 InputStream stream = EnviHeaderParser.class
48 .getResourceAsStream("/test-documents/envi_test_header.hdr");
49 assertNotNull("Test ENVI file not found", stream);
50 try {
51 parser.parse(stream, handler, metadata, new ParseContext());
52 } finally {
53 stream.close();
54 }
44 InputStream stream = EnviHeaderParser.class
45 .getResourceAsStream("/test-documents/envi_test_header.hdr");
46 assertNotNull("Test ENVI file not found", stream);
47 try {
48 parser.parse(stream, handler, metadata, new ParseContext());
49 } finally {
50 stream.close();
51 }
5552
56 // Check content of test file
57 String content = handler.toString();
58 assertTrue(content.contains("<body><p>ENVI</p>"));
59 assertTrue(content.contains("<p>samples = 2400</p>"));
60 assertTrue(content.contains("<p>lines = 2400</p>"));
61 assertTrue(content.contains("<p>map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters}</p>"));
62 assertTrue(content.contains("content=\"application/envi.hdr\""));
63 assertTrue(content
64 .contains("projection info = {16, 6371007.2, 0.000000, 0.0, 0.0, Sinusoidal, units=Meters}"));
65 }
53 // Check content of test file
54 String content = handler.toString();
55 assertContains("<body><p>ENVI</p>", content);
56 assertContains("<p>samples = 2400</p>", content);
57 assertContains("<p>lines = 2400</p>", content);
58 assertContains("<p>map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters}</p>", content);
59 assertContains("content=\"application/envi.hdr\"", content);
60 assertContains("projection info = {16, 6371007.2, 0.000000, 0.0, 0.0, Sinusoidal, units=Meters}", content);
61 }
6662 }
1616 package org.apache.tika.parser.epub;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
19 import static org.apache.tika.TikaTest.assertContains;
2020
2121 import java.io.InputStream;
2222
4848 metadata.get(TikaCoreProperties.PUBLISHER));
4949
5050 String content = handler.toString();
51 assertTrue(content.contains("Plus a simple div"));
52 assertTrue(content.contains("First item"));
53 assertTrue(content.contains("The previous headings were subchapters"));
54 assertTrue(content.contains("Table data"));
51 assertContains("Plus a simple div", content);
52 assertContains("First item", content);
53 assertContains("The previous headings were subchapters", content);
54 assertContains("Table data", content);
5555 } finally {
5656 input.close();
5757 }
1515 */
1616 package org.apache.tika.parser.font;
1717
18 import static org.apache.tika.TikaTest.assertContains;
19 import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_FAMILY_NAME;
20 import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_FULL_NAME;
21 import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_NAME;
22 import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_SUB_FAMILY_NAME;
23 import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_VERSION;
24 import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_WEIGHT;
25 import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_PS_NAME;
1826 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
2027
21 import java.util.TimeZone;
22
28 import org.apache.tika.io.TikaInputStream;
2329 import org.apache.tika.metadata.Metadata;
2430 import org.apache.tika.metadata.TikaCoreProperties;
2531 import org.apache.tika.parser.AutoDetectParser;
2632 import org.apache.tika.parser.ParseContext;
2733 import org.apache.tika.parser.Parser;
2834 import org.apache.tika.sax.BodyContentHandler;
35 import org.junit.Test;
2936 import org.xml.sax.ContentHandler;
30 import org.apache.tika.io.TikaInputStream;
31 import org.junit.Test;
32
33 import static org.apache.tika.parser.font.AdobeFontMetricParser.*;
3437
3538 /**
3639 * Test case for parsing various different font files.
6669 String content = handler.toString();
6770
6871 // Test that the comments got extracted
69 assertTrue(content.contains("Comments"));
70 assertTrue(content.contains("This is a comment in a sample file"));
71 assertTrue(content.contains("UniqueID 12345"));
72 assertContains("Comments", content);
73 assertContains("This is a comment in a sample file", content);
74 assertContains("UniqueID 12345", content);
7275 }
7376
7477 @Test
7780 ContentHandler handler = new BodyContentHandler();
7881 Metadata metadata = new Metadata();
7982 ParseContext context = new ParseContext();
83 //Open Sans font is ASL 2.0 according to
84 //http://www.google.com/fonts/specimen/Open+Sans
85 //...despite the copyright in the file's metadata.
8086 TikaInputStream stream = TikaInputStream.get(
8187 FontParsersTest.class.getResource(
82 "/test-documents/testTrueType.ttf"));
83
84 //Pending PDFBOX-2122's integration (PDFBox 1.8.6)
85 //we must set the default timezone to something
86 //standard for this test.
87 //TODO: once we upgrade to PDFBox 1.8.6, remove
88 //this timezone code.
89 TimeZone defaultTimeZone = TimeZone.getDefault();
90 TimeZone.setDefault(TimeZone.getTimeZone("UTC"));
91
88 "/test-documents/testTrueType3.ttf"));
89
9290 try {
9391 parser.parse(stream, handler, metadata, context);
9492 } finally {
95 //make sure to reset default timezone
96 TimeZone.setDefault(defaultTimeZone);
9793 stream.close();
9894 }
9995
10096 assertEquals("application/x-font-ttf", metadata.get(Metadata.CONTENT_TYPE));
101 assertEquals("NewBaskervilleEF-Roman", metadata.get(TikaCoreProperties.TITLE));
97 assertEquals("Open Sans Bold", metadata.get(TikaCoreProperties.TITLE));
10298
103 assertEquals("1904-01-01T00:00:00Z", metadata.get(Metadata.CREATION_DATE));
104 assertEquals("1904-01-01T00:00:00Z", metadata.get(TikaCoreProperties.CREATED));
105 assertEquals("1904-01-01T00:00:00Z", metadata.get(TikaCoreProperties.MODIFIED));
99 assertEquals("2010-12-30T11:04:00Z", metadata.get(Metadata.CREATION_DATE));
100 assertEquals("2010-12-30T11:04:00Z", metadata.get(TikaCoreProperties.CREATED));
101 assertEquals("2011-05-05T12:37:53Z", metadata.get(TikaCoreProperties.MODIFIED));
106102
107 assertEquals("NewBaskervilleEF-Roman", metadata.get(MET_FONT_NAME));
108 assertEquals("NewBaskerville", metadata.get(MET_FONT_FAMILY_NAME));
109 assertEquals("Regular", metadata.get(MET_FONT_SUB_FAMILY_NAME));
110 assertEquals("NewBaskervilleEF-Roman", metadata.get(MET_PS_NAME));
103 assertEquals("Open Sans Bold", metadata.get(MET_FONT_NAME));
104 assertEquals("Open Sans", metadata.get(MET_FONT_FAMILY_NAME));
105 assertEquals("Bold", metadata.get(MET_FONT_SUB_FAMILY_NAME));
106 assertEquals("OpenSans-Bold", metadata.get(MET_PS_NAME));
111107
112 assertEquals("Copyright", metadata.get("Copyright").substring(0, 9));
113 assertEquals("ITC New Baskerville", metadata.get("Trademark").substring(0, 19));
108 assertEquals("Digitized", metadata.get("Copyright").substring(0, 9));
109 assertEquals("Open Sans", metadata.get("Trademark").substring(0, 9));
114110
115111 // Not extracted
116112 assertEquals(null, metadata.get(MET_FONT_FULL_NAME));
1515 */
1616 package org.apache.tika.parser.fork;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
1920 import static org.junit.Assert.assertNotNull;
20 import static org.junit.Assert.assertTrue;
2121 import static org.junit.Assert.fail;
2222
2323 import java.io.IOException;
6565 parser.parse(stream, output, new Metadata(), context);
6666
6767 String content = output.toString();
68 assertTrue(content.contains("Test d'indexation"));
69 assertTrue(content.contains("http://www.apache.org"));
68 assertContains("Test d'indexation", content);
69 assertContains("http://www.apache.org", content);
7070 } finally {
7171 parser.close();
7272 }
119119 for (StackTraceElement ste : e.getStackTrace()) {
120120 if (ste.getClassName().equals(ForkParser.class.getName())) {
121121 found = true;
122 break;
122123 }
123124 }
124125 if (!found) {
223224 ForkParser parser = new ForkParser(
224225 ForkParserIntegrationTest.class.getClassLoader(),
225226 tika.getParser());
226 parser.setJavaCommand(
227 "java -Xmx32m -Xdebug -Xrunjdwp:"
228 + "transport=dt_socket,address=54321,server=y,suspend=n");
227 parser.setJavaCommand(Arrays.asList("java", "-Xmx32m", "-Xdebug",
228 "-Xrunjdwp:transport=dt_socket,address=54321,server=y,suspend=n"));
229229 try {
230230 ContentHandler body = new BodyContentHandler();
231231 InputStream stream = ForkParserIntegrationTest.class.getResourceAsStream(
232232 "/test-documents/testTXT.txt");
233233 parser.parse(stream, body, new Metadata(), context);
234234 String content = body.toString();
235 assertTrue(content.contains("Test d'indexation"));
236 assertTrue(content.contains("http://www.apache.org"));
235 assertContains("Test d'indexation", content);
236 assertContains("http://www.apache.org", content);
237237 } finally {
238238 parser.close();
239239 }
256256 parser.parse(stream, output, new Metadata(), context);
257257
258258 String content = output.toString();
259 assertTrue(content.contains("Apache Tika"));
260 assertTrue(content.contains("Tika - Content Analysis Toolkit"));
261 assertTrue(content.contains("incubator"));
262 assertTrue(content.contains("Apache Software Foundation"));
259 assertContains("Apache Tika", content);
260 assertContains("Tika - Content Analysis Toolkit", content);
261 assertContains("incubator", content);
262 assertContains("Apache Software Foundation", content);
263263 } finally {
264264 parser.close();
265265 }
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.parser.gdal;
18
19 //JDK imports
20
21 import java.io.InputStream;
22
23 //Tika imports
24 import org.apache.tika.TikaTest;
25 import org.apache.tika.metadata.Metadata;
26 import org.apache.tika.parser.ParseContext;
27 import org.apache.tika.parser.external.ExternalParser;
28 import org.apache.tika.sax.BodyContentHandler;
29
30 //Junit imports
31 import org.junit.Test;
32
33 import static org.junit.Assert.fail;
34 import static org.junit.Assert.assertTrue;
35 import static org.junit.Assert.assertEquals;
36 import static org.junit.Assert.assertNotNull;
37 import static org.junit.Assume.assumeTrue;
38
39 /**
40 * Test harness for the GDAL parser.
41 */
42 public class TestGDALParser extends TikaTest {
43
44 private boolean canRun() {
45 String[] checkCmd = {"gdalinfo"};
46 // If GDAL is not on the path, do not run the test.
47 return ExternalParser.check(checkCmd);
48 }
49
50 @Test
51 public void testParseBasicInfo() {
52 assumeTrue(canRun());
53 final String expectedDriver = "netCDF/Network Common Data Format";
54 final String expectedUpperRight = "512.0, 0.0";
55 final String expectedUpperLeft = "0.0, 0.0";
56 final String expectedLowerLeft = "0.0, 512.0";
57 final String expectedLowerRight = "512.0, 512.0";
58 final String expectedCoordinateSystem = "`'";
59 final String expectedSize = "512, 512";
60
61 GDALParser parser = new GDALParser();
62 InputStream stream = TestGDALParser.class
63 .getResourceAsStream("/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc");
64 Metadata met = new Metadata();
65 BodyContentHandler handler = new BodyContentHandler();
66 try {
67 parser.parse(stream, handler, met, new ParseContext());
68 assertNotNull(met);
69 assertNotNull(met.get("Driver"));
70 assertEquals(expectedDriver, met.get("Driver"));
71 assertNotNull(met.get("Files"));
72 assertNotNull(met.get("Coordinate System"));
73 assertEquals(expectedCoordinateSystem, met.get("Coordinate System"));
74 assertNotNull(met.get("Size"));
75 assertEquals(expectedSize, met.get("Size"));
76 assertNotNull(met.get("Upper Right"));
77 assertEquals(expectedUpperRight, met.get("Upper Right"));
78 assertNotNull(met.get("Upper Left"));
79 assertEquals(expectedUpperLeft, met.get("Upper Left"));
80 assertNotNull(met.get("Upper Right"));
81 assertEquals(expectedLowerRight, met.get("Lower Right"));
82 assertNotNull(met.get("Upper Right"));
83 assertEquals(expectedLowerLeft, met.get("Lower Left"));
84 } catch (Exception e) {
85 e.printStackTrace();
86 fail(e.getMessage());
87 }
88 }
89
90 @Test
91 public void testParseMetadata() {
92 assumeTrue(canRun());
93 final String expectedNcInst = "NCAR (National Center for Atmospheric Research, Boulder, CO, USA)";
94 final String expectedModelNameEnglish = "NCAR CCSM";
95 final String expectedProgramId = "Source file unknown Version unknown Date unknown";
96 final String expectedProjectId = "IPCC Fourth Assessment";
97 final String expectedRealization = "1";
98 final String expectedTitle = "model output prepared for IPCC AR4";
99 final String expectedSub8Name = "\":ua";
100 final String expectedSub8Desc = "[1x17x128x256] eastward_wind (32-bit floating-point)";
101
102 GDALParser parser = new GDALParser();
103 InputStream stream = TestGDALParser.class
104 .getResourceAsStream("/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc");
105 Metadata met = new Metadata();
106 BodyContentHandler handler = new BodyContentHandler();
107 try {
108 parser.parse(stream, handler, met, new ParseContext());
109 assertNotNull(met);
110 assertNotNull(met.get("NC_GLOBAL#institution"));
111 assertEquals(expectedNcInst, met.get("NC_GLOBAL#institution"));
112 assertNotNull(met.get("NC_GLOBAL#model_name_english"));
113 assertEquals(expectedModelNameEnglish,
114 met.get("NC_GLOBAL#model_name_english"));
115 assertNotNull(met.get("NC_GLOBAL#prg_ID"));
116 assertEquals(expectedProgramId, met.get("NC_GLOBAL#prg_ID"));
117 assertNotNull(met.get("NC_GLOBAL#prg_ID"));
118 assertEquals(expectedProgramId, met.get("NC_GLOBAL#prg_ID"));
119 assertNotNull(met.get("NC_GLOBAL#project_id"));
120 assertEquals(expectedProjectId, met.get("NC_GLOBAL#project_id"));
121 assertNotNull(met.get("NC_GLOBAL#realization"));
122 assertEquals(expectedRealization, met.get("NC_GLOBAL#realization"));
123 assertNotNull(met.get("NC_GLOBAL#title"));
124 assertEquals(expectedTitle, met.get("NC_GLOBAL#title"));
125 assertNotNull(met.get("SUBDATASET_8_NAME"));
126 assertTrue(met.get("SUBDATASET_8_NAME").endsWith(expectedSub8Name));
127 assertNotNull(met.get("SUBDATASET_8_DESC"));
128 assertEquals(expectedSub8Desc, met.get("SUBDATASET_8_DESC"));
129 } catch (Exception e) {
130 e.printStackTrace();
131 fail(e.getMessage());
132 }
133 }
134
135 @Test
136 public void testParseFITS() {
137 String fitsFilename = "/test-documents/WFPC2u5780205r_c0fx.fits";
138
139 assumeTrue(canRun());
140 // If the exit code is 1 (meaning FITS isn't supported by the installed version of gdalinfo, don't run this test.
141 String[] fitsCommand = {"gdalinfo", TestGDALParser.class.getResource(fitsFilename).getPath()};
142 assumeTrue(ExternalParser.check(fitsCommand, 1));
143
144 String expectedAllgMin = "-7.319537E1";
145 String expectedAtodcorr = "COMPLETE";
146 String expectedAtodfile = "uref$dbu1405iu.r1h";
147 String expectedCalVersion = " ";
148 String expectedCalibDef = "1466";
149
150 GDALParser parser = new GDALParser();
151 InputStream stream = TestGDALParser.class
152 .getResourceAsStream(fitsFilename);
153 Metadata met = new Metadata();
154 BodyContentHandler handler = new BodyContentHandler();
155 try {
156 parser.parse(stream, handler, met, new ParseContext());
157 assertNotNull(met);
158 assertNotNull(met.get("ALLG-MIN"));
159 assertEquals(expectedAllgMin, met.get("ALLG-MIN"));
160 assertNotNull(met.get("ATODCORR"));
161 assertEquals(expectedAtodcorr, met.get("ATODCORR"));
162 assertNotNull(met.get("ATODFILE"));
163 assertEquals(expectedAtodfile, met.get("ATODFILE"));
164 assertNotNull(met.get("CAL_VER"));
165 assertEquals(expectedCalVersion, met.get("CAL_VER"));
166 assertNotNull(met.get("CALIBDEF"));
167 assertEquals(expectedCalibDef, met.get("CALIBDEF"));
168
169 } catch (Exception e) {
170 e.printStackTrace();
171 fail(e.getMessage());
172 }
173 }
174 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.parser.grib;
18
19 //JDK imports
20 import static org.junit.Assert.*;
21 import java.io.InputStream;
22
23 //TIKA imports
24 import org.apache.tika.metadata.Metadata;
25 import org.apache.tika.metadata.TikaCoreProperties;
26 import org.apache.tika.parser.ParseContext;
27 import org.apache.tika.parser.Parser;
28 import org.apache.tika.sax.BodyContentHandler;
29 import org.junit.Test;
30 import org.xml.sax.ContentHandler;
31 import java.io.File;
32 /**
33 * Test cases to exercise the {@link org.apache.tika.parser.grib.GribParser}.
34 */
35
36 public class GribParserTest {
37
38 @Test
39 public void testParseGlobalMetadata() throws Exception {
40 Parser parser = new GribParser();
41 Metadata metadata = new Metadata();
42 ContentHandler handler = new BodyContentHandler();
43 InputStream stream = GribParser.class.getResourceAsStream("/test-documents/gdas1.forecmwf.2014062612.grib2");
44 try {
45 parser.parse(stream, handler, metadata, new ParseContext());
46 } finally {
47 stream.close();
48 }
49 assertNotNull(metadata);
50 String content = handler.toString();
51 assertTrue(content.contains("dimensions:"));
52 assertTrue(content.contains("variables:"));
53 }
54 }
55
9393 assertNotNull(metadata);
9494 assertEquals("Direct read of HDF4 file through CDM library", metadata.get("_History"));
9595 assertEquals("Ascending", metadata.get("Pass"));
96 assertEquals("Hierarchical Data Format, version 4",
97 metadata.get("File-Type-Description"));
9698 }
9799 }
1515 */
1616 package org.apache.tika.parser.html;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
1920 import static org.junit.Assert.assertFalse;
2021 import static org.junit.Assert.assertNotNull;
2122 import static org.junit.Assert.assertTrue;
2223
24 import javax.xml.transform.OutputKeys;
25 import javax.xml.transform.sax.SAXTransformerFactory;
26 import javax.xml.transform.sax.TransformerHandler;
27 import javax.xml.transform.stream.StreamResult;
2328 import java.io.ByteArrayInputStream;
2429 import java.io.IOException;
2530 import java.io.InputStream;
2934 import java.util.List;
3035 import java.util.regex.Pattern;
3136
32 import javax.xml.transform.OutputKeys;
33 import javax.xml.transform.sax.SAXTransformerFactory;
34 import javax.xml.transform.sax.TransformerHandler;
35 import javax.xml.transform.stream.StreamResult;
36
3737 import org.apache.tika.Tika;
3838 import org.apache.tika.exception.TikaException;
39 import org.apache.tika.io.IOUtils;
3940 import org.apache.tika.metadata.Geographic;
4041 import org.apache.tika.metadata.Metadata;
4142 import org.apache.tika.metadata.TikaCoreProperties;
43 import org.apache.tika.parser.AutoDetectParser;
4244 import org.apache.tika.parser.ParseContext;
4345 import org.apache.tika.sax.BodyContentHandler;
4446 import org.apache.tika.sax.LinkContentHandler;
132134 String content = new Tika().parseToString(
133135 HtmlParserTest.class.getResourceAsStream(path), metadata);
134136
135 assertEquals("application/xhtml+xml", metadata.get(Metadata.CONTENT_TYPE));
137 //can't specify charset because default differs between OS's
138 assertTrue(metadata.get(Metadata.CONTENT_TYPE).startsWith("application/xhtml+xml; charset="));
136139 assertEquals("XHTML test document", metadata.get(TikaCoreProperties.TITLE));
137140
138141 assertEquals("Tika Developers", metadata.get("Author"));
139142 assertEquals("5", metadata.get("refresh"));
140 assertTrue(content.contains("ability of Apache Tika"));
141 assertTrue(content.contains("extract content"));
142 assertTrue(content.contains("an XHTML document"));
143 assertContains("ability of Apache Tika", content);
144 assertContains("extract content", content);
145 assertContains("an XHTML document", content);
143146 }
144147
145148 @Test
147150 ContentHandler handler = new BodyContentHandler();
148151 new HtmlParser().parse(
149152 new ByteArrayInputStream(new byte[0]),
150 handler, new Metadata(), new ParseContext());
153 handler, new Metadata(), new ParseContext());
151154 assertEquals("", handler.toString());
152155 }
153156
159162 public void testCharactersDirectlyUnderBodyElement() throws Exception {
160163 String test = "<html><body>test</body></html>";
161164 String content = new Tika().parseToString(
162 new ByteArrayInputStream(test.getBytes("UTF-8")));
165 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)));
163166 assertEquals("test", content);
164167 }
165168
216219 + "<body><a href=\"" + relative + "\">test</a></body></html>";
217220 final List<String> links = new ArrayList<String>();
218221 new HtmlParser().parse(
219 new ByteArrayInputStream(test.getBytes("UTF-8")),
222 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
220223 new DefaultHandler() {
221224 @Override
222225 public void startElement(
241244 String test =
242245 "<html><body><table><tr><td>a</td><td>b</td></table></body></html>";
243246 String content = new Tika().parseToString(
244 new ByteArrayInputStream(test.getBytes("UTF-8")));
245 assertTrue(content.contains("a"));
246 assertTrue(content.contains("b"));
247 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)));
248 assertContains("a", content);
249 assertContains("b", content);
247250 assertFalse(content.contains("ab"));
248251 }
249252
259262 + "<title>the name is \u00e1ndre</title>"
260263 + "</head><body></body></html>";
261264 Metadata metadata = new Metadata();
262 new HtmlParser().parse (
265 new HtmlParser().parse(
263266 new ByteArrayInputStream(test.getBytes("ISO-8859-1")),
264 new BodyContentHandler(), metadata, new ParseContext());
267 new BodyContentHandler(), metadata, new ParseContext());
265268 assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING));
266269 }
267270
292295 "<html><head><title>\u017d</title></head><body></body></html>";
293296 Metadata metadata = new Metadata();
294297 new HtmlParser().parse (
295 new ByteArrayInputStream(test.getBytes("UTF-8")),
298 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
296299 new BodyContentHandler(), metadata, new ParseContext());
297300 assertEquals("\u017d", metadata.get(TikaCoreProperties.TITLE));
298301 }
309312
310313 Metadata metadata = new Metadata();
311314 new HtmlParser().parse (
312 new ByteArrayInputStream(test.getBytes("UTF-8")),
315 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
313316 new BodyContentHandler(), metadata, new ParseContext());
314317 assertEquals("UTF-8", metadata.get(Metadata.CONTENT_ENCODING));
315318
350353 String test = "<html><title>Simple Content</title><body></body></html>";
351354 Metadata metadata = new Metadata();
352355 metadata.add(Metadata.CONTENT_LANGUAGE, "en");
353 new HtmlParser().parse (
354 new ByteArrayInputStream(test.getBytes("UTF-8")),
355 new BodyContentHandler(), metadata, new ParseContext());
356 new HtmlParser().parse(
357 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
358 new BodyContentHandler(), metadata, new ParseContext());
356359
357360 assertEquals("en", metadata.get(Metadata.CONTENT_LANGUAGE));
358361 }
399402
400403 Metadata metadata = new Metadata();
401404 new HtmlParser().parse (
402 new ByteArrayInputStream(test.getBytes("UTF-8")),
405 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
403406 new BodyContentHandler(), metadata, new ParseContext());
404407 assertEquals("UTF-8", metadata.get(Metadata.CONTENT_ENCODING));
405408
461464
462465 StringWriter sw = new StringWriter();
463466 new HtmlParser().parse(
464 new ByteArrayInputStream(test.getBytes("UTF-8")),
467 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
465468 makeHtmlTransformer(sw), new Metadata(), new ParseContext());
466469
467470 String result = sw.toString();
498501
499502 StringWriter sw = new StringWriter();
500503 new HtmlParser().parse(
501 new ByteArrayInputStream(test.getBytes("UTF-8")),
504 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
502505 makeHtmlTransformer(sw), new Metadata(), new ParseContext());
503506
504507 String result = sw.toString();
519522
520523 StringWriter sw = new StringWriter();
521524 new HtmlParser().parse(
522 new ByteArrayInputStream(test.getBytes("UTF-8")),
525 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
523526 makeHtmlTransformer(sw), new Metadata(), new ParseContext());
524527
525528 String result = sw.toString();
541544
542545 StringWriter sw = new StringWriter();
543546 new HtmlParser().parse(
544 new ByteArrayInputStream(test.getBytes("UTF-8")),
547 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
545548 makeHtmlTransformer(sw), new Metadata(), new ParseContext());
546549
547550 String result = sw.toString();
564567
565568 StringWriter sw = new StringWriter();
566569 new HtmlParser().parse(
567 new ByteArrayInputStream(test.getBytes("UTF-8")),
570 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
568571 makeHtmlTransformer(sw), new Metadata(), new ParseContext());
569572
570573 String result = sw.toString();
587590
588591 StringWriter sw = new StringWriter();
589592 new HtmlParser().parse(
590 new ByteArrayInputStream(test.getBytes("UTF-8")),
593 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
591594 makeHtmlTransformer(sw), new Metadata(), new ParseContext());
592595
593596 String result = sw.toString();
613616
614617 StringWriter sw = new StringWriter();
615618 new HtmlParser().parse(
616 new ByteArrayInputStream(test.getBytes("UTF-8")),
619 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
617620 makeHtmlTransformer(sw), metadata, new ParseContext());
618621
619622 String result = sw.toString();
635638
636639 StringWriter sw1 = new StringWriter();
637640 new HtmlParser().parse(
638 new ByteArrayInputStream(test1.getBytes("UTF-8")),
641 new ByteArrayInputStream(test1.getBytes(IOUtils.UTF_8)),
639642 makeHtmlTransformer(sw1), new Metadata(), new ParseContext());
640643
641644 String result = sw1.toString();
656659
657660 StringWriter sw2 = new StringWriter();
658661 new HtmlParser().parse(
659 new ByteArrayInputStream(test2.getBytes("UTF-8")),
662 new ByteArrayInputStream(test2.getBytes(IOUtils.UTF_8)),
660663 makeHtmlTransformer(sw2), new Metadata(), new ParseContext());
661664
662665 result = sw2.toString();
708711
709712 StringWriter sw = new StringWriter();
710713 new HtmlParser().parse(
711 new ByteArrayInputStream(test.getBytes("UTF-8")),
714 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
712715 makeHtmlTransformer(sw), new Metadata(), new ParseContext());
713716
714717 String result = sw.toString();
790793 StringWriter sw = new StringWriter();
791794
792795 new HtmlParser().parse (
793 new ByteArrayInputStream(html.getBytes("UTF-8")),
796 new ByteArrayInputStream(html.getBytes(IOUtils.UTF_8)),
794797 makeHtmlTransformer(sw), metadata, parseContext);
795798
796799 String result = sw.toString();
811814
812815 BodyContentHandler handler = new BodyContentHandler();
813816 new HtmlParser().parse(
814 new ByteArrayInputStream(html.getBytes("UTF-8")),
817 new ByteArrayInputStream(html.getBytes(IOUtils.UTF_8)),
815818 handler, new Metadata(), new ParseContext());
816819
817820 // Make sure we get <tab>, "one", newline, newline
844847 assertFalse(content.contains("item_aitem_b"));
845848
846849 // Should contain the two list items with a newline in between.
847 assertTrue(content.contains("item_a\nitem_b"));
850 assertContains("item_a\nitem_b", content);
848851
849852 // Should contain 有什么需要我帮你的 (can i help you) without whitespace
850 assertTrue(content.contains("有什么需要我帮你的"));
853 assertContains("有什么需要我帮你的", content);
851854 }
852855
853856 /**
865868 + "<title>hello</title>"
866869 + "</head><body></body></html>";
867870 Metadata metadata = new Metadata();
868 new HtmlParser().parse (
871 new HtmlParser().parse(
869872 new ByteArrayInputStream(test1.getBytes("ISO-8859-1")),
870 new BodyContentHandler(), metadata, new ParseContext());
873 new BodyContentHandler(), metadata, new ParseContext());
871874 assertEquals("some description", metadata.get("og:description"));
872875 assertTrue(metadata.isMultiValued("og:image"));
873876 }
991994 // The text occurs at line 24 (if lines start at 0) or 25 (if lines start at 1).
992995 assertEquals(24, textPosition[line]);
993996 // The column reported seems fuzzy, just test it is close enough.
994 assertTrue(Math.abs(textPosition[col]-47) < 10);
997 assertTrue(Math.abs(textPosition[col] - 47) < 10);
995998 }
996999
9971000
10071010 + "<title>TitleToIgnore</title></body></html>";
10081011 Metadata metadata = new Metadata();
10091012
1010 new HtmlParser().parse (
1011 new ByteArrayInputStream(test.getBytes("UTF-8")),
1012 new BodyContentHandler(), metadata, new ParseContext());
1013 new HtmlParser().parse(
1014 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
1015 new BodyContentHandler(), metadata, new ParseContext());
10131016
10141017 //Expecting first title to be set in meta data and second one to be ignored.
10151018 assertEquals("Simple Content", metadata.get(TikaCoreProperties.TITLE));
10161019 }
1017
1020
1021 @Test
1022 public void testMisleadingMetaContentTypeTags() throws Exception {
1023 //TIKA-1519
1024
1025 String test = "<html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=UTF-ELEVEN\">"+
1026 "</head><title>title</title><body>body</body></html>";
1027 Metadata metadata = new Metadata();
1028
1029 new HtmlParser().parse(
1030 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
1031 new BodyContentHandler(), metadata, new ParseContext());
1032 assertEquals("text/html; charset=UTF-ELEVEN", metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
1033 assertEquals("text/html; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
1034
1035 test = "<html><head><meta http-equiv=\"content-type\" content=\"application/pdf\">"+
1036 "</head><title>title</title><body>body</body></html>";
1037 metadata = new Metadata();
1038
1039 new HtmlParser().parse(
1040 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
1041 new BodyContentHandler(), metadata, new ParseContext());
1042 assertEquals("application/pdf", metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
1043 assertEquals("text/html; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
1044
1045 //test two content values
1046 test = "<html><head><meta http-equiv=\"content-type\" content=\"application/pdf\" content=\"application/ms-word\">"+
1047 "</head><title>title</title><body>body</body></html>";
1048 metadata = new Metadata();
1049
1050 new HtmlParser().parse(
1051 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
1052 new BodyContentHandler(), metadata, new ParseContext());
1053 assertEquals("application/ms-word", metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
1054 assertEquals("text/html; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
1055 }
1056
1057 @Test
1058 public void testXHTMLWithMisleading() throws Exception {
1059 //first test an acceptable XHTML header with http-equiv tags
1060 String test = "<?xml version=\"1.0\" ?>"+
1061 "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" +
1062 "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n" +
1063 "<head>\n" +
1064 "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-8859-1\" />\n" +
1065 "<title>title</title></head><body>body</body></html>";
1066 Metadata metadata = new Metadata();
1067 new AutoDetectParser().parse(
1068 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
1069 new BodyContentHandler(), metadata, new ParseContext());
1070
1071 assertEquals("text/html; charset=iso-8859-1", metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
1072 assertEquals("application/xhtml+xml; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
1073
1074 test = "<?xml version=\"1.0\" ?>"+
1075 "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" +
1076 "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n" +
1077 "<head>\n" +
1078 "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=iso-NUMBER_SEVEN\" />\n" +
1079 "<title>title</title></head><body>body</body></html>";
1080 metadata = new Metadata();
1081 new AutoDetectParser().parse(
1082 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
1083 new BodyContentHandler(), metadata, new ParseContext());
1084
1085 assertEquals("text/html; charset=iso-NUMBER_SEVEN", metadata.get(TikaCoreProperties.CONTENT_TYPE_HINT));
1086 assertEquals("application/xhtml+xml; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
1087
1088 }
10181089 }
5050 /* TODO For some reason, the xhtml files in iBooks-style ePub are not parsed properly, and the content comes back empty.git che
5151 String content = handler.toString();
5252 System.out.println("content="+content);
53 assertTrue(content.contains("Plus a simple div"));
54 assertTrue(content.contains("First item"));
55 assertTrue(content.contains("The previous headings were subchapters"));
56 assertTrue(content.contains("Table data"));
57 assertTrue(content.contains("Lorem ipsum dolor rutur amet"));
53 assertContains("Plus a simple div", content);
54 assertContains("First item", content);
55 assertContains("The previous headings were subchapters", content);
56 assertContains("Table data", content);
57 assertContains("Lorem ipsum dolor rutur amet", content);
5858 */
5959 } finally {
6060 input.close();
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.image;
17
18 import java.io.InputStream;
19 import java.util.Arrays;
20 import java.util.List;
21
22 import org.apache.tika.metadata.Metadata;
23 import org.apache.tika.metadata.Photoshop;
24 import org.apache.tika.metadata.TikaCoreProperties;
25 import org.apache.tika.parser.ParseContext;
26 import org.apache.tika.parser.Parser;
27 import org.junit.Test;
28 import org.xml.sax.helpers.DefaultHandler;
29
30 import static org.junit.Assert.assertEquals;
31 import static org.junit.Assert.assertEquals;
32 import static org.junit.Assert.assertTrue;
33
34 public class BPGParserTest {
35 private final Parser parser = new BPGParser();
36
37 /**
38 * Tests a very basic file, without much metadata
39 */
40 @Test
41 public void testBPG() throws Exception {
42 Metadata metadata = new Metadata();
43 metadata.set(Metadata.CONTENT_TYPE, "image/x-bpg");
44 InputStream stream =
45 getClass().getResourceAsStream("/test-documents/testBPG.bpg");
46 parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
47
48 assertEquals("100", metadata.get(Metadata.IMAGE_WIDTH));
49 assertEquals("75", metadata.get(Metadata.IMAGE_LENGTH));
50 assertEquals("10", metadata.get(Metadata.BITS_PER_SAMPLE));
51 assertEquals("YCbCr Colour", metadata.get(Photoshop.COLOR_MODE));
52 }
53
54 /**
55 * Tests a file with comments
56 */
57 @Test
58 public void testBPG_Commented() throws Exception {
59 Metadata metadata = new Metadata();
60 metadata.set(Metadata.CONTENT_TYPE, "image/x-bpg");
61 InputStream stream =
62 getClass().getResourceAsStream("/test-documents/testBPG_commented.bpg");
63 parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
64
65 assertEquals("103", metadata.get(Metadata.IMAGE_WIDTH));
66 assertEquals("77", metadata.get(Metadata.IMAGE_LENGTH));
67 assertEquals("10", metadata.get(Metadata.BITS_PER_SAMPLE));
68 assertEquals("YCbCr Colour", metadata.get(Photoshop.COLOR_MODE));
69
70 // TODO Get the exif comment data to be properly extracted, see TIKA-1495
71 if (false) {
72 assertEquals("Tosteberga \u00C4ngar", metadata.get(TikaCoreProperties.TITLE));
73 assertEquals("Bird site in north eastern Sk\u00E5ne, Sweden.\n(new line)", metadata.get(TikaCoreProperties.DESCRIPTION));
74 List<String> keywords = Arrays.asList(metadata.getValues(Metadata.KEYWORDS));
75 assertTrue(keywords.contains("coast"));
76 assertTrue(keywords.contains("bird watching"));
77 assertEquals(keywords, Arrays.asList(metadata.getValues(TikaCoreProperties.KEYWORDS)));
78 }
79
80 // TODO Get the exif data to be properly extracted, see TIKA-1495
81 if (false) {
82 assertEquals("1.0E-6", metadata.get(Metadata.EXPOSURE_TIME)); // 1/1000000
83 assertEquals("2.8", metadata.get(Metadata.F_NUMBER));
84 assertEquals("4.6", metadata.get(Metadata.FOCAL_LENGTH));
85 assertEquals("114", metadata.get(Metadata.ISO_SPEED_RATINGS));
86 assertEquals(null, metadata.get(Metadata.EQUIPMENT_MAKE));
87 assertEquals(null, metadata.get(Metadata.EQUIPMENT_MODEL));
88 assertEquals(null, metadata.get(Metadata.SOFTWARE));
89 assertEquals("1", metadata.get(Metadata.ORIENTATION));
90 assertEquals("300.0", metadata.get(Metadata.RESOLUTION_HORIZONTAL));
91 assertEquals("300.0", metadata.get(Metadata.RESOLUTION_VERTICAL));
92 assertEquals("Inch", metadata.get(Metadata.RESOLUTION_UNIT));
93 }
94 }
95
96 /**
97 * Tests a file with geographic information in it
98 */
99 @Test
100 public void testBPG_Geo() throws Exception {
101 Metadata metadata = new Metadata();
102 metadata.set(Metadata.CONTENT_TYPE, "image/x-bpg");
103 InputStream stream =
104 getClass().getResourceAsStream("/test-documents/testBPG_GEO.bpg");
105 parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
106
107 assertEquals("100", metadata.get(Metadata.IMAGE_WIDTH));
108 assertEquals("68", metadata.get(Metadata.IMAGE_LENGTH));
109 assertEquals("10", metadata.get(Metadata.BITS_PER_SAMPLE));
110 assertEquals("YCbCr Colour", metadata.get(Photoshop.COLOR_MODE));
111
112 // TODO Get the geographic data to be properly extracted, see TIKA-1495
113 if (false) {
114 assertEquals("12.54321", metadata.get(Metadata.LATITUDE));
115 assertEquals("-54.1234", metadata.get(Metadata.LONGITUDE));
116 }
117
118 // TODO Get the exif data to be properly extracted, see TIKA-1495
119 if (false) {
120 assertEquals("6.25E-4", metadata.get(Metadata.EXPOSURE_TIME)); // 1/1600
121 assertEquals("5.6", metadata.get(Metadata.F_NUMBER));
122 assertEquals("false", metadata.get(Metadata.FLASH_FIRED));
123 assertEquals("194.0", metadata.get(Metadata.FOCAL_LENGTH));
124 assertEquals("400", metadata.get(Metadata.ISO_SPEED_RATINGS));
125 assertEquals("Canon", metadata.get(Metadata.EQUIPMENT_MAKE));
126 assertEquals("Canon EOS 40D", metadata.get(Metadata.EQUIPMENT_MODEL));
127 assertEquals("Adobe Photoshop CS3 Macintosh", metadata.get(Metadata.SOFTWARE));
128 assertEquals("240.0", metadata.get(Metadata.RESOLUTION_HORIZONTAL));
129 assertEquals("240.0", metadata.get(Metadata.RESOLUTION_VERTICAL));
130 assertEquals("Inch", metadata.get(Metadata.RESOLUTION_UNIT));
131 }
132 }
133 }
1515 */
1616 package org.apache.tika.parser.image;
1717
18 import java.util.Arrays;
19 import java.util.GregorianCalendar;
20 import java.util.Iterator;
21 import java.util.List;
2218
2319 import org.apache.tika.metadata.Metadata;
2420 import org.apache.tika.metadata.TikaCoreProperties;
3127 import com.drew.metadata.exif.ExifSubIFDDirectory;
3228 import com.drew.metadata.jpeg.JpegCommentDirectory;
3329
30 import java.util.Arrays;
31 import java.util.GregorianCalendar;
32 import java.util.Iterator;
33 import java.util.List;
34 import java.util.Locale;
35 import java.util.TimeZone;
36
3437 import static org.junit.Assert.assertEquals;
3538 import static org.junit.Assert.assertFalse;
3639 import static org.junit.Assert.assertNull;
3740 import static org.junit.Assert.assertTrue;
38 import static org.mockito.Mockito.*;
41 import static org.mockito.Mockito.mock;
42 import static org.mockito.Mockito.verify;
43 import static org.mockito.Mockito.when;
3944
4045 public class ImageMetadataExtractorTest {
4146
5661 verify(handler1).supports(JpegCommentDirectory.class);
5762 verify(handler1).handle(directory, metadata);
5863 }
59
64
6065 @Test
6166 public void testExifHandlerSupports() {
6267 assertTrue(new ImageMetadataExtractor.ExifHandler().supports(ExifIFD0Directory.class));
6974 public void testExifHandlerParseDate() throws MetadataException {
7075 ExifSubIFDDirectory exif = mock(ExifSubIFDDirectory.class);
7176 when(exif.containsTag(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)).thenReturn(true);
77 GregorianCalendar calendar = new GregorianCalendar(TimeZone.getDefault(), Locale.ROOT);
78 calendar.setTimeInMillis(0);
79 calendar.set(2000, 0, 1, 0, 0, 0);
7280 when(exif.getDate(ExifSubIFDDirectory.TAG_DATETIME_ORIGINAL)).thenReturn(
73 new GregorianCalendar(2000, 0, 1, 0, 0, 0).getTime()); // jvm default timezone as in Metadata Extractor
81 calendar.getTime()); // jvm default timezone as in Metadata Extractor
7482 Metadata metadata = new Metadata();
7583
7684 new ImageMetadataExtractor.ExifHandler().handle(exif, metadata);
8290 public void testExifHandlerParseDateFallback() throws MetadataException {
8391 ExifIFD0Directory exif = mock(ExifIFD0Directory.class);
8492 when(exif.containsTag(ExifIFD0Directory.TAG_DATETIME)).thenReturn(true);
93 GregorianCalendar calendar = new GregorianCalendar(TimeZone.getDefault(), Locale.ROOT);
94 calendar.setTimeInMillis(0);
95 calendar.set(1999, 0, 1, 0, 0, 0);
8596 when(exif.getDate(ExifIFD0Directory.TAG_DATETIME)).thenReturn(
86 new GregorianCalendar(1999, 0, 1, 0, 0, 0).getTime()); // jvm default timezone as in Metadata Extractor
97 calendar.getTime()); // jvm default timezone as in Metadata Extractor
8798 Metadata metadata = new Metadata();
8899
89100 new ImageMetadataExtractor.ExifHandler().handle(exif, metadata);
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.parser.image;
18
19 import static org.junit.Assert.assertEquals;
20
21 import java.io.InputStream;
22
23 import org.apache.tika.io.IOUtils;
24 import org.apache.tika.metadata.Metadata;
25 import org.apache.tika.parser.AutoDetectParser;
26 import org.apache.tika.parser.ParseContext;
27 import org.apache.tika.parser.Parser;
28 import org.junit.Test;
29 import org.xml.sax.helpers.DefaultHandler;
30
31
32 public class WebPParserTest {
33
34 Parser parser = new AutoDetectParser();
35 /*
36 Two photos in test-documents (testWebp_Alpha_Lossy.webp and testWebp_Alpha_Lossless.webp)
37 are in the public domain. These files were retrieved from:
38 https://github.com/drewnoakes/metadata-extractor-images/tree/master/webp
39 These photos are also available here:
40 https://developers.google.com/speed/webp/gallery2#webp_links
41 Credits for the photo:
42 "Free Stock Photo in High Resolution - Yellow Rose 3 - Flowers"
43 Image Author: Jon Sullivan
44 */
45 @Test
46 public void testSimple() throws Exception {
47 Metadata metadata = new Metadata();
48 InputStream stream =
49 getClass().getResourceAsStream("/test-documents/testWebp_Alpha_Lossy.webp");
50
51 parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
52
53 assertEquals("301", metadata.get("Image Height"));
54 assertEquals("400", metadata.get("Image Width"));
55 assertEquals("true", metadata.get("Has Alpha"));
56 assertEquals("false", metadata.get("Is Animation"));
57 assertEquals("image/webp", metadata.get(Metadata.CONTENT_TYPE));
58
59 IOUtils.closeQuietly(stream);
60
61 metadata = new Metadata();
62 stream = getClass().getResourceAsStream("/test-documents/testWebp_Alpha_Lossless.webp");
63 parser.parse(stream, new DefaultHandler(), metadata, new ParseContext());
64
65 //unfortunately, there isn't much metadata in lossless
66 assertEquals("image/webp", metadata.get(Metadata.CONTENT_TYPE));
67
68 }
69
70 }
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.parser.isatab;
18
19 import static org.junit.Assert.*;
20
21 import java.io.InputStream;
22
23 import org.apache.tika.metadata.Metadata;
24 import org.apache.tika.parser.AutoDetectParser;
25 import org.apache.tika.parser.ParseContext;
26 import org.apache.tika.parser.Parser;
27 import org.apache.tika.sax.BodyContentHandler;
28 import org.junit.Test;
29 import org.xml.sax.ContentHandler;
30
31 public class ISArchiveParserTest {
32
33 @Test
34 public void testParseArchive() throws Exception {
35 String path = "/test-documents/testISATab_BII-I-1/s_BII-S-1.txt";
36
37 Parser parser = new ISArchiveParser(ISArchiveParserTest.class.getResource("/test-documents/testISATab_BII-I-1/").getPath());
38 //Parser parser = new AutoDetectParser();
39
40 ContentHandler handler = new BodyContentHandler();
41 Metadata metadata = new Metadata();
42 ParseContext context = new ParseContext();
43 InputStream stream = null;
44 try {
45 stream = ISArchiveParserTest.class.getResourceAsStream(path);
46 parser.parse(stream, handler, metadata, context);
47 }
48 finally {
49 stream.close();
50 }
51
52 // INVESTIGATION
53 assertEquals("Invalid Investigation Identifier", "BII-I-1", metadata.get("Investigation Identifier"));
54 assertEquals("Invalid Investigation Title", "Growth control of the eukaryote cell: a systems biology study in yeast", metadata.get("Investigation Title"));
55
56 // INVESTIGATION PUBLICATIONS
57 assertEquals("Invalid Investigation PubMed ID", "17439666", metadata.get("Investigation PubMed ID"));
58 assertEquals("Invalid Investigation Publication DOI", "doi:10.1186/jbiol54", metadata.get("Investigation Publication DOI"));
59
60 // INVESTIGATION CONTACTS
61 assertEquals("Invalid Investigation Person Last Name", "Oliver", metadata.get("Investigation Person Last Name"));
62 assertEquals("Invalid Investigation Person First Name", "Stephen", metadata.get("Investigation Person First Name"));
63 }
64 }
1515 */
1616 package org.apache.tika.parser.iwork;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
1920 import static org.junit.Assert.assertTrue;
2021
4546 iWorkParser = new IWorkPackageParser();
4647 parseContext = new ParseContext();
4748 parseContext.set(Parser.class, new AutoDetectParser());
49 }
50
51 /**
52 * Check the given InputStream is not closed by the Parser (TIKA-1117).
53 *
54 * @throws Exception
55 */
56 @Test
57 public void testStreamNotClosed() throws Exception {
58 InputStream input = IWorkParserTest.class.getResourceAsStream("/test-documents/testKeynote.key");
59 Metadata metadata = new Metadata();
60 ContentHandler handler = new BodyContentHandler();
61 iWorkParser.parse(input, handler, metadata, parseContext);
62 input.read(); // Will throw an Exception if the stream was already closed.
4863 }
4964
5065 @Test
7388 assertEquals("Apache tika", metadata.get(TikaCoreProperties.TITLE));
7489
7590 String content = handler.toString();
76 assertTrue(content.contains("A sample presentation"));
77 assertTrue(content.contains("For the Apache Tika project"));
78 assertTrue(content.contains("Slide 1"));
79 assertTrue(content.contains("Some random text for the sake of testability."));
80 assertTrue(content.contains("A nice comment"));
81 assertTrue(content.contains("A nice note"));
91 assertContains("A sample presentation", content);
92 assertContains("For the Apache Tika project", content);
93 assertContains("Slide 1", content);
94 assertContains("Some random text for the sake of testability.", content);
95 assertContains("A nice comment", content);
96 assertContains("A nice note", content);
8297
8398 // test table data
84 assertTrue(content.contains("Cell one"));
85 assertTrue(content.contains("Cell two"));
86 assertTrue(content.contains("Cell three"));
87 assertTrue(content.contains("Cell four"));
88 assertTrue(content.contains("Cell 5"));
89 assertTrue(content.contains("Cell six"));
90 assertTrue(content.contains("7"));
91 assertTrue(content.contains("Cell eight"));
92 assertTrue(content.contains("5/5/1985"));
99 assertContains("Cell one", content);
100 assertContains("Cell two", content);
101 assertContains("Cell three", content);
102 assertContains("Cell four", content);
103 assertContains("Cell 5", content);
104 assertContains("Cell six", content);
105 assertContains("7", content);
106 assertContains("Cell eight", content);
107 assertContains("5/5/1985", content);
93108 }
94109
95110 // TIKA-910
126141
127142 String content = handler.toString();
128143 content = content.replaceAll("\\s+", " ");
129 assertTrue(content.contains("row 1 row 2 row 3"));
144 assertContains("row 1 row 2 row 3", content);
130145 }
131146
132147 // TIKA-923
139154
140155 String content = handler.toString();
141156 content = content.replaceAll("\\s+", " ");
142 assertTrue(content.contains("master row 1"));
143 assertTrue(content.contains("master row 2"));
144 assertTrue(content.contains("master row 3"));
157 assertContains("master row 1", content);
158 assertContains("master row 2", content);
159 assertContains("master row 3", content);
145160 }
146161
147162 @Test
174189 String content = handler.toString();
175190
176191 // text on page 1
177 assertTrue(content.contains("Sample pages document"));
178 assertTrue(content.contains("Some plain text to parse."));
179 assertTrue(content.contains("Cell one"));
180 assertTrue(content.contains("Cell two"));
181 assertTrue(content.contains("Cell three"));
182 assertTrue(content.contains("Cell four"));
183 assertTrue(content.contains("Cell five"));
184 assertTrue(content.contains("Cell six"));
185 assertTrue(content.contains("Cell seven"));
186 assertTrue(content.contains("Cell eight"));
187 assertTrue(content.contains("Cell nine"));
188 assertTrue(content.contains("Both Pages 1.x and Keynote 2.x")); // ...
192 assertContains("Sample pages document", content);
193 assertContains("Some plain text to parse.", content);
194 assertContains("Cell one", content);
195 assertContains("Cell two", content);
196 assertContains("Cell three", content);
197 assertContains("Cell four", content);
198 assertContains("Cell five", content);
199 assertContains("Cell six", content);
200 assertContains("Cell seven", content);
201 assertContains("Cell eight", content);
202 assertContains("Cell nine", content);
203 assertContains("Both Pages 1.x and Keynote 2.x", content); // ...
189204
190205 // text on page 2
191 assertTrue(content.contains("A second page...."));
192 assertTrue(content.contains("Extensible Markup Language")); // ...
206 assertContains("A second page....", content);
207 assertContains("Extensible Markup Language", content); // ...
193208 }
194209
195210 // TIKA-904
202217 iWorkParser.parse(input, handler, metadata, parseContext);
203218
204219 String content = handler.toString();
205 assertTrue(content.contains("text box 1 - here is some text"));
206 assertTrue(content.contains("created in a text box in layout mode"));
207 assertTrue(content.contains("text box 2 - more text!@!$@#"));
208 assertTrue(content.contains("this is text inside of a green box"));
209 assertTrue(content.contains("text inside of a green circle"));
220 assertContains("text box 1 - here is some text", content);
221 assertContains("created in a text box in layout mode", content);
222 assertContains("text box 2 - more text!@!$@#", content);
223 assertContains("this is text inside of a green box", content);
224 assertContains("text inside of a green circle", content);
210225 }
211226
212227 @Test
235250 assertEquals("a comment", metadata.get(TikaCoreProperties.COMMENTS));
236251
237252 String content = handler.toString();
238 assertTrue(content.contains("Category"));
239 assertTrue(content.contains("Home"));
240 assertTrue(content.contains("-226"));
241 assertTrue(content.contains("-137.5"));
242 assertTrue(content.contains("Checking Account: 300545668"));
243 assertTrue(content.contains("4650"));
244 assertTrue(content.contains("Credit Card"));
245 assertTrue(content.contains("Groceries"));
246 assertTrue(content.contains("-210"));
247 assertTrue(content.contains("Food"));
248 assertTrue(content.contains("Try adding your own account transactions to this table."));
253 assertContains("Category", content);
254 assertContains("Home", content);
255 assertContains("-226", content);
256 assertContains("-137.5", content);
257 assertContains("Checking Account: 300545668", content);
258 assertContains("4650", content);
259 assertContains("Credit Card", content);
260 assertContains("Groceries", content);
261 assertContains("-210", content);
262 assertContains("Food", content);
263 assertContains("Try adding your own account transactions to this table.", content);
249264 }
250265
251266 // TIKA- 924
256271 ContentHandler handler = new BodyContentHandler();
257272 iWorkParser.parse(input, handler, metadata, parseContext);
258273 String content = handler.toString();
259 assertTrue(content.contains("This is the main table"));
274 assertContains("This is the main table", content);
260275 }
261276
262277 @Test
268283
269284 String content = handler.toString();
270285 for(int header=1;header<=5;header++) {
271 assertTrue(content.contains("header" + header));
286 assertContains("header" + header, content);
272287 }
273288 for(int row=1;row<=3;row++) {
274 assertTrue(content.contains("row" + row));
289 assertContains("row" + row, content);
275290 }
276291 }
277292
315330 String contents = handler.toString();
316331
317332 // Check regular text
318 assertContains(contents, "Both Pages 1.x"); // P1
319 assertContains(contents, "understanding the Pages document"); // P1
320 assertContains(contents, "should be page 2"); // P2
333 assertContains("Both Pages 1.x", contents); // P1
334 assertContains("understanding the Pages document", contents); // P1
335 assertContains("should be page 2", contents); // P2
321336
322337 // Check for headers, footers and footnotes
323 assertContains(contents, header);
324 assertContains(contents, footer);
325 assertContains(contents, footer2);
326 assertContains(contents, footnote);
338 assertContains(header, contents);
339 assertContains(footer, contents);
340 assertContains(footer2, contents);
341 assertContains(footnote, contents);
327342 }
328343
329344 /**
342357 String contents = handler.toString();
343358
344359 // Check for headers, footers and footnotes
345 assertContains(contents, header);
346 assertContains(contents, footer);
347 assertContains(contents, footer2);
360 assertContains(header, contents);
361 assertContains(footer, contents);
362 assertContains(footer2, contents);
348363 }
349364
350365 /**
363378 String contents = handler.toString();
364379
365380 // Check for headers, footers and footnotes
366 assertContains(contents, header);
367 assertContains(contents, footer);
368 assertContains(contents, footer2);
381 assertContains(header, contents);
382 assertContains(footer, contents);
383 assertContains(footer2, contents);
369384 }
370385
371386 /**
384399 String contents = handler.toString();
385400
386401 // Check for headers, footers and footnotes
387 assertContains(contents, header);
388 assertContains(contents, footer);
389 assertContains(contents, footer2);
402 assertContains(header, contents);
403 assertContains(footer, contents);
404 assertContains(footer2, contents);
390405 }
391406
392407 /**
405420 String contents = handler.toString();
406421
407422 // Check for headers, footers and footnotes
408 assertContains(contents, header);
409 assertContains(contents, footer);
410 assertContains(contents, footer2);
423 assertContains(header, contents);
424 assertContains(footer, contents);
425 assertContains(footer2, contents);
411426 }
412427
413428 /**
426441 String contents = handler.toString();
427442
428443 // Check regular text
429 assertContains(contents, "Both Pages 1.x"); // P1
430 assertContains(contents, "understanding the Pages document"); // P1
431 assertContains(contents, "should be page 2"); // P2
444 assertContains("Both Pages 1.x", contents); // P1
445 assertContains("understanding the Pages document", contents); // P1
446 assertContains("should be page 2", contents); // P2
432447
433448 // Check for comments
434 assertContains(contents, commentA);
435 assertContains(contents, commentB);
449 assertContains(commentA, contents);
450 assertContains(commentB, contents);
436451 }
437452
438453 // TIKA-918
443458 ContentHandler handler = new BodyContentHandler();
444459 iWorkParser.parse(input, handler, metadata, parseContext);
445460 String contents = handler.toString();
446 assertContains(contents, "Expenditure by Category");
447 assertContains(contents, "Currency Chart name");
448 assertContains(contents, "Chart 2");
449 }
450
451 public void assertContains(String haystack, String needle) {
452 assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle));
461 assertContains("Expenditure by Category", contents);
462 assertContains("Currency Chart name", contents);
463 assertContains("Chart 2", contents);
453464 }
454465 }
0 package org.apache.tika.parser.jdbc;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import static org.junit.Assert.assertEquals;
20
21 import java.io.ByteArrayInputStream;
22 import java.io.ByteArrayOutputStream;
23 import java.io.IOException;
24 import java.io.InputStream;
25 import java.util.ArrayList;
26 import java.util.List;
27
28 import org.apache.tika.TikaTest;
29 import org.apache.tika.extractor.EmbeddedResourceHandler;
30 import org.apache.tika.extractor.ParserContainerExtractor;
31 import org.apache.tika.io.IOUtils;
32 import org.apache.tika.io.TikaInputStream;
33 import org.apache.tika.metadata.Database;
34 import org.apache.tika.metadata.Metadata;
35 import org.apache.tika.mime.MediaType;
36 import org.apache.tika.parser.AutoDetectParser;
37 import org.apache.tika.parser.ParseContext;
38 import org.apache.tika.parser.Parser;
39 import org.apache.tika.parser.RecursiveParserWrapper;
40 import org.apache.tika.sax.BasicContentHandlerFactory;
41 import org.apache.tika.sax.BodyContentHandler;
42 import org.apache.tika.sax.ToXMLContentHandler;
43 import org.junit.Test;
44 import org.xml.sax.ContentHandler;
45
46 public class SQLite3ParserTest extends TikaTest {
47 private final static String TEST_FILE_NAME = "testSqlite3b.db";
48 private final static String TEST_FILE1 = "/test-documents/"+TEST_FILE_NAME;;
49
50 @Test
51 public void testBasic() throws Exception {
52 Parser p = new AutoDetectParser();
53
54 //test different types of input streams
55 //actual inputstream, memory buffered bytearray and literal file
56 InputStream[] streams = new InputStream[3];
57 streams[0] = getResourceAsStream(TEST_FILE1);
58 ByteArrayOutputStream bos = new ByteArrayOutputStream();
59 IOUtils.copy(getResourceAsStream(TEST_FILE1), bos);
60 streams[1] = new ByteArrayInputStream(bos.toByteArray());
61 streams[2] = TikaInputStream.get(getResourceAsFile(TEST_FILE1));
62 int tests = 0;
63 for (InputStream stream : streams) {
64 Metadata metadata = new Metadata();
65 metadata.set(Metadata.RESOURCE_NAME_KEY, TEST_FILE_NAME);
66 //1) getXML closes the stream
67 //2) getXML runs recursively on the contents, so the embedded docs should show up
68 XMLResult result = getXML(stream, p, metadata);
69 String x = result.xml;
70 //first table name
71 assertContains("<table name=\"my_table1\"><thead><tr>\t<th>INT_COL</th>", x);
72 //non-ascii
73 assertContains("<td>普林斯顿大学</td>", x);
74 //boolean
75 assertContains("<td>true</td>\t<td>2015-01-02</td>", x);
76 //date test
77 assertContains("2015-01-04", x);
78 //timestamp test
79 assertContains("2015-01-03 15:17:03", x);
80 //first embedded doc's image tag
81 assertContains("alt=\"image1.png\"", x);
82 //second embedded doc's image tag
83 assertContains("alt=\"A description...\"", x);
84 //second table name
85 assertContains("<table name=\"my_table2\"><thead><tr>\t<th>INT_COL2</th>", x);
86
87 Metadata post = result.metadata;
88 String[] tableNames = post.getValues(Database.TABLE_NAME);
89 assertEquals(2, tableNames.length);
90 assertEquals("my_table1", tableNames[0]);
91 assertEquals("my_table2", tableNames[1]);
92 tests++;
93 }
94 assertEquals(3, tests);
95 }
96
97 //make sure that table cells and rows are properly marked to
98 //yield \t and \n at the appropriate places
99 @Test
100 public void testSpacesInBodyContentHandler() throws Exception {
101 Parser p = new AutoDetectParser();
102 InputStream stream = null;
103 Metadata metadata = new Metadata();
104 metadata.set(Metadata.RESOURCE_NAME_KEY, TEST_FILE_NAME);
105 ContentHandler handler = new BodyContentHandler(-1);
106 ParseContext ctx = new ParseContext();
107 ctx.set(Parser.class, p);
108 try {
109 stream = getResourceAsStream(TEST_FILE1);
110 p.parse(stream, handler, metadata, ctx);
111 } finally {
112 stream.close();
113 }
114 String s = handler.toString();
115 assertContains("0\t2.3\t2.4\tlorem", s);
116 assertContains("tempor\n", s);
117 }
118
119 //test what happens if the user forgets to pass in a parser via context
120 //to handle embedded documents
121 @Test
122 public void testNotAddingEmbeddedParserToParseContext() throws Exception {
123 Parser p = new AutoDetectParser();
124
125 InputStream is = getResourceAsStream(TEST_FILE1);
126 Metadata metadata = new Metadata();
127 metadata.set(Metadata.RESOURCE_NAME_KEY, TEST_FILE_NAME);
128 ContentHandler handler = new ToXMLContentHandler();
129 p.parse(is, handler, metadata, new ParseContext());
130 String xml = handler.toString();
131 //just includes headers for embedded documents
132 assertContains("<table name=\"my_table1\"><thead><tr>", xml);
133 assertContains("<td><span type=\"blob\" column_name=\"BYTES_COL\" row_number=\"0\"><div class=\"package-entry\"><h1>BYTES_COL_0.doc</h1>", xml);
134 //but no other content
135 assertNotContained("dog", xml);
136 assertNotContained("alt=\"image1.png\"", xml);
137 //second embedded doc's image tag
138 assertNotContained("alt=\"A description...\"", xml);
139 }
140
141 @Test
142 public void testRecursiveParserWrapper() throws Exception {
143 Parser p = new AutoDetectParser();
144
145 RecursiveParserWrapper wrapper =
146 new RecursiveParserWrapper(p, new BasicContentHandlerFactory(
147 BasicContentHandlerFactory.HANDLER_TYPE.BODY, -1));
148 InputStream is = getResourceAsStream(TEST_FILE1);
149 Metadata metadata = new Metadata();
150 metadata.set(Metadata.RESOURCE_NAME_KEY, TEST_FILE_NAME);
151 wrapper.parse(is, new BodyContentHandler(-1), metadata, new ParseContext());
152 List<Metadata> metadataList = wrapper.getMetadata();
153 int i = 0;
154 assertEquals(5, metadataList.size());
155 //make sure the \t are inserted in a body handler
156
157 String table = metadataList.get(0).get(RecursiveParserWrapper.TIKA_CONTENT);
158 assertContains("0\t2.3\t2.4\tlorem", table);
159 assertContains("普林斯顿大学", table);
160
161 //make sure the \n is inserted
162 String table2 = metadataList.get(0).get(RecursiveParserWrapper.TIKA_CONTENT);
163 assertContains("do eiusmod tempor\n", table2);
164
165 assertContains("The quick brown fox", metadataList.get(2).get(RecursiveParserWrapper.TIKA_CONTENT));
166 assertContains("The quick brown fox", metadataList.get(4).get(RecursiveParserWrapper.TIKA_CONTENT));
167
168 //confirm .doc was added to blob
169 assertEquals("testSqlite3b.db/BYTES_COL_0.doc/image1.png", metadataList.get(1).get(RecursiveParserWrapper.EMBEDDED_RESOURCE_PATH));
170 }
171
172 @Test
173 public void testParserContainerExtractor() throws Exception {
174 //There should be 6 embedded documents:
175 //2x tables -- UTF-8 csv representations of the tables
176 //2x word files, one doc and one docx
177 //2x png files, the same image embedded in each of the doc and docx
178
179 ParserContainerExtractor ex = new ParserContainerExtractor();
180 ByteCopyingHandler byteCopier = new ByteCopyingHandler();
181 InputStream is = getResourceAsStream(TEST_FILE1);
182 Metadata metadata = new Metadata();
183 metadata.set(Metadata.RESOURCE_NAME_KEY, TEST_FILE_NAME);
184 ex.extract(TikaInputStream.get(is), ex, byteCopier);
185
186 assertEquals(4, byteCopier.bytes.size());
187 String[] strings = new String[4];
188 for (int i = 1; i < byteCopier.bytes.size(); i++) {
189 byte[] byteArr = byteCopier.bytes.get(i);
190 String s = new String(byteArr, 0, Math.min(byteArr.length,1000), "UTF-8");
191 strings[i] = s;
192 }
193 byte[] oleBytes = new byte[]{
194 (byte)-48,
195 (byte)-49,
196 (byte)17,
197 (byte)-32,
198 (byte)-95,
199 (byte)-79,
200 (byte)26,
201 (byte)-31,
202 (byte)0,
203 (byte)0,
204 };
205 //test OLE
206 for (int i = 0; i < 10; i++) {
207 assertEquals(oleBytes[i], byteCopier.bytes.get(0)[i]);
208 }
209 assertContains("PNG", strings[1]);
210 assertContains("PK", strings[2]);
211 assertContains("PNG", strings[3]);
212 }
213
214 //This confirms that reading the stream twice is not
215 //quadrupling the number of attachments.
216 @Test
217 public void testInputStreamReset() throws Exception {
218 //There should be 8 embedded documents:
219 //4x word files, two docs and two docxs
220 //4x png files, the same image embedded in each of the doc and docx
221
222 ParserContainerExtractor ex = new ParserContainerExtractor();
223 InputStreamResettingHandler byteCopier = new InputStreamResettingHandler();
224 InputStream is = getResourceAsStream(TEST_FILE1);
225 Metadata metadata = new Metadata();
226 metadata.set(Metadata.RESOURCE_NAME_KEY, TEST_FILE_NAME);
227 ex.extract(TikaInputStream.get(is), ex, byteCopier);
228 is.reset();
229 assertEquals(8, byteCopier.bytes.size());
230 }
231
232
233
234 public static class InputStreamResettingHandler implements EmbeddedResourceHandler {
235
236 public List<byte[]> bytes = new ArrayList<byte[]>();
237
238 @Override
239 public void handle(String filename, MediaType mediaType,
240 InputStream stream) {
241 ByteArrayOutputStream os = new ByteArrayOutputStream();
242 if (! stream.markSupported()) {
243 stream = TikaInputStream.get(stream);
244 }
245 stream.mark(1000000);
246 try {
247 IOUtils.copy(stream, os);
248 bytes.add(os.toByteArray());
249 stream.reset();
250 //now try again
251 os.reset();
252 IOUtils.copy(stream, os);
253 bytes.add(os.toByteArray());
254 stream.reset();
255 } catch (IOException e) {
256 //swallow
257 }
258 }
259 }
260
261 //code used for creating the test file
262 /*
263 private Connection getConnection(String dbFileName) throws Exception {
264 File testDirectory = new File(this.getClass().getResource("/test-documents").toURI());
265 System.out.println("Writing to: " + testDirectory.getAbsolutePath());
266 File testDB = new File(testDirectory, dbFileName);
267 Connection c = null;
268 try {
269 Class.forName("org.sqlite.JDBC");
270 c = DriverManager.getConnection("jdbc:sqlite:" + testDB.getAbsolutePath());
271 } catch ( Exception e ) {
272 System.err.println( e.getClass().getName() + ": " + e.getMessage() );
273 System.exit(0);
274 }
275 return c;
276 }
277
278 @Test
279 public void testCreateDB() throws Exception {
280 Connection c = getConnection("testSQLLite3b.db");
281 Statement st = c.createStatement();
282 String sql = "DROP TABLE if exists my_table1";
283 st.execute(sql);
284 sql = "CREATE TABLE my_table1 (" +
285 "INT_COL INT PRIMARY KEY, "+
286 "FLOAT_COL FLOAT, " +
287 "DOUBLE_COL DOUBLE, " +
288 "CHAR_COL CHAR(30), "+
289 "VARCHAR_COL VARCHAR(30), "+
290 "BOOLEAN_COL BOOLEAN,"+
291 "DATE_COL DATE,"+
292 "TIME_STAMP_COL TIMESTAMP,"+
293 "BYTES_COL BYTES" +
294 ")";
295 st.execute(sql);
296 sql = "insert into my_table1 (INT_COL, FLOAT_COL, DOUBLE_COL, CHAR_COL, " +
297 "VARCHAR_COL, BOOLEAN_COL, DATE_COL, TIME_STAMP_COL, BYTES_COL) " +
298 "values (?,?,?,?,?,?,?,?,?)";
299 SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
300 java.util.Date d = f.parse("2015-01-03 15:17:03");
301 System.out.println(d.getTime());
302 long d1Long = 1420229823000L;// 2015-01-02 15:17:03
303 long d2Long = 1420316223000L;// 2015-01-03 15:17:03
304 PreparedStatement ps = c.prepareStatement(sql);
305 ps.setInt(1, 0);
306 ps.setFloat(2, 2.3f);
307 ps.setDouble(3, 2.4d);
308 ps.setString(4, "lorem");
309 ps.setString(5, "普林斯顿大学");
310 ps.setBoolean(6, true);
311 ps.setString(7, "2015-01-02");
312 ps.setString(8, "2015-01-03 15:17:03");
313 // ps.setClob(9, new StringReader(clobString));
314 ps.setBytes(9, getByteArray(this.getClass().getResourceAsStream("/test-documents/testWORD_1img.doc")));//contains "quick brown fox"
315 ps.executeUpdate();
316 ps.clearParameters();
317
318 ps.setInt(1, 1);
319 ps.setFloat(2, 4.6f);
320 ps.setDouble(3, 4.8d);
321 ps.setString(4, "dolor");
322 ps.setString(5, "sit");
323 ps.setBoolean(6, false);
324 ps.setString(7, "2015-01-04");
325 ps.setString(8, "2015-01-03 15:17:03");
326 //ps.setClob(9, new StringReader("consectetur adipiscing elit"));
327 ps.setBytes(9, getByteArray(this.getClass().getResourceAsStream("/test-documents/testWORD_1img.docx")));//contains "The end!"
328
329 ps.executeUpdate();
330
331 //build table2
332 sql = "DROP TABLE if exists my_table2";
333 st.execute(sql);
334
335 sql = "CREATE TABLE my_table2 (" +
336 "INT_COL2 INT PRIMARY KEY, "+
337 "VARCHAR_COL2 VARCHAR(64))";
338 st.execute(sql);
339 sql = "INSERT INTO my_table2 values(0,'sed, do eiusmod tempor')";
340 st.execute(sql);
341 sql = "INSERT INTO my_table2 values(1,'incididunt \nut labore')";
342 st.execute(sql);
343
344 c.close();
345 }
346
347 private byte[] getByteArray(InputStream is) throws IOException {
348 ByteArrayOutputStream bos = new ByteArrayOutputStream();
349 byte[] buff = new byte[1024];
350 for (int bytesRead; (bytesRead = is.read(buff)) != -1;) {
351 bos.write(buff, 0, bytesRead);
352 }
353 return bos.toByteArray();
354 }
355
356 */
357
358
359 }
1717
1818 import static org.junit.Assert.assertEquals;
1919 import static org.junit.Assert.assertFalse;
20 import static org.junit.Assert.assertNotNull;
2021 import static org.junit.Assert.assertTrue;
2122 import static org.junit.Assert.fail;
23 import static org.junit.Assume.assumeTrue;
2224 import static org.mockito.Matchers.any;
2325 import static org.mockito.Matchers.eq;
2426 import static org.mockito.Mockito.mock;
3032 import java.io.InputStream;
3133
3234 import org.apache.james.mime4j.stream.MimeConfig;
35 import org.apache.tika.TikaTest;
3336 import org.apache.tika.exception.TikaException;
37 import org.apache.tika.extractor.ContainerExtractor;
38 import org.apache.tika.extractor.ParserContainerExtractor;
39 import org.apache.tika.io.TikaInputStream;
3440 import org.apache.tika.metadata.Metadata;
3541 import org.apache.tika.metadata.TikaCoreProperties;
42 import org.apache.tika.mime.MediaType;
3643 import org.apache.tika.parser.ParseContext;
3744 import org.apache.tika.parser.Parser;
45 import org.apache.tika.parser.PasswordProvider;
46 import org.apache.tika.parser.ocr.TesseractOCRParserTest;
3847 import org.apache.tika.sax.BodyContentHandler;
3948 import org.apache.tika.sax.XHTMLContentHandler;
4049 import org.junit.Test;
4251 import org.xml.sax.ContentHandler;
4352 import org.xml.sax.helpers.DefaultHandler;
4453
45 public class RFC822ParserTest {
54 public class RFC822ParserTest extends TikaTest {
4655
4756 @Test
4857 public void testSimple() {
8291 try {
8392 parser.parse(stream, handler, metadata, new ParseContext());
8493 verify(handler).startDocument();
85 //4 body-part divs -- two outer bodies and two inner bodies
86 verify(handler, times(4)).startElement(eq(XHTMLContentHandler.XHTML), eq("div"), eq("div"), any(Attributes.class));
87 verify(handler, times(4)).endElement(XHTMLContentHandler.XHTML, "div", "div");
88 //5 paragraph elements, 4 for body-parts and 1 for encompassing message
89 verify(handler, times(5)).startElement(eq(XHTMLContentHandler.XHTML), eq("p"), eq("p"), any(Attributes.class));
90 verify(handler, times(5)).endElement(XHTMLContentHandler.XHTML, "p", "p");
94 int bodyExpectedTimes = 4, multipackExpectedTimes = 5;
95 // TIKA-1422. TesseractOCRParser interferes with the number of times the handler is invoked.
96 // But, different versions of Tesseract lead to a different number of invocations. So, we
97 // only verify the handler if Tesseract cannot run.
98 if (!TesseractOCRParserTest.canRun()) {
99 verify(handler, times(bodyExpectedTimes)).startElement(eq(XHTMLContentHandler.XHTML), eq("div"), eq("div"), any(Attributes.class));
100 verify(handler, times(bodyExpectedTimes)).endElement(XHTMLContentHandler.XHTML, "div", "div");
101 }
102 verify(handler, times(multipackExpectedTimes)).startElement(eq(XHTMLContentHandler.XHTML), eq("p"), eq("p"), any(Attributes.class));
103 verify(handler, times(multipackExpectedTimes)).endElement(XHTMLContentHandler.XHTML, "p", "p");
91104 verify(handler).endDocument();
105
92106 } catch (Exception e) {
93107 fail("Exception thrown: " + e.getMessage());
94108 }
140154 try {
141155 parser.parse(stream, handler, metadata, new ParseContext());
142156 //tests correct decoding of base64 text, including ISO-8859-1 bytes into Unicode
143 assertTrue(handler.toString().contains("Here is some text, with international characters, voil\u00E0!"));
157 assertContains("Here is some text, with international characters, voil\u00E0!", handler.toString());
144158 } catch (Exception e) {
145159 fail("Exception thrown: " + e.getMessage());
146160 }
244258 assertEquals("def", metadata.getValues(Metadata.MESSAGE_TO)[1]);
245259 assertEquals("abcd", metadata.get(TikaCoreProperties.TITLE));
246260 assertEquals("abcd", metadata.get(Metadata.SUBJECT));
247 assertTrue(handler.toString().contains("bar biz bat"));
261 assertContains("bar biz bat", handler.toString());
262 }
263
264 /**
265 * Test TIKA-1028 - If the mail contains an encrypted attachment (or
266 * an attachment that others triggers an error), parsing should carry
267 * on for the remainder regardless
268 */
269 @Test
270 public void testEncryptedZipAttachment() throws Exception {
271 Parser parser = new RFC822Parser();
272 Metadata metadata = new Metadata();
273 ParseContext context = new ParseContext();
274 InputStream stream = getStream("test-documents/testRFC822_encrypted_zip");
275 ContentHandler handler = new BodyContentHandler();
276 parser.parse(stream, handler, metadata, context);
277
278 // Check we go the metadata
279 assertEquals("Juha Haaga <juha.haaga@gmail.com>", metadata.get(Metadata.MESSAGE_FROM));
280 assertEquals("Test mail for Tika", metadata.get(TikaCoreProperties.TITLE));
281
282 // Check we got the message text, for both Plain Text and HTML
283 assertContains("Includes encrypted zip file", handler.toString());
284 assertContains("password is \"test\".", handler.toString());
285 assertContains("This is the Plain Text part", handler.toString());
286 assertContains("This is the HTML part", handler.toString());
287
288 // We won't get the contents of the zip file, but we will get the name
289 assertContains("text.txt", handler.toString());
290 assertNotContained("ENCRYPTED ZIP FILES", handler.toString());
291
292 // Try again, this time with the password supplied
293 // Check that we also get the zip's contents as well
294 context.set(PasswordProvider.class, new PasswordProvider() {
295 public String getPassword(Metadata metadata) {
296 return "test";
297 }
298 });
299 stream = getStream("test-documents/testRFC822_encrypted_zip");
300 handler = new BodyContentHandler();
301 parser.parse(stream, handler, metadata, context);
302
303 assertContains("Includes encrypted zip file", handler.toString());
304 assertContains("password is \"test\".", handler.toString());
305 assertContains("This is the Plain Text part", handler.toString());
306 assertContains("This is the HTML part", handler.toString());
307
308 // We do get the name of the file in the encrypted zip file
309 assertContains("text.txt", handler.toString());
310
311 // TODO Upgrade to a version of Commons Compress with Encryption
312 // support, then verify we get the contents of the text file
313 // held within the encrypted zip
314 assumeTrue(false); // No Zip Encryption support yet
315 assertContains("TEST DATA FOR TIKA.", handler.toString());
316 assertContains("ENCRYPTED ZIP FILES", handler.toString());
317 assertContains("TIKA-1028", handler.toString());
318 }
319
320 /**
321 * Test TIKA-1028 - Ensure we can get the contents of an
322 * un-encrypted zip file
323 */
324 @Test
325 public void testNormalZipAttachment() throws Exception {
326 Parser parser = new RFC822Parser();
327 Metadata metadata = new Metadata();
328 ParseContext context = new ParseContext();
329 InputStream stream = getStream("test-documents/testRFC822_normal_zip");
330 ContentHandler handler = new BodyContentHandler();
331 parser.parse(stream, handler, metadata, context);
332
333 // Check we go the metadata
334 assertEquals("Juha Haaga <juha.haaga@gmail.com>", metadata.get(Metadata.MESSAGE_FROM));
335 assertEquals("Test mail for Tika", metadata.get(TikaCoreProperties.TITLE));
336
337 // Check we got the message text, for both Plain Text and HTML
338 assertContains("Includes a normal, unencrypted zip file", handler.toString());
339 assertContains("This is the Plain Text part", handler.toString());
340 assertContains("This is the HTML part", handler.toString());
341
342 // We get both name and contents of the zip file's contents
343 assertContains("text.txt", handler.toString());
344 assertContains("TEST DATA FOR TIKA.", handler.toString());
345 assertContains("This is text inside an unencrypted zip file", handler.toString());
346 assertContains("TIKA-1028", handler.toString());
347 }
348
349 /**
350 * TIKA-1222 When requested, ensure that the various attachments of
351 * the mail come through properly as embedded resources
352 */
353 @Test
354 public void testGetAttachmentsAsEmbeddedResources() throws Exception {
355 TrackingHandler tracker = new TrackingHandler();
356 TikaInputStream tis = null;
357 ContainerExtractor ex = new ParserContainerExtractor();
358 try {
359 tis = TikaInputStream.get(getStream("test-documents/testRFC822-multipart"));
360 assertEquals(true, ex.isSupported(tis));
361 ex.extract(tis, ex, tracker);
362 } finally {
363 if (tis != null)
364 tis.close();
365 }
366
367 // Check we found all 3 parts
368 assertEquals(3, tracker.filenames.size());
369 assertEquals(3, tracker.mediaTypes.size());
370
371 // No filenames available
372 assertEquals(null, tracker.filenames.get(0));
373 assertEquals(null, tracker.filenames.get(1));
374 assertEquals(null, tracker.filenames.get(2));
375 // Types are available
376 assertEquals(MediaType.TEXT_PLAIN, tracker.mediaTypes.get(0));
377 assertEquals(MediaType.TEXT_HTML, tracker.mediaTypes.get(1));
378 assertEquals(MediaType.image("gif"), tracker.mediaTypes.get(2));
248379 }
249380
250381 private static InputStream getStream(String name) {
251 return Thread.currentThread().getContextClassLoader()
252 .getResourceAsStream(name);
253 }
254
382 InputStream stream = Thread.currentThread().getContextClassLoader()
383 .getResourceAsStream(name);
384 assertNotNull("Test file not found " + name, stream);
385 return stream;
386 }
255387 }
1515 */
1616 package org.apache.tika.parser.mat;
1717
18 //JDK imports
18 import static org.apache.tika.TikaTest.assertContains;
1919 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
2120
2221 import java.io.InputStream;
2322
24 //TIKA imports
2523 import org.apache.tika.metadata.Metadata;
2624 import org.apache.tika.parser.AutoDetectParser;
2725 import org.apache.tika.parser.ParseContext;
3129
3230 /**
3331 * Test cases to exercise the {@link MatParser}.
34 *
3532 */
3633 public class MatParserTest {
37
3834 @Test
3935 public void testParser() throws Exception {
40
4136 AutoDetectParser parser = new AutoDetectParser();
4237 ToXMLContentHandler handler = new ToXMLContentHandler();
4338 Metadata metadata = new Metadata();
5146 stream.close();
5247 }
5348
54 //Check Metadata
49 // Check Metadata
5550 assertEquals("PCWIN64", metadata.get("platform"));
5651 assertEquals("MATLAB 5.0 MAT-file", metadata.get("fileType"));
5752 assertEquals("IM", metadata.get("endian"));
5853 assertEquals("Thu Feb 21 15:52:49 2013", metadata.get("createdOn"));
5954
60 //Check Content
55 // Check Content
6156 String content = handler.toString();
6257
63 assertTrue(content.contains("<li>[1x909 double array]</li>"));
64 assertTrue(content.contains("<p>c1:[1x1 struct array]</p>"));
65 assertTrue(content.contains("<li>[1024x1 double array]</li>"));
66 assertTrue(content.contains("<p>b1:[1x1 struct array]</p>"));
67 assertTrue(content.contains("<p>a1:[1x1 struct array]</p>"));
68 assertTrue(content.contains("<li>[1024x1261 double array]</li>"));
69 assertTrue(content.contains("<li>[1x1 double array]</li>"));
70 assertTrue(content.contains("</body></html>"));
58 assertContains("<li>[1x909 double array]</li>", content);
59 assertContains("<p>c1:[1x1 struct array]</p>", content);
60 assertContains("<li>[1024x1 double array]</li>", content);
61 assertContains("<p>b1:[1x1 struct array]</p>", content);
62 assertContains("<p>a1:[1x1 struct array]</p>", content);
63 assertContains("<li>[1024x1261 double array]</li>", content);
64 assertContains("<li>[1x1 double array]</li>", content);
65 assertContains("</body></html>", content);
7166 }
7267
7368 @Test
7469 public void testParserForText() throws Exception {
75
7670 Parser parser = new MatParser();
7771 ToXMLContentHandler handler = new ToXMLContentHandler();
7872 Metadata metadata = new Metadata();
8680 stream.close();
8781 }
8882
89 //Check Content
83 // Check Content
9084 String content = handler.toString();
91 assertTrue(content.contains("<p>double:[2x2 double array]</p>"));
85 assertContains("<p>double:[2x2 double array]</p>", content);
9286 }
93
9487 }
1616 package org.apache.tika.parser.mbox;
1717
1818 import static junit.framework.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
19 import static org.apache.tika.TikaTest.assertContains;
2020
2121 import java.io.InputStream;
2222 import java.util.Map;
6363 }
6464
6565 String content = handler.toString();
66 assertTrue(content.contains("Test content 1"));
67 assertTrue(content.contains("Test content 2"));
66 assertContains("Test content 1", content);
67 assertContains("Test content 2", content);
6868 assertEquals("application/mbox", metadata.get(Metadata.CONTENT_TYPE));
6969
7070 Map<Integer, Metadata> mailsMetadata = mboxParser.getTrackingMetadata();
9191 stream.close();
9292 }
9393
94 assertTrue(handler.toString().contains("Test content"));
94 assertContains("Test content", handler.toString());
9595 assertEquals("Nb. Of mails", 1, mboxParser.getTrackingMetadata().size());
9696
9797 Metadata mailMetadata = mboxParser.getTrackingMetadata().get(0);
135135 stream.close();
136136 }
137137
138 assertTrue(handler.toString().contains("Test content"));
139 assertTrue(handler.toString().contains("> quoted stuff"));
138 assertContains("Test content", handler.toString());
139 assertContains("> quoted stuff", handler.toString());
140140 }
141141
142142 @Test
160160 assertEquals("Jothi Padmanabhan <jothipn@yahoo-inc.com>", firstMail.get(TikaCoreProperties.CREATOR));
161161 assertEquals("core-user@hadoop.apache.org", firstMail.get(Metadata.MESSAGE_RECIPIENT_ADDRESS));
162162
163 assertTrue(handler.toString().contains("When a Mapper completes"));
163 assertContains("When a Mapper completes", handler.toString());
164164 }
165165
166166 private static InputStream getStream(String name) {
4949
5050 @Test
5151 public void testParse() throws Exception {
52 OutlookPSTParser pstParser = new OutlookPSTParser();
52 Parser pstParser = new AutoDetectParser();
5353 Metadata metadata = new Metadata();
5454 ContentHandler handler = new ToHTMLContentHandler();
5555
6767 }
6868 }
6969
70 protected TikaInputStream getTestFile(String filename) throws Exception {
70 protected static TikaInputStream getTestFile(String filename) throws Exception {
7171 URL input = AbstractPOIContainerExtractionTest.class.getResource(
7272 "/test-documents/" + filename);
7373 assertNotNull(filename + " not found", input);
1515 */
1616 package org.apache.tika.parser.microsoft;
1717
18 import static org.apache.tika.TikaTest.assertContains;
19 import static org.apache.tika.TikaTest.assertNotContained;
1820 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertFalse;
2021 import static org.junit.Assert.assertTrue;
22 import static org.junit.Assert.fail;
2123
2224 import java.io.InputStream;
2325 import java.util.Locale;
2426
2527 import org.apache.tika.detect.DefaultDetector;
2628 import org.apache.tika.detect.Detector;
29 import org.apache.tika.exception.EncryptedDocumentException;
2730 import org.apache.tika.metadata.Metadata;
31 import org.apache.tika.metadata.Office;
2832 import org.apache.tika.metadata.OfficeOpenXMLExtended;
2933 import org.apache.tika.metadata.TikaCoreProperties;
3034 import org.apache.tika.mime.MediaType;
3135 import org.apache.tika.parser.AutoDetectParser;
3236 import org.apache.tika.parser.ParseContext;
37 import org.apache.tika.parser.PasswordProvider;
3338 import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
3439 import org.apache.tika.sax.BodyContentHandler;
3540 import org.junit.Test;
3641 import org.xml.sax.ContentHandler;
3742
3843 public class ExcelParserTest {
39
40 @Test
44 @Test
45 @SuppressWarnings("deprecation") // Checks legacy Tika-1.0 style metadata keys
4146 public void testExcelParser() throws Exception {
4247 InputStream input = ExcelParserTest.class.getResourceAsStream(
4348 "/test-documents/testEXCEL.xls");
6469 assertEquals("2007-10-01T16:31:43Z", metadata.get(Metadata.DATE));
6570
6671 String content = handler.toString();
67 assertTrue(content.contains("Sample Excel Worksheet"));
68 assertTrue(content.contains("Numbers and their Squares"));
69 assertTrue(content.contains("\t\tNumber\tSquare"));
70 assertTrue(content.contains("9"));
71 assertFalse(content.contains("9.0"));
72 assertTrue(content.contains("196"));
73 assertFalse(content.contains("196.0"));
72 assertContains("Sample Excel Worksheet", content);
73 assertContains("Numbers and their Squares", content);
74 assertContains("\t\tNumber\tSquare", content);
75 assertContains("9", content);
76 assertNotContained("9.0", content);
77 assertContains("196", content);
78 assertNotContained("196.0", content);
7479 } finally {
7580 input.close();
7681 }
9499 String content = handler.toString();
95100
96101 // Number #,##0.00
97 assertTrue(content.contains("1,599.99"));
98 assertTrue(content.contains("-1,599.99"));
102 assertContains("1,599.99", content);
103 assertContains("-1,599.99", content);
99104
100105 // Currency $#,##0.00;[Red]($#,##0.00)
101 assertTrue(content.contains("$1,599.99"));
102 assertTrue(content.contains("($1,599.99)"));
106 assertContains("$1,599.99", content);
107 assertContains("($1,599.99)", content);
103108
104109 // Scientific 0.00E+00
105110 // poi <=3.8beta1 returns 1.98E08, newer versions return 1.98+E08
107112 assertTrue(content.contains("-1.98E08") || content.contains("-1.98E+08"));
108113
109114 // Percentage.
110 assertTrue(content.contains("2.50%"));
115 assertContains("2.50%", content);
111116 // Excel rounds up to 3%, but that requires Java 1.6 or later
112117 if(System.getProperty("java.version").startsWith("1.5")) {
113 assertTrue(content.contains("2%"));
118 assertContains("2%", content);
114119 } else {
115 assertTrue(content.contains("3%"));
120 assertContains("3%", content);
116121 }
117122
118123 // Time Format: h:mm
119 assertTrue(content.contains("6:15"));
120 assertTrue(content.contains("18:15"));
124 assertContains("6:15", content);
125 assertContains("18:15", content);
121126
122127 // Date Format: d-mmm-yy
123 assertTrue(content.contains("17-May-07"));
128 assertContains("17-May-07", content);
124129
125130 // Date Format: m/d/yy
126 assertTrue(content.contains("10/3/09"));
131 assertContains("10/3/09", content);
127132
128133 // Date/Time Format: m/d/yy h:mm
129 assertTrue(content.contains("1/19/08 4:35"));
134 assertContains("1/19/08 4:35", content);
135
136 // Fraction (2.5): # ?/?
137 assertContains("2 1/2", content);
130138
131139
132140 // Below assertions represent outstanding formatting issues to be addressed
135143
136144 /*************************************************************************
137145 // Custom Number (0 "dollars and" .00 "cents")
138 assertTrue(content.contains("19 dollars and .99 cents"));
146 assertContains("19 dollars and .99 cents", content);
139147
140148 // Custom Number ("At" h:mm AM/PM "on" dddd mmmm d"," yyyy)
141 assertTrue(content.contains("At 4:20 AM on Thursday May 17, 2007"));
142
143 // Fraction (2.5): # ?/? (TODO Coming in POI 3.8 beta 6)
144 assertTrue(content.contains("2 1 / 2"));
149 assertContains("At 4:20 AM on Thursday May 17, 2007", content);
145150 **************************************************************************/
146151
152 } finally {
153 input.close();
154 }
155 }
156
157 @Test
158 public void testExcelParserPassword() throws Exception {
159 InputStream input = ExcelParserTest.class.getResourceAsStream(
160 "/test-documents/testEXCEL_protected_passtika.xls");
161 try {
162 Metadata metadata = new Metadata();
163 ContentHandler handler = new BodyContentHandler();
164 ParseContext context = new ParseContext();
165 context.set(Locale.class, Locale.US);
166 new OfficeParser().parse(input, handler, metadata, context);
167 fail("Document is encrypted, shouldn't parse");
168 } catch (EncryptedDocumentException e) {
169 // Good
170 } finally {
171 input.close();
172 }
173
174 // Try again, this time with the password
175 input = ExcelParserTest.class.getResourceAsStream(
176 "/test-documents/testEXCEL_protected_passtika.xls");
177 try {
178 Metadata metadata = new Metadata();
179 ContentHandler handler = new BodyContentHandler();
180 ParseContext context = new ParseContext();
181 context.set(Locale.class, Locale.US);
182 context.set(PasswordProvider.class, new PasswordProvider() {
183 @Override
184 public String getPassword(Metadata metadata) {
185 return "tika";
186 }
187 });
188 new OfficeParser().parse(input, handler, metadata, context);
189
190 assertEquals(
191 "application/vnd.ms-excel",
192 metadata.get(Metadata.CONTENT_TYPE));
193
194 assertEquals(null, metadata.get(TikaCoreProperties.TITLE));
195 assertEquals("Antoni", metadata.get(TikaCoreProperties.CREATOR));
196 assertEquals("2011-11-25T09:52:48Z", metadata.get(TikaCoreProperties.CREATED));
197
198 String content = handler.toString();
199 assertContains("This is an Encrypted Excel spreadsheet", content);
200 assertNotContained("9.0", content);
147201 } finally {
148202 input.close();
149203 }
170224 String content = handler.toString();
171225
172226 // The first sheet has a pie chart
173 assertTrue(content.contains("charttabyodawg"));
174 assertTrue(content.contains("WhamPuff"));
227 assertContains("charttabyodawg", content);
228 assertContains("WhamPuff", content);
175229
176230 // The second sheet has a bar chart and some text
177 assertTrue(content.contains("Sheet1"));
178 assertTrue(content.contains("Test Excel Spreasheet"));
179 assertTrue(content.contains("foo"));
180 assertTrue(content.contains("bar"));
181 assertTrue(content.contains("fizzlepuff"));
182 assertTrue(content.contains("whyaxis"));
183 assertTrue(content.contains("eksaxis"));
231 assertContains("Sheet1", content);
232 assertContains("Test Excel Spreasheet", content);
233 assertContains("foo", content);
234 assertContains("bar", content);
235 assertContains("fizzlepuff", content);
236 assertContains("whyaxis", content);
237 assertContains("eksaxis", content);
184238
185239 // The third sheet has some text
186 assertTrue(content.contains("Sheet2"));
187 assertTrue(content.contains("dingdong"));
240 assertContains("Sheet2", content);
241 assertContains("dingdong", content);
188242 } finally {
189243 input.close();
190244 }
205259 "application/vnd.ms-excel",
206260 metadata.get(Metadata.CONTENT_TYPE));
207261 String content = handler.toString();
208 assertTrue(content.contains("Number Formats"));
262 assertContains("Number Formats", content);
209263 } finally {
210264 input.close();
211265 }
223277 new OfficeParser().parse(input, handler, metadata, context);
224278
225279 String content = handler.toString();
226 assertTrue(content.contains("Microsoft Works"));
280 assertContains("Microsoft Works", content);
227281 } finally {
228282 input.close();
229283 }
276330 }
277331
278332 /**
279 * We don't currently support the old Excel 95 .xls file format,
280 * but we shouldn't break on these files either (TIKA-976)
333 * Excel 5 and 95 are older formats, and only get basic support
281334 */
282335 @Test
283336 public void testExcel95() throws Exception {
284337 Detector detector = new DefaultDetector();
285338 AutoDetectParser parser = new AutoDetectParser();
286
287 InputStream input = ExcelParserTest.class.getResourceAsStream(
288 "/test-documents/testEXCEL_95.xls");
289 Metadata m = new Metadata();
339 InputStream input;
340 MediaType type;
341 Metadata m;
342
343 // First try detection of Excel 5
344 m = new Metadata();
345 m.add(Metadata.RESOURCE_NAME_KEY, "excel_5.xls");
346 input = ExcelParserTest.class.getResourceAsStream("/test-documents/testEXCEL_5.xls");
347 try {
348 type = detector.detect(input, m);
349 assertEquals("application/vnd.ms-excel", type.toString());
350 } finally {
351 input.close();
352 }
353
354 // Now Excel 95
355 m = new Metadata();
290356 m.add(Metadata.RESOURCE_NAME_KEY, "excel_95.xls");
291
292 // Should be detected correctly
293 MediaType type = null;
357 input = ExcelParserTest.class.getResourceAsStream("/test-documents/testEXCEL_95.xls");
294358 try {
295 type = detector.detect(input, m);
296 assertEquals("application/vnd.ms-excel", type.toString());
297 } finally {
298 input.close();
299 }
300
301 // OfficeParser will claim to handle it
302 assertEquals(true, (new OfficeParser()).getSupportedTypes(new ParseContext()).contains(type));
303
304 // OOXMLParser won't handle it
305 assertEquals(false, (new OOXMLParser()).getSupportedTypes(new ParseContext()).contains(type));
306
307 // AutoDetectParser doesn't break on it
308 input = ExcelParserTest.class.getResourceAsStream("/test-documents/testEXCEL_95.xls");
309
310 try {
311 ContentHandler handler = new BodyContentHandler(-1);
312 ParseContext context = new ParseContext();
313 context.set(Locale.class, Locale.US);
314 parser.parse(input, handler, m, context);
315
316 String content = handler.toString();
317 assertEquals("", content);
318 } finally {
319 input.close();
320 }
359 type = detector.detect(input, m);
360 assertEquals("application/vnd.ms-excel", type.toString());
361 } finally {
362 input.close();
363 }
364
365 // OfficeParser can handle it
366 assertEquals(true, (new OfficeParser()).getSupportedTypes(new ParseContext()).contains(type));
367
368 // OOXMLParser won't handle it
369 assertEquals(false, (new OOXMLParser()).getSupportedTypes(new ParseContext()).contains(type));
370
371
372 // Parse the Excel 5 file
373 m = new Metadata();
374 input = ExcelParserTest.class.getResourceAsStream("/test-documents/testEXCEL_5.xls");
375 try {
376 ContentHandler handler = new BodyContentHandler(-1);
377 ParseContext context = new ParseContext();
378 context.set(Locale.class, Locale.US);
379 parser.parse(input, handler, m, context);
380
381 String content = handler.toString();
382
383 // Sheet names
384 assertContains("Feuil1", content);
385 assertContains("Feuil3", content);
386
387 // Text
388 assertContains("Sample Excel", content);
389 assertContains("Number", content);
390
391 // Numbers
392 assertContains("15", content);
393 assertContains("225", content);
394
395 // Metadata was also fetched
396 assertEquals("Simple Excel document", m.get(TikaCoreProperties.TITLE));
397 assertEquals("Keith Bennett", m.get(TikaCoreProperties.CREATOR));
398 } finally {
399 input.close();
400 }
401
402 // Parse the Excel 95 file
403 m = new Metadata();
404 input = ExcelParserTest.class.getResourceAsStream("/test-documents/testEXCEL_95.xls");
405 try {
406 ContentHandler handler = new BodyContentHandler(-1);
407 ParseContext context = new ParseContext();
408 context.set(Locale.class, Locale.US);
409 parser.parse(input, handler, m, context);
410
411 String content = handler.toString();
412
413 // Sheet name
414 assertContains("Foglio1", content);
415
416 // Very boring file, no actual text or numbers!
417
418 // Metadata was also fetched
419 assertEquals(null, m.get(TikaCoreProperties.TITLE));
420 assertEquals("Marco Quaranta", m.get(Office.LAST_AUTHOR));
421 } finally {
422 input.close();
423 }
321424 }
322425
323426 /**
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.microsoft;
17
18 import static org.apache.tika.parser.microsoft.AbstractPOIContainerExtractionTest.getTestFile;
19 import static org.junit.Assert.assertEquals;
20
21 import org.apache.tika.TikaTest;
22 import org.apache.tika.detect.DefaultDetector;
23 import org.apache.tika.detect.Detector;
24 import org.apache.tika.io.TikaInputStream;
25 import org.apache.tika.metadata.Metadata;
26 import org.apache.tika.metadata.TikaCoreProperties;
27 import org.apache.tika.mime.MediaType;
28 import org.apache.tika.parser.ParseContext;
29 import org.apache.tika.sax.BodyContentHandler;
30 import org.junit.Ignore;
31 import org.junit.Test;
32 import org.xml.sax.ContentHandler;
33
34 /**
35 * Tests for the Old Excel (2-4) parser
36 */
37 public class OldExcelParserTest extends TikaTest {
38 private static final String file = "testEXCEL_4.xls";
39
40 @Test
41 public void testDetection() throws Exception {
42 TikaInputStream stream = getTestFile(file);
43 Detector detector = new DefaultDetector();
44 try {
45 assertEquals(
46 MediaType.application("vnd.ms-excel.sheet.4"),
47 detector.detect(stream, new Metadata()));
48 } finally {
49 stream.close();
50 }
51 }
52
53 // Disabled, until we can get the POI code to tell us the version
54 @Test
55 @Ignore
56 public void testMetadata() throws Exception {
57 TikaInputStream stream = getTestFile(file);
58
59 Metadata metadata = new Metadata();
60 ContentHandler handler = new BodyContentHandler();
61
62 OldExcelParser parser = new OldExcelParser();
63 parser.parse(stream, handler, metadata, new ParseContext());
64
65 // We can get the content type
66 assertEquals("application/vnd.ms-excel.sheet.4", metadata.get(Metadata.CONTENT_TYPE));
67
68 // But no other metadata
69 assertEquals(null, metadata.get(TikaCoreProperties.TITLE));
70 assertEquals(null, metadata.get(Metadata.SUBJECT));
71 }
72
73 /**
74 * Check we can get the plain text properly
75 */
76 @Test
77 public void testPlainText() throws Exception {
78 ContentHandler handler = new BodyContentHandler();
79 Metadata metadata = new Metadata();
80
81 TikaInputStream stream = getTestFile(file);
82 try {
83 new OldExcelParser().parse(stream, handler, metadata, new ParseContext());
84 } finally {
85 stream.close();
86 }
87
88 String text = handler.toString();
89
90 // Check we find a few words we expect in there
91 assertContains("Size", text);
92 assertContains("Returns", text);
93
94 // Check we find a few numbers we expect in there
95 assertContains("11", text);
96 assertContains("784", text);
97 }
98
99 /**
100 * Check the HTML version comes through correctly
101 */
102 @Test
103 public void testHTML() throws Exception {
104 XMLResult result = getXML(file);
105 String xml = result.xml;
106
107 // Sheet name not found - only 5+ have sheet names
108 assertNotContained("<p>Sheet 1</p>", xml);
109
110 // String cells
111 assertContains("<p>Table 10 -", xml);
112 assertContains("<p>Tax</p>", xml);
113 assertContains("<p>N/A</p>", xml);
114
115 // Number cells
116 assertContains("<p>(1)</p>", xml);
117 assertContains("<p>5.0</p>", xml);
118 }
119 }
1515 */
1616 package org.apache.tika.parser.microsoft;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
1920 import static org.junit.Assert.assertFalse;
2021 import static org.junit.Assert.assertTrue;
7980 metadata.get(TikaCoreProperties.CREATED));
8081
8182 String content = handler.toString();
82 assertTrue(content.contains(""));
83 assertTrue(content.contains("Microsoft Outlook Express 6"));
84 assertTrue(content.contains("L'\u00C9quipe Microsoft Outlook Express"));
85 assertTrue(content.contains("Nouvel utilisateur de Outlook Express"));
86 assertTrue(content.contains("Messagerie et groupes de discussion"));
83 assertContains("Microsoft Outlook Express 6", content);
84 assertContains("L'\u00C9quipe Microsoft Outlook Express", content);
85 assertContains("Nouvel utilisateur de Outlook Express", content);
86 assertContains("Messagerie et groupes de discussion", content);
8787 }
8888
8989 /**
143143 metadata.get(TikaCoreProperties.TITLE));
144144
145145 String content = handler.toString();
146 assertTrue(content.contains("Outlook 2003"));
147 assertTrue(content.contains("Streamlined Mail Experience"));
148 assertTrue(content.contains("Navigation Pane"));
146 assertContains("Outlook 2003", content);
147 assertContains("Streamlined Mail Experience", content);
148 assertContains("Navigation Pane", content);
149149 }
150150
151151 @Test
173173 // As the HTML version should have been processed, ensure
174174 // we got some of the links
175175 String content = sw.toString();
176 assertTrue(content.contains("<dd>tests.chang@fengttt.com</dd>"));
177 assertTrue(content.contains("<p>Alfresco MSG format testing"));
178 assertTrue(content.contains("<li>1"));
179 assertTrue(content.contains("<li>2"));
176 assertContains("<dd>tests.chang@fengttt.com</dd>", content);
177 assertContains("<p>Alfresco MSG format testing", content);
178 assertContains("<li>1", content);
179 assertContains("<li>2", content);
180180
181181 // Make sure we don't have nested html docs
182182 assertEquals(2, content.split("<body>").length);
236236 // As the HTML version should have been processed, ensure
237237 // we got some of the links
238238 String content = sw.toString().replaceAll("<p>\\s+","<p>");
239 assertTrue(content.contains("<dd>New Outlook User</dd>"));
240 assertTrue(content.contains("designed <i>to help you"));
241 assertTrue(content.contains("<p><a href=\"http://r.office.microsoft.com/r/rlidOutlookWelcomeMail10?clid=1033\">Cached Exchange Mode</a>"));
239 assertContains("<dd>New Outlook User</dd>", content);
240 assertContains("designed <i>to help you", content);
241 assertContains("<p><a href=\"http://r.office.microsoft.com/r/rlidOutlookWelcomeMail10?clid=1033\">Cached Exchange Mode</a>", content);
242242
243243 // Link - check text around it, and the link itself
244 assertTrue(content.contains("sign up for a free subscription"));
245 assertTrue(content.contains("Office Newsletter"));
246 assertTrue(content.contains("newsletter will be sent to you"));
247 assertTrue(content.contains("http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033"));
244 assertContains("sign up for a free subscription", content);
245 assertContains("Office Newsletter", content);
246 assertContains("newsletter will be sent to you", content);
247 assertContains("http://r.office.microsoft.com/r/rlidNewsletterSignUp?clid=1033", content);
248248
249249 // Make sure we don't have nested html docs
250250 assertEquals(2, content.split("<body>").length);
251 //assertEquals(2, content.split("<\\/body>").length); // TODO Fix
251 assertEquals(2, content.split("<\\/body>").length);
252252 }
253253 }
235235
236236
237237 // PowerPoint with excel and word
238 // TODO
238 handler = process("testPPT_embeded.ppt", extractor, false);
239 assertEquals(7, handler.filenames.size());
240 assertEquals(7, handler.mediaTypes.size());
241
242 // We don't get all that helpful filenames
243 assertEquals("1", handler.filenames.get(0));
244 assertEquals("2", handler.filenames.get(1));
245 assertEquals(null, handler.filenames.get(2));
246 assertEquals(null, handler.filenames.get(3));
247 assertEquals(null, handler.filenames.get(4));
248 assertEquals(null, handler.filenames.get(5));
249 assertEquals(null, handler.filenames.get(6));
250 // But we do know their types
251 assertEquals(TYPE_XLS, handler.mediaTypes.get(0)); // Embedded office doc
252 assertEquals(TYPE_DOC, handler.mediaTypes.get(1)); // Embedded office doc
253 assertEquals(TYPE_EMF, handler.mediaTypes.get(2)); // Icon of embedded office doc
254 assertEquals(TYPE_EMF, handler.mediaTypes.get(3)); // Icon of embedded office doc
255 assertEquals(TYPE_PNG, handler.mediaTypes.get(4)); // Embedded image
256 assertEquals(TYPE_PNG, handler.mediaTypes.get(5)); // Embedded image
257 assertEquals(TYPE_PNG, handler.mediaTypes.get(6)); // Embedded image
258
259 // Run again on PowerPoint but with recursion
260 handler = process("testPPT_embeded.ppt", extractor, true);
261 assertEquals(11, handler.filenames.size());
262 assertEquals(11, handler.mediaTypes.size());
263
264 assertEquals("1", handler.filenames.get(0));
265 assertEquals(null, handler.filenames.get(1));
266 assertEquals("2", handler.filenames.get(2));
267 assertEquals("image1.png", handler.filenames.get(3));
268 assertEquals("image2.jpg", handler.filenames.get(4));
269 assertEquals("image3.png", handler.filenames.get(5));
270 assertEquals(null, handler.filenames.get(6));
271 assertEquals(null, handler.filenames.get(7));
272 assertEquals(null, handler.filenames.get(8));
273 assertEquals(null, handler.filenames.get(9));
274 assertEquals(null, handler.filenames.get(10));
275
276 assertEquals(TYPE_XLS, handler.mediaTypes.get(0)); // Embedded office doc
277 assertEquals(TYPE_PNG, handler.mediaTypes.get(1)); // PNG inside .xls
278 assertEquals(TYPE_DOC, handler.mediaTypes.get(2)); // Embedded office doc
279 assertEquals(TYPE_PNG, handler.mediaTypes.get(3)); // PNG inside .docx
280 assertEquals(TYPE_JPG, handler.mediaTypes.get(4)); // JPG inside .docx
281 assertEquals(TYPE_PNG, handler.mediaTypes.get(5)); // PNG inside .docx
282 assertEquals(TYPE_EMF, handler.mediaTypes.get(6)); // Icon of embedded office doc
283 assertEquals(TYPE_EMF, handler.mediaTypes.get(7)); // Icon of embedded office doc
284 assertEquals(TYPE_PNG, handler.mediaTypes.get(8)); // Embedded image
285 assertEquals(TYPE_PNG, handler.mediaTypes.get(9)); // Embedded image
286 assertEquals(TYPE_PNG, handler.mediaTypes.get(10)); // Embedded image
239287
240288
241289 // Word, with a non-office file (PDF)
4949 assertEquals("Keith Bennett", metadata.get(TikaCoreProperties.CREATOR));
5050 assertEquals("Keith Bennett", metadata.get(Metadata.AUTHOR));
5151 String content = handler.toString();
52 assertTrue(content.contains("Sample Powerpoint Slide"));
53 assertTrue(content.contains("Powerpoint X for Mac"));
52 assertContains("Sample Powerpoint Slide", content);
53 assertContains("Powerpoint X for Mac", content);
5454 } finally {
5555 input.close();
5656 }
9191 for(int row=1;row<=3;row++) {
9292 //assertContains("·\tBullet " + row, content);
9393 //assertContains("\u00b7\tBullet " + row, content);
94 // TODO OfficeParser fails to extract the bullet symbol
9495 assertContains("Bullet " + row, content);
9596 }
9697 assertContains("Here is a numbered list:", content);
9798 for(int row=1;row<=3;row++) {
9899 //assertContains(row + ")\tNumber bullet " + row, content);
99100 //assertContains(row + ") Number bullet " + row, content);
100 // TODO: OOXMLExtractor fails to number the bullets:
101 // TODO: OfficeParser fails to number the bullets:
101102 assertContains("Number bullet " + row, content);
102103 }
103104
104105 for(int row=1;row<=2;row++) {
105106 for(int col=1;col<=3;col++) {
106 // TODO Work out why the upgrade to POI 3.9 broke this test (table text)
107 // assertContains("Row " + row + " Col " + col, content);
107 assertContains("Row " + row + " Col " + col, content);
108108 }
109109 }
110110
152152 assertEquals(-1, content.indexOf("*"));
153153 }
154154
155 // TODO: once we fix TIKA-712, re-enable this
155 /**
156 * TIKA-712 Master Slide Text from PPT and PPTX files
157 * should be extracted too
158 */
156159 @Test
157160 public void testMasterText() throws Exception {
158161 ContentHandler handler = new BodyContentHandler();
176179 assertEquals(-1, content.indexOf("*"));
177180 }
178181
179 // TODO: once we fix TIKA-712, re-enable this
180182 @Test
181183 public void testMasterText2() throws Exception {
182184 ContentHandler handler = new BodyContentHandler();
1515 */
1616 package org.apache.tika.parser.microsoft;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
2020
2121 import java.io.InputStream;
2222
4545 assertEquals("Nick Burch", metadata.get(TikaCoreProperties.CREATOR));
4646 assertEquals("Nick Burch", metadata.get(Metadata.AUTHOR));
4747 String content = handler.toString();
48 assertTrue(content.contains("0123456789"));
49 assertTrue(content.contains("abcdef"));
48 assertContains("0123456789", content);
49 assertContains("abcdef", content);
5050 } finally {
5151 input.close();
5252 }
1515 */
1616 package org.apache.tika.parser.microsoft;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
2020
2121 import java.io.InputStream;
2222
4444 assertEquals("", metadata.get(TikaCoreProperties.TITLE));
4545 assertEquals("Hogwarts", metadata.get(TikaCoreProperties.CREATOR));
4646 String content = handler.toString();
47 assertTrue(content.contains("Some random text, on a page"));
47 assertContains("Some random text, on a page", content);
4848 } finally {
4949 input.close();
5050 }
3232 import org.apache.tika.metadata.TikaCoreProperties;
3333 import org.apache.tika.parser.ParseContext;
3434 import org.apache.tika.sax.BodyContentHandler;
35 import org.junit.Ignore;
3536 import org.junit.Test;
3637 import org.xml.sax.ContentHandler;
3738
5253 assertEquals("Sample Word Document", metadata.get(TikaCoreProperties.TITLE));
5354 assertEquals("Keith Bennett", metadata.get(TikaCoreProperties.CREATOR));
5455 assertEquals("Keith Bennett", metadata.get(Metadata.AUTHOR));
55 assertTrue(handler.toString().contains("Sample Word Document"));
56 assertContains("Sample Word Document", handler.toString());
5657 } finally {
5758 input.close();
5859 }
6768 Metadata metadata = new Metadata();
6869 new OfficeParser().parse(input, handler, metadata, new ParseContext());
6970
70 assertTrue(handler.toString().contains("MSj00974840000[1].wav"));
71 assertContains("MSj00974840000[1].wav", handler.toString());
7172 } finally {
7273 input.close();
7374 }
191192 assertEquals("Gym class featuring a brown fox and lazy dog", metadata.get(Metadata.SUBJECT));
192193 assertEquals("Nevin Nollop", metadata.get(TikaCoreProperties.CREATOR));
193194 assertEquals("Nevin Nollop", metadata.get(Metadata.AUTHOR));
194 assertTrue(handler.toString().contains("The quick brown fox jumps over the lazy dog"));
195 assertContains("The quick brown fox jumps over the lazy dog", handler.toString());
195196 } finally {
196197 input.close();
197198 }
396397 assertContains("<p>1. Organisering av vakten:</p>", xml);
397398
398399 }
400
401 @Test
402 public void testHyperlinkStringIOOBESmartQuote() throws Exception {
403 //TIKA-1512, one cause: closing double quote is a smart quote
404 //test file contributed by user
405 XMLResult result = getXML("testWORD_closingSmartQInHyperLink.doc");
406 assertContains("href=\"https://issues.apache.org/jira/browse/TIKA-1512", result.xml);
407 }
408
409 @Test
410 @Ignore //until we determine whether we can include test docs or not
411 public void testHyperlinkStringLongNoCloseQuote() throws Exception {
412 //TIKA-1512, one cause: no closing quote on really long string
413 //test file derived from govdocs1 012152.doc
414 XMLResult result = getXML("testWORD_longHyperLinkNoCloseQuote.doc");
415 assertContains("href=\"http://www.lexis.com", result.xml);
416 }
417
418 @Test
419 @Ignore //until we determine whether we can include test docs or not
420 public void testHyperlinkStringLongCarriageReturn() throws Exception {
421 //TIKA-1512, one cause: no closing quote, but carriage return
422 //test file derived from govdocs1 040044.doc
423 XMLResult result = getXML("testWORD_hyperLinkCarriageReturn.doc");
424 assertContains("href=\"http://www.nib.org", result.xml);
425 }
399426 }
1515 */
1616 package org.apache.tika.parser.microsoft;
1717
18 import static org.junit.Assert.assertTrue;
18 import static org.apache.tika.TikaTest.assertContains;
19
20 import java.io.InputStream;
1921
2022 import org.apache.tika.metadata.Metadata;
2123 import org.apache.tika.parser.ParseContext;
2224 import org.apache.tika.sax.BodyContentHandler;
2325 import org.junit.Test;
2426 import org.xml.sax.ContentHandler;
25
26 import java.io.InputStream;
2727
2828 public class WriteProtectedParserTest {
2929
3636 ContentHandler handler = new BodyContentHandler();
3737 new OfficeParser().parse(input, handler, metadata, new ParseContext());
3838 String content = handler.toString();
39 assertTrue(content.contains("Office"));
39 assertContains("Office", content);
4040 }
4141 }
1515 */
1616 package org.apache.tika.parser.microsoft.ooxml;
1717
18 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertFalse;
20 import static org.junit.Assert.assertTrue;
18 import javax.xml.transform.OutputKeys;
19 import javax.xml.transform.sax.SAXTransformerFactory;
20 import javax.xml.transform.sax.TransformerHandler;
21 import javax.xml.transform.stream.StreamResult;
2122
2223 import java.io.ByteArrayOutputStream;
2324 import java.io.InputStream;
2425 import java.io.PrintStream;
2526 import java.io.StringWriter;
27 import java.util.HashMap;
2628 import java.util.Locale;
27
28 import javax.xml.transform.OutputKeys;
29 import javax.xml.transform.sax.SAXTransformerFactory;
30 import javax.xml.transform.sax.TransformerHandler;
31 import javax.xml.transform.stream.StreamResult;
29 import java.util.Map;
3230
3331 import org.apache.tika.TikaTest;
32 import org.apache.tika.exception.EncryptedDocumentException;
33 import org.apache.tika.io.IOUtils;
3434 import org.apache.tika.io.TikaInputStream;
3535 import org.apache.tika.metadata.Metadata;
3636 import org.apache.tika.metadata.Office;
4141 import org.apache.tika.parser.AutoDetectParser;
4242 import org.apache.tika.parser.ParseContext;
4343 import org.apache.tika.parser.Parser;
44 import org.apache.tika.parser.PasswordProvider;
4445 import org.apache.tika.parser.microsoft.WordParserTest;
4546 import org.apache.tika.sax.BodyContentHandler;
47 import org.junit.Ignore;
4648 import org.junit.Test;
4749 import org.xml.sax.ContentHandler;
50
51 import static org.junit.Assert.assertEquals;
52 import static org.junit.Assert.assertTrue;
4853
4954 public class OOXMLParserTest extends TikaTest {
5055
7277 assertEquals("Simple Excel document", metadata.get(TikaCoreProperties.TITLE));
7378 assertEquals("Keith Bennett", metadata.get(TikaCoreProperties.CREATOR));
7479 assertEquals("Keith Bennett", metadata.get(Metadata.AUTHOR));
80
7581 String content = handler.toString();
76 assertTrue(content.contains("Sample Excel Worksheet"));
77 assertTrue(content.contains("Numbers and their Squares"));
78 assertTrue(content.contains("9"));
79 assertFalse(content.contains("9.0"));
80 assertTrue(content.contains("196"));
81 assertFalse(content.contains("196.0"));
82 assertContains("Sample Excel Worksheet", content);
83 assertContains("Numbers and their Squares", content);
84 assertContains("9", content);
85 assertNotContained("9.0", content);
86 assertContains("196", content);
87 assertNotContained("196.0", content);
8288 assertEquals("false", metadata.get(TikaMetadataKeys.PROTECTED));
8389 } finally {
8490 input.close();
103109 String content = handler.toString();
104110
105111 // Number #,##0.00
106 assertTrue(content.contains("1,599.99"));
107 assertTrue(content.contains("-1,599.99"));
112 assertContains("1,599.99", content);
113 assertContains("-1,599.99", content);
108114
109115 // Currency $#,##0.00;[Red]($#,##0.00)
110 assertTrue(content.contains("$1,599.99"));
111 assertTrue(content.contains("$1,599.99)"));
112
113 // Scientific 0.00E+00
114 // poi <=3.8beta1 returns 1.98E08, newer versions return 1.98+E08
115 assertTrue(content.contains("1.98E08") || content.contains("1.98E+08"));
116 assertTrue(content.contains("-1.98E08") || content.contains("-1.98E+08"));
116 assertContains("$1,599.99", content);
117 assertContains("$1,599.99)", content);
118
119 // Scientific 0.00E+00
120 // poi <=3.8beta1 returns 1.98E08, newer versions return 1.98+E08
121 assertTrue(content.contains("1.98E08") || content.contains("1.98E+08"));
122 assertTrue(content.contains("-1.98E08") || content.contains("-1.98E+08"));
117123
118124 // Percentage
119 assertTrue(content.contains("2.50%"));
125 assertContains("2.50%", content);
120126 // Excel rounds up to 3%, but that requires Java 1.6 or later
121127 if(System.getProperty("java.version").startsWith("1.5")) {
122 assertTrue(content.contains("2%"));
128 assertContains("2%", content);
123129 } else {
124 assertTrue(content.contains("3%"));
130 assertContains("3%", content);
125131 }
126132
127133 // Time Format: h:mm
128 assertTrue(content.contains("6:15"));
129 assertTrue(content.contains("18:15"));
134 assertContains("6:15", content);
135 assertContains("18:15", content);
130136
131137 // Date Format: d-mmm-yy
132 assertTrue(content.contains("17-May-07"));
138 assertContains("17-May-07", content);
133139
134140 // Currency $#,##0.00;[Red]($#,##0.00)
135 assertTrue(content.contains("$1,599.99"));
136 assertTrue(content.contains("($1,599.99)"));
141 assertContains("$1,599.99", content);
142 assertContains("($1,599.99)", content);
143
144 // Fraction (2.5): # ?/?
145 assertContains("2 1/2", content);
137146
138147 // Below assertions represent outstanding formatting issues to be addressed
139148 // they are included to allow the issues to be progressed with the Apache POI
141150
142151 /*************************************************************************
143152 // Date Format: m/d/yy
144 assertTrue(content.contains("03/10/2009"));
153 assertContains("03/10/2009", content);
145154
146155 // Date/Time Format
147 assertTrue(content.contains("19/01/2008 04:35"));
156 assertContains("19/01/2008 04:35", content);
148157
149158 // Custom Number (0 "dollars and" .00 "cents")
150 assertTrue(content.contains("19 dollars and .99 cents"));
159 assertContains("19 dollars and .99 cents", content);
151160
152161 // Custom Number ("At" h:mm AM/PM "on" dddd mmmm d"," yyyy)
153 assertTrue(content.contains("At 4:20 AM on Thursday May 17, 2007"));
154
155 // Fraction (2.5): # ?/?
156 assertTrue(content.contains("2 1 / 2"));
162 assertContains("At 4:20 AM on Thursday May 17, 2007", content);
157163 **************************************************************************/
164 } finally {
165 input.close();
166 }
167 }
168
169 @Test
170 @Ignore("OOXML-Strict not currently supported by POI, see #57699")
171 public void testExcelStrict() throws Exception {
172 Metadata metadata = new Metadata();
173 ContentHandler handler = new BodyContentHandler();
174 ParseContext context = new ParseContext();
175 context.set(Locale.class, Locale.US);
176
177 InputStream input = getTestDocument("testEXCEL.strict.xlsx");
178 try {
179 parser.parse(input, handler, metadata, context);
180
181 assertEquals(
182 "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
183 metadata.get(Metadata.CONTENT_TYPE));
184 assertEquals("Sample Spreadsheet", metadata.get(TikaCoreProperties.TITLE));
185 assertEquals("Nick Burch", metadata.get(TikaCoreProperties.CREATOR));
186 assertEquals("Spreadsheet for testing", metadata.get(TikaCoreProperties.DESCRIPTION));
187
188 String content = handler.toString();
189 assertContains("Test spreadsheet", content);
190 assertContains("This one is red", content);
191 assertContains("cb=10", content);
192 assertNotContained("10.0", content);
193 assertContains("cb=sum", content);
194 assertNotContained("13.0", content);
195 assertEquals("false", metadata.get(TikaMetadataKeys.PROTECTED));
158196 } finally {
159197 input.close();
160198 }
186224
187225 Parser parser = new AutoDetectParser();
188226 Metadata metadata = new Metadata();
189 // TODO: should auto-detect without the resource name
190 metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
191227 ContentHandler handler = new BodyContentHandler();
192228 ParseContext context = new ParseContext();
193229
261297
262298 Parser parser = new AutoDetectParser();
263299 final Metadata metadata = new Metadata();
264 // TODO: should auto-detect without the resource name
265 metadata.set(Metadata.RESOURCE_NAME_KEY, filename);
266300
267301 // Allow the value to be access from the inner class
268302 final int currentI = i;
384418 */
385419 @Test
386420 public void testWordHTML() throws Exception {
387
388421 XMLResult result = getXML("testWORD.docx");
389422 String xml = result.xml;
390423 Metadata metadata = result.metadata;
527560 assertEquals("true", metadata.get(TikaMetadataKeys.PROTECTED));
528561
529562 String content = handler.toString();
530 assertTrue(content.contains("Office"));
563 assertContains("Office", content);
531564 } finally {
532565 input.close();
533566 }
547580 InputStream input = getTestDocument("NullHeader.docx");
548581 try {
549582 parser.parse(input, handler, metadata, context);
550 assertFalse(handler.toString().length()==0);
583 assertEquals("Should have found some text", false, handler.toString().isEmpty());
551584 } finally {
552585 input.close();
553586 }
714747 assertContains("Master footer is here", content);
715748 }
716749
717 // TODO: once we fix TIKA-712, re-enable this
718 /*
750 /**
751 * TIKA-712 Master Slide Text from PPT and PPTX files
752 * should be extracted too
753 */
754 @Test
719755 public void testMasterText() throws Exception {
720756 ContentHandler handler = new BodyContentHandler();
721757 Metadata metadata = new Metadata();
731767 String content = handler.toString();
732768 assertContains("Text that I added to the master slide", content);
733769 }
734 */
735
736 // TODO: once we fix TIKA-712, re-enable this
737 /*
770
771 @Test
738772 public void testMasterText2() throws Exception {
739773 ContentHandler handler = new BodyContentHandler();
740774 Metadata metadata = new Metadata();
750784 String content = handler.toString();
751785 assertContains("Text that I added to the master slide", content);
752786 }
753 */
754787
755788 @Test
756789 public void testWordArt() throws Exception {
9861019 /**
9871020 * Test for missing text described in
9881021 * <a href="https://issues.apache.org/jira/browse/TIKA-1130">TIKA-1130</a>.
1022 * and TIKA-1317
9891023 */
9901024 @Test
9911025 public void testMissingText() throws Exception {
9991033 assertEquals(
10001034 "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
10011035 metadata.get(Metadata.CONTENT_TYPE));
1002 assertTrue(handler.toString().contains("BigCompany"));
1003 assertTrue(handler.toString().contains("Seasoned"));
1036 assertContains("BigCompany", handler.toString());
1037 assertContains("Seasoned", handler.toString());
1038 assertContains("Rich_text_in_cell", handler.toString());
10041039 } finally {
10051040 input.close();
10061041 }
10341069
10351070 //grab stderr
10361071 ByteArrayOutputStream errContent = new ByteArrayOutputStream();
1037 System.setErr(new PrintStream(errContent));
1072 System.setErr(new PrintStream(errContent, true, IOUtils.UTF_8.name()));
10381073 parser.parse(input, handler, metadata, context);
10391074
10401075 //return stderr
10411076 System.setErr(origErr);
10421077
1043 String err = errContent.toString();
1078 String err = errContent.toString(IOUtils.UTF_8.name());
10441079 assertTrue(err.length() == 0);
10451080 input.close();
10461081 }
10581093
10591094 }
10601095
1061 //TIKA-1223
10621096 @Test
10631097 public void testDOCXThumbnail() throws Exception {
10641098 String xml = getXML("testDOCX_Thumbnail.docx").xml;
10651099 int a = xml.indexOf("This file contains a thumbnail");
1066 int b = xml.indexOf("<div class=\"embedded\" id=\"thumbnail_0.emf\" />");
1067 int c = xml.indexOf( "<div class=\"package-entry\"><h1>thumbnail_0.emf</h1></div>" );
1100 int b = xml.indexOf("<div class=\"embedded\" id=\"/docProps/thumbnail.emf\" />");
10681101
10691102 assertTrue(a != -1);
10701103 assertTrue(b != -1);
1071 assertTrue(c != -1);
10721104 assertTrue(a < b);
1073 assertTrue(b < c);
10741105 }
10751106
10761107 @Test
10771108 public void testXLSXThumbnail() throws Exception {
10781109 String xml = getXML("testXLSX_Thumbnail.xlsx").xml;
10791110 int a = xml.indexOf("This file contains an embedded thumbnail by default");
1080 int b = xml.indexOf("<div class=\"embedded\" id=\"thumbnail_0.wmf\" />");
1081 int c = xml.indexOf( "<div class=\"package-entry\"><h1>thumbnail_0.wmf</h1></div>" );
1111 int b = xml.indexOf("<div class=\"embedded\" id=\"/docProps/thumbnail.wmf\" />");
10821112
10831113 assertTrue(a != -1);
10841114 assertTrue(b != -1);
1085 assertTrue(c != -1);
10861115 assertTrue(a < b);
1087 assertTrue(b < c);
10881116 }
10891117
10901118 @Test
10911119 public void testPPTXThumbnail() throws Exception {
10921120 String xml = getXML("testPPTX_Thumbnail.pptx").xml;
10931121 int a = xml.indexOf("<body><p>This file contains an embedded thumbnail</p>");
1094 int b = xml.indexOf("<div class=\"embedded\" id=\"thumbnail_0.jpeg\" />");
1095 int c = xml.indexOf( "<div class=\"package-entry\"><h1>thumbnail_0.jpeg</h1></div>" );
1122 int b = xml.indexOf("<div class=\"embedded\" id=\"/docProps/thumbnail.jpeg\" />");
10961123
10971124 assertTrue(a != -1);
10981125 assertTrue(b != -1);
1099 assertTrue(c != -1);
11001126 assertTrue(a < b);
1101 assertTrue(b < c);
1127 }
1128
1129 @Test
1130 public void testEncrypted() throws Exception {
1131 Map<String, String> tests = new HashMap<String, String>();
1132 tests.put("testWORD_protected_passtika.docx",
1133 "This is an encrypted Word 2007 File");
1134 tests.put("testPPT_protected_passtika.pptx",
1135 "This is an encrypted PowerPoint 2007 slide.");
1136 tests.put("testEXCEL_protected_passtika.xlsx",
1137 "This is an Encrypted Excel spreadsheet.");
1138
1139 Parser parser = new AutoDetectParser();
1140 Metadata m = new Metadata();
1141 PasswordProvider passwordProvider = new PasswordProvider() {
1142 @Override
1143 public String getPassword(Metadata metadata) {
1144 return "tika";
1145 }
1146 };
1147 ParseContext passwordContext = new ParseContext();
1148 passwordContext.set(org.apache.tika.parser.PasswordProvider.class, passwordProvider);
1149
1150 for (Map.Entry<String, String> e : tests.entrySet()) {
1151 InputStream is = null;
1152 try {
1153 is = getTestDocument(e.getKey());
1154 ContentHandler handler = new BodyContentHandler();
1155 parser.parse(is, handler, m, passwordContext);
1156 assertContains(e.getValue(), handler.toString());
1157 } finally {
1158 is.close();
1159 }
1160 }
1161
1162 ParseContext context = new ParseContext();
1163 //now try with no password
1164 for (Map.Entry<String, String> e : tests.entrySet()) {
1165 InputStream is = null;
1166 boolean exc = false;
1167 try {
1168 is = getTestDocument(e.getKey());
1169 ContentHandler handler = new BodyContentHandler();
1170 parser.parse(is, handler, m, context);
1171 } catch (EncryptedDocumentException ex) {
1172 exc = true;
1173 } finally {
1174 is.close();
1175 }
1176 assertTrue(exc);
1177 }
1178
11021179 }
11031180 }
1181
0 package org.apache.tika.parser.mock;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import static junit.framework.TestCase.assertEquals;
20 import static junit.framework.TestCase.assertTrue;
21 import static org.junit.Assert.fail;
22
23 import java.io.ByteArrayOutputStream;
24 import java.io.IOException;
25 import java.io.InputStream;
26 import java.io.PrintStream;
27 import java.util.Date;
28
29 import org.apache.tika.TikaTest;
30 import org.apache.tika.exception.TikaException;
31 import org.apache.tika.io.IOUtils;
32 import org.apache.tika.metadata.Metadata;
33 import org.apache.tika.parser.AutoDetectParser;
34 import org.apache.tika.parser.Parser;
35 import org.junit.Test;
36
37 public class MockParserTest extends TikaTest {
38 private final static String M = "/test-documents/mock/";
39 private final static Parser PARSER = new AutoDetectParser();
40
41 @Override
42 public XMLResult getXML(String path, Metadata m) throws Exception {
43 //note that this is specific to MockParserTest with addition of M to the path!
44 InputStream is = getResourceAsStream(M+path);
45 try {
46 return super.getXML(is, PARSER, m);
47 } finally {
48 IOUtils.closeQuietly(is);
49 }
50 }
51
52 @Test
53 public void testExample() throws Exception {
54 Metadata m = new Metadata();
55 PrintStream out = System.out;
56 PrintStream err = System.err;
57 ByteArrayOutputStream outBos = new ByteArrayOutputStream();
58 ByteArrayOutputStream errBos = new ByteArrayOutputStream();
59 PrintStream tmpOut = new PrintStream(outBos, true, IOUtils.UTF_8.toString());
60 PrintStream tmpErr = new PrintStream(errBos, true, IOUtils.UTF_8.toString());
61 System.setOut(tmpOut);
62 System.setErr(tmpErr);
63 try {
64 assertThrowable("example.xml", m, IOException.class, "not another IOException");
65 assertMockParser(m);
66 } finally {
67 System.setOut(out);
68 System.setErr(err);
69 }
70 String outString = new String(outBos.toByteArray(), IOUtils.UTF_8);
71 assertContains("writing to System.out", outString);
72
73 String errString = new String(errBos.toByteArray(), IOUtils.UTF_8);
74 assertContains("writing to System.err", errString);
75
76 }
77
78 @Test
79 public void testNothingBad() throws Exception {
80 Metadata m = new Metadata();
81 String content = getXML("nothing_bad.xml", m).xml;
82 assertEquals("Geoffrey Chaucer", m.get("author"));
83 assertContains("<p>And bathed every veyne in swich licour,</p>", content);
84 assertMockParser(m);
85 }
86
87 @Test
88 public void testNullPointer() throws Exception {
89 Metadata m = new Metadata();
90 assertThrowable("null_pointer.xml", m, NullPointerException.class, "another null pointer exception");
91 assertMockParser(m);
92 }
93
94 @Test
95 public void testNullPointerNoMsg() throws Exception {
96 Metadata m = new Metadata();
97 assertThrowable("null_pointer_no_msg.xml", m, NullPointerException.class, null);
98 assertMockParser(m);
99 }
100
101
102 @Test
103 public void testSleep() throws Exception {
104 long start = new Date().getTime();
105 Metadata m = new Metadata();
106 String content = getXML("sleep.xml", m).xml;
107 assertMockParser(m);
108 long elapsed = new Date().getTime()-start;
109 //should sleep for at least 3000
110 boolean enoughTimeHasElapsed = elapsed > 2000;
111 assertTrue("not enough time has not elapsed: "+elapsed, enoughTimeHasElapsed);
112 assertMockParser(m);
113 }
114
115 @Test
116 public void testHeavyHang() throws Exception {
117 long start = new Date().getTime();
118 Metadata m = new Metadata();
119
120 String content = getXML("heavy_hang.xml", m).xml;
121 assertMockParser(m);
122 long elapsed = new Date().getTime()-start;
123 //should sleep for at least 3000
124 boolean enoughTimeHasElapsed = elapsed > 2000;
125 assertTrue("not enough time has elapsed: "+elapsed, enoughTimeHasElapsed);
126 assertMockParser(m);
127 }
128
129 @Test
130 public void testFakeOOM() throws Exception {
131 Metadata m = new Metadata();
132 assertThrowable("fake_oom.xml", m, OutOfMemoryError.class, "not another oom");
133 assertMockParser(m);
134 }
135
136 @Test
137 public void testRealOOM() throws Exception {
138 //Note: we're not actually testing the diff between fake and real oom
139 //i.e. by creating child process and setting different -Xmx or
140 //memory profiling.
141 Metadata m = new Metadata();
142 assertThrowable("real_oom.xml", m, OutOfMemoryError.class, "Java heap space");
143 assertMockParser(m);
144 }
145
146 @Test
147 public void testInterruptibleSleep() {
148 //Without static initialization of the parser, it can take ~1 second after t.start()
149 //before the parser actually calls parse. This is
150 //just the time it takes to instantiate and call AutoDetectParser, do the detection, etc.
151 //This is not thread creation overhead.
152 ParserRunnable r = new ParserRunnable("sleep_interruptible.xml");
153 Thread t = new Thread(r);
154 t.start();
155 long start = new Date().getTime();
156 try {
157 Thread.sleep(1000);
158 } catch (InterruptedException e) {
159 //swallow
160 }
161
162 t.interrupt();
163
164 try {
165 t.join(10000);
166 } catch (InterruptedException e) {
167 //swallow
168 }
169 long elapsed = new Date().getTime()-start;
170 boolean shortEnough = elapsed < 2000;//the xml file specifies 3000
171 assertTrue("elapsed (" + elapsed + " millis) was not short enough", shortEnough);
172 }
173
174 @Test
175 public void testNonInterruptibleSleep() {
176 ParserRunnable r = new ParserRunnable("sleep_not_interruptible.xml");
177 Thread t = new Thread(r);
178 t.start();
179 long start = new Date().getTime();
180 try {
181 //make sure that the thread has actually started
182 Thread.sleep(1000);
183 } catch (InterruptedException e) {
184 //swallow
185 }
186 t.interrupt();
187 try {
188 t.join(20000);
189 } catch (InterruptedException e) {
190 //swallow
191 }
192 long elapsed = new Date().getTime()-start;
193 boolean longEnough = elapsed > 3000;//the xml file specifies 3000, this sleeps 1000
194 assertTrue("elapsed ("+elapsed+" millis) was not long enough", longEnough);
195 }
196
197 private class ParserRunnable implements Runnable {
198 private final String path;
199 ParserRunnable(String path) {
200 this.path = path;
201 }
202 @Override
203 public void run() {
204 Metadata m = new Metadata();
205 try {
206 getXML(path, m);
207 } catch (Exception e) {
208 throw new RuntimeException(e);
209 } finally {
210 assertMockParser(m);
211 }
212 }
213 }
214
215 private void assertThrowable(String path, Metadata m, Class<? extends Throwable> expected, String message) {
216
217 try {
218 getXML(path, m);
219 } catch (Throwable t) {
220 //if this is a throwable wrapped in a TikaException, use the cause
221 if (t instanceof TikaException && t.getCause() != null) {
222 t = t.getCause();
223 }
224 if (! (t.getClass().isAssignableFrom(expected))){
225 fail(t.getClass() +" is not assignable from "+expected);
226 }
227 if (message != null) {
228 assertEquals(message, t.getMessage());
229 }
230 }
231 }
232
233 private void assertMockParser(Metadata m) {
234 String[] parsers = m.getValues("X-Parsed-By");
235 //make sure that it was actually parsed by mock.
236 boolean parsedByMock = false;
237 for (String parser : parsers) {
238 if (parser.equals("org.apache.tika.parser.mock.MockParser")) {
239 parsedByMock = true;
240 break;
241 }
242 }
243 assertTrue("mock parser should have been called", parsedByMock);
244 }
245 }
1515 */
1616 package org.apache.tika.parser.mp3;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
2020
2121 import java.io.ByteArrayInputStream;
22 import java.io.InputStream;
22 import java.io.InputStream;
2323
2424 import org.apache.tika.metadata.Metadata;
2525 import org.apache.tika.metadata.TikaCoreProperties;
6969 assertEquals("Test Artist", metadata.get(Metadata.AUTHOR));
7070
7171 String content = handler.toString();
72 assertTrue(content.contains("Test Title"));
73 assertTrue(content.contains("Test Artist"));
74 assertTrue(content.contains("Test Album"));
75 assertTrue(content.contains("2008"));
76 assertTrue(content.contains("Test Comment"));
77 assertTrue(content.contains("Rock"));
72 assertContains("Test Title", content);
73 assertContains("Test Artist", content);
74 assertContains("Test Album", content);
75 assertContains("2008", content);
76 assertContains("Test Comment", content);
77 assertContains("Rock", content);
7878
7979 assertEquals("MPEG 3 Layer III Version 1", metadata.get("version"));
8080 assertEquals("44100", metadata.get("samplerate"));
108108
109109 // Check the textual contents
110110 String content = handler.toString();
111 assertTrue(content.contains("Test Title"));
112 assertTrue(content.contains("Test Artist"));
113 assertTrue(content.contains("Test Album"));
114 assertTrue(content.contains("2008"));
115 assertTrue(content.contains("Test Comment"));
116 assertTrue(content.contains("Rock"));
111 assertContains("Test Title", content);
112 assertContains("Test Artist", content);
113 assertContains("Test Album", content);
114 assertContains("2008", content);
115 assertContains("Test Comment", content);
116 assertContains("Rock", content);
117 assertContains(", track 1", content);
118 assertContains(", disc 1", content);
117119
118120 // Check un-typed audio properties
119121 assertEquals("MPEG 3 Layer III Version 1", metadata.get("version"));
123125 // Check XMPDM-typed audio properties
124126 assertEquals("Test Album", metadata.get(XMPDM.ALBUM));
125127 assertEquals("Test Artist", metadata.get(XMPDM.ARTIST));
128 assertEquals("Test Album Artist", metadata.get(XMPDM.ALBUM_ARTIST));
126129 assertEquals(null, metadata.get(XMPDM.COMPOSER));
127130 assertEquals("2008", metadata.get(XMPDM.RELEASE_DATE));
128131 assertEquals("Rock", metadata.get(XMPDM.GENRE));
129132 assertEquals("XXX - ID3v1 Comment\nTest Comment", metadata.get(XMPDM.LOG_COMMENT.getName()));
130133 assertEquals("1", metadata.get(XMPDM.TRACK_NUMBER));
134 assertEquals("1/1", metadata.get(XMPDM.DISC_NUMBER));
135 assertEquals("1", metadata.get(XMPDM.COMPILATION));
131136
132137 assertEquals("44100", metadata.get(XMPDM.AUDIO_SAMPLE_RATE));
133138 assertEquals("Mono", metadata.get(XMPDM.AUDIO_CHANNEL_TYPE));
159164 assertEquals("Test Artist", metadata.get(Metadata.AUTHOR));
160165
161166 String content = handler.toString();
162 assertTrue(content.contains("Test Title"));
163 assertTrue(content.contains("Test Artist"));
164 assertTrue(content.contains("Test Album"));
165 assertTrue(content.contains("2008"));
166 assertTrue(content.contains("Test Comment"));
167 assertTrue(content.contains("Rock"));
167 assertContains("Test Title", content);
168 assertContains("Test Artist", content);
169 assertContains("Test Album", content);
170 assertContains("2008", content);
171 assertContains("Test Comment", content);
172 assertContains("Rock", content);
168173
169174 assertEquals("MPEG 3 Layer III Version 1", metadata.get("version"));
170175 assertEquals("44100", metadata.get("samplerate"));
196201 assertEquals("Test Artist", metadata.get(Metadata.AUTHOR));
197202
198203 String content = handler.toString();
199 assertTrue(content.contains("Test Title"));
200 assertTrue(content.contains("Test Artist"));
201 assertTrue(content.contains("Test Album"));
202 assertTrue(content.contains("2008"));
203 assertTrue(content.contains("Test Comment"));
204 assertTrue(content.contains("Rock"));
204 assertContains("Test Title", content);
205 assertContains("Test Artist", content);
206 assertContains("Test Album", content);
207 assertContains("2008", content);
208 assertContains("Test Comment", content);
209 assertContains("Rock", content);
210 assertContains(", disc 1", content);
205211
206212 assertEquals("MPEG 3 Layer III Version 1", metadata.get("version"));
207213 assertEquals("44100", metadata.get("samplerate"));
208214 assertEquals("1", metadata.get("channels"));
209215 checkDuration(metadata, 2);
216
217 // Check XMPDM-typed audio properties
218 assertEquals("Test Album", metadata.get(XMPDM.ALBUM));
219 assertEquals("Test Artist", metadata.get(XMPDM.ARTIST));
220 assertEquals("Test Album Artist", metadata.get(XMPDM.ALBUM_ARTIST));
221 assertEquals(null, metadata.get(XMPDM.COMPOSER));
222 assertEquals("2008", metadata.get(XMPDM.RELEASE_DATE));
223 assertEquals("Rock", metadata.get(XMPDM.GENRE));
224 assertEquals("1", metadata.get(XMPDM.COMPILATION));
225
226 assertEquals(null, metadata.get(XMPDM.TRACK_NUMBER));
227 assertEquals("1", metadata.get(XMPDM.DISC_NUMBER));
210228 }
211229
212230 /**
274292 assertEquals("Test Artist", metadata.get(Metadata.AUTHOR));
275293
276294 String content = handler.toString();
277 assertTrue(content.contains("Test Title"));
278 assertTrue(content.contains("Test Artist"));
279 assertTrue(content.contains("Test Album"));
280 assertTrue(content.contains("2008"));
281 assertTrue(content.contains("Test Comment"));
282 assertTrue(content.contains("Rock"));
295 assertContains("Test Title", content);
296 assertContains("Test Artist", content);
297 assertContains("Test Album", content);
298 assertContains("2008", content);
299 assertContains("Test Comment", content);
300 assertContains("Rock", content);
283301
284302 assertEquals("MPEG 3 Layer III Version 1", metadata.get("version"));
285303 assertEquals("44100", metadata.get("samplerate"));
309327 assertEquals("", ID3v2Frame.getTagString(new byte[] {0,0,0,0}, 0, 3));
310328 assertEquals("A", ID3v2Frame.getTagString(new byte[] {(byte)'A',0,0,0}, 0, 3));
311329 }
330
331 @Test
332 public void testTIKA1589_noId3ReturnsDurationCorrectly() throws Exception {
333 Parser parser = new AutoDetectParser(); // Should auto-detect!
334 ContentHandler handler = new BodyContentHandler();
335 Metadata metadata = new Metadata();
336
337 InputStream stream = Mp3ParserTest.class.getResourceAsStream(
338 "/test-documents/testMP3noid3.mp3");
339 try {
340 parser.parse(stream, handler, metadata, new ParseContext());
341 } finally {
342 stream.close();
343 }
344
345 assertEquals("2455.510986328125", metadata.get(XMPDM.DURATION));
346 }
312347
313348 /**
314349 * This test will do nothing, unless you've downloaded the
343378 assertEquals("Merzhin", metadata.get(Metadata.AUTHOR));
344379
345380 String content = handler.toString();
346 assertTrue(content.contains("Plus loin vers l'ouest"));
381 assertContains("Plus loin vers l'ouest", content);
347382
348383 assertEquals("MPEG 3 Layer III Version 1", metadata.get("version"));
349384 assertEquals("44100", metadata.get("samplerate"));
380415 assertEquals("The White Stripes", metadata.get(Metadata.AUTHOR));
381416
382417 String content = handler.toString();
383 assertTrue(content.contains("Girl you have no faith in medicine"));
384 assertTrue(content.contains("The White Stripes"));
385 assertTrue(content.contains("Elephant"));
386 assertTrue(content.contains("2003"));
418 assertContains("Girl you have no faith in medicine", content);
419 assertContains("The White Stripes", content);
420 assertContains("Elephant", content);
421 assertContains("2003", content);
387422
388423 // File lacks any audio frames, so we can't know these
389424 assertEquals(null, metadata.get("version"));
2424 import java.io.ByteArrayOutputStream;
2525 import java.io.IOException;
2626 import java.io.OutputStream;
27
27 import org.apache.tika.io.IOUtils;
2828 import org.junit.After;
2929 import org.junit.Test;
3030
156156 public void testSkipNoCurrentHeader() throws IOException
157157 {
158158 ByteArrayOutputStream bos = new ByteArrayOutputStream();
159 bos.write("This is a test".getBytes());
159 bos.write("This is a test".getBytes(IOUtils.UTF_8));
160160 ByteArrayInputStream in = new ByteArrayInputStream(bos.toByteArray());
161161 stream = new MpegStream(in);
162162 assertFalse("Wrong result", stream.skipFrame());
1515 */
1616 package org.apache.tika.parser.mp4;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
2020
2121 import java.io.InputStream;
2222
2323 import org.apache.tika.io.TikaInputStream;
2424 import org.apache.tika.metadata.Metadata;
2525 import org.apache.tika.metadata.TikaCoreProperties;
26 import org.apache.tika.metadata.XMP;
2627 import org.apache.tika.metadata.XMPDM;
2728 import org.apache.tika.parser.AutoDetectParser;
2829 import org.apache.tika.parser.ParseContext;
6566
6667 // Check the textual contents
6768 String content = handler.toString();
68 assertTrue(content.contains("Test Title"));
69 assertTrue(content.contains("Test Artist"));
70 assertTrue(content.contains("Test Album"));
71 assertTrue(content.contains("2008"));
72 assertTrue(content.contains("Test Comment"));
73 assertTrue(content.contains("Test Genre"));
69 assertContains("Test Title", content);
70 assertContains("Test Artist", content);
71 assertContains("Test Album", content);
72 assertContains("2008", content);
73 assertContains("Test Comment", content);
74 assertContains("Test Genre", content);
7475
7576 // Check XMPDM-typed audio properties
7677 assertEquals("Test Album", metadata.get(XMPDM.ALBUM));
8081 assertEquals("Test Genre", metadata.get(XMPDM.GENRE));
8182 assertEquals("Test Comments", metadata.get(XMPDM.LOG_COMMENT.getName()));
8283 assertEquals("1", metadata.get(XMPDM.TRACK_NUMBER));
84 assertEquals("Test Album Artist", metadata.get(XMPDM.ALBUM_ARTIST));
85 assertEquals("6", metadata.get(XMPDM.DISC_NUMBER));
86 assertEquals("0", metadata.get(XMPDM.COMPILATION));
87
8388
8489 assertEquals("44100", metadata.get(XMPDM.AUDIO_SAMPLE_RATE));
85 //assertEquals("Stereo", metadata.get(XMPDM.AUDIO_CHANNEL_TYPE)); // TODO Extract
90 assertEquals("Stereo", metadata.get(XMPDM.AUDIO_CHANNEL_TYPE));
8691 assertEquals("M4A", metadata.get(XMPDM.AUDIO_COMPRESSOR));
92 assertEquals("0.07", metadata.get(XMPDM.DURATION));
93
94 assertEquals("iTunes 10.5.3.3", metadata.get(XMP.CREATOR_TOOL));
8795
8896
8997 // Check again by file, rather than stream
1616 package org.apache.tika.parser.netcdf;
1717
1818 //JDK imports
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
2219 import java.io.InputStream;
2320
2421 //TIKA imports
2926 import org.apache.tika.sax.BodyContentHandler;
3027 import org.junit.Test;
3128 import org.xml.sax.ContentHandler;
29
30 import static org.apache.tika.TikaTest.assertContains;
31 import static org.junit.Assert.assertEquals;
32
3233 /**
3334 * Test cases to exercise the {@link NetCDFParser}.
34 *
3535 */
3636 public class NetCDFParserTest {
3737
3838 @Test
3939 public void testParseGlobalMetadata() throws Exception {
40 if(System.getProperty("java.version").startsWith("1.5")) {
41 return;
42 }
43
4440 Parser parser = new NetCDFParser();
4541 ContentHandler handler = new BodyContentHandler();
4642 Metadata metadata = new Metadata();
6258 assertEquals(metadata.get(Metadata.REALIZATION), "1");
6359 assertEquals(metadata.get(Metadata.EXPERIMENT_ID),
6460 "720 ppm stabilization experiment (SRESA1B)");
65
61 assertEquals(metadata.get("File-Type-Description"),
62 "NetCDF-3/CDM");
63
6664 String content = handler.toString();
67 assertTrue(content.contains(":long_name = \"Surface area\";"));
68 assertTrue(content.contains("float area(lat=128, lon=256);"));
69 assertTrue(content.contains("float lat(lat=128);"));
70 assertTrue(content.contains("double lat_bnds(lat=128, bnds=2);"));
71 assertTrue(content.contains("double lon_bnds(lon=256, bnds=2);"));
72
73
65 assertContains("long_name = \"Surface area\"", content);
66 assertContains("float area(lat=128, lon=256)", content);
67 assertContains("float lat(lat=128)", content);
68 assertContains("double lat_bnds(lat=128, bnds=2)", content);
69 assertContains("double lon_bnds(lon=256, bnds=2)", content);
70
71
72
7473 }
7574
7675 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.ocr;
17
18 import org.apache.tika.TikaTest;
19 import org.junit.Test;
20
21 import java.io.File;
22 import java.io.InputStream;
23
24 import static org.junit.Assert.assertEquals;
25 import static org.junit.Assert.assertTrue;
26 import static org.junit.Assert.fail;
27
28 public class TesseractOCRConfigTest extends TikaTest {
29
30 @Test
31 public void testNoConfig() throws Exception {
32 TesseractOCRConfig config = new TesseractOCRConfig();
33 assertEquals("Invalid default tesseractPath value", "", config.getTesseractPath());
34 assertEquals("Invalid default language value", "eng", config.getLanguage());
35 assertEquals("Invalid default pageSegMode value", "1", config.getPageSegMode());
36 assertEquals("Invalid default minFileSizeToOcr value", 0, config.getMinFileSizeToOcr());
37 assertEquals("Invalid default maxFileSizeToOcr value", Integer.MAX_VALUE, config.getMaxFileSizeToOcr());
38 assertEquals("Invalid default timeout value", 120, config.getTimeout());
39 }
40
41 @Test
42 public void testPartialConfig() throws Exception {
43
44 InputStream stream = TesseractOCRConfigTest.class.getResourceAsStream(
45 "/test-properties/TesseractOCRConfig-partial.properties");
46
47 TesseractOCRConfig config = new TesseractOCRConfig(stream);
48 assertEquals("Invalid default tesseractPath value", "", config.getTesseractPath());
49 assertEquals("Invalid overridden language value", "fra+deu", config.getLanguage());
50 assertEquals("Invalid default pageSegMode value", "1", config.getPageSegMode());
51 assertEquals("Invalid overridden minFileSizeToOcr value", 1, config.getMinFileSizeToOcr());
52 assertEquals("Invalid default maxFileSizeToOcr value", Integer.MAX_VALUE, config.getMaxFileSizeToOcr());
53 assertEquals("Invalid overridden timeout value", 240, config.getTimeout());
54 }
55
56 @Test
57 public void testFullConfig() throws Exception {
58
59 InputStream stream = TesseractOCRConfigTest.class.getResourceAsStream(
60 "/test-properties/TesseractOCRConfig-full.properties");
61
62 TesseractOCRConfig config = new TesseractOCRConfig(stream);
63 assertEquals("Invalid overridden tesseractPath value", "/opt/tesseract" + File.separator, config.getTesseractPath());
64 assertEquals("Invalid overridden language value", "fra+deu", config.getLanguage());
65 assertEquals("Invalid overridden pageSegMode value", "2", config.getPageSegMode());
66 assertEquals("Invalid overridden minFileSizeToOcr value", 1, config.getMinFileSizeToOcr());
67 assertEquals("Invalid overridden maxFileSizeToOcr value", 2000000, config.getMaxFileSizeToOcr());
68 assertEquals("Invalid overridden timeout value", 240, config.getTimeout());
69 }
70
71 @Test(expected=IllegalArgumentException.class)
72 public void testValidateLanguage() {
73 TesseractOCRConfig config = new TesseractOCRConfig();
74 config.setLanguage("eng");
75 config.setLanguage("eng+fra");
76 assertTrue("Couldn't set valid values", true);
77 config.setLanguage("rm -Rf *");
78 }
79
80 @Test(expected=IllegalArgumentException.class)
81 public void testValidatePageSegMode() {
82 TesseractOCRConfig config = new TesseractOCRConfig();
83 config.setPageSegMode("0");
84 config.setPageSegMode("10");
85 assertTrue("Couldn't set valid values", true);
86 config.setPageSegMode("11");
87 }
88
89 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.ocr;
17
18 import static org.apache.tika.parser.ocr.TesseractOCRParser.getTesseractProg;
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21 import static org.junit.Assume.assumeTrue;
22
23 import java.io.InputStream;
24 import java.util.List;
25
26 import org.apache.tika.TikaTest;
27 import org.apache.tika.metadata.Metadata;
28 import org.apache.tika.mime.MediaType;
29 import org.apache.tika.parser.AutoDetectParser;
30 import org.apache.tika.parser.DefaultParser;
31 import org.apache.tika.parser.ParseContext;
32 import org.apache.tika.parser.Parser;
33 import org.apache.tika.parser.RecursiveParserWrapper;
34 import org.apache.tika.parser.external.ExternalParser;
35 import org.apache.tika.parser.image.ImageParser;
36 import org.apache.tika.parser.pdf.PDFParserConfig;
37 import org.apache.tika.sax.BasicContentHandlerFactory;
38 import org.junit.Test;
39 import org.xml.sax.helpers.DefaultHandler;
40
41 public class TesseractOCRParserTest extends TikaTest {
42
43 public static boolean canRun() {
44 TesseractOCRConfig config = new TesseractOCRConfig();
45 TesseractOCRParserTest tesseractOCRTest = new TesseractOCRParserTest();
46 return tesseractOCRTest.canRun(config);
47 }
48
49 private boolean canRun(TesseractOCRConfig config) {
50 String[] checkCmd = {config.getTesseractPath() + getTesseractProg()};
51 // If Tesseract is not on the path, do not run the test.
52 return ExternalParser.check(checkCmd);
53 }
54
55 /*
56 Check that if Tesseract is not found, the TesseractOCRParser claims to not support
57 any file types. So, the standard image parser is called instead.
58 */
59 @Test
60 public void offersNoTypesIfNotFound() throws Exception {
61 TesseractOCRParser parser = new TesseractOCRParser();
62 DefaultParser defaultParser = new DefaultParser();
63 MediaType png = MediaType.image("png");
64
65 // With an invalid path, will offer no types
66 TesseractOCRConfig invalidConfig = new TesseractOCRConfig();
67 invalidConfig.setTesseractPath("/made/up/path");
68
69 ParseContext parseContext = new ParseContext();
70 parseContext.set(TesseractOCRConfig.class, invalidConfig);
71
72 // No types offered
73 assertEquals(0, parser.getSupportedTypes(parseContext).size());
74
75 // And DefaultParser won't use us
76 assertEquals(ImageParser.class, defaultParser.getParsers(parseContext).get(png).getClass());
77 }
78
79 /*
80 If Tesseract is found, test we retrieve the proper number of supporting Parsers.
81 */
82 @Test
83 public void offersTypesIfFound() throws Exception {
84 TesseractOCRParser parser = new TesseractOCRParser();
85 DefaultParser defaultParser = new DefaultParser();
86
87 ParseContext parseContext = new ParseContext();
88 MediaType png = MediaType.image("png");
89
90 // Assuming that Tesseract is on the path, we should find 5 Parsers that support PNG.
91 assumeTrue(canRun());
92
93 assertEquals(5, parser.getSupportedTypes(parseContext).size());
94 assertTrue(parser.getSupportedTypes(parseContext).contains(png));
95
96 // DefaultParser will now select the TesseractOCRParser.
97 assertEquals(TesseractOCRParser.class, defaultParser.getParsers(parseContext).get(png).getClass());
98 }
99
100 @Test
101 public void testPDFOCR() throws Exception {
102 String resource = "/test-documents/testOCR.pdf";
103 String[] nonOCRContains = new String[0];
104 testBasicOCR(resource, nonOCRContains, 2);
105 }
106
107 @Test
108 public void testDOCXOCR() throws Exception {
109 String resource = "/test-documents/testOCR.docx";
110 String[] nonOCRContains = {
111 "This is some text.",
112 "Here is an embedded image:"
113 };
114 testBasicOCR(resource, nonOCRContains, 3);
115 }
116
117 @Test
118 public void testPPTXOCR() throws Exception {
119 String resource = "/test-documents/testOCR.pptx";
120 String[] nonOCRContains = {
121 "This is some text"
122 };
123 testBasicOCR(resource, nonOCRContains, 3);
124 }
125
126 private void testBasicOCR(String resource, String[] nonOCRContains, int numMetadatas) throws Exception {
127 TesseractOCRConfig config = new TesseractOCRConfig();
128 Parser parser = new RecursiveParserWrapper(new AutoDetectParser(),
129 new BasicContentHandlerFactory(
130 BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1));
131
132 PDFParserConfig pdfConfig = new PDFParserConfig();
133 pdfConfig.setExtractInlineImages(true);
134
135 ParseContext parseContext = new ParseContext();
136 parseContext.set(TesseractOCRConfig.class, config);
137 parseContext.set(Parser.class, parser);
138 parseContext.set(PDFParserConfig.class, pdfConfig);
139
140 InputStream stream = TesseractOCRParserTest.class.getResourceAsStream(resource);
141
142 try {
143 parser.parse(stream, new DefaultHandler(), new Metadata(), parseContext);
144 } finally {
145 stream.close();
146 }
147 List<Metadata> metadataList = ((RecursiveParserWrapper) parser).getMetadata();
148 assertEquals(numMetadatas, metadataList.size());
149
150 StringBuilder contents = new StringBuilder();
151 for (Metadata m : metadataList) {
152 contents.append(m.get(RecursiveParserWrapper.TIKA_CONTENT));
153 }
154 if (canRun()) {
155 assertTrue(contents.toString().contains("Happy New Year 2003!"));
156 }
157 for (String needle : nonOCRContains) {
158 assertContains(needle, contents.toString());
159 }
160 assertTrue(metadataList.get(0).names().length > 10);
161 assertTrue(metadataList.get(1).names().length > 10);
162 //test at least one value
163 assertEquals("deflate", metadataList.get(1).get("Compression CompressionTypeName"));
164 }
165
166 @Test
167 public void testSingleImage() throws Exception {
168 assumeTrue(canRun());
169 String xml = getXML("testOCR.jpg").xml;
170 assertContains("OCR Testing", xml);
171 }
172
173 @Test
174 public void getNormalMetadataToo() throws Exception {
175 //this should be successful whether or not TesseractOCR is installed/active
176 //If tesseract is installed, the internal metadata extraction parser should
177 //work; and if tesseract isn't installed, the regular parsers should take over.
178
179 //gif
180 Metadata m = getXML("testGIF.gif").metadata;
181 assertTrue(m.names().length > 20);
182 assertEquals("RGB", m.get("Chroma ColorSpaceType"));
183
184 //jpg
185 m = getXML("testOCR.jpg").metadata;
186 assertEquals("136", m.get(Metadata.IMAGE_WIDTH));
187 assertEquals("66", m.get(Metadata.IMAGE_LENGTH));
188 assertEquals("8", m.get(Metadata.BITS_PER_SAMPLE));
189 assertEquals(null, m.get(Metadata.SAMPLES_PER_PIXEL));
190 assertContains("This is a test Apache Tika imag", m.get(Metadata.COMMENTS));
191
192 //bmp
193 m = getXML("testBMP.bmp").metadata;
194 assertEquals("100", m.get(Metadata.IMAGE_WIDTH));
195 assertEquals("75", m.get(Metadata.IMAGE_LENGTH));
196
197 //png
198 m = getXML("testPNG.png").metadata;
199 assertEquals("100", m.get(Metadata.IMAGE_WIDTH));
200 assertEquals("75", m.get(Metadata.IMAGE_LENGTH));
201 assertEquals("UnsignedIntegral", m.get("Data SampleFormat"));
202
203 //tiff
204 m = getXML("testTIFF.tif").metadata;
205 assertEquals("100", m.get(Metadata.IMAGE_WIDTH));
206 assertEquals("75", m.get(Metadata.IMAGE_LENGTH));
207 assertEquals("72 dots per inch", m.get("Y Resolution"));
208 }
209 }
6161 metadata.get(Metadata.CONTENT_TYPE));
6262
6363 String content = handler.toString();
64 assertTrue(content.contains("Tika is part of the Lucene project."));
65 assertTrue(content.contains("Solr"));
66 assertTrue(content.contains("one embedded"));
67 assertTrue(content.contains("Rectangle Title"));
68 assertTrue(content.contains("a blue background and dark border"));
64 assertContains("Tika is part of the Lucene project.", content);
65 assertContains("Solr", content);
66 assertContains("one embedded", content);
67 assertContains("Rectangle Title", content);
68 assertContains("a blue background and dark border", content);
6969 } finally {
7070 input.close();
7171 }
345345 metadata.get(Metadata.CONTENT_TYPE));
346346
347347 String content = handler.toString();
348 assertTrue(content.contains("Tika is part of the Lucene project."));
348 assertContains("Tika is part of the Lucene project.", content);
349349 } finally {
350350 tis.close();
351351 }
352352 }
353
354 @Test
355 public void testNPEFromFile() throws Exception {
356 TikaInputStream tis = TikaInputStream.get(this.getClass().getResource(
357 "/test-documents/testNPEOpenDocument.odt"));
358 OpenDocumentParser parser = new OpenDocumentParser();
359
360 try {
361 Metadata metadata = new Metadata();
362 ContentHandler handler = new BodyContentHandler();
363 parser.parse(tis, handler, metadata, new ParseContext());
364
365 assertEquals(
366 "application/vnd.oasis.opendocument.text",
367 metadata.get(Metadata.CONTENT_TYPE));
368
369 String content = handler.toString();
370 assertContains("primero hay que generar un par de claves", content);
371 } finally {
372 tis.close();
373 }
374 }
375
376 // TIKA-1063: Test basic style support.
377 @Test
378 public void testODTStyles() throws Exception {
379 String xml = getXML("testStyles.odt").xml;
380 assertContains("This <i>is</i> <b>just</b> a <u>test</u>", xml);
381 assertContains("<p>And <b>another <i>test</i> is</b> here.</p>", xml);
382 assertContains("<ol>\t<li><p>One</p>", xml);
383 assertContains("</ol>", xml);
384 assertContains("<ul>\t<li><p>First</p>", xml);
385 assertContains("</ul>", xml);
386 }
387
388 //TIKA-1600: Test that null pointer doesn't break parsing.
389 @Test
390 public void testNullStylesInODTFooter() throws Exception {
391 Parser parser = new OpenDocumentParser();
392 InputStream input = ODFParserTest.class.getResourceAsStream("/test-documents/testODT-TIKA-6000.odt");
393 try {
394 Metadata metadata = new Metadata();
395 ContentHandler handler = new BodyContentHandler();
396 parser.parse(input, handler, metadata, new ParseContext());
397
398 assertEquals("application/vnd.oasis.opendocument.text", metadata.get(Metadata.CONTENT_TYPE));
399
400 String content = handler.toString();
401
402 assertContains("Utilisation de ce document", content);
403 assertContains("Copyright and License", content);
404 assertContains("Changer la langue", content);
405 assertContains("La page d’accueil permet de faire une recherche simple", content);
406 } finally {
407 input.close();
408 }
409 }
353410 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.pdf;
17
18
19 import static org.junit.Assert.assertTrue;
20
21 import org.apache.tika.exception.AccessPermissionException;
22 import org.apache.tika.metadata.AccessPermissions;
23 import org.apache.tika.metadata.Metadata;
24 import org.apache.tika.metadata.PropertyTypeException;
25 import org.junit.Test;
26
27 public class AccessCheckerTest {
28
29 @Test
30 public void testLegacy() throws AccessPermissionException{
31
32 Metadata m = getMetadata(false, false);
33 //legacy behavior; don't bother checking
34 AccessChecker checker = new AccessChecker();
35 checker.check(m);
36 assertTrue("no exception", true);
37
38 m = getMetadata(false, true);
39 assertTrue("no exception", true);
40 checker.check(m);
41
42 m = getMetadata(true, true);
43 assertTrue("no exception", true);
44 checker.check(m);
45 }
46
47 @Test
48 public void testNoExtraction() {
49
50 Metadata m = null;
51 //allow nothing
52 AccessChecker checker = new AccessChecker(false);
53 boolean ex = false;
54 try {
55 m = getMetadata(false, false);
56 checker.check(m);
57 } catch (AccessPermissionException e) {
58 ex = true;
59 }
60 assertTrue("correct exception with no extraction, no extract for accessibility", ex);
61 ex = false;
62 try {
63 //document allows extraction for accessibility
64 m = getMetadata(false, true);
65 checker.check(m);
66 } catch (AccessPermissionException e) {
67 //but application is not an accessibility application
68 ex = true;
69 }
70 assertTrue("correct exception with no extraction, no extract for accessibility", ex);
71 }
72
73 @Test
74 public void testExtractOnlyForAccessibility() throws AccessPermissionException {
75 Metadata m = getMetadata(false, true);
76 //allow accessibility
77 AccessChecker checker = new AccessChecker(true);
78 checker.check(m);
79 assertTrue("no exception", true);
80 boolean ex = false;
81 try {
82 m = getMetadata(false, false);
83 checker.check(m);
84 } catch (AccessPermissionException e) {
85 ex = true;
86 }
87 assertTrue("correct exception", ex);
88 }
89
90 @Test
91 public void testCrazyExtractNotForAccessibility() throws AccessPermissionException {
92 Metadata m = getMetadata(true, false);
93 //allow accessibility
94 AccessChecker checker = new AccessChecker(true);
95 checker.check(m);
96 assertTrue("no exception", true);
97
98 //don't extract for accessibility
99 checker = new AccessChecker(false);
100 //if extract content is allowed, the checker shouldn't
101 //check the value of extract for accessibility
102 checker.check(m);
103 assertTrue("no exception", true);
104
105 }
106
107 @Test
108 public void testCantAddMultiplesToMetadata() {
109 Metadata m = new Metadata();
110 boolean ex = false;
111 m.add(AccessPermissions.EXTRACT_CONTENT, "true");
112 try {
113 m.add(AccessPermissions.EXTRACT_CONTENT, "false");
114 } catch (PropertyTypeException e) {
115 ex = true;
116 }
117 assertTrue("can't add multiple values", ex);
118
119 m = new Metadata();
120 ex = false;
121 m.add(AccessPermissions.EXTRACT_FOR_ACCESSIBILITY, "true");
122 try {
123 m.add(AccessPermissions.EXTRACT_FOR_ACCESSIBILITY, "false");
124 } catch (PropertyTypeException e) {
125 ex = true;
126 }
127 assertTrue("can't add multiple values", ex);
128 }
129
130 private Metadata getMetadata(boolean allowExtraction, boolean allowExtractionForAccessibility) {
131 Metadata m = new Metadata();
132 m.set(AccessPermissions.EXTRACT_CONTENT, Boolean.toString(allowExtraction));
133 m.set(AccessPermissions.EXTRACT_FOR_ACCESSIBILITY, Boolean.toString(allowExtractionForAccessibility));
134 return m;
135 }
136 }
2626 import java.util.HashMap;
2727 import java.util.HashSet;
2828 import java.util.List;
29 import java.util.Locale;
2930 import java.util.Map;
3031 import java.util.Set;
3132
33 import org.apache.log4j.Level;
34 import org.apache.log4j.Logger;
3235 import org.apache.tika.TikaTest;
36 import org.apache.tika.exception.AccessPermissionException;
37 import org.apache.tika.exception.EncryptedDocumentException;
38 import org.apache.tika.exception.TikaException;
3339 import org.apache.tika.extractor.ContainerExtractor;
3440 import org.apache.tika.extractor.DocumentSelector;
3541 import org.apache.tika.extractor.ParserContainerExtractor;
42 import org.apache.tika.io.IOUtils;
3643 import org.apache.tika.io.TikaInputStream;
3744 import org.apache.tika.metadata.Metadata;
3845 import org.apache.tika.metadata.OfficeOpenXMLCore;
4249 import org.apache.tika.parser.ParseContext;
4350 import org.apache.tika.parser.Parser;
4451 import org.apache.tika.parser.PasswordProvider;
52 import org.apache.tika.parser.RecursiveParserWrapper;
53 import org.apache.tika.sax.BasicContentHandlerFactory;
4554 import org.apache.tika.sax.BodyContentHandler;
4655 import org.apache.tika.sax.ContentHandlerDecorator;
56 import org.apache.tika.sax.ToXMLContentHandler;
57 import org.junit.AfterClass;
58 import org.junit.BeforeClass;
4759 import org.junit.Test;
4860 import org.xml.sax.ContentHandler;
4961 /**
5668 public static final MediaType TYPE_PDF = MediaType.application("pdf");
5769 public static final MediaType TYPE_DOCX = MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document");
5870 public static final MediaType TYPE_DOC = MediaType.application("msword");
71 public static Level PDFBOX_LOG_LEVEL = Level.INFO;
72
73 @BeforeClass
74 public static void setup() {
75 //remember default logging level, but turn off for PDFParserTest
76 PDFBOX_LOG_LEVEL = Logger.getLogger("org.apache.pdfbox").getLevel();
77 Logger.getLogger("org.apache.pdfbox").setLevel(Level.OFF);
78 }
79
80 @AfterClass
81 public static void tearDown() {
82 //return to regular logging level
83 Logger.getLogger("org.apache.pdfbox").setLevel(PDFBOX_LOG_LEVEL);
84 }
5985
6086 @Test
6187 public void testPdfParsing() throws Exception {
77103 // assertEquals("Sat Sep 15 10:02:31 BST 2007", metadata.get(Metadata.CREATION_DATE));
78104 // assertEquals("Sat Sep 15 10:02:31 BST 2007", metadata.get(Metadata.LAST_MODIFIED));
79105
80 assertTrue(content.contains("Apache Tika"));
81 assertTrue(content.contains("Tika - Content Analysis Toolkit"));
82 assertTrue(content.contains("incubator"));
83 assertTrue(content.contains("Apache Software Foundation"));
106 assertContains("Apache Tika", content);
107 assertContains("Tika - Content Analysis Toolkit", content);
108 assertContains("incubator", content);
109 assertContains("Apache Software Foundation", content);
84110 // testing how the end of one paragraph is separated from start of the next one
85111 assertTrue("should have word boundary after headline",
86112 !content.contains("ToolkitApache"));
130156 assertEquals("Array Entry 1", metadata.getValues("Custom Array")[0]);
131157 assertEquals("Array Entry 2", metadata.getValues("Custom Array")[1]);
132158
133 assertTrue(content.contains("Hello World!"));
159 assertContains("Hello World!", content);
134160 }
135161
136162 /**
153179 stream.close();
154180 }
155181
182 assertEquals("true", metadata.get("pdf:encrypted"));
156183 assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
157184 assertEquals("The Bank of England", metadata.get(TikaCoreProperties.CREATOR));
158185 assertEquals("The Bank of England", metadata.get(Metadata.AUTHOR));
161188 assertEquals("Rethinking the Financial Network, Speech by Andrew G Haldane, Executive Director, Financial Stability delivered at the Financial Student Association, Amsterdam on 28 April 2009", metadata.get(TikaCoreProperties.TITLE));
162189
163190 String content = handler.toString();
164 assertTrue(content.contains("RETHINKING THE FINANCIAL NETWORK"));
165 assertTrue(content.contains("On 16 November 2002"));
166 assertTrue(content.contains("In many important respects"));
191 assertContains("RETHINKING THE FINANCIAL NETWORK", content);
192 assertContains("On 16 November 2002", content);
193 assertContains("In many important respects", content);
167194
168195
169196 // Try again with an explicit empty password
184211 } finally {
185212 stream.close();
186213 }
214 assertEquals("true", metadata.get("pdf:encrypted"));
187215
188216 assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
189217 assertEquals("The Bank of England", metadata.get(TikaCoreProperties.CREATOR));
191219 assertEquals("Speeches by Andrew G Haldane", metadata.get(Metadata.SUBJECT));
192220 assertEquals("Rethinking the Financial Network, Speech by Andrew G Haldane, Executive Director, Financial Stability delivered at the Financial Student Association, Amsterdam on 28 April 2009", metadata.get(TikaCoreProperties.TITLE));
193221
194 assertTrue(content.contains("RETHINKING THE FINANCIAL NETWORK"));
195 assertTrue(content.contains("On 16 November 2002"));
196 assertTrue(content.contains("In many important respects"));
222 assertContains("RETHINKING THE FINANCIAL NETWORK", content);
223 assertContains("On 16 November 2002", content);
224 assertContains("In many important respects", content);
225
226 //now test wrong password
227 handler = new BodyContentHandler();
228 metadata = new Metadata();
229 context = new ParseContext();
230 context.set(PasswordProvider.class, new PasswordProvider() {
231 public String getPassword(Metadata metadata) {
232 return "WRONG!!!!";
233 }
234 });
235
236 stream = PDFParserTest.class.getResourceAsStream(
237 "/test-documents/testPDF_protected.pdf");
238 boolean ex = false;
239 try {
240 parser.parse(stream, handler, metadata, context);
241 } catch (EncryptedDocumentException e) {
242 ex = true;
243 } finally {
244 stream.close();
245 }
246 content = handler.toString();
247
248 assertTrue("encryption exception", ex);
249 assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
250 assertEquals("true", metadata.get("pdf:encrypted"));
251 //pdf:encrypted, X-Parsed-By and Content-Type
252 assertEquals("very little metadata should be parsed", 3, metadata.names().length);
253 assertEquals(0, content.length());
254
255 //now test wrong password with non sequential parser
256 handler = new BodyContentHandler();
257 metadata = new Metadata();
258 context = new ParseContext();
259 context.set(PasswordProvider.class, new PasswordProvider() {
260 public String getPassword(Metadata metadata) {
261 return "WRONG!!!!";
262 }
263 });
264 PDFParserConfig config = new PDFParserConfig();
265 config.setUseNonSequentialParser(true);
266 context.set(PDFParserConfig.class, config);
267
268 stream = PDFParserTest.class.getResourceAsStream(
269 "/test-documents/testPDF_protected.pdf");
270 ex = false;
271 try {
272 parser.parse(stream, handler, metadata, context);
273 } catch (EncryptedDocumentException e) {
274 ex = true;
275 } finally {
276 stream.close();
277 }
278 content = handler.toString();
279 assertTrue("encryption exception", ex);
280 assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
281 assertEquals("true", metadata.get("pdf:encrypted"));
282
283 //pdf:encrypted, X-Parsed-By and Content-Type
284 assertEquals("very little metadata should be parsed", 3, metadata.names().length);
285 assertEquals(0, content.length());
197286 }
198287
199288 @Test
203292 "/test-documents/testPDFTwoTextBoxes.pdf");
204293 String content = getText(stream, parser);
205294 content = content.replaceAll("\\s+"," ");
206 assertTrue(content.contains("Left column line 1 Left column line 2 Right column line 1 Right column line 2"));
295 assertContains("Left column line 1 Left column line 2 Right column line 1 Right column line 2", content);
207296 }
208297
209298 @Test
560649 Set<String> knownMetadataDiffs = new HashSet<String>();
561650 //PDFBox-1792/Tika-1203
562651 knownMetadataDiffs.add("testAnnotations.pdf");
652 // Added for TIKA-93.
653 knownMetadataDiffs.add("testOCR.pdf");
563654
564655 //empty for now
565656 Set<String> knownContentDiffs = new HashSet<String>();
566657
567658 for (File f : testDocs.listFiles()) {
568 if (! f.getName().toLowerCase().endsWith(".pdf")) {
659 if (! f.getName().toLowerCase(Locale.ROOT).endsWith(".pdf")) {
569660 continue;
570661 }
571662
663 String sequentialContent = null;
664 Metadata sequentialMetadata = new Metadata();
665 try {
666 sequentialContent = getText(new FileInputStream(f),
667 sequentialParser, seqContext, sequentialMetadata);
668 } catch (EncryptedDocumentException e) {
669 //silently skip a file that requires a user password
670 continue;
671 } catch (Exception e) {
672 throw new TikaException("Sequential Parser failed on test file " + f, e);
673 }
674
572675 pdfs++;
573 Metadata sequentialMetadata = new Metadata();
574 String sequentialContent = getText(new FileInputStream(f),
575 sequentialParser, seqContext, sequentialMetadata);
576
676
677 String nonSequentialContent = null;
577678 Metadata nonSequentialMetadata = new Metadata();
578 String nonSequentialContent = getText(new FileInputStream(f),
579 nonSequentialParser, nonSeqContext, nonSequentialMetadata);
679 try {
680 nonSequentialContent = getText(new FileInputStream(f),
681 nonSequentialParser, nonSeqContext, nonSequentialMetadata);
682 } catch (Exception e) {
683 throw new TikaException("Non-Sequential Parser failed on test file " + f, e);
684 }
580685
581686 if (knownContentDiffs.contains(f.getName())) {
582687 assertFalse(f.getName(), sequentialContent.equals(nonSequentialContent));
663768 //"regressiveness" exists only in Unit10.doc not in the container pdf document
664769 assertTrue(xml.contains("regressiveness"));
665770
666 RecursiveMetadataParser p = new RecursiveMetadataParser(new AutoDetectParser(), false);
771 RecursiveParserWrapper p = new RecursiveParserWrapper(new AutoDetectParser(),
772 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
667773 TikaInputStream tis = null;
668774 ParseContext context = new ParseContext();
669775 PDFParserConfig config = new PDFParserConfig();
682788 }
683789 }
684790
685 List<Metadata> metadatas = p.getAllMetadata();
791 List<Metadata> metadatas = p.getMetadata();
792
686793 assertEquals(5, metadatas.size());
687794 assertNull(metadatas.get(0).get(Metadata.RESOURCE_NAME_KEY));
688 assertNull(metadatas.get(1).get(Metadata.RESOURCE_NAME_KEY));
689 assertEquals("Press Quality(1).joboptions", metadatas.get(2).get(Metadata.RESOURCE_NAME_KEY));
690 assertEquals("Unit10.doc", metadatas.get(3).get(Metadata.RESOURCE_NAME_KEY));
691 assertEquals(MediaType.image("jpeg").toString(), metadatas.get(0).get(Metadata.CONTENT_TYPE));
692 assertEquals(MediaType.image("tiff").toString(), metadatas.get(1).get(Metadata.CONTENT_TYPE));
693 assertEquals("text/plain; charset=ISO-8859-1", metadatas.get(2).get(Metadata.CONTENT_TYPE));
694 assertEquals(TYPE_DOC.toString(), metadatas.get(3).get(Metadata.CONTENT_TYPE));
695 }
696
795 assertEquals("image0.jpg", metadatas.get(1).get(Metadata.RESOURCE_NAME_KEY));
796 assertEquals("Press Quality(1).joboptions", metadatas.get(3).get(Metadata.RESOURCE_NAME_KEY));
797 assertEquals("Unit10.doc", metadatas.get(4).get(Metadata.RESOURCE_NAME_KEY));
798 assertEquals(MediaType.image("jpeg").toString(), metadatas.get(1).get(Metadata.CONTENT_TYPE));
799 assertEquals(MediaType.image("tiff").toString(), metadatas.get(2).get(Metadata.CONTENT_TYPE));
800 assertEquals("text/plain; charset=ISO-8859-1", metadatas.get(3).get(Metadata.CONTENT_TYPE));
801 assertEquals(TYPE_DOC.toString(), metadatas.get(4).get(Metadata.CONTENT_TYPE));
802 }
803
804
805 @Test
806 public void testEmbeddedFilesInAnnotations() throws Exception {
807 String xml = getXML("/testPDFFileEmbInAnnotation.pdf").xml;
808
809 assertTrue(xml.contains("This is a Excel"));
810 }
697811
698812 @Test
699813 public void testSingleCloseDoc() throws Exception {
845959
846960 Parser defaultParser = new AutoDetectParser();
847961
848 RecursiveMetadataParser p = new RecursiveMetadataParser(defaultParser, false);
962 RecursiveParserWrapper p = new RecursiveParserWrapper(defaultParser,
963 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
849964 ParseContext context = new ParseContext();
850965 context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
851966 context.set(org.apache.tika.parser.Parser.class, p);
856971
857972 p.parse(stream, handler, metadata, context);
858973
859 List<Metadata> metadatas = p.getAllMetadata();
974 List<Metadata> metadatas = p.getMetadata();
860975 int inline = 0;
861976 int attach = 0;
862977 for (Metadata m : metadatas) {
873988 assertEquals(2, attach);
874989
875990 stream.close();
876 p.clear();
991 p.reset();
877992
878993 //now try turning off inline
879994 stream = TikaInputStream.get(this.getClass().getResource(path));
8851000 metadata = new Metadata();
8861001 p.parse(stream, handler, metadata, context);
8871002
888 metadatas = p.getAllMetadata();
1003 metadatas = p.getMetadata();
8891004 for (Metadata m : metadatas) {
8901005 String v = m.get(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE);
8911006 if (v != null) {
9061021 public void testInlineConfig() throws Exception {
9071022
9081023 Parser defaultParser = new AutoDetectParser();
909 RecursiveMetadataParser p = new RecursiveMetadataParser(defaultParser, false);
1024 RecursiveParserWrapper p = new RecursiveParserWrapper(defaultParser,
1025 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
9101026 ParseContext context = new ParseContext();
9111027 context.set(org.apache.tika.parser.Parser.class, p);
9121028 Metadata metadata = new Metadata();
9161032
9171033 p.parse(stream, handler, metadata, context);
9181034
919 List<Metadata> metadatas = p.getAllMetadata();
1035 List<Metadata> metadatas = p.getMetadata();
9201036 int inline = 0;
9211037 int attach = 0;
9221038 for (Metadata m : metadatas) {
9331049 assertEquals(2, attach);
9341050
9351051 stream.close();
936 p.clear();
1052 p.reset();
9371053
9381054 //now try turning off inline
9391055 stream = TikaInputStream.get(this.getClass().getResource(path));
9481064 metadata = new Metadata();
9491065 p.parse(stream, handler, metadata, context);
9501066
951 metadatas = p.getAllMetadata();
1067 metadatas = p.getMetadata();
9521068 for (Metadata m : metadatas) {
9531069 String v = m.get(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE);
9541070 if (v != null) {
9671083 public void testEmbeddedFileNameExtraction() throws Exception {
9681084 InputStream is = PDFParserTest.class.getResourceAsStream(
9691085 "/test-documents/testPDF_multiFormatEmbFiles.pdf");
970 RecursiveMetadataParser p = new RecursiveMetadataParser(new AutoDetectParser(), false);
1086 RecursiveParserWrapper p = new RecursiveParserWrapper(
1087 new AutoDetectParser(),
1088 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
9711089 Metadata m = new Metadata();
9721090 ParseContext c = new ParseContext();
9731091 c.set(org.apache.tika.parser.Parser.class, p);
9741092 ContentHandler h = new BodyContentHandler();
9751093 p.parse(is, h, m, c);
9761094 is.close();
977 List<Metadata> metadatas = p.getAllMetadata();
1095 List<Metadata> metadatas = p.getMetadata();
9781096 assertEquals("metadata size", 5, metadatas.size());
979 Metadata firstAttachment = metadatas.get(0);
1097 Metadata firstAttachment = metadatas.get(1);
9801098 assertEquals("attachment file name", "Test.txt", firstAttachment.get(Metadata.RESOURCE_NAME_KEY));
9811099 }
9821100
9841102 public void testOSSpecificEmbeddedFileExtraction() throws Exception {
9851103 InputStream is = PDFParserTest.class.getResourceAsStream(
9861104 "/test-documents/testPDF_multiFormatEmbFiles.pdf");
987 RecursiveMetadataParser p = new RecursiveMetadataParser(new AutoDetectParser(), true);
1105 RecursiveParserWrapper p = new RecursiveParserWrapper(
1106 new AutoDetectParser(),
1107 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1));
9881108 Metadata m = new Metadata();
9891109 ParseContext c = new ParseContext();
9901110 c.set(org.apache.tika.parser.Parser.class, p);
9911111 ContentHandler h = new BodyContentHandler();
9921112 p.parse(is, h, m, c);
9931113 is.close();
994 List<Metadata> metadatas = p.getAllMetadata();
1114 List<Metadata> metadatas = p.getMetadata();
9951115 assertEquals("metadata size", 5, metadatas.size());
9961116
997 assertEquals("file name", "Test.txt", metadatas.get(0).get(Metadata.RESOURCE_NAME_KEY));
998 assertContains("os specific", metadatas.get(0).get(RecursiveMetadataParser.TIKA_CONTENT));
999 assertEquals("file name", "TestMac.txt", metadatas.get(1).get(Metadata.RESOURCE_NAME_KEY));
1000 assertContains("mac embedded", metadatas.get(1).get(RecursiveMetadataParser.TIKA_CONTENT));
1001 assertEquals("file name", "TestDos.txt", metadatas.get(2).get(Metadata.RESOURCE_NAME_KEY));
1002 assertContains("dos embedded", metadatas.get(2).get(RecursiveMetadataParser.TIKA_CONTENT));
1003 assertEquals("file name", "TestUnix.txt", metadatas.get(3).get(Metadata.RESOURCE_NAME_KEY));
1004 assertContains("unix embedded", metadatas.get(3).get(RecursiveMetadataParser.TIKA_CONTENT));
1005
1006 }
1007
1117 assertEquals("file name", "Test.txt", metadatas.get(1).get(Metadata.RESOURCE_NAME_KEY));
1118 assertContains("os specific", metadatas.get(1).get(RecursiveParserWrapper.TIKA_CONTENT));
1119 assertEquals("file name", "TestMac.txt", metadatas.get(2).get(Metadata.RESOURCE_NAME_KEY));
1120 assertContains("mac embedded", metadatas.get(2).get(RecursiveParserWrapper.TIKA_CONTENT));
1121 assertEquals("file name", "TestDos.txt", metadatas.get(3).get(Metadata.RESOURCE_NAME_KEY));
1122 assertContains("dos embedded", metadatas.get(3).get(RecursiveParserWrapper.TIKA_CONTENT));
1123 assertEquals("file name", "TestUnix.txt", metadatas.get(4).get(Metadata.RESOURCE_NAME_KEY));
1124 assertContains("unix embedded", metadatas.get(4).get(RecursiveParserWrapper.TIKA_CONTENT));
1125
1126 }
1127
1128 @Test //TIKA-1427
1129 public void testEmbeddedFileMarkup() throws Exception {
1130 Parser parser = new AutoDetectParser();
1131 ParseContext context = new ParseContext();
1132 context.set(org.apache.tika.parser.Parser.class, parser);
1133
1134 PDFParserConfig config = new PDFParserConfig();
1135 config.setExtractInlineImages(true);
1136 config.setExtractUniqueInlineImagesOnly(false);
1137 context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
1138
1139
1140 Metadata metadata = new Metadata();
1141 ContentHandler handler = new ToXMLContentHandler();
1142 String path = "/test-documents/testPDF_childAttachments.pdf";
1143 InputStream stream = null;
1144 try {
1145 stream = TikaInputStream.get(this.getClass().getResource(path));
1146 parser.parse(stream, handler, metadata, context);
1147 } finally {
1148 IOUtils.closeQuietly(stream);
1149 }
1150
1151 String xml = handler.toString();
1152 //regular attachment
1153 assertContains("<div class=\"embedded\" id=\"Unit10.doc\" />", xml);
1154 //inline image
1155 assertContains("<img src=\"embedded:image1.tif\" alt=\"image1.tif\" />", xml);
1156
1157 //doc embedded inside an annotation
1158 xml = getXML("testPDFFileEmbInAnnotation.pdf").xml;
1159 assertContains("<div class=\"embedded\" id=\"Excel.xlsx\" />", xml);
1160 }
1161
1162 //Access checker tests
1163
1164 @Test
1165 public void testLegacyAccessChecking() throws Exception {
1166 //test that default behavior doesn't throw AccessPermissionException
1167 for (String file : new String[] {
1168 "testPDF_no_extract_no_accessibility_owner_empty.pdf",
1169 "testPDF_no_extract_yes_accessibility_owner_empty.pdf",
1170 }) {
1171 String xml = getXML(file).xml;
1172 assertContains("Hello World", xml);
1173 }
1174
1175 //now try with the user password
1176 PasswordProvider provider = new PasswordProvider() {
1177 @Override
1178 public String getPassword(Metadata metadata) {
1179 return "user";
1180 }
1181 };
1182
1183 ParseContext context = new ParseContext();
1184 context.set(PasswordProvider.class, provider);
1185 Parser parser = new AutoDetectParser();
1186
1187 for (String path : new String[] {
1188 "testPDF_no_extract_no_accessibility_owner_user.pdf",
1189 "testPDF_no_extract_yes_accessibility_owner_user.pdf",
1190 }) {
1191 InputStream stream = null;
1192 try {
1193 stream = TikaInputStream.get(this.getClass().getResource("/test-documents/"+path));
1194 String text = getText(stream, parser, context);
1195 assertContains("Hello World", text);
1196 } finally {
1197 IOUtils.closeQuietly(stream);
1198 }
1199 }
1200 }
1201
1202 @Test
1203 public void testAccessCheckingEmptyPassword() throws Exception {
1204 PDFParserConfig config = new PDFParserConfig();
1205
1206 //don't allow extraction, not even for accessibility
1207 config.setAccessChecker(new AccessChecker(false));
1208 Parser parser = new AutoDetectParser();
1209 ParseContext context = new ParseContext();
1210 context.set(PDFParserConfig.class, config);
1211
1212 //test exception for empty password
1213 for (String path : new String[] {
1214 "testPDF_no_extract_no_accessibility_owner_empty.pdf",
1215 "testPDF_no_extract_yes_accessibility_owner_empty.pdf",
1216 }) {
1217 assertException("/test-documents/"+path, parser, context, AccessPermissionException.class);
1218 }
1219
1220 config.setAccessChecker(new AccessChecker(true));
1221 assertException("/test-documents/" + "testPDF_no_extract_no_accessibility_owner_empty.pdf",
1222 parser, context, AccessPermissionException.class);
1223
1224 InputStream is = null;
1225 try {
1226 is = getResourceAsStream("/test-documents/"+ "testPDF_no_extract_yes_accessibility_owner_empty.pdf");
1227 assertContains("Hello World", getText(is, parser, context));
1228 } finally {
1229 IOUtils.closeQuietly(is);
1230 }
1231 }
1232
1233 @Test
1234 public void testAccessCheckingUserPassword() throws Exception {
1235 ParseContext context = new ParseContext();
1236
1237 PDFParserConfig config = new PDFParserConfig();
1238 //don't allow extraction, not even for accessibility
1239 config.setAccessChecker(new AccessChecker(false));
1240 PasswordProvider passwordProvider = new PasswordProvider() {
1241 @Override
1242 public String getPassword(Metadata metadata) {
1243 return "user";
1244 }
1245 };
1246
1247 context.set(PasswordProvider.class, passwordProvider);
1248 context.set(PDFParserConfig.class, config);
1249
1250 Parser parser = new AutoDetectParser();
1251
1252 //test bad passwords
1253 for (String path : new String[] {
1254 "testPDF_no_extract_no_accessibility_owner_empty.pdf",
1255 "testPDF_no_extract_yes_accessibility_owner_empty.pdf",
1256 }) {
1257 assertException("/test-documents/"+path, parser, context, EncryptedDocumentException.class);
1258 }
1259
1260 //bad password is still a bad password
1261 config.setAccessChecker(new AccessChecker(true));
1262 for (String path : new String[] {
1263 "testPDF_no_extract_no_accessibility_owner_empty.pdf",
1264 "testPDF_no_extract_yes_accessibility_owner_empty.pdf",
1265 }) {
1266 assertException("/test-documents/"+path, parser, context, EncryptedDocumentException.class);
1267 }
1268
1269 //now test documents that require this "user" password
1270 assertException("/test-documents/"+"testPDF_no_extract_no_accessibility_owner_user.pdf",
1271 parser, context, AccessPermissionException.class);
1272
1273
1274 InputStream is = null;
1275 try {
1276 is = getResourceAsStream("/test-documents/"+ "testPDF_no_extract_yes_accessibility_owner_user.pdf");
1277 assertContains("Hello World", getText(is, parser, context));
1278 } finally {
1279 IOUtils.closeQuietly(is);
1280 }
1281
1282 config.setAccessChecker(new AccessChecker(false));
1283 for (String path : new String[] {
1284 "testPDF_no_extract_no_accessibility_owner_user.pdf",
1285 "testPDF_no_extract_yes_accessibility_owner_user.pdf",
1286 }) {
1287 assertException("/test-documents/"+path, parser, context, AccessPermissionException.class);
1288 }
1289 }
1290
1291 @Test
1292 public void testAccessCheckingOwnerPassword() throws Exception {
1293 ParseContext context = new ParseContext();
1294
1295 PDFParserConfig config = new PDFParserConfig();
1296 //don't allow extraction, not even for accessibility
1297 config.setAccessChecker(new AccessChecker(true));
1298 PasswordProvider passwordProvider = new PasswordProvider() {
1299 @Override
1300 public String getPassword(Metadata metadata) {
1301 return "owner";
1302 }
1303 };
1304
1305 context.set(PasswordProvider.class, passwordProvider);
1306 context.set(PDFParserConfig.class, config);
1307
1308 Parser parser = new AutoDetectParser();
1309 //with owner's password, text can be extracted, no matter the AccessibilityChecker's settings
1310 for (String path : new String[] {
1311 "testPDF_no_extract_no_accessibility_owner_user.pdf",
1312 "testPDF_no_extract_yes_accessibility_owner_user.pdf",
1313 "testPDF_no_extract_no_accessibility_owner_empty.pdf",
1314 "testPDF_no_extract_yes_accessibility_owner_empty.pdf",
1315 }) {
1316
1317 InputStream is = null;
1318 try {
1319 is = getResourceAsStream("/test-documents/" + "testPDF_no_extract_yes_accessibility_owner_user.pdf");
1320 assertContains("Hello World", getText(is, parser, context));
1321 } finally {
1322 IOUtils.closeQuietly(is);
1323 }
1324 }
1325
1326 //really, with owner's password, all extraction is allowed
1327 config.setAccessChecker(new AccessChecker(false));
1328 for (String path : new String[] {
1329 "testPDF_no_extract_no_accessibility_owner_user.pdf",
1330 "testPDF_no_extract_yes_accessibility_owner_user.pdf",
1331 "testPDF_no_extract_no_accessibility_owner_empty.pdf",
1332 "testPDF_no_extract_yes_accessibility_owner_empty.pdf",
1333 }) {
1334
1335 InputStream is = null;
1336 try {
1337 is = getResourceAsStream("/test-documents/" + "testPDF_no_extract_yes_accessibility_owner_user.pdf");
1338 assertContains("Hello World", getText(is, parser, context));
1339 } finally {
1340 IOUtils.closeQuietly(is);
1341 }
1342 }
1343 }
1344
1345 private void assertException(String path, Parser parser, ParseContext context, Class expected) {
1346 boolean noEx = false;
1347 InputStream is = getResourceAsStream(path);
1348 try {
1349 String text = getText(is, parser, context);
1350 noEx = true;
1351 } catch (Exception e) {
1352 assertEquals("Not the right exception: "+path, expected, e.getClass());
1353 } finally {
1354 IOUtils.closeQuietly(is);
1355 }
1356 assertFalse(path + " should have thrown exception", noEx);
1357 }
10081358 /**
10091359 *
10101360 * Simple class to count end of document events. If functionality is useful,
2424 import org.apache.tika.TikaTest;
2525 import org.apache.tika.exception.TikaException;
2626 import org.apache.tika.metadata.Metadata;
27 import org.apache.tika.metadata.TikaCoreProperties;
2728 import org.apache.tika.mime.MediaType;
2829 import org.apache.tika.parser.AbstractParser;
2930 import org.apache.tika.parser.AutoDetectParser;
5960 protected static class EmbeddedTrackingParser extends AbstractParser {
6061 protected List<String> filenames = new ArrayList<String>();
6162 protected List<String> mediatypes = new ArrayList<String>();
63 protected List<String> createdAts = new ArrayList<String>();
64 protected List<String> modifiedAts = new ArrayList<String>();
6265 protected byte[] lastSeenStart;
6366
6467 public void reset() {
6568 filenames.clear();
6669 mediatypes.clear();
70 createdAts.clear();
71 modifiedAts.clear();
6772 }
6873
6974 public Set<MediaType> getSupportedTypes(ParseContext context) {
7681 SAXException, TikaException {
7782 filenames.add(metadata.get(Metadata.RESOURCE_NAME_KEY));
7883 mediatypes.add(metadata.get(Metadata.CONTENT_TYPE));
84 createdAts.add(metadata.get(TikaCoreProperties.CREATED));
85 modifiedAts.add(metadata.get(TikaCoreProperties.MODIFIED));
7986
8087 lastSeenStart = new byte[32];
8188 stream.read(lastSeenStart);
3131
3232 public class ArParserTest extends AbstractPkgTest {
3333 @Test
34 public void testArParsing() throws Exception {
35 Parser parser = new AutoDetectParser();
34 public void testArParsing() throws Exception {
35 Parser parser = new AutoDetectParser();
3636
37 ContentHandler handler = new BodyContentHandler();
38 Metadata metadata = new Metadata();
37 ContentHandler handler = new BodyContentHandler();
38 Metadata metadata = new Metadata();
3939
40 InputStream stream = ArParserTest.class
41 .getResourceAsStream("/test-documents/testARofText.ar");
42 try {
43 parser.parse(stream, handler, metadata, recursingContext);
44 } finally {
45 stream.close();
46 }
40 InputStream stream = ArParserTest.class.getResourceAsStream(
41 "/test-documents/testARofText.ar");
42 try {
43 parser.parse(stream, handler, metadata, recursingContext);
44 } finally {
45 stream.close();
46 }
4747
48 assertEquals("application/x-archive",
49 metadata.get(Metadata.CONTENT_TYPE));
50 String content = handler.toString();
51 assertTrue(content.contains("testTXT.txt"));
52 assertTrue(content.contains("Test d'indexation de Txt"));
53 assertTrue(content.contains("http://www.apache.org"));
48 assertEquals("application/x-archive",
49 metadata.get(Metadata.CONTENT_TYPE));
50 String content = handler.toString();
51 assertContains("testTXT.txt", content);
52 assertContains("Test d'indexation de Txt", content);
53 assertContains("http://www.apache.org", content);
5454
55 stream = ArParserTest.class
56 .getResourceAsStream("/test-documents/testARofSND.ar");
57 try {
58 parser.parse(stream, handler, metadata, recursingContext);
59 } finally {
60 stream.close();
61 }
55 stream = ArParserTest.class.getResourceAsStream(
56 "/test-documents/testARofSND.ar");
57 try {
58 parser.parse(stream, handler, metadata, recursingContext);
59 } finally {
60 stream.close();
61 }
6262
63 assertEquals("application/x-archive",
64 metadata.get(Metadata.CONTENT_TYPE));
65 content = handler.toString();
66 assertTrue(content.contains("testAU.au"));
67 }
63 assertEquals("application/x-archive",
64 metadata.get(Metadata.CONTENT_TYPE));
65 content = handler.toString();
66 assertContains("testAU.au", content);
67 }
6868
69 /**
70 * Tests that the ParseContext parser is correctly fired for all the
71 * embedded entries.
72 */
69 /**
70 * Tests that the ParseContext parser is correctly fired for all the
71 * embedded entries.
72 */
7373 @Test
74 public void testEmbedded() throws Exception {
75 Parser parser = new AutoDetectParser(); // Should auto-detect!
76 ContentHandler handler = new BodyContentHandler();
77 Metadata metadata = new Metadata();
74 public void testEmbedded() throws Exception {
75 Parser parser = new AutoDetectParser(); // Should auto-detect!
76 ContentHandler handler = new BodyContentHandler();
77 Metadata metadata = new Metadata();
7878
79 InputStream stream = ArParserTest.class
80 .getResourceAsStream("/test-documents/testARofText.ar");
81 try {
82 parser.parse(stream, handler, metadata, trackingContext);
83 } finally {
84 stream.close();
85 }
79 InputStream stream = ArParserTest.class.getResourceAsStream(
80 "/test-documents/testARofText.ar");
81 try {
82 parser.parse(stream, handler, metadata, trackingContext);
83 } finally {
84 stream.close();
85 }
8686
87 assertEquals(1, tracker.filenames.size());
88 assertEquals(1, tracker.mediatypes.size());
87 assertEquals(1, tracker.filenames.size());
88 assertEquals(1, tracker.mediatypes.size());
89 assertEquals(1, tracker.modifiedAts.size());
8990
90 assertEquals("testTXT.txt", tracker.filenames.get(0));
91 assertEquals("testTXT.txt", tracker.filenames.get(0));
9192
92 for (String type : tracker.mediatypes) {
93 assertNull(type);
94 }
93 String modifiedAt = tracker.modifiedAts.get(0);
94 assertTrue("Modified at " + modifiedAt, modifiedAt.startsWith("201"));
9595
96 tracker.reset();
97 stream = ArParserTest.class
98 .getResourceAsStream("/test-documents/testARofSND.ar");
99 try {
100 parser.parse(stream, handler, metadata, trackingContext);
101 } finally {
102 stream.close();
103 }
96 for (String type : tracker.mediatypes) {
97 assertNull(type);
98 }
99 for(String crt : tracker.createdAts) {
100 assertNull(crt);
101 }
104102
105 assertEquals(1, tracker.filenames.size());
106 assertEquals(1, tracker.mediatypes.size());
107 assertEquals("testAU.au", tracker.filenames.get(0));
103 tracker.reset();
104 stream = ArParserTest.class.getResourceAsStream(
105 "/test-documents/testARofSND.ar");
106 try {
107 parser.parse(stream, handler, metadata, trackingContext);
108 } finally {
109 stream.close();
110 }
108111
109 for (String type : tracker.mediatypes) {
110 assertNull(type);
111 }
112 }
112 assertEquals(1, tracker.filenames.size());
113 assertEquals(1, tracker.mediatypes.size());
114 assertEquals(1, tracker.modifiedAts.size());
115 assertEquals("testAU.au", tracker.filenames.get(0));
116
117 modifiedAt = tracker.modifiedAts.get(0);
118 assertTrue("Modified at " + modifiedAt, modifiedAt.startsWith("201"));
119
120 for (String type : tracker.mediatypes) {
121 assertNull(type);
122 }
123 for(String crt : tracker.createdAts) {
124 assertNull(crt);
125 }
126 }
113127 }
1616 package org.apache.tika.parser.pkg;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
2019
2120 import java.io.InputStream;
2221
4847
4948 assertEquals("application/x-bzip2", metadata.get(Metadata.CONTENT_TYPE));
5049 String content = handler.toString();
51 assertTrue(content.contains("test-documents/testEXCEL.xls"));
52 assertTrue(content.contains("Sample Excel Worksheet"));
53 assertTrue(content.contains("test-documents/testHTML.html"));
54 assertTrue(content.contains("Test Indexation Html"));
55 assertTrue(content.contains("test-documents/testOpenOffice2.odt"));
56 assertTrue(content.contains("This is a sample Open Office document"));
57 assertTrue(content.contains("test-documents/testPDF.pdf"));
58 assertTrue(content.contains("Apache Tika"));
59 assertTrue(content.contains("test-documents/testPPT.ppt"));
60 assertTrue(content.contains("Sample Powerpoint Slide"));
61 assertTrue(content.contains("test-documents/testRTF.rtf"));
62 assertTrue(content.contains("indexation Word"));
63 assertTrue(content.contains("test-documents/testTXT.txt"));
64 assertTrue(content.contains("Test d'indexation de Txt"));
65 assertTrue(content.contains("test-documents/testWORD.doc"));
66 assertTrue(content.contains("This is a sample Microsoft Word Document"));
67 assertTrue(content.contains("test-documents/testXML.xml"));
68 assertTrue(content.contains("Rida Benjelloun"));
50 assertContains("test-documents/testEXCEL.xls", content);
51 assertContains("Sample Excel Worksheet", content);
52 assertContains("test-documents/testHTML.html", content);
53 assertContains("Test Indexation Html", content);
54 assertContains("test-documents/testOpenOffice2.odt", content);
55 assertContains("This is a sample Open Office document", content);
56 assertContains("test-documents/testPDF.pdf", content);
57 assertContains("Apache Tika", content);
58 assertContains("test-documents/testPPT.ppt", content);
59 assertContains("Sample Powerpoint Slide", content);
60 assertContains("test-documents/testRTF.rtf", content);
61 assertContains("indexation Word", content);
62 assertContains("test-documents/testTXT.txt", content);
63 assertContains("Test d'indexation de Txt", content);
64 assertContains("test-documents/testWORD.doc", content);
65 assertContains("This is a sample Microsoft Word Document", content);
66 assertContains("test-documents/testXML.xml", content);
67 assertContains("Rida Benjelloun", content);
6968 }
7069
7170
9089 // Should find a single entry, for the (compressed) tar file
9190 assertEquals(1, tracker.filenames.size());
9291 assertEquals(1, tracker.mediatypes.size());
92 assertEquals(1, tracker.modifiedAts.size());
9393
9494 assertEquals(null, tracker.filenames.get(0));
9595 assertEquals(null, tracker.mediatypes.get(0));
96 assertEquals(null, tracker.createdAts.get(0));
97 assertEquals(null, tracker.modifiedAts.get(0));
9698
9799 // Tar file starts with the directory name
98100 assertEquals("test-documents/", new String(tracker.lastSeenStart, 0, 15, "ASCII"));
1616 package org.apache.tika.parser.pkg;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
2019
2120 import java.io.InputStream;
2221
4847
4948 assertEquals("application/gzip", metadata.get(Metadata.CONTENT_TYPE));
5049 String content = handler.toString();
51 assertTrue(content.contains("test-documents/testEXCEL.xls"));
52 assertTrue(content.contains("Sample Excel Worksheet"));
53 assertTrue(content.contains("test-documents/testHTML.html"));
54 assertTrue(content.contains("Test Indexation Html"));
55 assertTrue(content.contains("test-documents/testOpenOffice2.odt"));
56 assertTrue(content.contains("This is a sample Open Office document"));
57 assertTrue(content.contains("test-documents/testPDF.pdf"));
58 assertTrue(content.contains("Apache Tika"));
59 assertTrue(content.contains("test-documents/testPPT.ppt"));
60 assertTrue(content.contains("Sample Powerpoint Slide"));
61 assertTrue(content.contains("test-documents/testRTF.rtf"));
62 assertTrue(content.contains("indexation Word"));
63 assertTrue(content.contains("test-documents/testTXT.txt"));
64 assertTrue(content.contains("Test d'indexation de Txt"));
65 assertTrue(content.contains("test-documents/testWORD.doc"));
66 assertTrue(content.contains("This is a sample Microsoft Word Document"));
67 assertTrue(content.contains("test-documents/testXML.xml"));
68 assertTrue(content.contains("Rida Benjelloun"));
50 assertContains("test-documents/testEXCEL.xls", content);
51 assertContains("Sample Excel Worksheet", content);
52 assertContains("test-documents/testHTML.html", content);
53 assertContains("Test Indexation Html", content);
54 assertContains("test-documents/testOpenOffice2.odt", content);
55 assertContains("This is a sample Open Office document", content);
56 assertContains("test-documents/testPDF.pdf", content);
57 assertContains("Apache Tika", content);
58 assertContains("test-documents/testPPT.ppt", content);
59 assertContains("Sample Powerpoint Slide", content);
60 assertContains("test-documents/testRTF.rtf", content);
61 assertContains("indexation Word", content);
62 assertContains("test-documents/testTXT.txt", content);
63 assertContains("Test d'indexation de Txt", content);
64 assertContains("test-documents/testWORD.doc", content);
65 assertContains("This is a sample Microsoft Word Document", content);
66 assertContains("test-documents/testXML.xml", content);
67 assertContains("Rida Benjelloun", content);
6968 }
7069
7170 /**
8988 // Should find a single entry, for the (compressed) tar file
9089 assertEquals(1, tracker.filenames.size());
9190 assertEquals(1, tracker.mediatypes.size());
91 assertEquals(1, tracker.modifiedAts.size());
9292
9393 assertEquals(null, tracker.filenames.get(0));
9494 assertEquals(null, tracker.mediatypes.get(0));
95 assertEquals(null, tracker.modifiedAts.get(0));
9596
9697 // Tar file starts with the directory name
9798 assertEquals("test-documents/", new String(tracker.lastSeenStart, 0, 15, "ASCII"));
113114
114115 assertEquals("application/gzip", metadata.get(Metadata.CONTENT_TYPE));
115116 String content = handler.toString();
116 assertTrue(content.contains("Test SVG image"));
117 assertContains("Test SVG image", content);
117118 }
118119
119120 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.parser.pkg;
17
18 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertNotNull;
20 import static org.junit.Assert.assertNull;
21 import static org.junit.Assert.assertTrue;
22
23 import java.io.InputStream;
24
25 import org.apache.tika.metadata.Metadata;
26 import org.apache.tika.parser.AutoDetectParser;
27 import org.apache.tika.parser.Parser;
28 import org.apache.tika.sax.BodyContentHandler;
29 import org.junit.Test;
30 import org.xml.sax.ContentHandler;
31
32 /**
33 * Test case for parsing rar files.
34 */
35 public class RarParserTest extends AbstractPkgTest {
36
37 @Test
38 public void testRarParsing() throws Exception {
39 Parser parser = new AutoDetectParser(); // Should auto-detect!
40 ContentHandler handler = new BodyContentHandler();
41 Metadata metadata = new Metadata();
42
43 InputStream stream = RarParserTest.class.getResourceAsStream(
44 "/test-documents/test-documents.rar");
45 try {
46 parser.parse(stream, handler, metadata, recursingContext);
47 } finally {
48 stream.close();
49 }
50
51 assertEquals("application/x-rar-compressed", metadata.get(Metadata.CONTENT_TYPE));
52 String content = handler.toString();
53 assertContains("test-documents/testEXCEL.xls", content);
54 assertContains("Sample Excel Worksheet", content);
55 assertContains("test-documents/testHTML.html", content);
56 assertContains("Test Indexation Html", content);
57 assertContains("test-documents/testOpenOffice2.odt", content);
58 assertContains("This is a sample Open Office document", content);
59 assertContains("test-documents/testPDF.pdf", content);
60 assertContains("Apache Tika", content);
61 assertContains("test-documents/testPPT.ppt", content);
62 assertContains("Sample Powerpoint Slide", content);
63 assertContains("test-documents/testRTF.rtf", content);
64 assertContains("indexation Word", content);
65 assertContains("test-documents/testTXT.txt", content);
66 assertContains("Test d'indexation de Txt", content);
67 assertContains("test-documents/testWORD.doc", content);
68 assertContains("This is a sample Microsoft Word Document", content);
69 assertContains("test-documents/testXML.xml", content);
70 assertContains("Rida Benjelloun", content);
71 }
72
73 /**
74 * Tests that the ParseContext parser is correctly
75 * fired for all the embedded entries.
76 */
77 @Test
78 public void testEmbedded() throws Exception {
79 Parser parser = new AutoDetectParser(); // Should auto-detect!
80 ContentHandler handler = new BodyContentHandler();
81 Metadata metadata = new Metadata();
82
83 InputStream stream = RarParserTest.class.getResourceAsStream(
84 "/test-documents/test-documents.rar");
85 try {
86 parser.parse(stream, handler, metadata, trackingContext);
87 } finally {
88 stream.close();
89 }
90
91 // Should have found all 9 documents, but not the directory
92 assertEquals(9, tracker.filenames.size());
93 assertEquals(9, tracker.mediatypes.size());
94 assertEquals(9, tracker.modifiedAts.size());
95
96 // Should have names but not content types, as rar doesn't
97 // store the content types
98 assertEquals("test-documents/testEXCEL.xls", tracker.filenames.get(0));
99 assertEquals("test-documents/testHTML.html", tracker.filenames.get(1));
100 assertEquals("test-documents/testOpenOffice2.odt", tracker.filenames.get(2));
101 assertEquals("test-documents/testPDF.pdf", tracker.filenames.get(3));
102 assertEquals("test-documents/testPPT.ppt", tracker.filenames.get(4));
103 assertEquals("test-documents/testRTF.rtf", tracker.filenames.get(5));
104 assertEquals("test-documents/testTXT.txt", tracker.filenames.get(6));
105 assertEquals("test-documents/testWORD.doc", tracker.filenames.get(7));
106 assertEquals("test-documents/testXML.xml", tracker.filenames.get(8));
107
108 for(String type : tracker.mediatypes) {
109 assertNull(type);
110 }
111 for(String crt : tracker.createdAts) {
112 assertNull(crt);
113 }
114 for(String mod : tracker.modifiedAts) {
115 assertNotNull(mod);
116 assertTrue("Modified at " + mod, mod.startsWith("20"));
117 }
118
119 // Should have filenames in the content string
120 String content = handler.toString();
121 assertContains("test-documents/testHTML.html", content);
122 assertContains("test-documents/testEXCEL.xls", content);
123 assertContains("test-documents/testOpenOffice2.odt", content);
124 assertContains("test-documents/testPDF.pdf", content);
125 assertContains("test-documents/testPPT.ppt", content);
126 assertContains("test-documents/testRTF.rtf", content);
127 assertContains("test-documents/testTXT.txt", content);
128 assertContains("test-documents/testWORD.doc", content);
129 assertContains("test-documents/testXML.xml", content);
130 }
131 }
1616 package org.apache.tika.parser.pkg;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertNotNull;
1920 import static org.junit.Assert.assertNull;
2021 import static org.junit.Assert.assertTrue;
22 import static org.junit.Assert.fail;
23
24 import javax.crypto.Cipher;
2125
2226 import java.io.InputStream;
23
27 import java.security.NoSuchAlgorithmException;
28
29 import org.apache.tika.exception.EncryptedDocumentException;
30 import org.apache.tika.exception.TikaException;
2431 import org.apache.tika.metadata.Metadata;
2532 import org.apache.tika.mime.MediaType;
2633 import org.apache.tika.parser.AutoDetectParser;
2734 import org.apache.tika.parser.Parser;
35 import org.apache.tika.parser.PasswordProvider;
2836 import org.apache.tika.sax.BodyContentHandler;
2937 import org.junit.Test;
3038 import org.xml.sax.ContentHandler;
4654 parser.getSupportedTypes(recursingContext).contains(TYPE_7ZIP));
4755
4856 // Parse
49 InputStream stream = TarParserTest.class.getResourceAsStream(
57 InputStream stream = Seven7ParserTest.class.getResourceAsStream(
5058 "/test-documents/test-documents.7z");
5159 try {
5260 parser.parse(stream, handler, metadata, recursingContext);
5664
5765 assertEquals(TYPE_7ZIP.toString(), metadata.get(Metadata.CONTENT_TYPE));
5866 String content = handler.toString();
59 assertTrue(content.contains("test-documents/testEXCEL.xls"));
60 assertTrue(content.contains("Sample Excel Worksheet"));
61 assertTrue(content.contains("test-documents/testHTML.html"));
62 assertTrue(content.contains("Test Indexation Html"));
63 assertTrue(content.contains("test-documents/testOpenOffice2.odt"));
64 assertTrue(content.contains("This is a sample Open Office document"));
65 assertTrue(content.contains("test-documents/testPDF.pdf"));
66 assertTrue(content.contains("Apache Tika"));
67 assertTrue(content.contains("test-documents/testPPT.ppt"));
68 assertTrue(content.contains("Sample Powerpoint Slide"));
69 assertTrue(content.contains("test-documents/testRTF.rtf"));
70 assertTrue(content.contains("indexation Word"));
71 assertTrue(content.contains("test-documents/testTXT.txt"));
72 assertTrue(content.contains("Test d'indexation de Txt"));
73 assertTrue(content.contains("test-documents/testWORD.doc"));
74 assertTrue(content.contains("This is a sample Microsoft Word Document"));
75 assertTrue(content.contains("test-documents/testXML.xml"));
76 assertTrue(content.contains("Rida Benjelloun"));
67 assertContains("test-documents/testEXCEL.xls", content);
68 assertContains("Sample Excel Worksheet", content);
69 assertContains("test-documents/testHTML.html", content);
70 assertContains("Test Indexation Html", content);
71 assertContains("test-documents/testOpenOffice2.odt", content);
72 assertContains("This is a sample Open Office document", content);
73 assertContains("test-documents/testPDF.pdf", content);
74 assertContains("Apache Tika", content);
75 assertContains("test-documents/testPPT.ppt", content);
76 assertContains("Sample Powerpoint Slide", content);
77 assertContains("test-documents/testRTF.rtf", content);
78 assertContains("indexation Word", content);
79 assertContains("test-documents/testTXT.txt", content);
80 assertContains("Test d'indexation de Txt", content);
81 assertContains("test-documents/testWORD.doc", content);
82 assertContains("This is a sample Microsoft Word Document", content);
83 assertContains("test-documents/testXML.xml", content);
84 assertContains("Rida Benjelloun", content);
7785 }
7886
7987 /**
8694 ContentHandler handler = new BodyContentHandler();
8795 Metadata metadata = new Metadata();
8896
89 InputStream stream = ZipParserTest.class.getResourceAsStream(
97 InputStream stream = Seven7ParserTest.class.getResourceAsStream(
9098 "/test-documents/test-documents.7z");
9199 try {
92100 parser.parse(stream, handler, metadata, trackingContext);
97105 // Should have found all 9 documents, but not the directory
98106 assertEquals(9, tracker.filenames.size());
99107 assertEquals(9, tracker.mediatypes.size());
108 assertEquals(9, tracker.modifiedAts.size());
100109
101110 // Should have names but not content types, as 7z doesn't
102111 // store the content types
113122 for(String type : tracker.mediatypes) {
114123 assertNull(type);
115124 }
125 for(String mod : tracker.modifiedAts) {
126 assertNotNull(mod);
127 assertTrue("Modified at " + mod, mod.startsWith("20"));
128 }
129 }
130
131 @Test
132 public void testPasswordProtected() throws Exception {
133 Parser parser = new AutoDetectParser();
134 ContentHandler handler = new BodyContentHandler();
135 Metadata metadata = new Metadata();
136
137 // No password, will fail with EncryptedDocumentException
138 InputStream stream = Seven7ParserTest.class.getResourceAsStream(
139 "/test-documents/test7Z_protected_passTika.7z");
140 boolean ex = false;
141 try {
142 parser.parse(stream, handler, metadata, recursingContext);
143 fail("Shouldn't be able to read a password protected 7z without the password");
144 } catch (EncryptedDocumentException e) {
145 // Good
146 ex = true;
147 } finally {
148 stream.close();
149 }
150
151 assertTrue("test no password", ex);
152
153 ex = false;
154
155 // Wrong password currently silently gives no content
156 // Ideally we'd like Commons Compress to give an error, but it doesn't...
157 recursingContext.set(PasswordProvider.class, new PasswordProvider() {
158 @Override
159 public String getPassword(Metadata metadata) {
160 return "wrong";
161 }
162 });
163 handler = new BodyContentHandler();
164 stream = Seven7ParserTest.class.getResourceAsStream(
165 "/test-documents/test7Z_protected_passTika.7z");
166 try {
167 parser.parse(stream, handler, metadata, recursingContext);
168 fail("Shouldn't be able to read a password protected 7z with wrong password");
169 } catch (TikaException e) {
170 //if JCE is installed, the cause will be: Caused by: org.tukaani.xz.CorruptedInputException: Compressed data is corrupt
171 //if JCE is not installed, the message will include
172 // "(do you have the JCE Unlimited Strength Jurisdiction Policy Files installed?")
173 ex = true;
174 } finally {
175 stream.close();
176 }
177 assertTrue("TikaException for bad password", ex);
178 // Will be empty
179 assertEquals("", handler.toString());
180
181 ex = false;
182 // Right password works fine if JCE Unlimited Strength has been installed!!!
183 if (isStrongCryptoAvailable()) {
184 recursingContext.set(PasswordProvider.class, new PasswordProvider() {
185 @Override
186 public String getPassword(Metadata metadata) {
187 return "Tika";
188 }
189 });
190 handler = new BodyContentHandler();
191 stream = Seven7ParserTest.class.getResourceAsStream(
192 "/test-documents/test7Z_protected_passTika.7z");
193 try {
194 parser.parse(stream, handler, metadata, recursingContext);
195 } finally {
196 stream.close();
197 }
198
199 assertEquals(TYPE_7ZIP.toString(), metadata.get(Metadata.CONTENT_TYPE));
200 String content = handler.toString();
201
202 // Should get filename
203 assertContains("text.txt", content);
204
205 // Should get contents from the text file in the 7z file
206 assertContains("TEST DATA FOR TIKA.", content);
207 assertContains("This is text inside an encrypted 7zip (7z) file.", content);
208 assertContains("It should be processed by Tika just fine!", content);
209 assertContains("TIKA-1521", content);
210 } else {
211 //if jce is not installed, test for IOException wrapped in TikaException
212 boolean ioe = false;
213 recursingContext.set(PasswordProvider.class, new PasswordProvider() {
214 @Override
215 public String getPassword(Metadata metadata) {
216 return "Tika";
217 }
218 });
219 handler = new BodyContentHandler();
220 stream = Seven7ParserTest.class.getResourceAsStream(
221 "/test-documents/test7Z_protected_passTika.7z");
222 try {
223 parser.parse(stream, handler, metadata, recursingContext);
224 } catch (TikaException e) {
225 ioe = true;
226 } finally {
227 stream.close();
228 }
229 assertTrue("IOException because JCE was not installed", ioe);
230 }
231 }
232
233 private static boolean isStrongCryptoAvailable() throws NoSuchAlgorithmException {
234 return Cipher.getMaxAllowedKeyLength("AES/ECB/PKCS5Padding") >= 256;
116235 }
117236 }
1616 package org.apache.tika.parser.pkg;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertNotNull;
1920 import static org.junit.Assert.assertNull;
2021 import static org.junit.Assert.assertTrue;
2122
4950
5051 assertEquals("application/x-tar", metadata.get(Metadata.CONTENT_TYPE));
5152 String content = handler.toString();
52 assertTrue(content.contains("test-documents/testEXCEL.xls"));
53 assertTrue(content.contains("Sample Excel Worksheet"));
54 assertTrue(content.contains("test-documents/testHTML.html"));
55 assertTrue(content.contains("Test Indexation Html"));
56 assertTrue(content.contains("test-documents/testOpenOffice2.odt"));
57 assertTrue(content.contains("This is a sample Open Office document"));
58 assertTrue(content.contains("test-documents/testPDF.pdf"));
59 assertTrue(content.contains("Apache Tika"));
60 assertTrue(content.contains("test-documents/testPPT.ppt"));
61 assertTrue(content.contains("Sample Powerpoint Slide"));
62 assertTrue(content.contains("test-documents/testRTF.rtf"));
63 assertTrue(content.contains("indexation Word"));
64 assertTrue(content.contains("test-documents/testTXT.txt"));
65 assertTrue(content.contains("Test d'indexation de Txt"));
66 assertTrue(content.contains("test-documents/testWORD.doc"));
67 assertTrue(content.contains("This is a sample Microsoft Word Document"));
68 assertTrue(content.contains("test-documents/testXML.xml"));
69 assertTrue(content.contains("Rida Benjelloun"));
53 assertContains("test-documents/testEXCEL.xls", content);
54 assertContains("Sample Excel Worksheet", content);
55 assertContains("test-documents/testHTML.html", content);
56 assertContains("Test Indexation Html", content);
57 assertContains("test-documents/testOpenOffice2.odt", content);
58 assertContains("This is a sample Open Office document", content);
59 assertContains("test-documents/testPDF.pdf", content);
60 assertContains("Apache Tika", content);
61 assertContains("test-documents/testPPT.ppt", content);
62 assertContains("Sample Powerpoint Slide", content);
63 assertContains("test-documents/testRTF.rtf", content);
64 assertContains("indexation Word", content);
65 assertContains("test-documents/testTXT.txt", content);
66 assertContains("Test d'indexation de Txt", content);
67 assertContains("test-documents/testWORD.doc", content);
68 assertContains("This is a sample Microsoft Word Document", content);
69 assertContains("test-documents/testXML.xml", content);
70 assertContains("Rida Benjelloun", content);
7071 }
7172
7273 /**
9091 // Should have found all 9 documents, but not the directory
9192 assertEquals(9, tracker.filenames.size());
9293 assertEquals(9, tracker.mediatypes.size());
94 assertEquals(9, tracker.modifiedAts.size());
9395
9496 // Should have names but not content types, as tar doesn't
9597 // store the content types
106108 for(String type : tracker.mediatypes) {
107109 assertNull(type);
108110 }
111 for(String crt : tracker.createdAts) {
112 assertNull(crt);
113 }
114 for(String mod : tracker.modifiedAts) {
115 assertNotNull(mod);
116 assertTrue("Modified at " + mod, mod.startsWith("20"));
117 }
109118 }
110119 }
1616 package org.apache.tika.parser.pkg;
1717
1818 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertNotNull;
1920 import static org.junit.Assert.assertNull;
2021 import static org.junit.Assert.assertTrue;
2122
5859
5960 assertEquals("application/zip", metadata.get(Metadata.CONTENT_TYPE));
6061 String content = handler.toString();
61 assertTrue(content.contains("testEXCEL.xls"));
62 assertTrue(content.contains("Sample Excel Worksheet"));
63 assertTrue(content.contains("testHTML.html"));
64 assertTrue(content.contains("Test Indexation Html"));
65 assertTrue(content.contains("testOpenOffice2.odt"));
66 assertTrue(content.contains("This is a sample Open Office document"));
67 assertTrue(content.contains("testPDF.pdf"));
68 assertTrue(content.contains("Apache Tika"));
69 assertTrue(content.contains("testPPT.ppt"));
70 assertTrue(content.contains("Sample Powerpoint Slide"));
71 assertTrue(content.contains("testRTF.rtf"));
72 assertTrue(content.contains("indexation Word"));
73 assertTrue(content.contains("testTXT.txt"));
74 assertTrue(content.contains("Test d'indexation de Txt"));
75 assertTrue(content.contains("testWORD.doc"));
76 assertTrue(content.contains("This is a sample Microsoft Word Document"));
77 assertTrue(content.contains("testXML.xml"));
78 assertTrue(content.contains("Rida Benjelloun"));
62 assertContains("testEXCEL.xls", content);
63 assertContains("Sample Excel Worksheet", content);
64 assertContains("testHTML.html", content);
65 assertContains("Test Indexation Html", content);
66 assertContains("testOpenOffice2.odt", content);
67 assertContains("This is a sample Open Office document", content);
68 assertContains("testPDF.pdf", content);
69 assertContains("Apache Tika", content);
70 assertContains("testPPT.ppt", content);
71 assertContains("Sample Powerpoint Slide", content);
72 assertContains("testRTF.rtf", content);
73 assertContains("indexation Word", content);
74 assertContains("testTXT.txt", content);
75 assertContains("Test d'indexation de Txt", content);
76 assertContains("testWORD.doc", content);
77 assertContains("This is a sample Microsoft Word Document", content);
78 assertContains("testXML.xml", content);
79 assertContains("Rida Benjelloun", content);
7980 }
8081
8182 /**
99100 // Should have found all 9 documents
100101 assertEquals(9, tracker.filenames.size());
101102 assertEquals(9, tracker.mediatypes.size());
103 assertEquals(9, tracker.modifiedAts.size());
102104
103 // Should have names but not content types, as zip doesn't
104 // store the content types
105 // Should have names and modified dates, but not content types,
106 // as zip doesn't store the content types
105107 assertEquals("testEXCEL.xls", tracker.filenames.get(0));
106108 assertEquals("testHTML.html", tracker.filenames.get(1));
107109 assertEquals("testOpenOffice2.odt", tracker.filenames.get(2));
115117 for(String type : tracker.mediatypes) {
116118 assertNull(type);
117119 }
120 for(String crt : tracker.createdAts) {
121 assertNull(crt);
122 }
123 for(String mod : tracker.modifiedAts) {
124 assertNotNull(mod);
125 assertTrue("Modified at " + mod, mod.startsWith("20"));
126 }
118127 }
119128
120129 /**
129138 String content = new Tika().parseToString(
130139 ZipParserTest.class.getResourceAsStream(
131140 "/test-documents/moby.zip"));
132 assertTrue(content.contains("README"));
141 assertContains("README", content);
133142 }
134143
135144 private class GatherRelIDsDocumentExtractor implements EmbeddedDocumentExtractor {
1717
1818 import static org.junit.Assert.assertEquals;
1919 import static org.junit.Assert.assertFalse;
20 import static org.junit.Assert.assertNotNull;
21 import static org.junit.Assert.assertNull;
2022 import static org.junit.Assert.assertTrue;
21 import static org.junit.Assert.assertNull;
22 import static org.junit.Assert.assertNotNull;
2323
2424 import java.io.File;
2525 import java.io.FileInputStream;
3030 import java.util.HashSet;
3131 import java.util.List;
3232 import java.util.Set;
33
3433
3534 import org.apache.tika.Tika;
3635 import org.apache.tika.TikaTest;
4746 import org.apache.tika.parser.AutoDetectParser;
4847 import org.apache.tika.parser.ParseContext;
4948 import org.apache.tika.parser.Parser;
49 import org.apache.tika.parser.RecursiveParserWrapper;
50 import org.apache.tika.sax.BasicContentHandlerFactory;
5051 import org.apache.tika.sax.BodyContentHandler;
5152 import org.apache.tika.sax.WriteOutContentHandler;
5253 import org.junit.Test;
123124 File file = getResourceAsFile("/test-documents/testRTFTableCellSeparation.rtf");
124125 String content = tika.parseToString(file);
125126 content = content.replaceAll("\\s+"," ");
126 assertTrue(content.contains("a b c d \u00E4 \u00EB \u00F6 \u00FC"));
127 assertContains("a b c d \u00E4 \u00EB \u00F6 \u00FC", content);
127128 assertContains("a b c d \u00E4 \u00EB \u00F6 \u00FC", content);
128129 }
129130
441442 trueNames.add("file_3.pdf");
442443 trueNames.add("file_4.ppt");
443444 trueNames.add("file_5.pptx");
444 trueNames.add("thumbnail_0.jpeg");
445 trueNames.add("thumbnail.jpeg");
445446 trueNames.add("file_6.doc");
446447 trueNames.add("file_7.doc");
447448 trueNames.add("file_8.docx");
515516 public void testRegularImages() throws Exception {
516517 Parser base = new AutoDetectParser();
517518 ParseContext ctx = new ParseContext();
518 RecursiveMetadataParser parser = new RecursiveMetadataParser(base, false);
519 RecursiveParserWrapper parser = new RecursiveParserWrapper(base,
520 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1));
519521 ctx.set(org.apache.tika.parser.Parser.class, parser);
520522 TikaInputStream tis = null;
521523 ContentHandler handler = new BodyContentHandler();
527529 } finally {
528530 tis.close();
529531 }
530 List<Metadata> metadatas = parser.getAllMetadata();
532 List<Metadata> metadatas = parser.getMetadata();
531533
532534 Metadata meta_jpg_exif = metadatas.get(0);//("testJPEG_EXIF_\u666E\u6797\u65AF\u987F.jpg");
533535 Metadata meta_jpg = metadatas.get(2);//("testJPEG_\u666E\u6797\u65AF\u987F.jpg");
544546 assertEquals(40, meta_jpg.names().length);
545547 assertEquals(105, meta_jpg.names().length);
546548 }
547
549
550 @Test
551 public void testMultipleNewlines() throws Exception {
552 String content = getXML("testRTFNewlines.rtf").xml;
553 content = content.replaceAll("[\r\n]+", " ");
554 assertContains("<body><p>one</p> " +
555 "<p /> " +
556 "<p>two</p> " +
557 "<p /> " +
558 "<p /> " +
559 "<p>three</p> " +
560 "<p /> " +
561 "<p /> " +
562 "<p /> " +
563 "<p>four</p>", content);
564 }
565
566
548567 //TIKA-1010 test linked embedded doc
549568 @Test
550569 public void testEmbeddedLinkedDocument() throws Exception {
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import static org.junit.Assert.*;
16
17 import org.junit.Test;
18
19 public class FileConfigTest {
20
21 @Test
22 public void testNoConfig() {
23 FileConfig config = new FileConfig();
24 assertEquals("Invalid default filePath value", "", config.getFilePath());
25 assertEquals("Invalid default mime option value", false, config.isMimetype());
26 }
27 }
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import static org.junit.Assert.assertTrue;
16
17 import java.io.ByteArrayInputStream;
18 import java.io.ByteArrayOutputStream;
19 import java.io.InputStream;
20
21 import org.apache.tika.metadata.Metadata;
22 import org.apache.tika.parser.ParseContext;
23 import org.apache.tika.parser.Parser;
24 import org.apache.tika.sax.BodyContentHandler;
25 import org.junit.Test;
26 import org.xml.sax.ContentHandler;
27
28 public class Latin1StringsParserTest {
29
30 @Test
31 public void testParse() throws Exception {
32
33 String testStr = "These are Latin1 accented scripts: \u00C2 \u00C3 \u00C9 \u00DC \u00E2 \u00E3 \u00E9 \u00FC";
34 String smallStr = "ab";
35
36 byte[] iso8859Bytes = testStr.getBytes("ISO-8859-1");
37 byte[] utf8Bytes = testStr.getBytes("UTF-8");
38 byte[] utf16Bytes = testStr.getBytes("UTF-16");
39 byte[] zeros = new byte[10];
40 byte[] smallString = smallStr.getBytes("ISO-8859-1");
41 byte[] trashBytes = { 0x00, 0x01, 0x02, 0x03, 0x1E, 0x1F, (byte) 0xFF };
42
43 ByteArrayOutputStream baos = new ByteArrayOutputStream();
44 baos.write(iso8859Bytes);
45 baos.write(zeros);
46 baos.write(utf8Bytes);
47 baos.write(trashBytes);
48 baos.write(utf16Bytes);
49 baos.write(zeros);
50 baos.write(smallString);
51
52 Parser parser = new Latin1StringsParser();
53 ContentHandler handler = new BodyContentHandler();
54
55 InputStream stream = new ByteArrayInputStream(baos.toByteArray());
56
57 try {
58 parser.parse(stream, handler, new Metadata(), new ParseContext());
59 } finally {
60 stream.close();
61 }
62
63 String result = handler.toString();
64 String expected = testStr + "\n" + testStr + "\n" + testStr + "\n";
65
66 // Test if result contains only the test string appended 3 times
67 assertTrue(result.equals(expected));
68 }
69 }
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import static org.junit.Assert.*;
16
17 import java.io.File;
18 import java.io.InputStream;
19
20 import org.junit.Test;
21
22 public class StringsConfigTest {
23
24 @Test
25 public void testNoConfig() {
26 StringsConfig config = new StringsConfig();
27 assertEquals("Invalid default filePath value", "", config.getStringsPath());
28 assertEquals("Invalid default encoding value", StringsEncoding.SINGLE_7_BIT, config.getEncoding());
29 assertEquals("Invalid default min-len value", 4, config.getMinLength());
30 assertEquals("Invalid default timeout value", 120, config.getTimeout());
31 }
32
33 @Test
34 public void testPartialConfig() {
35 InputStream stream = StringsConfigTest.class.getResourceAsStream("/test-properties/StringsConfig-partial.properties");
36
37 StringsConfig config = new StringsConfig(stream);
38 assertEquals("Invalid default stringsPath value", "", config.getStringsPath());
39 assertEquals("Invalid overridden encoding value", StringsEncoding.BIGENDIAN_16_BIT, config.getEncoding());
40 assertEquals("Invalid default min-len value", 4, config.getMinLength());
41 assertEquals("Invalid overridden timeout value", 60, config.getTimeout());
42 }
43
44 @Test
45 public void testFullConfig() {
46 InputStream stream = StringsConfigTest.class.getResourceAsStream("/test-properties/StringsConfig-full.properties");
47
48 StringsConfig config = new StringsConfig(stream);
49 assertEquals("Invalid overridden stringsPath value", "/opt/strings" + File.separator, config.getStringsPath());
50 assertEquals("Invalid overridden encoding value", StringsEncoding.BIGENDIAN_16_BIT, config.getEncoding());
51 assertEquals("Invalid overridden min-len value", 3, config.getMinLength());
52 assertEquals("Invalid overridden timeout value", 60, config.getTimeout());
53 }
54
55 @Test(expected=IllegalArgumentException.class)
56 public void testValidateEconding() {
57 StringsConfig config = new StringsConfig();
58 config.setMinLength(0);
59 }
60 }
0 /*
1 * Licensed under the Apache License, Version 2.0 (the "License");
2 * you may not use this file except in compliance with the License.
3 * You may obtain a copy of the License at
4 *
5 * http://www.apache.org/licenses/LICENSE-2.0
6 *
7 * Unless required by applicable law or agreed to in writing, software
8 * distributed under the License is distributed on an "AS IS" BASIS,
9 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
10 * See the License for the specific language governing permissions and
11 * limitations under the License.
12 */
13 package org.apache.tika.parser.strings;
14
15 import static org.apache.tika.parser.strings.StringsParser.getStringsProg;
16 import static org.junit.Assert.*;
17 import static org.junit.Assume.assumeTrue;
18
19 import java.io.InputStream;
20 import java.util.Arrays;
21
22 import org.apache.tika.metadata.Metadata;
23 import org.apache.tika.parser.ParseContext;
24 import org.apache.tika.parser.Parser;
25 import org.apache.tika.parser.external.ExternalParser;
26 import org.apache.tika.sax.BodyContentHandler;
27 import org.junit.Test;
28 import org.xml.sax.ContentHandler;
29
30 public class StringsParserTest {
31 public static boolean canRun() {
32 StringsConfig config = new StringsConfig();
33 String[] checkCmd = {config.getStringsPath() + getStringsProg(), "--version"};
34 boolean hasStrings = ExternalParser.check(checkCmd);
35 return hasStrings;
36 }
37
38 @Test
39 public void testParse() throws Exception {
40 assumeTrue(canRun());
41
42 String resource = "/test-documents/testOCTET_header.dbase3";
43
44 String[] content = { "CLASSNO", "TITLE", "ITEMNO", "LISTNO", "LISTDATE" };
45
46 String[] met_attributes = {"min-len", "encoding", "strings:file_output"};
47
48 StringsConfig stringsConfig = new StringsConfig();
49 FileConfig fileConfig = new FileConfig();
50
51 Parser parser = new StringsParser();
52 ContentHandler handler = new BodyContentHandler();
53 Metadata metadata = new Metadata();
54
55 ParseContext context = new ParseContext();
56 context.set(StringsConfig.class, stringsConfig);
57 context.set(FileConfig.class, fileConfig);
58
59 InputStream stream = StringsParserTest.class.getResourceAsStream(resource);
60
61 try {
62 parser.parse(stream, handler, metadata, context);
63 } catch (Exception e) {
64 e.printStackTrace();
65 } finally {
66 stream.close();
67 }
68
69 // Content
70 for (String word : content) {
71 assertTrue(handler.toString().contains(word));
72 }
73
74 // Metadata
75 Arrays.equals(met_attributes, metadata.names());
76 }
77 }
1515 */
1616 package org.apache.tika.parser.txt;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
1920 import static org.junit.Assert.assertNull;
20 import static org.junit.Assert.assertTrue;
2121
2222 import java.io.ByteArrayInputStream;
2323 import java.io.StringWriter;
2424
25 import org.apache.tika.io.IOUtils;
2526 import org.apache.tika.metadata.Metadata;
2627 import org.apache.tika.metadata.TikaCoreProperties;
2728 import org.apache.tika.parser.ParseContext;
5859 assertNull(metadata.get(Metadata.CONTENT_LANGUAGE));
5960 assertNull(metadata.get(TikaCoreProperties.LANGUAGE));
6061
61 assertTrue(content.contains("Hello"));
62 assertTrue(content.contains("World"));
63 assertTrue(content.contains("autodetection"));
64 assertTrue(content.contains("stream"));
62 assertContains("Hello", content);
63 assertContains("World", content);
64 assertContains("autodetection", content);
65 assertContains("stream", content);
6566 }
6667
6768 @Test
7172 ContentHandler handler = new BodyContentHandler();
7273 Metadata metadata = new Metadata();
7374 parser.parse(
74 new ByteArrayInputStream(text.getBytes("UTF-8")),
75 new ByteArrayInputStream(text.getBytes(IOUtils.UTF_8)),
7576 handler, metadata, new ParseContext());
7677 assertEquals("text/plain; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
7778 assertEquals("UTF-8", metadata.get(Metadata.CONTENT_ENCODING)); // deprecated
7879
79 assertTrue(handler.toString().contains(text));
80 assertContains(text, handler.toString());
8081 }
8182
8283 @Test
222223 metadata.set(TikaCoreProperties.LANGUAGE, "en");
223224
224225 parser.parse(
225 new ByteArrayInputStream(test.getBytes("UTF-8")),
226 new ByteArrayInputStream(test.getBytes(IOUtils.UTF_8)),
226227 new BodyContentHandler(), metadata, new ParseContext());
227228
228229 assertEquals("en", metadata.get(TikaCoreProperties.LANGUAGE));
276277
277278 Metadata metadata = new Metadata();
278279 parser.parse(
279 new ByteArrayInputStream(text.getBytes("UTF-8")),
280 new ByteArrayInputStream(text.getBytes(IOUtils.UTF_8)),
280281 new BodyContentHandler(), metadata, new ParseContext());
281282 assertEquals("text/plain; charset=ISO-8859-1", metadata.get(Metadata.CONTENT_TYPE));
282283
284285 // we get back (see TIKA-868)
285286 metadata.set(Metadata.CONTENT_TYPE, "application/binary; charset=UTF-8");
286287 parser.parse(
287 new ByteArrayInputStream(text.getBytes("UTF-8")),
288 new ByteArrayInputStream(text.getBytes(IOUtils.UTF_8)),
288289 new BodyContentHandler(), metadata, new ParseContext());
289290 assertEquals("text/plain; charset=UTF-8", metadata.get(Metadata.CONTENT_TYPE));
290291 }
7575 assertTrue(metadata.get(TikaCoreProperties.RIGHTS).contains("testing chars"));
7676
7777 String content = handler.toString();
78 assertTrue(content.contains("Tika test document"));
78 assertContains("Tika test document", content);
7979
8080 assertEquals("2000-12-01T00:00:00.000Z", metadata.get(TikaCoreProperties.CREATED));
8181 } finally {
1515 */
1616 package org.apache.tika.parser.xml;
1717
18 import static org.apache.tika.TikaTest.assertContains;
1819 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertTrue;
20
21 import java.io.InputStream;
2022
2123 import org.apache.tika.TikaTest.TrackingHandler;
2224 import org.apache.tika.extractor.ContainerExtractor;
2325 import org.apache.tika.extractor.ParserContainerExtractor;
2426 import org.apache.tika.io.TikaInputStream;
2527 import org.apache.tika.metadata.Metadata;
28 import org.apache.tika.parser.ParseContext;
2629 import org.apache.tika.sax.BodyContentHandler;
2730 import org.junit.Test;
2831 import org.xml.sax.ContentHandler;
29
30 import java.io.InputStream;
3132
3233 public class FictionBookParserTest {
3334
3738 try {
3839 Metadata metadata = new Metadata();
3940 ContentHandler handler = new BodyContentHandler();
40 new FictionBookParser().parse(input, handler, metadata);
41 new FictionBookParser().parse(input, handler, metadata, new ParseContext());
4142 String content = handler.toString();
4243
43 assertTrue(content.contains("1812"));
44 assertContains("1812", content);
4445 } finally {
4546 input.close();
4647 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.sax;
18
19 import org.apache.tika.metadata.Metadata;
20 import org.apache.tika.parser.AutoDetectParser;
21 import org.apache.tika.parser.ParseContext;
22 import org.apache.tika.parser.Parser;
23 import org.junit.Test;
24
25 import java.io.InputStream;
26
27 import static org.apache.tika.TikaTest.assertContains;
28
29 /**
30 * Test class for the {@link org.apache.tika.sax.PhoneExtractingContentHandler}
31 * class. This demonstrates how to parse a document and retrieve any phone numbers
32 * found within.
33 *
34 * The phone numbers are added to a multivalued Metadata object under the key, "phonenumbers".
35 * You can get an array of phone numbers by calling metadata.getValues("phonenumber").
36 */
37 public class PhoneExtractingContentHandlerTest {
38 @Test
39 public void testExtractPhoneNumbers() throws Exception {
40 Parser parser = new AutoDetectParser();
41 Metadata metadata = new Metadata();
42 // The PhoneExtractingContentHandler will examine any characters for phone numbers before passing them
43 // to the underlying Handler.
44 PhoneExtractingContentHandler handler = new PhoneExtractingContentHandler(new BodyContentHandler(), metadata);
45 InputStream stream = PhoneExtractingContentHandlerTest.class.getResourceAsStream("/test-documents/testPhoneNumberExtractor.odt");
46 try {
47 parser.parse(stream, handler, metadata, new ParseContext());
48 }
49 finally {
50 stream.close();
51 }
52 String[] phoneNumbers = metadata.getValues("phonenumbers");
53 assertContains("9498888888", phoneNumbers[0]);
54 assertContains("9497777777", phoneNumbers[1]);
55 assertContains("9496666666", phoneNumbers[2]);
56 assertContains("9495555555", phoneNumbers[3]);
57 assertContains("4193404645", phoneNumbers[4]);
58 assertContains("9044687081", phoneNumbers[5]);
59 assertContains("2604094811", phoneNumbers[6]);
60 }
61 }
2323 <parent>
2424 <groupId>org.apache.tika</groupId>
2525 <artifactId>tika-parent</artifactId>
26 <version>1.6</version>
26 <version>1.8</version>
2727 <relativePath>../tika-parent/pom.xml</relativePath>
2828 </parent>
2929
3232 <url>http://tika.apache.org</url>
3333
3434 <dependencies>
35 <!-- Optional OSGi dependency, used only when running within OSGi -->
35 <!-- Optional OSGi dependency, used only when running within OSGi -->
3636
3737 <dependency>
3838 <groupId>org.osgi</groupId>
5151 <dependency>
5252 <groupId>com.google.code.gson</groupId>
5353 <artifactId>gson</artifactId>
54 <version>1.7.1</version>
54 <version>2.2.4</version>
5555 </dependency>
5656
5757 <!-- Test dependencies -->
5858 <dependency>
5959 <groupId>junit</groupId>
6060 <artifactId>junit</artifactId>
61 <scope>test</scope>
6261 </dependency>
6362
6463 </dependencies>
8483 <url>http://www.apache.org</url>
8584 </organization>
8685 <scm>
87 <url>http://svn.apache.org/viewvc/tika/tags/1.6/tika-app</url>
88 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/tika-serialization</connection>
89 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/tika-serialization</developerConnection>
86 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-app</url>
87 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-serialization</connection>
88 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-serialization</developerConnection>
9089 </scm>
9190 <issueManagement>
9291 <system>JIRA</system>
2020 import java.io.Reader;
2121 import java.io.Writer;
2222
23 import com.google.gson.Gson;
24 import com.google.gson.JsonIOException;
2325 import org.apache.tika.exception.TikaException;
2426 import org.apache.tika.metadata.Metadata;
2527
26 import com.google.gson.Gson;
27 import com.google.gson.GsonBuilder;
28 import com.google.gson.JsonIOException;
28 public class JsonMetadata extends JsonMetadataBase{
29 private static Gson GSON;
2930
30 public class JsonMetadata {
31
32 private static Gson GSON;
33
3431 static {
35 GsonBuilder builder = new GsonBuilder();
36 builder.registerTypeHierarchyAdapter(Metadata.class, new JsonMetadataSerializer());
37 builder.registerTypeHierarchyAdapter(Metadata.class, new JsonMetadataDeserializer());
38 GSON = builder.create();
32 GSON = defaultInit();
3933 }
40
41
4234 /**
4335 * Serializes a Metadata object to Json. This does not flush or close the writer.
4436 *
7163 }
7264 return m;
7365 }
74
66
7567 /**
7668 * Enables setting custom configurations on Gson. Remember to register
7769 * a serializer and a deserializer for Metadata. This does a literal set
7870 * and does not add the default serializer and deserializers.
79 *
71 *
8072 * @param gson
8173 */
8274 public static void setGson(Gson gson) {
8375 GSON = gson;
8476 }
77
78 public static void setPrettyPrinting(boolean prettyPrint) {
79 if (prettyPrint) {
80 GSON = prettyInit();
81 } else {
82 GSON = defaultInit();
83 }
84 }
85
8586 }
0 package org.apache.tika.metadata.serialization;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.util.Arrays;
20
21 import com.google.gson.Gson;
22 import com.google.gson.GsonBuilder;
23 import org.apache.tika.metadata.Metadata;
24
25 public class JsonMetadataBase {
26
27
28 static Gson defaultInit() {
29 GsonBuilder builder = new GsonBuilder();
30 builder.registerTypeHierarchyAdapter(Metadata.class, new JsonMetadataSerializer());
31 builder.registerTypeHierarchyAdapter(Metadata.class, new JsonMetadataDeserializer());
32 return builder.create();
33 }
34
35 static Gson prettyInit() {
36 GsonBuilder builder = new GsonBuilder();
37 builder.registerTypeHierarchyAdapter(Metadata.class, new SortedJsonMetadataSerializer());
38 builder.registerTypeHierarchyAdapter(Metadata.class, new JsonMetadataDeserializer());
39 builder.setPrettyPrinting();
40 return builder.create();
41 }
42
43 private static class SortedJsonMetadataSerializer extends JsonMetadataSerializer {
44 @Override
45 public String[] getNames(Metadata m) {
46 String[] names = m.names();
47 Arrays.sort(names, new PrettyMetadataKeyComparator());
48 return names;
49 }
50 }
51 }
0 package org.apache.tika.metadata.serialization;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19
20 import java.io.Reader;
21 import java.io.Writer;
22 import java.lang.reflect.Type;
23 import java.util.List;
24
25 import com.google.gson.Gson;
26 import com.google.gson.JsonIOException;
27 import com.google.gson.reflect.TypeToken;
28 import org.apache.tika.exception.TikaException;
29 import org.apache.tika.metadata.Metadata;
30
31 public class JsonMetadataList extends JsonMetadataBase {
32
33 private final static Type listType = new TypeToken<List<Metadata>>(){}.getType();
34 private static Gson GSON;
35 static {
36 GSON = defaultInit();
37 }
38
39 /**
40 * Serializes a Metadata object to Json. This does not flush or close the writer.
41 *
42 * @param metadataList list of metadata to write
43 * @param writer writer
44 * @throws org.apache.tika.exception.TikaException if there is an IOException during writing
45 */
46 public static void toJson(List<Metadata> metadataList, Writer writer) throws TikaException {
47 try {
48 GSON.toJson(metadataList, writer);
49 } catch (JsonIOException e) {
50 throw new TikaException(e.getMessage());
51 }
52 }
53
54 /**
55 * Read metadata from reader.
56 *
57 * @param reader
58 * @return Metadata or null if nothing could be read from the reader
59 * @throws org.apache.tika.exception.TikaException in case of parse failure by Gson or IO failure with Reader
60 */
61 public static List<Metadata> fromJson(Reader reader) throws TikaException {
62 List<Metadata> ms = null;
63 if (reader == null) {
64 return ms;
65 }
66 try {
67 ms = GSON.fromJson(reader, listType);
68 } catch (com.google.gson.JsonParseException e){
69 //covers both io and parse exceptions
70 throw new TikaException(e.getMessage());
71 }
72 return ms;
73 }
74
75 /**
76 * Enables setting custom configurations on Gson. Remember to register
77 * a serializer and a deserializer for Metadata. This does a literal set
78 * and does not add the default serializer and deserializers.
79 *
80 * @param gson
81 */
82 public static void setGson(Gson gson) {
83 GSON = gson;
84 }
85
86 public static void setPrettyPrinting(boolean prettyPrint) {
87 if (prettyPrint) {
88 GSON = prettyInit();
89 } else {
90 GSON = defaultInit();
91 }
92 }
93
94
95 }
0 package org.apache.tika.metadata.serialization;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 public class PrettyMetadataKeyComparator implements java.util.Comparator<String> {
20 @Override
21 public int compare(String s1, String s2) {
22 if (s1 == null) {
23 return 1;
24 } else if (s2 == null) {
25 return -1;
26 }
27
28 //this is stinky. This should reference RecursiveParserWrapper.TIKA_CONTENT
29 //but that would require making core a dependency of serialization...
30 //do we want to do that?
31 if (s1.equals("tika:content")) {
32 if (s2.equals("tika:content")) {
33 return 0;
34 }
35 return 2;
36 } else if (s2.equals("tika:content")) {
37 return -2;
38 }
39 //do we want to lowercase?
40 return s1.compareTo(s2);
41 }
42 }
43
0 package org.apache.tika.metadata.serialization;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import java.io.StringReader;
20 import java.io.StringWriter;
21 import java.util.LinkedList;
22 import java.util.List;
23
24 import org.apache.tika.metadata.Metadata;
25 import org.junit.Test;
26
27 import static junit.framework.Assert.assertTrue;
28 import static junit.framework.TestCase.assertNull;
29 import static org.junit.Assert.assertEquals;
30
31 public class JsonMetadataListTest {
32
33
34 @Test
35 public void testListBasic() throws Exception {
36 Metadata m1 = new Metadata();
37 m1.add("k1", "v1");
38 m1.add("k1", "v2");
39 m1.add("k1", "v3");
40 m1.add("k1", "v4");
41 m1.add("k1", "v4");
42 m1.add("k2", "v1");
43
44 Metadata m2 = new Metadata();
45 m2.add("k3", "v1");
46 m2.add("k3", "v2");
47 m2.add("k3", "v3");
48 m2.add("k3", "v4");
49 m2.add("k3", "v4");
50 m2.add("k4", "v1");
51
52 List<Metadata> metadataList = new LinkedList<Metadata>();
53 metadataList.add(m1);
54 metadataList.add(m2);
55 StringWriter writer = new StringWriter();
56 JsonMetadataList.toJson(metadataList, writer);
57 List<Metadata> deserialized = JsonMetadataList.fromJson(new StringReader(writer.toString()));
58 assertEquals(metadataList, deserialized);
59 }
60
61 @Test
62 public void testListNull() throws Exception {
63 StringWriter writer = new StringWriter();
64 JsonMetadataList.toJson(null, writer);
65 assertEquals("null", writer.toString().trim());
66
67 List<Metadata> m = JsonMetadataList.fromJson(null);
68 assertNull(m);
69 }
70
71 @Test
72 public void testListCorrupted() throws Exception {
73 String json = "[{\"k1\":[\"v1\",\"v2\",\"v3\",\"v4\",\"v4\"],\"k2\":\"v1\"}," +
74 "\"k3\":[\"v1\",\"v2\",\"v3\",\"v4\",\"v4\"],\"k4\":\"v1\"}]";
75 List<Metadata> m = JsonMetadataList.fromJson(null);
76 assertNull(m);
77 }
78
79 @Test
80 public void testPrettyPrint() throws Exception {
81 Metadata m1 = new Metadata();
82 m1.add("tika:content", "this is the content");
83 m1.add("zk1", "v1");
84 m1.add("zk1", "v2");
85 m1.add("zk1", "v3");
86 m1.add("zk1", "v4");
87 m1.add("zk1", "v4");
88 m1.add("zk2", "v1");
89
90 Metadata m2 = new Metadata();
91 m2.add("k3", "v1");
92 m2.add("k3", "v2");
93 m2.add("k3", "v3");
94 m2.add("k3", "v4");
95 m2.add("k3", "v4");
96 m2.add("k4", "v1");
97
98 List<Metadata> metadataList = new LinkedList<Metadata>();
99 metadataList.add(m1);
100 metadataList.add(m2);
101 StringWriter writer = new StringWriter();
102 JsonMetadataList.toJson(metadataList, writer);
103 assertTrue(writer.toString().startsWith("[{\"tika:content\":\"this is the content\",\"zk1\":[\"v1\",\"v2\","));
104 writer = new StringWriter();
105 JsonMetadataList.setPrettyPrinting(true);
106 JsonMetadataList.toJson(metadataList, writer);
107 assertTrue(writer.toString().startsWith("[\n" +
108 " {\n" +
109 " \"zk1\": [\n" +
110 " \"v1\",\n" +
111 " \"v2\","));
112 assertTrue(writer.toString().contains(" \"zk2\": \"v1\",\n" +
113 " \"tika:content\": \"this is the content\"\n" +
114 " },"));
115
116 //now set it back to false
117 JsonMetadataList.setPrettyPrinting(false);
118 writer = new StringWriter();
119 JsonMetadataList.toJson(metadataList, writer);
120 assertTrue(writer.toString().startsWith("[{\"tika:content\":\"this is the content\",\"zk1\":[\"v1\",\"v2\","));
121 }
122 }
1616 * limitations under the License.
1717 */
1818
19 import static org.junit.Assert.*;
20
2119 import java.io.StringReader;
2220 import java.io.StringWriter;
2321
2422 import org.apache.tika.exception.TikaException;
2523 import org.apache.tika.metadata.Metadata;
2624 import org.junit.Test;
25
26 import static org.junit.Assert.assertEquals;
27 import static org.junit.Assert.assertFalse;
28 import static org.junit.Assert.assertTrue;
2729
2830 public class JsonMetadataTest {
2931
5456
5557 //test that this really is 6 Chinese characters
5658 assertEquals(6, deserialized.get("alma_mater").length());
59
60 //now test pretty print;
61 writer = new StringWriter();
62 JsonMetadata.setPrettyPrinting(true);
63 JsonMetadata.toJson(metadata, writer);
64 assertTrue(writer.toString().contains(
65 " \"json_escapes\": \"the: \\\"quick\\\" brown, fox\",\n" +
66 " \"k1\": [\n" +
67 " \"v1\",\n" +
68 " \"v2\"\n" +
69 " ],\n" +
70 " \"k3\": [\n" +
71 " \"v3\",\n" +
72 " \"v3\"\n" +
73 " ],\n" +
74 " \"k4\": \"500,000\",\n" +
75 " \"url\": \"/myApp/myAction.html?method\\u003drouter\\u0026cmd\\u003d1\"\n" +
76 "}"));
5777 }
5878
5979 @Test
7999 ex = true;
80100 }
81101 assertFalse(ex);
82 assertEquals("", writer.toString());
102 assertEquals("null", writer.toString());
83103 }
84104
85105 @Test
108128 Metadata deserialized = JsonMetadata.fromJson(new StringReader(writer.toString()));
109129 assertEquals(m, deserialized);
110130 }
111
112131 }
0 # Licensed to the Apache Software Foundation (ASF) under one
1 # or more contributor license agreements. See the NOTICE file
2 # distributed with this work for additional information
3 # regarding copyright ownership. The ASF licenses this file
4 # to you under the Apache License, Version 2.0 (the
5 # "License"); you may not use this file except in compliance
6 # with the License. You may obtain a copy of the License at
7 # http://www.apache.org/licenses/LICENSE-2.0
8 # Unless required by applicable law or agreed to in writing,
9 # software distributed under the License is distributed on an
10 # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
11 # KIND, either express or implied. See the License for the
12 # specific language governing permissions and limitations
13 # under the License.
14
15 FROM ubuntu:latest
16 MAINTAINER Apache Tika Team
17
18 ENV TIKA_VERSION 1.7
19 ENV TIKA_SERVER_URL https://www.apache.org/dist/tika/tika-server-$TIKA_VERSION.jar
20
21 RUN apt-get update \
22 && apt-get install openjdk-7-jre-headless curl gdal-bin tesseract-ocr \
23 tesseract-ocr-eng tesseract-ocr-ita tesseract-ocr-fra tesseract-ocr-spa tesseract-ocr-deu -y \
24 && curl -sSL https://people.apache.org/keys/group/tika.asc -o /tmp/tika.asc \
25 && gpg --import /tmp/tika.asc \
26 && curl -sSL "$TIKA_SERVER_URL.asc" -o /tmp/tika-server-${TIKA_VERSION}.jar.asc \
27 && NEAREST_TIKA_SERVER_URL=$(curl -sSL http://www.apache.org/dyn/closer.cgi/${TIKA_SERVER_URL#https://www.apache.org/dist/}\?asjson\=1 \
28 | awk '/"path_info": / { pi=$2; }; /"preferred":/ { pref=$2; }; END { print pref " " pi; };' \
29 | sed -r -e 's/^"//; s/",$//; s/" "//') \
30 && echo "Nearest mirror: $NEAREST_TIKA_SERVER_URL" \
31 && curl -sSL "$NEAREST_TIKA_SERVER_URL" -o /tika-server-${TIKA_VERSION}.jar \
32 && gpg --verify /tmp/tika-server-${TIKA_VERSION}.jar.asc /tika-server-${TIKA_VERSION}.jar \
33 && apt-get clean -y && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
34
35 EXPOSE 9998
36 ENTRYPOINT java -jar /tika-server-${TIKA_VERSION}.jar -h 0.0.0.0
+0
-40
tika-server/README less more
0 This is JAX-RS Tika server for Tika
1 (https://issues.apache.org/jira/browse/TIKA-593)
2
3 Running
4 -------
5 java -jar target/tikaserver-1.0-SNAPSHOT.jar
6
7 Usage
8 -----
9 Usage examples from command line with curl utility:
10
11 1) Extract plain text:
12
13 curl -T price.xls http://localhost:9998/tika
14
15 2) Extract text with mime-type hint:
16
17 curl -v -H "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" -T document.docx http://localhost:9998/tika
18
19 3) Get all document attachments as ZIP-file:
20
21 curl -v -T Doc1_ole.doc http://localhost:9998/unpacker > /var/tmp/x.zip
22
23 4) Extract metadata to CSV format:
24
25 curl -T price.xls http://localhost:9998/meta
26
27 5) Detect media type from CSV format using file extension hint
28
29 curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream
30
31
32 HTTP Codes
33 ----------
34 200 - Ok
35 204 - No content (for example when we are unpacking file without attachments)
36 415 - Unknown file type
37 422 - Unparsable document of known type (password protected documents and unsupported versions like Biff5 Excel)
38 500 - Internal error
39
0 # Apache Tika JAX-RS Server
1
2 https://issues.apache.org/jira/browse/TIKA-593
3
4 Running
5 -------
6 ```
7 $ java -jar tika-server/target/tika-server.jar --help
8 usage: tikaserver
9 -?,--help this help message
10 -h,--host <arg> host name (default = localhost)
11 -l,--log <arg> request URI log level ('debug' or 'info')
12 -p,--port <arg> listen port (default = 9998)
13 -s,--includeStack whether or not to return a stack trace
14 if there is an exception during 'parse'
15 ```
16
17 Usage
18 -----
19 Usage examples from command line with `curl` utility:
20
21 * Extract plain text:
22 `curl -T price.xls http://localhost:9998/tika`
23
24 * Extract text with mime-type hint:
25 `curl -v -H "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document" -T document.docx http://localhost:9998/tika`
26
27 * Get all document attachments as ZIP-file:
28 `curl -v -T Doc1_ole.doc http://localhost:9998/unpacker > /var/tmp/x.zip`
29
30 * Extract metadata to CSV format:
31 `curl -T price.xls http://localhost:9998/meta`
32
33 * Detect media type from CSV format using file extension hint:
34 `curl -X PUT -H "Content-Disposition: attachment; filename=foo.csv" --upload-file foo.csv http://localhost:9998/detect/stream`
35
36
37 HTTP Return Codes
38 -----------------
39 `200` - Ok
40 `204` - No content (for example when we are unpacking file without attachments)
41 `415` - Unknown file type
42 `422` - Unparsable document of known type (password protected documents and unsupported versions like Biff5 Excel)
43 `500` - Internal error
44
1919 <parent>
2020 <groupId>org.apache.tika</groupId>
2121 <artifactId>tika-parent</artifactId>
22 <version>1.6</version>
22 <version>1.8</version>
2323 <relativePath>../tika-parent/pom.xml</relativePath>
2424 </parent>
2525
2626 <artifactId>tika-server</artifactId>
2727 <name>Apache Tika server</name>
2828
29 <properties>
30 <cxf.version>3.0.3</cxf.version>
31 </properties>
32
33 <pluginRepositories>
34 <pluginRepository>
35 <id>miredot</id>
36 <name>MireDot Releases</name>
37 <url>http://nexus.qmino.com/content/repositories/miredot</url>
38 </pluginRepository>
39 </pluginRepositories>
40
2941 <dependencies>
3042 <dependency>
3143 <groupId>${project.groupId}</groupId>
3547 <dependency>
3648 <groupId>${project.groupId}</groupId>
3749 <artifactId>tika-serialization</artifactId>
50 <version>${project.version}</version>
51 </dependency>
52 <dependency>
53 <groupId>${project.groupId}</groupId>
54 <artifactId>tika-xmp</artifactId>
3855 <version>${project.version}</version>
3956 </dependency>
4057 <dependency>
4562 <dependency>
4663 <groupId>org.apache.cxf</groupId>
4764 <artifactId>cxf-rt-frontend-jaxrs</artifactId>
48 <version>2.7.8</version>
65 <version>${cxf.version}</version>
66 </dependency>
67 <dependency>
68 <groupId>org.apache.cxf</groupId>
69 <artifactId>cxf-rt-rs-service-description</artifactId>
70 <version>${cxf.version}</version>
71 <scope>test</scope>
4972 </dependency>
5073 <dependency>
5174 <groupId>org.apache.cxf</groupId>
5275 <artifactId>cxf-rt-transports-http-jetty</artifactId>
53 <version>2.7.8</version>
76 <version>${cxf.version}</version>
77 </dependency>
78 <dependency>
79 <groupId>org.apache.cxf</groupId>
80 <artifactId>cxf-rt-rs-security-cors</artifactId>
81 <version>${cxf.version}</version>
82 </dependency>
83 <dependency>
84 <groupId>javax.mail</groupId>
85 <artifactId>mail</artifactId>
86 <version>1.4.4</version>
5487 </dependency>
5588 <dependency>
5689 <groupId>commons-cli</groupId>
6396 <version>2.5</version>
6497 </dependency>
6598 <dependency>
99 <groupId>org.apache.cxf</groupId>
100 <artifactId>cxf-rt-rs-client</artifactId>
101 <version>${cxf.version}</version>
102 <scope>test</scope>
103 </dependency>
104 <dependency>
105 <groupId>${project.groupId}</groupId>
106 <artifactId>tika-core</artifactId>
107 <version>${project.version}</version>
108 <type>test-jar</type>
109 <scope>test</scope>
110 </dependency>
111 <dependency>
112 <groupId>${project.groupId}</groupId>
113 <artifactId>tika-parsers</artifactId>
114 <version>${project.version}</version>
115 <type>test-jar</type>
116 <scope>test</scope>
117 </dependency>
118 <dependency>
66119 <groupId>junit</groupId>
67120 <artifactId>junit</artifactId>
68 <scope>test</scope>
69 <version>4.11</version>
121 </dependency>
122 <dependency>
123 <groupId>org.slf4j</groupId>
124 <artifactId>slf4j-jcl</artifactId>
125 <version>1.6.1</version>
70126 </dependency>
71127 </dependencies>
72128
114170 <exclude>CHANGES</exclude>
115171 <exclude>README</exclude>
116172 <exclude>builddef.lst</exclude>
173 <!-- clutter not needed in jar -->
174 <exclude>resources/grib1/nasa/README*.pdf</exclude>
175 <exclude>resources/grib1/**/readme*.txt</exclude>
176 <exclude>resources/grib2/**/readme*.txt</exclude>
117177 <!-- TIKA-763: Workaround to avoid including LGPL classes -->
118178 <exclude>ucar/nc2/iosp/fysat/Fysat*.class</exclude>
119179 <exclude>ucar/nc2/dataset/transform/VOceanSG1*class</exclude>
184244 </execution>
185245 </executions>
186246 </plugin>
247 <plugin>
248 <groupId>com.qmino</groupId>
249 <artifactId>miredot-maven-plugin</artifactId>
250 <version>1.4</version>
251 <executions>
252 <execution>
253 <goals>
254 <goal>restdoc</goal>
255 </goals>
256 </execution>
257 </executions>
258 <configuration>
259 <licence>
260 <!-- Miredot license key valid until August 1st, 2016 when we can apply for a new one - http://s.apache.org/oE -->
261 UHJvamVjdHxvcmcuYXBhY2hlLnRpa2EudGlrYS1zZXJ2ZXJ8MjAxNi0wOC0wMXx0cnVlI01Dd0NGRklXRzRqRmNTZXNJb2laRElKZVF4RXpieUNTQWhSMHBmTzZCMUdMbDBPQ1B1WmJYQ3NpZElZSCtRPT0=
262 </licence>
263 <!-- insert other configuration here (optional) -->
264 </configuration>
265 </plugin>
187266 </plugins>
188267 </build>
189268 <profiles>
190 <profile>
191 <id>server</id>
192 <build>
193 <defaultGoal>test</defaultGoal>
194 <plugins>
195 <plugin>
196 <groupId>org.codehaus.mojo</groupId>
197 <artifactId>exec-maven-plugin</artifactId>
198 <executions>
199 <execution>
200 <phase>test</phase>
201 <goals>
202 <goal>java</goal>
203 </goals>
204 <configuration>
205 <mainClass>org.apache.tika.server.TikaServerCli</mainClass>
206 </configuration>
207 </execution>
208 </executions>
209 </plugin>
210 </plugins>
211 </build>
212 </profile>
213 </profiles>
269 <profile>
270 <id>server</id>
271 <build>
272 <defaultGoal>test</defaultGoal>
273 <plugins>
274 <plugin>
275 <groupId>org.codehaus.mojo</groupId>
276 <artifactId>exec-maven-plugin</artifactId>
277 <executions>
278 <execution>
279 <phase>test</phase>
280 <goals>
281 <goal>java</goal>
282 </goals>
283 <configuration>
284 <mainClass>org.apache.tika.server.TikaServerCli</mainClass>
285 </configuration>
286 </execution>
287 </executions>
288 </plugin>
289 </plugins>
290 </build>
291 </profile>
292 </profiles>
214293 <url>http://tika.apache.org/</url>
215294 <organization>
216 <name>The Apache Software Foundation</name>
217 <url>http://www.apache.org</url>
295 <name>The Apache Software Foundation</name>
296 <url>http://www.apache.org</url>
218297 </organization>
219298 <scm>
220 <url>http://svn.apache.org/viewvc/tika/tags/1.6/tika-server</url>
221 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/tika-server</connection>
222 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/tika-server</developerConnection>
299 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-server</url>
300 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-server</connection>
301 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-server</developerConnection>
223302 </scm>
224303 <issueManagement>
225 <system>JIRA</system>
226 <url>https://issues.apache.org/jira/browse/TIKA</url>
304 <system>JIRA</system>
305 <url>https://issues.apache.org/jira/browse/TIKA</url>
227306 </issueManagement>
228307 <ciManagement>
229 <system>Jenkins</system>
230 <url>https://builds.apache.org/job/Tika-trunk/</url>
308 <system>Jenkins</system>
309 <url>https://builds.apache.org/job/Tika-trunk/</url>
231310 </ciManagement>
232311 </project>
+0
-70
tika-server/src/main/java/org/apache/tika/server/CSVMessageBodyWriter.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import org.apache.tika.metadata.Metadata;
20
21 import javax.ws.rs.Produces;
22 import javax.ws.rs.WebApplicationException;
23 import javax.ws.rs.core.MediaType;
24 import javax.ws.rs.core.MultivaluedMap;
25 import javax.ws.rs.ext.MessageBodyWriter;
26 import javax.ws.rs.ext.Provider;
27
28 import java.io.IOException;
29 import java.io.OutputStream;
30 import java.io.OutputStreamWriter;
31 import java.lang.annotation.Annotation;
32 import java.lang.reflect.Type;
33 import java.util.ArrayList;
34 import java.util.Arrays;
35
36 import au.com.bytecode.opencsv.CSVWriter;
37
38 @Provider
39 @Produces("text/csv")
40 public class CSVMessageBodyWriter implements MessageBodyWriter<Metadata> {
41
42 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
43 return Metadata.class.isAssignableFrom(type);
44 }
45
46 public long getSize(Metadata data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
47 return -1;
48 }
49
50 @Override
51 @SuppressWarnings("resource")
52 public void writeTo(Metadata metadata, Class<?> type, Type genericType, Annotation[] annotations,
53 MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException,
54 WebApplicationException {
55
56 CSVWriter writer = new CSVWriter(new OutputStreamWriter(entityStream, "UTF-8"));
57
58 for (String name : metadata.names()) {
59 String[] values = metadata.getValues(name);
60 ArrayList<String> list = new ArrayList<String>(values.length + 1);
61 list.add(name);
62 list.addAll(Arrays.asList(values));
63 writer.writeNext(list.toArray(values));
64 }
65
66 // Don't close, just flush the stream
67 writer.flush();
68 }
69 }
+0
-72
tika-server/src/main/java/org/apache/tika/server/DetectorResource.java less more
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import java.io.IOException;
20 import java.io.InputStream;
21
22 import javax.ws.rs.Consumes;
23 import javax.ws.rs.PUT;
24 import javax.ws.rs.Path;
25 import javax.ws.rs.Produces;
26 import javax.ws.rs.core.Context;
27 import javax.ws.rs.core.HttpHeaders;
28 import javax.ws.rs.core.UriInfo;
29
30 import org.apache.commons.logging.Log;
31 import org.apache.commons.logging.LogFactory;
32 import org.apache.tika.config.TikaConfig;
33 import org.apache.tika.io.TikaInputStream;
34 import org.apache.tika.metadata.Metadata;
35 import org.apache.tika.mime.MediaType;
36
37 @Path("/detect")
38 public class DetectorResource {
39
40 private static final Log logger = LogFactory.getLog(DetectorResource.class
41 .getName());
42
43 private TikaConfig config = null;
44
45 public DetectorResource(TikaConfig config) {
46 this.config = config;
47 }
48
49 @PUT
50 @Path("stream")
51 @Consumes("*/*")
52 @Produces("text/plain")
53 public String detect(final InputStream is,
54 @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
55 Metadata met = new Metadata();
56 TikaInputStream tis = TikaInputStream.get(is);
57 String filename = TikaResource.detectFilename(httpHeaders
58 .getRequestHeaders());
59 logger.info("Detecting media type for Filename: " + filename);
60 met.add(Metadata.RESOURCE_NAME_KEY, filename);
61 try {
62 return this.config.getDetector().detect(tis, met).toString();
63 } catch (IOException e) {
64 logger.warn("Unable to detect MIME type for file. Reason: "
65 + e.getMessage());
66 e.printStackTrace();
67 return MediaType.OCTET_STREAM.toString();
68 }
69 }
70
71 }
2020 import java.io.InputStream;
2121
2222 import org.apache.tika.io.IOUtils;
23
24 /**
25 * Helps produce user facing HTML output.
26 *
27 * TODO Decide if this would be better done as a MessageBodyWriter
28 */
29 public class HTMLHelper {
23
24 /**
25 * Helps produce user facing HTML output.
26 * <p/>
27 * TODO Decide if this would be better done as a MessageBodyWriter
28 */
29 public class HTMLHelper {
3030 private static final String PATH = "/tikaserver-template.html";
3131 private static final String TITLE_VAR = "[[TITLE]]";
32 private static final String BODY_VAR = "[[BODY]]";
33 private String PRE_BODY;
34 private String POST_BODY;
35
36 public HTMLHelper() {
37 InputStream htmlStr = getClass().getResourceAsStream(PATH);
38 if (htmlStr == null) {
32 private static final String BODY_VAR = "[[BODY]]";
33 private String PRE_BODY;
34 private String POST_BODY;
35
36 public HTMLHelper() {
37 InputStream htmlStr = getClass().getResourceAsStream(PATH);
38 if (htmlStr == null) {
3939 throw new IllegalArgumentException("Template Not Found - " + PATH);
4040 }
4141 try {
42 String html = IOUtils.toString(htmlStr, "UTF-8");
42 String html = IOUtils.toString(htmlStr, IOUtils.UTF_8.name());
4343 int bodyAt = html.indexOf(BODY_VAR);
4444 PRE_BODY = html.substring(0, bodyAt);
4545 POST_BODY = html.substring(bodyAt + BODY_VAR.length());
4646 } catch (IOException e) {
47 throw new IllegalStateException("Unable to read template");
48 }
49 }
50
51 /**
52 * Generates the HTML Header for the user facing page, adding
53 * in the given title as required
54 */
55 public void generateHeader(StringBuffer html, String title) {
56 html.append(PRE_BODY.replace(TITLE_VAR, title));
57 }
58 public void generateFooter(StringBuffer html) {
59 html.append(POST_BODY);
60 }
61 }
47 throw new IllegalStateException("Unable to read template");
48 }
49 }
50
51 /**
52 * Generates the HTML Header for the user facing page, adding
53 * in the given title as required
54 */
55 public void generateHeader(StringBuffer html, String title) {
56 html.append(PRE_BODY.replace(TITLE_VAR, title));
57 }
58
59 public void generateFooter(StringBuffer html) {
60 html.append(POST_BODY);
61 }}
+0
-63
tika-server/src/main/java/org/apache/tika/server/JSONMessageBodyWriter.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import org.apache.tika.exception.TikaException;
20 import org.apache.tika.metadata.Metadata;
21 import org.apache.tika.metadata.serialization.JsonMetadata;
22
23 import javax.ws.rs.Produces;
24 import javax.ws.rs.WebApplicationException;
25 import javax.ws.rs.core.MediaType;
26 import javax.ws.rs.core.MultivaluedMap;
27 import javax.ws.rs.ext.MessageBodyWriter;
28 import javax.ws.rs.ext.Provider;
29
30 import java.io.IOException;
31 import java.io.OutputStream;
32 import java.io.OutputStreamWriter;
33 import java.io.Writer;
34 import java.lang.annotation.Annotation;
35 import java.lang.reflect.Type;
36
37 @Provider
38 @Produces(MediaType.APPLICATION_JSON)
39 public class JSONMessageBodyWriter implements MessageBodyWriter<Metadata> {
40
41 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
42 return Metadata.class.isAssignableFrom(type);
43 }
44
45 public long getSize(Metadata data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
46 return -1;
47 }
48
49 @Override
50 public void writeTo(Metadata metadata, Class<?> type, Type genericType, Annotation[] annotations,
51 MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException,
52 WebApplicationException {
53 try {
54 Writer writer = new OutputStreamWriter(entityStream, "UTF-8");
55 JsonMetadata.toJson(metadata, writer);
56 writer.flush();
57 } catch (TikaException e) {
58 throw new IOException(e);
59 }
60 entityStream.flush();
61 }
62 }
+0
-168
tika-server/src/main/java/org/apache/tika/server/MetadataEP.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import java.io.InputStream;
20
21 import javax.ws.rs.POST;
22 import javax.ws.rs.Path;
23 import javax.ws.rs.PathParam;
24 import javax.ws.rs.Produces;
25 import javax.ws.rs.core.Context;
26 import javax.ws.rs.core.HttpHeaders;
27 import javax.ws.rs.core.MediaType;
28 import javax.ws.rs.core.Response;
29 import javax.ws.rs.core.Response.Status;
30 import javax.ws.rs.core.UriInfo;
31
32 import org.apache.commons.logging.Log;
33 import org.apache.commons.logging.LogFactory;
34 import org.apache.tika.config.TikaConfig;
35 import org.apache.tika.metadata.Metadata;
36 import org.apache.tika.parser.AutoDetectParser;
37 import org.xml.sax.helpers.DefaultHandler;
38
39 /**
40 * This JAX-RS endpoint provides access to the metadata contained within a
41 * document. It is possible to submit a relatively small prefix (a few KB) of a
42 * document's content to retrieve individual metadata fields.
43 * <p>
44 */
45 @Path("/metadata")
46 public class MetadataEP {
47 private static final Log logger = LogFactory.getLog(MetadataEP.class);
48
49 private TikaConfig config;
50 private final AutoDetectParser parser;
51
52 /** The metdata for the request */
53 private final Metadata metadata = new Metadata();
54
55 public MetadataEP(@Context HttpHeaders httpHeaders, @Context UriInfo info) {
56 // TODO How to get this better?
57 config = TikaConfig.getDefaultConfig();
58 parser = TikaResource.createParser(config);
59
60 TikaResource.fillMetadata(parser, metadata, httpHeaders.getRequestHeaders());
61 TikaResource.logRequest(logger, info, metadata);
62 }
63
64 /**
65 * Get all metadata that can be parsed from the specified input stream. An
66 * error is produced if the input stream cannot be parsed.
67 *
68 * @param is
69 * an input stream
70 * @return the metadata
71 * @throws Exception
72 */
73 @POST
74 public Response getMetadata(InputStream is) throws Exception {
75 parser.parse(is, new DefaultHandler(), metadata);
76 return Response.ok(metadata).build();
77 }
78
79 /**
80 * Get a specific TIKA metadata field as a simple text string. If the field is
81 * multivalued, then only the first value is returned. If the input stream
82 * cannot be parsed, but a value was found for the given metadata field, then
83 * the value of the field is returned as part of a 200 OK response; otherwise
84 * a {@link Status#BAD_REQUEST} is generated. If the stream was successfully
85 * parsed but the specific metadata field was not found, then a
86 * {@link Status#NOT_FOUND} is returned.
87 * <p>
88 *
89 * @param field
90 * the tika metadata field name
91 * @param is
92 * the document stream
93 * @return one of {@link Status#OK}, {@link Status#NOT_FOUND}, or
94 * {@link Status#BAD_REQUEST}
95 * @throws Exception
96 */
97 @POST
98 @Path("{field}")
99 @Produces(MediaType.TEXT_PLAIN)
100 public Response getSimpleMetadataField(@PathParam("field") String field, InputStream is) throws Exception {
101
102 // use BAD request to indicate that we may not have had enough data to
103 // process the request
104 Status defaultErrorResponse = Status.BAD_REQUEST;
105 try {
106 parser.parse(is, new DefaultHandler(), metadata);
107 // once we've parsed the document successfully, we should use NOT_FOUND
108 // if we did not see the field
109 defaultErrorResponse = Status.NOT_FOUND;
110 } catch (Exception e) {
111 logger.info("Failed to process field " + field, e);
112 }
113 String value = metadata.get(field);
114 if (value == null) {
115 return Response.status(defaultErrorResponse).entity("Failed to get metadata field " + field).build();
116 }
117 return Response.ok(value, MediaType.TEXT_PLAIN_TYPE).build();
118 }
119
120 /**
121 * Get a specific metadata field. If the input stream cannot be parsed, but a
122 * value was found for the given metadata field, then the value of the field
123 * is returned as part of a 200 OK response; otherwise a
124 * {@link Status#BAD_REQUEST} is generated. If the stream was successfully
125 * parsed but the specific metadata field was not found, then a
126 * {@link Status#NOT_FOUND} is returned.
127 * <p>
128 * Note that this method handles multivalue fields and returns possibly more
129 * metadata than requested.
130 *
131 * @param field
132 * the tika metadata field name
133 * @param is
134 * the document stream
135 * @return one of {@link Status#OK}, {@link Status#NOT_FOUND}, or
136 * {@link Status#BAD_REQUEST}
137 * @throws Exception
138 */
139 @POST
140 @Path("{field}")
141 public Response getMetadataField(@PathParam("field") String field, InputStream is) throws Exception {
142
143 // use BAD request to indicate that we may not have had enough data to
144 // process the request
145 Status defaultErrorResponse = Status.BAD_REQUEST;
146 try {
147 parser.parse(is, new DefaultHandler(), metadata);
148 // once we've parsed the document successfully, we should use NOT_FOUND
149 // if we did not see the field
150 defaultErrorResponse = Status.NOT_FOUND;
151 } catch (Exception e) {
152 logger.info("Failed to process field " + field, e);
153 }
154 String[] values = metadata.getValues(field);
155 if (values.length == 0) {
156 return Response.status(defaultErrorResponse).entity("Failed to get metadata field " + field).build();
157 }
158 // remove fields we don't care about for the response
159 for (String name : metadata.names()) {
160 if (!field.equals(name)) {
161 metadata.remove(name);
162 }
163 }
164 return Response.ok(metadata).build();
165 }
166
167 }
0 package org.apache.tika.server;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import org.apache.tika.metadata.Metadata;
20
21 import java.util.List;
22
23 /**
24 * wrapper class to make isWriteable in MetadataListMBW simpler
25 */
26 public class MetadataList {
27 private final List<Metadata> metadata;
28
29 public MetadataList(List<Metadata> metadata) {
30 this.metadata = metadata;
31 }
32
33 public List<Metadata> getMetadata() {
34 return metadata;
35 }
36 }
+0
-99
tika-server/src/main/java/org/apache/tika/server/MetadataResource.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.io.OutputStream;
22 import java.io.OutputStreamWriter;
23 import java.util.ArrayList;
24 import java.util.Arrays;
25
26 import javax.ws.rs.Consumes;
27 import javax.ws.rs.PUT;
28 import javax.ws.rs.Path;
29 import javax.ws.rs.Produces;
30 import javax.ws.rs.WebApplicationException;
31 import javax.ws.rs.core.Context;
32 import javax.ws.rs.core.HttpHeaders;
33 import javax.ws.rs.core.MultivaluedMap;
34 import javax.ws.rs.core.StreamingOutput;
35 import javax.ws.rs.core.UriInfo;
36
37 import org.apache.commons.logging.Log;
38 import org.apache.commons.logging.LogFactory;
39 import org.apache.cxf.jaxrs.ext.multipart.Attachment;
40 import org.apache.tika.config.TikaConfig;
41 import org.apache.tika.metadata.Metadata;
42 import org.apache.tika.parser.AutoDetectParser;
43 import org.xml.sax.helpers.DefaultHandler;
44
45 import au.com.bytecode.opencsv.CSVWriter;
46
47 @Path("/meta")
48 public class MetadataResource {
49 private static final Log logger = LogFactory.getLog(MetadataResource.class);
50
51 private TikaConfig tikaConfig;
52 public MetadataResource(TikaConfig tikaConfig) {
53 this.tikaConfig = tikaConfig;
54 }
55
56 @PUT
57 @Consumes("multipart/form-data")
58 @Produces("text/csv")
59 @Path("form")
60 public StreamingOutput getMetadataFromMultipart(Attachment att, @Context UriInfo info) throws Exception {
61 return produceMetadata(att.getObject(InputStream.class), att.getHeaders(), info);
62 }
63
64 @PUT
65 @Produces("text/csv")
66 public StreamingOutput getMetadata(InputStream is, @Context HttpHeaders httpHeaders, @Context UriInfo info) throws Exception {
67 return produceMetadata(is, httpHeaders.getRequestHeaders(), info);
68 }
69
70 private StreamingOutput produceMetadata(InputStream is, MultivaluedMap<String, String> httpHeaders, UriInfo info) throws Exception {
71 final Metadata metadata = new Metadata();
72 AutoDetectParser parser = TikaResource.createParser(tikaConfig);
73 TikaResource.fillMetadata(parser, metadata, httpHeaders);
74 TikaResource.logRequest(logger, info, metadata);
75
76 parser.parse(is, new DefaultHandler(), metadata);
77
78 return new StreamingOutput() {
79 public void write(OutputStream outputStream) throws IOException, WebApplicationException {
80 metadataToCsv(metadata, outputStream);
81 }
82 };
83 }
84
85 public static void metadataToCsv(Metadata metadata, OutputStream outputStream) throws IOException {
86 CSVWriter writer = new CSVWriter(new OutputStreamWriter(outputStream, "UTF-8"));
87
88 for (String name : metadata.names()) {
89 String[] values = metadata.getValues(name);
90 ArrayList<String> list = new ArrayList<String>(values.length+1);
91 list.add(name);
92 list.addAll(Arrays.asList(values));
93 writer.writeNext(list.toArray(values));
94 }
95
96 writer.close();
97 }
98 }
1313 * See the License for the specific language governing permissions and
1414 * limitations under the License.
1515 */
16
17 package org.apache.tika.server;
18
19 import org.apache.tika.sax.WriteOutContentHandler;
20 import org.xml.sax.Attributes;
21 import org.xml.sax.SAXException;
22
23 import java.io.Writer;
24
25 class RichTextContentHandler extends WriteOutContentHandler {
26 public RichTextContentHandler(Writer writer) {
27 super(writer);
28 }
29
30 @Override
31 public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
32 super.startElement(uri, localName, qName, attributes);
33
34 if ("img".equals(localName) && attributes.getValue("alt")!=null) {
35 String nfo = "[image: "+attributes.getValue("alt")+ ']';
36
37 characters(nfo.toCharArray(), 0, nfo.length());
38 }
39
40 if ("a".equals(localName) && attributes.getValue("name")!=null) {
41 String nfo = "[bookmark: "+attributes.getValue("name")+ ']';
42
43 characters(nfo.toCharArray(), 0, nfo.length());
44 }
45 }
46 }
16
17 package org.apache.tika.server;
18
19 import java.io.Writer;
20
21 import org.apache.tika.sax.WriteOutContentHandler;
22 import org.xml.sax.Attributes;
23 import org.xml.sax.SAXException;
24
25 public class RichTextContentHandler extends WriteOutContentHandler {
26 public RichTextContentHandler(Writer writer) {
27 super(writer);
28 }
29
30 @Override
31 public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
32 super.startElement(uri, localName, qName, attributes);
33
34 if ("img".equals(localName) && attributes.getValue("alt") != null) {
35 String nfo = "[image: " + attributes.getValue("alt") + ']';
36
37 characters(nfo.toCharArray(), 0, nfo.length());
38 }
39
40 if ("a".equals(localName) && attributes.getValue("name") != null) {
41 String nfo = "[bookmark: " + attributes.getValue("name") + ']';
42
43 characters(nfo.toCharArray(), 0, nfo.length());
44 }
45 }
46 }
+0
-67
tika-server/src/main/java/org/apache/tika/server/TarWriter.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
20 import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream;
21
22 import javax.ws.rs.Produces;
23 import javax.ws.rs.WebApplicationException;
24 import javax.ws.rs.core.MediaType;
25 import javax.ws.rs.core.MultivaluedMap;
26 import javax.ws.rs.ext.MessageBodyWriter;
27 import javax.ws.rs.ext.Provider;
28 import java.io.IOException;
29 import java.io.OutputStream;
30 import java.lang.annotation.Annotation;
31 import java.lang.reflect.Type;
32 import java.util.Map;
33
34 @Provider
35 @Produces("application/x-tar")
36 public class TarWriter implements MessageBodyWriter<Map<String, byte[]>> {
37 private static void tarStoreBuffer(TarArchiveOutputStream zip, String name, byte[] dataBuffer) throws IOException {
38 TarArchiveEntry entry = new TarArchiveEntry(name);
39
40 entry.setSize(dataBuffer.length);
41
42 zip.putArchiveEntry(entry);
43
44 zip.write(dataBuffer);
45
46 zip.closeArchiveEntry();
47 }
48
49 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
50 return Map.class.isAssignableFrom(type);
51 }
52
53 public long getSize(Map<String, byte[]> stringMap, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
54 return -1;
55 }
56
57 public void writeTo(Map<String, byte[]> parts, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
58 TarArchiveOutputStream zip = new TarArchiveOutputStream(entityStream);
59
60 for (Map.Entry<String, byte[]> entry : parts.entrySet()) {
61 tarStoreBuffer(zip, entry.getKey(), entry.getValue());
62 }
63
64 zip.close();
65 }
66 }
+0
-125
tika-server/src/main/java/org/apache/tika/server/TikaDetectors.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server;
17
18 import java.util.ArrayList;
19 import java.util.HashMap;
20 import java.util.List;
21 import java.util.Map;
22
23 import javax.ws.rs.GET;
24 import javax.ws.rs.Path;
25 import javax.ws.rs.Produces;
26
27 import org.apache.tika.config.TikaConfig;
28 import org.apache.tika.detect.CompositeDetector;
29 import org.apache.tika.detect.Detector;
30 import org.eclipse.jetty.util.ajax.JSON;
31
32 /**
33 * <p>Provides details of all the {@link Detector}s registered with
34 * Apache Tika, similar to <em>--list-detectors</em> with the Tika CLI.
35 */
36 @Path("/detectors")
37 public class TikaDetectors {
38 private TikaConfig tika;
39 private HTMLHelper html;
40
41 public TikaDetectors(TikaConfig tika) {
42 this.tika = tika;
43 this.html = new HTMLHelper();
44 }
45
46 @GET
47 @Produces("text/html")
48 public String getDectorsHTML() {
49 StringBuffer h = new StringBuffer();
50 html.generateHeader(h, "Detectors available to Apache Tika");
51 detectorAsHTML(tika.getDetector(), h, 2);
52 html.generateFooter(h);
53 return h.toString();
54 }
55 private void detectorAsHTML(Detector d, StringBuffer html, int level) {
56 html.append("<h");
57 html.append(level);
58 html.append(">");
59 String name = d.getClass().getName();
60 html.append(name.substring(name.lastIndexOf('.')+1));
61 html.append("</h");
62 html.append(level);
63 html.append(">");
64 html.append("<p>Class: ");
65 html.append(name);
66 html.append("</p>");
67 if (d instanceof CompositeDetector) {
68 html.append("<p>Composite Detector</p>");
69 for (Detector cd : ((CompositeDetector)d).getDetectors()) {
70 detectorAsHTML(cd, html, level+1);
71 }
72 }
73 }
74
75 @GET
76 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
77 public String getDetectorsJSON() {
78 Map<String,Object> details = new HashMap<String, Object>();
79 detectorAsMap(tika.getDetector(), details);
80 return JSON.toString(details);
81 }
82 private void detectorAsMap(Detector d, Map<String, Object> details) {
83 details.put("name", d.getClass().getName());
84
85 boolean isComposite = (d instanceof CompositeDetector);
86 details.put("composite", isComposite);
87 if (isComposite) {
88 List<Map<String, Object>> c = new ArrayList<Map<String,Object>>();
89 for (Detector cd : ((CompositeDetector)d).getDetectors()) {
90 Map<String,Object> cdet = new HashMap<String, Object>();
91 detectorAsMap(cd, cdet);
92 c.add(cdet);
93 }
94 details.put("children", c);
95 }
96 }
97
98 @GET
99 @Produces("text/plain")
100 public String getDetectorsPlain() {
101 StringBuffer text = new StringBuffer();
102 renderDetector(tika.getDetector(), text, 0);
103 return text.toString();
104 }
105 private void renderDetector(Detector d, StringBuffer text, int indent) {
106 boolean isComposite = (d instanceof CompositeDetector);
107 String name = d.getClass().getName();
108
109 for (int i=0; i<indent; i++) {
110 text.append(" ");
111 }
112 text.append(name);
113 if (isComposite) {
114 text.append(" (Composite Detector):\n");
115
116 List<Detector> subDetectors = ((CompositeDetector)d).getDetectors();
117 for(Detector sd : subDetectors) {
118 renderDetector(sd, text, indent+1);
119 }
120 } else {
121 text.append("\n");
122 }
123 }
124 }
+0
-36
tika-server/src/main/java/org/apache/tika/server/TikaExceptionMapper.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import org.apache.tika.exception.TikaException;
20
21 import javax.ws.rs.WebApplicationException;
22 import javax.ws.rs.core.Response;
23 import javax.ws.rs.ext.ExceptionMapper;
24 import javax.ws.rs.ext.Provider;
25
26 @Provider
27 public class TikaExceptionMapper implements ExceptionMapper<TikaException> {
28 public Response toResponse(TikaException e) {
29 if (e.getCause() !=null && e.getCause() instanceof WebApplicationException) {
30 return ((WebApplicationException) e.getCause()).getResponse();
31 } else {
32 return Response.serverError().build();
33 }
34 }
35 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import javax.ws.rs.container.ContainerRequestContext;
20 import javax.ws.rs.container.ContainerRequestFilter;
21 import javax.ws.rs.container.PreMatching;
22 import javax.ws.rs.ext.Provider;
23
24 import java.io.IOException;
25
26 import org.apache.commons.logging.Log;
27 import org.apache.commons.logging.LogFactory;
28
29 @Provider
30 @PreMatching
31 public class TikaLoggingFilter implements ContainerRequestFilter {
32 private static final Log logger = LogFactory.getLog(TikaLoggingFilter.class);
33 private boolean infoLevel;
34
35 public TikaLoggingFilter(boolean infoLevel) {
36 this.infoLevel = infoLevel;
37 }
38
39 @Override
40 public void filter(ContainerRequestContext requestContext) throws IOException {
41 String requestUri = requestContext.getUriInfo().getRequestUri().toString();
42 String logMessage = "Request URI: " + requestUri;
43 if (infoLevel) {
44 logger.info(logMessage);
45 } else {
46 logger.debug(logMessage);
47 }
48 }
49
50 }
+0
-178
tika-server/src/main/java/org/apache/tika/server/TikaMimeTypes.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server;
17
18 import java.util.ArrayList;
19 import java.util.HashMap;
20 import java.util.List;
21 import java.util.Map;
22 import java.util.SortedMap;
23 import java.util.TreeMap;
24
25 import javax.ws.rs.GET;
26 import javax.ws.rs.Path;
27 import javax.ws.rs.Produces;
28
29 import org.apache.tika.config.TikaConfig;
30 import org.apache.tika.mime.MediaType;
31 import org.apache.tika.mime.MediaTypeRegistry;
32 import org.apache.tika.parser.CompositeParser;
33 import org.apache.tika.parser.Parser;
34 import org.eclipse.jetty.util.ajax.JSON;
35
36 /**
37 * <p>Provides details of all the mimetypes known to Apache Tika,
38 * similar to <em>--list-supported-types</em> with the Tika CLI.
39 */
40 @Path("/mime-types")
41 public class TikaMimeTypes {
42 private TikaConfig tika;
43 private HTMLHelper html;
44
45 public TikaMimeTypes(TikaConfig tika) {
46 this.tika = tika;
47 this.html = new HTMLHelper();
48 }
49
50 @GET
51 @Produces("text/html")
52 public String getMimeTypesHTML() {
53 StringBuffer h = new StringBuffer();
54 html.generateHeader(h, "Apache Tika Supported Mime Types");
55
56 // Get our types
57 List<MediaTypeDetails> types = getMediaTypes();
58
59 // Get the first type in each section
60 SortedMap<String,String> firstType = new TreeMap<String, String>();
61 for (MediaTypeDetails type : types) {
62 if (! firstType.containsKey(type.type.getType())) {
63 firstType.put(type.type.getType(), type.type.toString());
64 }
65 }
66 h.append("<ul>");
67 for (String section : firstType.keySet()) {
68 h.append("<li><a href=\"#" + firstType.get(section) + "\">" +
69 section + "</a></li>\n");
70 }
71 h.append("</ul>");
72
73 // Output all of them
74 for (MediaTypeDetails type : types) {
75 h.append("<a name=\"" + type.type + "\"></a>\n");
76 h.append("<h2>" + type.type + "</h2>\n");
77
78 for (MediaType alias : type.aliases) {
79 h.append("<div>Alias: " + alias + "</div>\n");
80 }
81 if (type.supertype != null) {
82 h.append("<div>Super Type: <a href=\"#" + type.supertype +
83 "\">" + type.supertype + "</a></div>\n");
84 }
85
86 if (type.parser != null) {
87 h.append("<div>Parser: " + type.parser + "</div>\n");
88 }
89 }
90
91 html.generateFooter(h);
92 return h.toString();
93 }
94
95 @GET
96 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
97 public String getMimeTypesJSON() {
98 Map<String,Object> details = new HashMap<String, Object>();
99
100 for (MediaTypeDetails type : getMediaTypes()) {
101 Map<String,Object> typeDets = new HashMap<String, Object>();
102
103 typeDets.put("alias", type.aliases);
104 if (type.supertype != null) {
105 typeDets.put("supertype", type.supertype);
106 }
107 if (type.parser != null) {
108 typeDets.put("parser", type.parser);
109 }
110
111 details.put(type.type.toString(), typeDets);
112 }
113
114 return JSON.toString(details);
115 }
116
117 @GET
118 @Produces("text/plain")
119 public String getMimeTypesPlain() {
120 StringBuffer text = new StringBuffer();
121
122 for (MediaTypeDetails type : getMediaTypes()) {
123 text.append(type.type.toString());
124 text.append("\n");
125
126 for (MediaType alias : type.aliases) {
127 text.append(" alias: " + alias + "\n");
128 }
129 if (type.supertype != null) {
130 text.append(" supertype: " + type.supertype.toString() + "\n");
131 }
132
133 if (type.parser != null) {
134 text.append(" parser: " + type.parser + "\n");
135 }
136 }
137
138 return text.toString();
139 }
140
141 protected List<MediaTypeDetails> getMediaTypes() {
142 MediaTypeRegistry registry = tika.getMediaTypeRegistry();
143 Map<MediaType, Parser> parsers = ((CompositeParser)tika.getParser()).getParsers();
144 List<MediaTypeDetails> types =
145 new ArrayList<TikaMimeTypes.MediaTypeDetails>(registry.getTypes().size());
146
147 for (MediaType type : registry.getTypes()) {
148 MediaTypeDetails details = new MediaTypeDetails();
149 details.type = type;
150 details.aliases = registry.getAliases(type).toArray(new MediaType[0]);
151
152 MediaType supertype = registry.getSupertype(type);
153 if (supertype != null && !MediaType.OCTET_STREAM.equals(supertype)) {
154 details.supertype = supertype;
155 }
156
157 Parser p = parsers.get(type);
158 if (p != null) {
159 if (p instanceof CompositeParser) {
160 p = ((CompositeParser)p).getParsers().get(type);
161 }
162 details.parser = p.getClass().getName();
163 }
164
165 types.add(details);
166 }
167
168 return types;
169 }
170
171 private static class MediaTypeDetails {
172 private MediaType type;
173 private MediaType[] aliases;
174 private MediaType supertype;
175 private String parser;
176 }
177 }
+0
-226
tika-server/src/main/java/org/apache/tika/server/TikaParsers.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server;
17
18 import java.util.ArrayList;
19 import java.util.Collections;
20 import java.util.Comparator;
21 import java.util.HashMap;
22 import java.util.HashSet;
23 import java.util.List;
24 import java.util.Map;
25 import java.util.Set;
26
27 import javax.ws.rs.GET;
28 import javax.ws.rs.Path;
29 import javax.ws.rs.Produces;
30
31 import org.apache.tika.config.TikaConfig;
32 import org.apache.tika.mime.MediaType;
33 import org.apache.tika.parser.CompositeParser;
34 import org.apache.tika.parser.ParseContext;
35 import org.apache.tika.parser.Parser;
36 import org.apache.tika.parser.ParserDecorator;
37 import org.eclipse.jetty.util.ajax.JSON;
38
39 /**
40 * <p>Provides details of all the {@link Parser}s registered with
41 * Apache Tika, similar to <em>--list-parsers</em> and
42 * <em>--list-parser-details</em> within the Tika CLI.
43 */
44 @Path("/parsers")
45 public class TikaParsers {
46 private static final ParseContext EMPTY_PC = new ParseContext();
47 private TikaConfig tika;
48 private HTMLHelper html;
49
50 public TikaParsers(TikaConfig tika) {
51 this.tika = tika;
52 this.html = new HTMLHelper();
53 }
54
55 @GET
56 @Path("/details")
57 @Produces("text/html")
58 public String getParserDetailsHTML() {
59 return getParsersHTML(true);
60 }
61 @GET
62 @Produces("text/html")
63 public String getParsersHTML() {
64 return getParsersHTML(false);
65 }
66 protected String getParsersHTML(boolean withMimeTypes) {
67 ParserDetails p = new ParserDetails(tika.getParser());
68
69 StringBuffer h = new StringBuffer();
70 html.generateHeader(h, "Parsers available to Apache Tika");
71 parserAsHTML(p, withMimeTypes, h, 2);
72 html.generateFooter(h);
73 return h.toString();
74 }
75 private void parserAsHTML(ParserDetails p, boolean withMimeTypes, StringBuffer html, int level) {
76 html.append("<h");
77 html.append(level);
78 html.append(">");
79 html.append(p.shortName);
80 html.append("</h");
81 html.append(level);
82 html.append(">");
83 html.append("<p>Class: ");
84 html.append(p.className);
85 html.append("</p>");
86 if (p.isDecorated) {
87 html.append("<p>Decorated Parser</p>");
88 }
89 if (p.isComposite) {
90 html.append("<p>Composite Parser</p>");
91 for (Parser cp : p.childParsers) {
92 parserAsHTML(new ParserDetails(cp), withMimeTypes, html, level+1);
93 }
94 } else if (withMimeTypes) {
95 html.append("<p>Mime Types:");
96 html.append("<ul>");
97 for (MediaType mt : p.supportedTypes) {
98 html.append("<li>");
99 html.append(mt.toString());
100 html.append("</li>");
101 }
102 html.append("</ul>");
103 html.append("</p>");
104 }
105 }
106
107 @GET
108 @Path("/details")
109 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
110 public String getParserDetailsJSON() {
111 return getParsersJSON(true);
112 }
113 @GET
114 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
115 public String getParsersJSON() {
116 return getParsersJSON(false);
117 }
118 protected String getParsersJSON(boolean withMimeTypes) {
119 Map<String,Object> details = new HashMap<String, Object>();
120 parserAsMap(new ParserDetails(tika.getParser()), withMimeTypes, details);
121 return JSON.toString(details);
122 }
123 private void parserAsMap(ParserDetails p, boolean withMimeTypes, Map<String, Object> details) {
124 details.put("name", p.className);
125 details.put("composite", p.isComposite);
126 details.put("decorated", p.isDecorated);
127
128 if (p.isComposite) {
129 List<Map<String, Object>> c = new ArrayList<Map<String,Object>>();
130 for (Parser cp : p.childParsers) {
131 Map<String,Object> cdet = new HashMap<String, Object>();
132 parserAsMap(new ParserDetails(cp), withMimeTypes, cdet);
133 c.add(cdet);
134 }
135 details.put("children", c);
136 } else if (withMimeTypes) {
137 List<String> mts = new ArrayList<String>(p.supportedTypes.size());
138 for (MediaType mt : p.supportedTypes) {
139 mts.add(mt.toString());
140 }
141 details.put("supportedTypes", mts);
142 }
143 }
144
145 @GET
146 @Path("/details")
147 @Produces("text/plain")
148 public String getParserDetailssPlain() {
149 return getParsersPlain(true);
150 }
151 @GET
152 @Produces("text/plain")
153 public String getParsersPlain() {
154 return getParsersPlain(false);
155 }
156 protected String getParsersPlain(boolean withMimeTypes) {
157 StringBuffer text = new StringBuffer();
158 renderParser(new ParserDetails(tika.getParser()), withMimeTypes, text, "");
159 return text.toString();
160 }
161 private void renderParser(ParserDetails p, boolean withMimeTypes, StringBuffer text, String indent) {
162 String nextIndent = indent + " ";
163
164 text.append(indent);
165 text.append(p.className);
166 if (p.isDecorated) {
167 text.append(" (Decorated Parser)");
168 }
169 if (p.isComposite) {
170 text.append(" (Composite Parser):\n");
171
172 for (Parser cp : p.childParsers) {
173 renderParser(new ParserDetails(cp), withMimeTypes, text, nextIndent);
174 }
175 } else {
176 text.append("\n");
177 if (withMimeTypes) {
178 for (MediaType mt : p.supportedTypes) {
179 text.append(nextIndent);
180 text.append("Supports: ");
181 text.append(mt.toString());
182 text.append("\n");
183 }
184 }
185 }
186 }
187
188 private static class ParserDetails {
189 private String className;
190 private String shortName;
191 private boolean isComposite;
192 private boolean isDecorated;
193 private Set<MediaType> supportedTypes;
194 private List<Parser> childParsers;
195
196 private ParserDetails(Parser p) {
197 if (p instanceof ParserDecorator) {
198 isDecorated = true;
199 p = ((ParserDecorator)p).getWrappedParser();
200 }
201
202 className = p.getClass().getName();
203 shortName = className.substring(className.lastIndexOf('.')+1);
204
205 if (p instanceof CompositeParser) {
206 isComposite = true;
207 supportedTypes = Collections.emptySet();
208
209 // Get the unique set of child parsers
210 Set<Parser> children = new HashSet<Parser>(
211 ((CompositeParser)p).getParsers(EMPTY_PC).values());
212 // Sort it by class name
213 childParsers = new ArrayList<Parser>(children);
214 Collections.sort(childParsers, new Comparator<Parser>() {
215 @Override
216 public int compare(Parser p1, Parser p2) {
217 return p1.getClass().getName().compareTo(p2.getClass().getName());
218 }
219 });
220 } else {
221 supportedTypes = p.getSupportedTypes(EMPTY_PC);
222 }
223 }
224 }
225 }
+0
-349
tika-server/src/main/java/org/apache/tika/server/TikaResource.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import java.io.IOException;
20 import java.io.InputStream;
21 import java.io.OutputStream;
22 import java.io.OutputStreamWriter;
23 import java.io.Writer;
24 import java.util.Map;
25 import java.util.Set;
26
27 import javax.mail.internet.ContentDisposition;
28 import javax.mail.internet.ParseException;
29 import javax.ws.rs.Consumes;
30 import javax.ws.rs.GET;
31 import javax.ws.rs.PUT;
32 import javax.ws.rs.Path;
33 import javax.ws.rs.Produces;
34 import javax.ws.rs.WebApplicationException;
35 import javax.ws.rs.core.Context;
36 import javax.ws.rs.core.HttpHeaders;
37 import javax.ws.rs.core.MultivaluedMap;
38 import javax.ws.rs.core.Response;
39 import javax.ws.rs.core.StreamingOutput;
40 import javax.ws.rs.core.UriInfo;
41 import javax.xml.transform.OutputKeys;
42 import javax.xml.transform.TransformerConfigurationException;
43 import javax.xml.transform.sax.SAXTransformerFactory;
44 import javax.xml.transform.sax.TransformerHandler;
45 import javax.xml.transform.stream.StreamResult;
46
47 import org.apache.commons.logging.Log;
48 import org.apache.commons.logging.LogFactory;
49 import org.apache.cxf.jaxrs.ext.multipart.Attachment;
50 import org.apache.poi.extractor.ExtractorFactory;
51 import org.apache.poi.hwpf.OldWordFileFormatException;
52 import org.apache.tika.config.TikaConfig;
53 import org.apache.tika.detect.Detector;
54 import org.apache.tika.exception.EncryptedDocumentException;
55 import org.apache.tika.exception.TikaException;
56 import org.apache.tika.io.TikaInputStream;
57 import org.apache.tika.metadata.Metadata;
58 import org.apache.tika.metadata.TikaMetadataKeys;
59 import org.apache.tika.mime.MediaType;
60 import org.apache.tika.parser.AutoDetectParser;
61 import org.apache.tika.parser.ParseContext;
62 import org.apache.tika.parser.Parser;
63 import org.apache.tika.parser.html.HtmlParser;
64 import org.apache.tika.sax.BodyContentHandler;
65 import org.apache.tika.sax.ExpandedTitleContentHandler;
66 import org.xml.sax.ContentHandler;
67 import org.xml.sax.SAXException;
68
69 @Path("/tika")
70 public class TikaResource {
71 public static final String GREETING = "This is Tika Server. Please PUT\n";
72 private final Log logger = LogFactory.getLog(TikaResource.class);
73
74 private TikaConfig tikaConfig;
75 public TikaResource(TikaConfig tikaConfig) {
76 this.tikaConfig = tikaConfig;
77 }
78
79 static {
80 ExtractorFactory.setAllThreadsPreferEventExtractors(true);
81 }
82
83 @GET
84 @Produces("text/plain")
85 public String getMessage() {
86 return GREETING;
87 }
88
89 @SuppressWarnings("serial")
90 public static AutoDetectParser createParser(TikaConfig tikaConfig) {
91 final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
92
93 Map<MediaType,Parser> parsers = parser.getParsers();
94 parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
95 parser.setParsers(parsers);
96
97 parser.setFallback(new Parser() {
98 public Set<MediaType> getSupportedTypes(ParseContext parseContext) {
99 return parser.getSupportedTypes(parseContext);
100 }
101
102 public void parse(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, ParseContext parseContext) {
103 throw new WebApplicationException(Response.Status.UNSUPPORTED_MEDIA_TYPE);
104 }
105 });
106
107 return parser;
108 }
109
110 public static String detectFilename(MultivaluedMap<String, String> httpHeaders) {
111
112 String disposition = httpHeaders.getFirst("Content-Disposition");
113 if (disposition != null) {
114 try {
115 ContentDisposition c = new ContentDisposition(disposition);
116
117 // only support "attachment" dispositions
118 if ("attachment".equals(c.getDisposition())) {
119 String fn = c.getParameter("filename");
120 if (fn != null) {
121 return fn;
122 }
123 }
124 } catch (ParseException e) {
125 // not a valid content-disposition field
126 }
127 }
128
129 // this really should not be used, since it's not an official field
130 return httpHeaders.getFirst("File-Name");
131 }
132
133 @SuppressWarnings("serial")
134 public static void fillMetadata(AutoDetectParser parser, Metadata metadata, MultivaluedMap<String, String> httpHeaders) {
135 String fileName = detectFilename(httpHeaders);
136 if (fileName != null) {
137 metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, fileName);
138 }
139
140 String contentTypeHeader = httpHeaders.getFirst(HttpHeaders.CONTENT_TYPE);
141 javax.ws.rs.core.MediaType mediaType = contentTypeHeader == null ? null
142 : javax.ws.rs.core.MediaType.valueOf(contentTypeHeader);
143 if (mediaType!=null && "xml".equals(mediaType.getSubtype()) ) {
144 mediaType = null;
145 }
146
147 if (mediaType !=null && mediaType.equals(javax.ws.rs.core.MediaType.APPLICATION_OCTET_STREAM_TYPE)) {
148 mediaType = null;
149 }
150
151 if (mediaType !=null) {
152 metadata.add(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE, mediaType.toString());
153
154 final Detector detector = parser.getDetector();
155
156 parser.setDetector(new Detector() {
157 public MediaType detect(InputStream inputStream, Metadata metadata) throws IOException {
158 String ct = metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE);
159
160 if (ct!=null) {
161 return MediaType.parse(ct);
162 } else {
163 return detector.detect(inputStream, metadata);
164 }
165 }
166 });
167 }
168 }
169
170 @PUT
171 @Consumes("multipart/form-data")
172 @Produces("text/plain")
173 @Path("form")
174 public StreamingOutput getTextFromMultipart(Attachment att, @Context final UriInfo info) {
175 return produceText(att.getObject(InputStream.class), att.getHeaders(), info);
176 }
177
178 @PUT
179 @Consumes("*/*")
180 @Produces("text/plain")
181 public StreamingOutput getText(final InputStream is, @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
182 return produceText(is, httpHeaders.getRequestHeaders(), info);
183 }
184 public StreamingOutput produceText(final InputStream is, MultivaluedMap<String, String> httpHeaders, final UriInfo info) {
185 final AutoDetectParser parser = createParser(tikaConfig);
186 final Metadata metadata = new Metadata();
187
188 fillMetadata(parser, metadata, httpHeaders);
189
190 logRequest(logger, info, metadata);
191
192 return new StreamingOutput() {
193 public void write(OutputStream outputStream) throws IOException, WebApplicationException {
194 Writer writer = new OutputStreamWriter(outputStream, "UTF-8");
195
196 BodyContentHandler body = new BodyContentHandler(new RichTextContentHandler(writer));
197
198 TikaInputStream tis = TikaInputStream.get(is);
199
200 try {
201 parser.parse(tis, body, metadata);
202 } catch (SAXException e) {
203 throw new WebApplicationException(e);
204 } catch (EncryptedDocumentException e) {
205 logger.warn(String.format(
206 "%s: Encrypted document",
207 info.getPath()
208 ), e);
209
210 throw new WebApplicationException(e, Response.status(422).build());
211 } catch (TikaException e) {
212 logger.warn(String.format(
213 "%s: Text extraction failed",
214 info.getPath()
215 ), e);
216
217 if (e.getCause()!=null && e.getCause() instanceof WebApplicationException) {
218 throw (WebApplicationException) e.getCause();
219 }
220
221 if (e.getCause()!=null && e.getCause() instanceof IllegalStateException) {
222 throw new WebApplicationException(Response.status(422).build());
223 }
224
225 if (e.getCause()!=null && e.getCause() instanceof OldWordFileFormatException) {
226 throw new WebApplicationException(Response.status(422).build());
227 }
228
229 throw new WebApplicationException(Response.Status.INTERNAL_SERVER_ERROR);
230 } finally {
231 tis.close();
232 }
233 }
234 };
235 }
236
237 @PUT
238 @Consumes("multipart/form-data")
239 @Produces("text/html")
240 @Path("form")
241 public StreamingOutput getHTMLFromMultipart(Attachment att, @Context final UriInfo info) {
242 return produceOutput(att.getObject(InputStream.class), att.getHeaders(), info, "html");
243 }
244
245 @PUT
246 @Consumes("*/*")
247 @Produces("text/html")
248 public StreamingOutput getHTML(final InputStream is, @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
249 return produceOutput(is, httpHeaders.getRequestHeaders(), info, "html");
250 }
251
252 @PUT
253 @Consumes("multipart/form-data")
254 @Produces("text/xml")
255 @Path("form")
256 public StreamingOutput getXMLFromMultipart(Attachment att, @Context final UriInfo info) {
257 return produceOutput(att.getObject(InputStream.class), att.getHeaders(), info, "xml");
258 }
259
260 @PUT
261 @Consumes("*/*")
262 @Produces("text/xml")
263 public StreamingOutput getXML(final InputStream is, @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
264 return produceOutput(is, httpHeaders.getRequestHeaders(), info, "xml");
265 }
266
267 private StreamingOutput produceOutput(final InputStream is, final MultivaluedMap<String, String> httpHeaders,
268 final UriInfo info, final String format) {
269 final AutoDetectParser parser = createParser(tikaConfig);
270 final Metadata metadata = new Metadata();
271
272 fillMetadata(parser, metadata, httpHeaders);
273
274 logRequest(logger, info, metadata);
275
276 return new StreamingOutput() {
277 public void write(OutputStream outputStream)
278 throws IOException, WebApplicationException {
279 Writer writer = new OutputStreamWriter(outputStream, "UTF-8");
280 ContentHandler content;
281
282 try {
283 SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance( );
284 TransformerHandler handler = factory.newTransformerHandler( );
285 handler.getTransformer().setOutputProperty(OutputKeys.METHOD, format);
286 handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
287 handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "UTF-8");
288 handler.setResult(new StreamResult(writer));
289 content = new ExpandedTitleContentHandler( handler );
290 }
291 catch ( TransformerConfigurationException e ) {
292 throw new WebApplicationException( e );
293 }
294
295 TikaInputStream tis = TikaInputStream.get(is);
296
297 try {
298 parser.parse(tis, content, metadata);
299 }
300 catch (SAXException e) {
301 throw new WebApplicationException(e);
302 }
303 catch (EncryptedDocumentException e) {
304 logger.warn(String.format(
305 "%s: Encrypted document",
306 info.getPath()
307 ), e);
308 throw new WebApplicationException(e, Response.status(422).build());
309 }
310 catch (TikaException e) {
311 logger.warn(String.format(
312 "%s: Text extraction failed",
313 info.getPath()
314 ), e);
315
316 if (e.getCause()!=null && e.getCause() instanceof WebApplicationException)
317 throw (WebApplicationException) e.getCause();
318
319 if (e.getCause()!=null && e.getCause() instanceof IllegalStateException)
320 throw new WebApplicationException(Response.status(422).build());
321
322 if (e.getCause()!=null && e.getCause() instanceof OldWordFileFormatException)
323 throw new WebApplicationException(Response.status(422).build());
324
325 throw new WebApplicationException(Response.Status.INTERNAL_SERVER_ERROR);
326 }
327 finally {
328 tis.close();
329 }
330 }
331 };
332 }
333
334 public static void logRequest(Log logger, UriInfo info, Metadata metadata) {
335 if (metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE)==null) {
336 logger.info(String.format(
337 "%s (autodetecting type)",
338 info.getPath()
339 ));
340 } else {
341 logger.info(String.format(
342 "%s (%s)",
343 info.getPath(),
344 metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE)
345 ));
346 }
347 }
348 }
1616
1717 package org.apache.tika.server;
1818
19 import java.io.IOException;
2019 import java.util.ArrayList;
20 import java.util.Arrays;
21 import java.util.HashSet;
2122 import java.util.List;
22 import java.util.Properties;
23 import java.util.Set;
2324
2425 import org.apache.commons.cli.CommandLine;
2526 import org.apache.commons.cli.CommandLineParser;
3334 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
3435 import org.apache.cxf.jaxrs.lifecycle.ResourceProvider;
3536 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
37 import org.apache.cxf.rs.security.cors.CrossOriginResourceSharingFilter;
3638 import org.apache.tika.Tika;
3739 import org.apache.tika.config.TikaConfig;
40 import org.apache.tika.server.resource.DetectorResource;
41 import org.apache.tika.server.resource.MetadataResource;
42 import org.apache.tika.server.resource.RecursiveMetadataResource;
43 import org.apache.tika.server.writer.TarWriter;
44 import org.apache.tika.server.resource.TikaDetectors;
45 import org.apache.tika.server.resource.TikaMimeTypes;
46 import org.apache.tika.server.resource.TikaParsers;
47 import org.apache.tika.server.resource.TikaResource;
48 import org.apache.tika.server.resource.TikaVersion;
49 import org.apache.tika.server.resource.TikaWelcome;
50 import org.apache.tika.server.resource.UnpackerResource;
51 import org.apache.tika.server.writer.CSVMessageBodyWriter;
52 import org.apache.tika.server.writer.JSONMessageBodyWriter;
53 import org.apache.tika.server.writer.MetadataListMessageBodyWriter;
54 import org.apache.tika.server.writer.TextMessageBodyWriter;
55 import org.apache.tika.server.writer.XMPMessageBodyWriter;
56 import org.apache.tika.server.writer.ZipWriter;
3857
3958 public class TikaServerCli {
40 private static final Log logger = LogFactory.getLog(TikaServerCli.class);
41 public static final int DEFAULT_PORT = 9998;
42 public static final String DEFAULT_HOST = "localhost";
59 public static final int DEFAULT_PORT = 9998;
60 public static final String DEFAULT_HOST = "localhost";
61 public static final Set<String> LOG_LEVELS =
62 new HashSet<String>(Arrays.asList("debug", "info"));
63 private static final Log logger = LogFactory.getLog(TikaServerCli.class);
4364
44 private static Options getOptions() {
45 Options options = new Options();
46 options.addOption("h", "host", true, "host name (default = " + DEFAULT_HOST + ')');
47 options.addOption("p", "port", true, "listen port (default = " + DEFAULT_PORT + ')');
48 options.addOption("?", "help", false, "this help message");
65 private static Options getOptions() {
66 Options options = new Options();
67 options.addOption("C", "cors", true, "origin allowed to make CORS requests (default=NONE)\nall allowed if \"all\"");
68 options.addOption("h", "host", true, "host name (default = " + DEFAULT_HOST + ')');
69 options.addOption("p", "port", true, "listen port (default = " + DEFAULT_PORT + ')');
70 options.addOption("l", "log", true, "request URI log level ('debug' or 'info')");
71 options.addOption("s", "includeStack", false, "whether or not to return a stack trace\nif there is an exception during 'parse'");
72 options.addOption("?", "help", false, "this help message");
4973
50 return options;
51 }
74 return options;
75 }
5276
53 public static void main(String[] args) {
54
55 logger.info("Starting " + new Tika().toString() + " server");
77 public static void main(String[] args) {
5678
57 try {
58 Options options = getOptions();
79 logger.info("Starting " + new Tika().toString() + " server");
5980
60 CommandLineParser cliParser = new GnuParser();
61 CommandLine line = cliParser.parse(options, args);
81 try {
82 Options options = getOptions();
6283
63 if (line.hasOption("help")) {
64 HelpFormatter helpFormatter = new HelpFormatter();
65 helpFormatter.printHelp("tikaserver", options);
66 System.exit(-1);
67 }
68
69 String host = DEFAULT_HOST;
84 CommandLineParser cliParser = new GnuParser();
85 CommandLine line = cliParser.parse(options, args);
7086
71 if (line.hasOption("host")) {
72 host = line.getOptionValue("host");
73 }
74
75 int port = DEFAULT_PORT;
87 if (line.hasOption("help")) {
88 HelpFormatter helpFormatter = new HelpFormatter();
89 helpFormatter.printHelp("tikaserver", options);
90 System.exit(-1);
91 }
7692
77 if (line.hasOption("port")) {
78 port = Integer.valueOf(line.getOptionValue("port"));
79 }
80
81 // The Tika Configuration to use throughout
82 TikaConfig tika = TikaConfig.getDefaultConfig();
93 String host = DEFAULT_HOST;
8394
84 JAXRSServerFactoryBean sf = new JAXRSServerFactoryBean();
85 // Note - at least one of these stops TikaWelcome matching on /
86 // This prevents TikaWelcome acting as a partial solution to TIKA-1269
87 sf.setResourceClasses(MetadataEP.class, MetadataResource.class, DetectorResource.class,
88 TikaResource.class, UnpackerResource.class,
89 TikaDetectors.class, TikaParsers.class,
90 TikaMimeTypes.class, TikaVersion.class,
91 TikaWelcome.class);
92 // Use this one instead for the Welcome page to work
93 /*
94 sf.setResourceClasses(
95 // MetadataEP.class,
96 MetadataResource.class,
97 TikaResource.class,
98 // UnpackerResource.class,
99 TikaDetectors.class,
100 TikaMimeTypes.class,
101 TikaParsers.class,
102 TikaVersion.class,
103 TikaWelcome.class
104 );
105 */
95 if (line.hasOption("host")) {
96 host = line.getOptionValue("host");
97 }
10698
107 List<Object> providers = new ArrayList<Object>();
108 providers.add(new TarWriter());
109 providers.add(new ZipWriter());
110 providers.add(new CSVMessageBodyWriter());
111 providers.add(new JSONMessageBodyWriter());
112 providers.add(new TikaExceptionMapper());
113 sf.setProviders(providers);
114
115 List<ResourceProvider> rProviders = new ArrayList<ResourceProvider>();
116 rProviders.add(new SingletonResourceProvider(new MetadataResource(tika)));
117 rProviders.add(new SingletonResourceProvider(new DetectorResource(tika)));
118 rProviders.add(new SingletonResourceProvider(new TikaResource(tika)));
119 rProviders.add(new SingletonResourceProvider(new UnpackerResource(tika)));
120 rProviders.add(new SingletonResourceProvider(new TikaMimeTypes(tika)));
121 rProviders.add(new SingletonResourceProvider(new TikaDetectors(tika)));
122 rProviders.add(new SingletonResourceProvider(new TikaParsers(tika)));
123 rProviders.add(new SingletonResourceProvider(new TikaVersion(tika)));
124 rProviders.add(new SingletonResourceProvider(new TikaWelcome(tika, sf)));
125 sf.setResourceProviders(rProviders);
126
127 sf.setAddress("http://" + host + ":" + port + "/");
128 BindingFactoryManager manager = sf.getBus().getExtension(
129 BindingFactoryManager.class);
130 JAXRSBindingFactory factory = new JAXRSBindingFactory();
131 factory.setBus(sf.getBus());
132 manager.registerBindingFactory(JAXRSBindingFactory.JAXRS_BINDING_ID,
133 factory);
134 sf.create();
135 logger.info("Started");
136 } catch (Exception ex) {
137 logger.fatal("Can't start", ex);
138 System.exit(-1);
99 int port = DEFAULT_PORT;
100
101 if (line.hasOption("port")) {
102 port = Integer.valueOf(line.getOptionValue("port"));
103 }
104
105 boolean returnStackTrace = false;
106 if (line.hasOption("includeStack")) {
107 returnStackTrace = true;
108 }
109
110 TikaLoggingFilter logFilter = null;
111 if (line.hasOption("log")) {
112 String logLevel = line.getOptionValue("log");
113 if (LOG_LEVELS.contains(logLevel)) {
114 boolean isInfoLevel = "info".equals(logLevel);
115 logFilter = new TikaLoggingFilter(isInfoLevel);
116 } else {
117 logger.info("Unsupported request URI log level: " + logLevel);
118 }
119 }
120
121 CrossOriginResourceSharingFilter corsFilter = null;
122 if (line.hasOption("cors")) {
123 corsFilter = new CrossOriginResourceSharingFilter();
124 String url = line.getOptionValue("cors");
125 List<String> origins = new ArrayList<String>();
126 if (!url.equals("*")) origins.add(url); // Empty list allows all origins.
127 corsFilter.setAllowOrigins(origins);
128 }
129
130 // The Tika Configuration to use throughout
131 TikaConfig tika = TikaConfig.getDefaultConfig();
132
133 JAXRSServerFactoryBean sf = new JAXRSServerFactoryBean();
134
135 List<ResourceProvider> rCoreProviders = new ArrayList<ResourceProvider>();
136 rCoreProviders.add(new SingletonResourceProvider(new MetadataResource(tika)));
137 rCoreProviders.add(new SingletonResourceProvider(new RecursiveMetadataResource(tika)));
138 rCoreProviders.add(new SingletonResourceProvider(new DetectorResource(tika)));
139 rCoreProviders.add(new SingletonResourceProvider(new TikaResource(tika)));
140 rCoreProviders.add(new SingletonResourceProvider(new UnpackerResource(tika)));
141 rCoreProviders.add(new SingletonResourceProvider(new TikaMimeTypes(tika)));
142 rCoreProviders.add(new SingletonResourceProvider(new TikaDetectors(tika)));
143 rCoreProviders.add(new SingletonResourceProvider(new TikaParsers(tika)));
144 rCoreProviders.add(new SingletonResourceProvider(new TikaVersion(tika)));
145 List<ResourceProvider> rAllProviders = new ArrayList<ResourceProvider>(rCoreProviders);
146 rAllProviders.add(new SingletonResourceProvider(new TikaWelcome(tika, rCoreProviders)));
147 sf.setResourceProviders(rAllProviders);
148
149 List<Object> providers = new ArrayList<Object>();
150 providers.add(new TarWriter());
151 providers.add(new ZipWriter());
152 providers.add(new CSVMessageBodyWriter());
153 providers.add(new MetadataListMessageBodyWriter());
154 providers.add(new JSONMessageBodyWriter());
155 providers.add(new XMPMessageBodyWriter());
156 providers.add(new TextMessageBodyWriter());
157 providers.add(new TikaServerParseExceptionMapper(returnStackTrace));
158 if (logFilter != null) {
159 providers.add(logFilter);
160 }
161 if (corsFilter != null) {
162 providers.add(corsFilter);
163 }
164 sf.setProviders(providers);
165
166 sf.setAddress("http://" + host + ":" + port + "/");
167 BindingFactoryManager manager = sf.getBus().getExtension(
168 BindingFactoryManager.class);
169 JAXRSBindingFactory factory = new JAXRSBindingFactory();
170 factory.setBus(sf.getBus());
171 manager.registerBindingFactory(JAXRSBindingFactory.JAXRS_BINDING_ID,
172 factory);
173 sf.create();
174 logger.info("Started");
175 } catch (Exception ex) {
176 logger.fatal("Can't start", ex);
177 System.exit(-1);
178 }
139179 }
140 }
141180 }
0 package org.apache.tika.server;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import javax.ws.rs.WebApplicationException;
20
21 /**
22 * Simple wrapper exception to be thrown for consistent handling
23 * of exceptions that can happen during a parse.
24 */
25 public class TikaServerParseException extends WebApplicationException {
26
27 public TikaServerParseException(String msg) {
28 super(msg);
29 }
30
31 public TikaServerParseException(Exception e) {
32 super(e);
33 }
34 }
0 package org.apache.tika.server;
1 /*
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 *
9 * http://www.apache.org/licenses/LICENSE-2.0
10 *
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import javax.ws.rs.WebApplicationException;
19 import javax.ws.rs.core.Response;
20 import javax.ws.rs.ext.ExceptionMapper;
21 import javax.ws.rs.ext.Provider;
22
23
24 import java.io.IOException;
25 import java.io.PrintWriter;
26 import java.io.StringWriter;
27 import java.io.Writer;
28
29 import org.apache.poi.hwpf.OldWordFileFormatException;
30 import org.apache.tika.exception.EncryptedDocumentException;
31 import org.apache.tika.exception.TikaException;
32
33 @Provider
34 public class TikaServerParseExceptionMapper implements ExceptionMapper<TikaServerParseException> {
35
36 private final boolean returnStack;
37
38 public TikaServerParseExceptionMapper(boolean returnStack) {
39 this.returnStack = returnStack;
40 }
41
42 public Response toResponse(TikaServerParseException e) {
43 if (e.getMessage().equals(Response.Status.UNSUPPORTED_MEDIA_TYPE.toString())) {
44 return buildResponse(e, 415);
45 }
46 Throwable cause = e.getCause();
47 if (cause == null) {
48 return buildResponse(e, Response.Status.INTERNAL_SERVER_ERROR.getStatusCode());
49 } else {
50 if (cause instanceof EncryptedDocumentException) {
51 return buildResponse(cause, 422);
52 } else if (cause instanceof TikaException) {
53 //unsupported media type
54 Throwable causeOfCause = cause.getCause();
55 if (causeOfCause instanceof WebApplicationException) {
56 return ((WebApplicationException) causeOfCause).getResponse();
57 }
58 return buildResponse(cause, 422);
59 } else if (cause instanceof IllegalStateException) {
60 return buildResponse(cause, 422);
61 } else if (cause instanceof OldWordFileFormatException) {
62 return buildResponse(cause, 422);
63 } else if (cause instanceof WebApplicationException) {
64 return ((WebApplicationException) e.getCause()).getResponse();
65 } else {
66 return buildResponse(e, 500);
67 }
68 }
69 }
70
71 private Response buildResponse(Throwable cause, int i) {
72 if (returnStack && cause != null) {
73 Writer result = new StringWriter();
74 PrintWriter writer = new PrintWriter(result);
75 cause.printStackTrace(writer);
76 writer.flush();
77 try {
78 result.flush();
79 } catch (IOException e) {
80 //something went seriously wrong
81 return Response.status(500).build();
82 }
83 return Response.status(i).entity(result.toString()).type("text/plain").build();
84 } else {
85 return Response.status(i).build();
86 }
87 }
88 }
+0
-38
tika-server/src/main/java/org/apache/tika/server/TikaVersion.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server;
17
18 import javax.ws.rs.GET;
19 import javax.ws.rs.Path;
20 import javax.ws.rs.Produces;
21
22 import org.apache.tika.Tika;
23 import org.apache.tika.config.TikaConfig;
24
25 @Path("/version")
26 public class TikaVersion {
27 private Tika tika;
28 public TikaVersion(TikaConfig tika) {
29 this.tika = new Tika(tika);
30 }
31
32 @GET
33 @Produces("text/plain")
34 public String getVersion() {
35 return tika.toString();
36 }
37 }
+0
-203
tika-server/src/main/java/org/apache/tika/server/TikaWelcome.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server;
17
18 import java.lang.annotation.Annotation;
19 import java.lang.reflect.Method;
20 import java.util.ArrayList;
21 import java.util.Arrays;
22 import java.util.Collections;
23 import java.util.Comparator;
24 import java.util.HashMap;
25 import java.util.List;
26 import java.util.Map;
27
28 import javax.ws.rs.DELETE;
29 import javax.ws.rs.GET;
30 import javax.ws.rs.HEAD;
31 import javax.ws.rs.OPTIONS;
32 import javax.ws.rs.POST;
33 import javax.ws.rs.PUT;
34 import javax.ws.rs.Path;
35 import javax.ws.rs.Produces;
36
37 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
38 import org.apache.tika.Tika;
39 import org.apache.tika.config.TikaConfig;
40
41 /**
42 * <p>Provides a basic welcome to the Apache Tika Server.</p>
43 */
44 @Path("/")
45 public class TikaWelcome {
46 private static final String DOCS_URL = "https://wiki.apache.org/tika/TikaJAXRS";
47
48 private static final Map<Class<? extends Annotation>, String> HTTP_METHODS =
49 new HashMap<Class<? extends Annotation>, String>();
50 static {
51 HTTP_METHODS.put(DELETE.class , "DELETE");
52 HTTP_METHODS.put(GET.class, "GET");
53 HTTP_METHODS.put(HEAD.class, "HEAD");
54 HTTP_METHODS.put(OPTIONS.class, "OPTIONS");
55 HTTP_METHODS.put(POST.class, "POST");
56 HTTP_METHODS.put(PUT.class, "PUT");
57 }
58
59 private Tika tika;
60 private HTMLHelper html;
61 private List<Class<?>> endpoints;
62
63 public TikaWelcome(TikaConfig tika, JAXRSServerFactoryBean sf) {
64 this.tika = new Tika(tika);
65 this.html = new HTMLHelper();
66 this.endpoints = sf.getResourceClasses();
67 }
68
69 protected List<Endpoint> identifyEndpoints() {
70 List<Endpoint> found = new ArrayList<Endpoint>();
71 for (Class<?> endpoint : endpoints) {
72 Path p = endpoint.getAnnotation(Path.class);
73 String basePath = null;
74 if (p != null)
75 basePath = p.value();
76
77 for (Method m : endpoint.getMethods()) {
78 String httpMethod = null;
79 String methodPath = null;
80 String[] produces = null;
81
82 for (Annotation a : m.getAnnotations()) {
83 for (Class<? extends Annotation> httpMethAnn : HTTP_METHODS.keySet()) {
84 if (httpMethAnn.isInstance(a)) {
85 httpMethod = HTTP_METHODS.get(httpMethAnn);
86 }
87 }
88 if (a instanceof Path) {
89 methodPath = ((Path)a).value();
90 }
91 if (a instanceof Produces) {
92 produces = ((Produces)a).value();
93 }
94 }
95
96 if (httpMethod != null) {
97 String mPath = basePath;
98 if (mPath == null) {
99 mPath = "";
100 }
101 if (methodPath != null) {
102 mPath += methodPath;
103 }
104 if (produces == null) {
105 produces = new String[0];
106 }
107 found.add(new Endpoint(endpoint, m, mPath, httpMethod, produces));
108 }
109 }
110 }
111 Collections.sort(found, new Comparator<Endpoint>() {
112 @Override
113 public int compare(Endpoint e1, Endpoint e2) {
114 int res = e1.path.compareTo(e2.path);
115 if (res == 0) {
116 res = e1.methodName.compareTo(e2.methodName);
117 }
118 return res;
119 }
120 });
121 return found;
122 }
123
124 @GET
125 @Produces("text/html")
126 public String getWelcomeHTML() {
127 StringBuffer h = new StringBuffer();
128 html.generateHeader(h, "Welcome to the " + tika.toString() + " Server");
129
130 h.append("<p>For endpoints, please see <a href=\"");
131 h.append(DOCS_URL);
132 h.append("\">");
133 h.append(DOCS_URL);
134 h.append("</a></p>\n");
135
136 h.append("<ul>\n");
137 for (Endpoint e : identifyEndpoints()) {
138 h.append("<li><b>");
139 h.append(e.httpMethod);
140 h.append("</b> <i><a href=\"");
141 h.append(e.path);
142 h.append("\">");
143 h.append(e.path);
144 h.append("</a></i><br />");
145 h.append("Class: ");
146 h.append(e.className);
147 h.append("<br />Method: ");
148 h.append(e.methodName);
149 for (String produces : e.produces) {
150 h.append("<br />Produces: ");
151 h.append(produces);
152 }
153 h.append("</li>\n");
154 }
155 h.append("</ul>\n");
156
157 html.generateFooter(h);
158 return h.toString();
159 }
160
161 @GET
162 @Produces("text/plain")
163 public String getWelcomePlain() {
164 StringBuffer text = new StringBuffer();
165
166 text.append(tika.toString());
167 text.append("\n");
168 text.append("For endpoints, please see ");
169 text.append(DOCS_URL);
170 text.append("\n\n");
171
172 for (Endpoint e : identifyEndpoints()) {
173 text.append(e.httpMethod);
174 text.append(" ");
175 text.append(e.path);
176 text.append("\n");
177 for (String produces : e.produces) {
178 text.append(" => ");
179 text.append(produces);
180 text.append("\n");
181 }
182 }
183
184 return text.toString();
185 }
186
187 protected class Endpoint {
188 public final String className;
189 public final String methodName;
190 public final String path;
191 public final String httpMethod;
192 public final List<String> produces;
193 protected Endpoint(Class<?> endpoint, Method method, String path,
194 String httpMethod, String[] produces) {
195 this.className = endpoint.getCanonicalName();
196 this.methodName = method.getName();
197 this.path = path;
198 this.httpMethod = httpMethod;
199 this.produces = Collections.unmodifiableList(Arrays.asList(produces));
200 }
201 }
202 }
+0
-257
tika-server/src/main/java/org/apache/tika/server/UnpackerResource.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import java.io.ByteArrayInputStream;
20 import java.io.ByteArrayOutputStream;
21 import java.io.IOException;
22 import java.io.InputStream;
23 import java.io.OutputStreamWriter;
24 import java.util.HashMap;
25 import java.util.Map;
26
27 import javax.ws.rs.PUT;
28 import javax.ws.rs.Path;
29 import javax.ws.rs.Produces;
30 import javax.ws.rs.WebApplicationException;
31 import javax.ws.rs.core.Context;
32 import javax.ws.rs.core.HttpHeaders;
33 import javax.ws.rs.core.Response;
34 import javax.ws.rs.core.UriInfo;
35
36 import org.apache.commons.lang.mutable.MutableInt;
37 import org.apache.commons.logging.Log;
38 import org.apache.commons.logging.LogFactory;
39 import org.apache.poi.poifs.filesystem.DirectoryEntry;
40 import org.apache.poi.poifs.filesystem.DocumentEntry;
41 import org.apache.poi.poifs.filesystem.DocumentInputStream;
42 import org.apache.poi.poifs.filesystem.Entry;
43 import org.apache.poi.poifs.filesystem.Ole10Native;
44 import org.apache.poi.poifs.filesystem.Ole10NativeException;
45 import org.apache.poi.poifs.filesystem.POIFSFileSystem;
46 import org.apache.poi.util.IOUtils;
47 import org.apache.tika.config.TikaConfig;
48 import org.apache.tika.exception.TikaException;
49 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
50 import org.apache.tika.io.TikaInputStream;
51 import org.apache.tika.metadata.Metadata;
52 import org.apache.tika.metadata.TikaMetadataKeys;
53 import org.apache.tika.mime.MimeTypeException;
54 import org.apache.tika.parser.AutoDetectParser;
55 import org.apache.tika.parser.ParseContext;
56 import org.apache.tika.parser.microsoft.OfficeParser;
57 import org.apache.tika.sax.BodyContentHandler;
58 import org.xml.sax.ContentHandler;
59 import org.xml.sax.SAXException;
60 import org.xml.sax.helpers.DefaultHandler;
61
62 @Path("/unpack")
63 public class UnpackerResource {
64 private static final Log logger = LogFactory.getLog(UnpackerResource.class);
65 public static final String TEXT_FILENAME = "__TEXT__";
66 private static final String META_FILENAME = "__METADATA__";
67
68 private TikaConfig tikaConfig;
69 public UnpackerResource(TikaConfig tikaConfig) {
70 this.tikaConfig = tikaConfig;
71 }
72
73 @Path("/{id:(/.*)?}")
74 @PUT
75 @Produces({"application/zip", "application/x-tar"})
76 public Map<String, byte[]> unpack(
77 InputStream is,
78 @Context HttpHeaders httpHeaders,
79 @Context UriInfo info
80 ) throws Exception {
81 return process(is, httpHeaders, info, false);
82 }
83
84 @Path("/all{id:(/.*)?}")
85 @PUT
86 @Produces({"application/zip", "application/x-tar"})
87 public Map<String, byte[]> unpackAll(
88 InputStream is,
89 @Context HttpHeaders httpHeaders,
90 @Context UriInfo info
91 ) throws Exception {
92 return process(is, httpHeaders, info, true);
93 }
94
95 private Map<String, byte[]> process(
96 InputStream is,
97 @Context HttpHeaders httpHeaders,
98 @Context UriInfo info,
99 boolean saveAll
100 ) throws Exception {
101 Metadata metadata = new Metadata();
102
103 AutoDetectParser parser = TikaResource.createParser(tikaConfig);
104
105 TikaResource.fillMetadata(parser, metadata, httpHeaders.getRequestHeaders());
106 TikaResource.logRequest(logger, info, metadata);
107
108 ContentHandler ch;
109 ByteArrayOutputStream text = new ByteArrayOutputStream();
110
111 if (saveAll) {
112 ch = new BodyContentHandler(new RichTextContentHandler(new OutputStreamWriter(text, "UTF-8")));
113 } else {
114 ch = new DefaultHandler();
115 }
116
117 ParseContext pc = new ParseContext();
118
119 Map<String, byte[]> files = new HashMap<String, byte[]>();
120 MutableInt count = new MutableInt();
121
122 pc.set(EmbeddedDocumentExtractor.class, new MyEmbeddedDocumentExtractor(count, files));
123
124 try {
125 parser.parse(is, ch, metadata, pc);
126 } catch (TikaException ex) {
127 logger.warn(String.format(
128 "%s: Unpacker failed",
129 info.getPath()
130 ), ex);
131
132 throw ex;
133 }
134
135 if (count.intValue() == 0 && !saveAll) {
136 throw new WebApplicationException(Response.Status.NO_CONTENT);
137 }
138
139 if (saveAll) {
140 files.put(TEXT_FILENAME, text.toByteArray());
141
142 ByteArrayOutputStream metaStream = new ByteArrayOutputStream();
143 MetadataResource.metadataToCsv(metadata, metaStream);
144
145 files.put(META_FILENAME, metaStream.toByteArray());
146 }
147
148 return files;
149 }
150
151 private class MyEmbeddedDocumentExtractor implements EmbeddedDocumentExtractor {
152 private final MutableInt count;
153 private final Map<String, byte[]> zout;
154
155 MyEmbeddedDocumentExtractor(MutableInt count, Map<String, byte[]> zout) {
156 this.count = count;
157 this.zout = zout;
158 }
159
160 public boolean shouldParseEmbedded(Metadata metadata) {
161 return true;
162 }
163
164 public void parseEmbedded(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, boolean b) throws SAXException, IOException {
165 ByteArrayOutputStream bos = new ByteArrayOutputStream();
166 IOUtils.copy(inputStream, bos);
167 byte[] data = bos.toByteArray();
168
169 String name = metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY);
170 String contentType = metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE);
171
172 if (name == null) {
173 name = Integer.toString(count.intValue());
174 }
175
176 if (!name.contains(".") && contentType!=null) {
177 try {
178 String ext = tikaConfig.getMimeRepository().forName(contentType).getExtension();
179
180 if (ext!=null) {
181 name += ext;
182 }
183 } catch (MimeTypeException e) {
184 logger.warn("Unexpected MimeTypeException", e);
185 }
186 }
187
188 if ("application/vnd.openxmlformats-officedocument.oleObject".equals(contentType)) {
189 POIFSFileSystem poifs = new POIFSFileSystem(new ByteArrayInputStream(data));
190 OfficeParser.POIFSDocumentType type = OfficeParser.POIFSDocumentType.detectType(poifs);
191
192 if (type == OfficeParser.POIFSDocumentType.OLE10_NATIVE) {
193 try {
194 Ole10Native ole = Ole10Native.createFromEmbeddedOleObject(poifs);
195 if (ole.getDataSize()>0) {
196 String label = ole.getLabel();
197
198 if (label.startsWith("ole-")) {
199 label = Integer.toString(count.intValue()) + '-' + label;
200 }
201
202 name = label;
203
204 data = ole.getDataBuffer();
205 }
206 } catch (Ole10NativeException ex) {
207 logger.warn("Skipping invalid part", ex);
208 }
209 } else {
210 name += '.' + type.getExtension();
211 }
212 }
213
214 final String finalName = name;
215
216 if (data.length > 0) {
217 zout.put(finalName, data);
218
219 count.increment();
220 } else {
221 if (inputStream instanceof TikaInputStream) {
222 TikaInputStream tin = (TikaInputStream) inputStream;
223
224 if (tin.getOpenContainer()!=null && tin.getOpenContainer() instanceof DirectoryEntry) {
225 POIFSFileSystem fs = new POIFSFileSystem();
226 copy((DirectoryEntry) tin.getOpenContainer(), fs.getRoot());
227 ByteArrayOutputStream bos2 = new ByteArrayOutputStream();
228 fs.writeFilesystem(bos2);
229 bos2.close();
230
231 zout.put(finalName, bos2.toByteArray());
232 }
233 }
234 }
235 }
236
237 protected void copy(DirectoryEntry sourceDir, DirectoryEntry destDir)
238 throws IOException {
239 for (Entry entry : sourceDir) {
240 if (entry instanceof DirectoryEntry) {
241 // Need to recurse
242 DirectoryEntry newDir = destDir.createDirectory(entry.getName());
243 copy((DirectoryEntry) entry, newDir);
244 } else {
245 // Copy entry
246 InputStream contents = new DocumentInputStream((DocumentEntry) entry);
247 try {
248 destDir.createDocument(entry.getName(), contents);
249 } finally {
250 contents.close();
251 }
252 }
253 }
254 }
255 }
256 }
+0
-85
tika-server/src/main/java/org/apache/tika/server/ZipWriter.java less more
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
20 import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
21
22 import javax.ws.rs.Produces;
23 import javax.ws.rs.WebApplicationException;
24 import javax.ws.rs.core.MediaType;
25 import javax.ws.rs.core.MultivaluedMap;
26 import javax.ws.rs.ext.MessageBodyWriter;
27 import javax.ws.rs.ext.Provider;
28 import java.io.IOException;
29 import java.io.OutputStream;
30 import java.lang.annotation.Annotation;
31 import java.lang.reflect.Type;
32 import java.util.Map;
33 import java.util.UUID;
34 import java.util.zip.CRC32;
35 import java.util.zip.ZipEntry;
36 import java.util.zip.ZipException;
37 import java.util.zip.ZipOutputStream;
38
39 @Provider
40 @Produces("application/zip")
41 public class ZipWriter implements MessageBodyWriter<Map<String, byte[]>> {
42 private static void zipStoreBuffer(ZipArchiveOutputStream zip, String name, byte[] dataBuffer) throws IOException {
43 ZipEntry zipEntry = new ZipEntry(name!=null?name: UUID.randomUUID().toString());
44 zipEntry.setMethod(ZipOutputStream.STORED);
45
46 zipEntry.setSize(dataBuffer.length);
47 CRC32 crc32 = new CRC32();
48 crc32.update(dataBuffer);
49 zipEntry.setCrc(crc32.getValue());
50
51 try {
52 zip.putArchiveEntry(new ZipArchiveEntry(zipEntry));
53 } catch (ZipException ex) {
54 if (name!=null) {
55 zipStoreBuffer(zip, "x-"+name, dataBuffer);
56 return;
57 }
58 }
59
60 zip.write(dataBuffer);
61
62 zip.closeArchiveEntry();
63 }
64
65 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
66 return Map.class.isAssignableFrom(type);
67 }
68
69 public long getSize(Map<String, byte[]> stringMap, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
70 return -1;
71 }
72
73 public void writeTo(Map<String, byte[]> parts, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
74 ZipArchiveOutputStream zip = new ZipArchiveOutputStream(entityStream);
75
76 zip.setMethod(ZipArchiveOutputStream.STORED);
77
78 for (Map.Entry<String, byte[]> entry : parts.entrySet()) {
79 zipStoreBuffer(zip, entry.getKey(), entry.getValue());
80 }
81
82 zip.close();
83 }
84 }
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.resource;
18
19 import javax.ws.rs.Consumes;
20 import javax.ws.rs.PUT;
21 import javax.ws.rs.Path;
22 import javax.ws.rs.Produces;
23 import javax.ws.rs.core.Context;
24 import javax.ws.rs.core.HttpHeaders;
25 import javax.ws.rs.core.UriInfo;
26
27 import java.io.IOException;
28 import java.io.InputStream;
29
30 import org.apache.commons.logging.Log;
31 import org.apache.commons.logging.LogFactory;
32 import org.apache.tika.config.TikaConfig;
33 import org.apache.tika.io.TikaInputStream;
34 import org.apache.tika.metadata.Metadata;
35 import org.apache.tika.mime.MediaType;
36
37 @Path("/detect")
38 public class DetectorResource {
39
40 private static final Log logger = LogFactory.getLog(DetectorResource.class
41 .getName());
42
43 private TikaConfig config = null;
44
45 public DetectorResource(TikaConfig config) {
46 this.config = config;
47 }
48
49 @PUT
50 @Path("stream")
51 @Consumes("*/*")
52 @Produces("text/plain")
53 public String detect(final InputStream is,
54 @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
55 Metadata met = new Metadata();
56 TikaInputStream tis = TikaInputStream.get(is);
57 String filename = TikaResource.detectFilename(httpHeaders
58 .getRequestHeaders());
59 logger.info("Detecting media type for Filename: " + filename);
60 met.add(Metadata.RESOURCE_NAME_KEY, filename);
61 try {
62 return this.config.getDetector().detect(tis, met).toString();
63 } catch (IOException e) {
64 logger.warn("Unable to detect MIME type for file. Reason: "
65 + e.getMessage());
66 e.printStackTrace();
67 return MediaType.OCTET_STREAM.toString();
68 }
69 }
70
71 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.resource;
18
19 import javax.ws.rs.Consumes;
20 import javax.ws.rs.POST;
21 import javax.ws.rs.PUT;
22 import javax.ws.rs.Path;
23 import javax.ws.rs.PathParam;
24 import javax.ws.rs.Produces;
25 import javax.ws.rs.core.Context;
26 import javax.ws.rs.core.HttpHeaders;
27 import javax.ws.rs.core.MultivaluedMap;
28 import javax.ws.rs.core.Response;
29 import javax.ws.rs.core.UriInfo;
30
31 import java.io.IOException;
32 import java.io.InputStream;
33
34 import org.apache.commons.logging.Log;
35 import org.apache.commons.logging.LogFactory;
36 import org.apache.cxf.jaxrs.ext.multipart.Attachment;
37 import org.apache.tika.config.TikaConfig;
38 import org.apache.tika.metadata.Metadata;
39 import org.apache.tika.parser.AutoDetectParser;
40 import org.apache.tika.parser.ParseContext;
41 import org.xml.sax.helpers.DefaultHandler;
42
43
44 @Path("/meta")
45 public class MetadataResource {
46 private static final Log logger = LogFactory.getLog(MetadataResource.class);
47
48 private TikaConfig tikaConfig;
49
50 public MetadataResource(TikaConfig tikaConfig) {
51 this.tikaConfig = tikaConfig;
52 }
53
54 @POST
55 @Consumes("multipart/form-data")
56 @Produces({"text/csv", "application/json", "application/rdf+xml"})
57 @Path("form")
58 public Response getMetadataFromMultipart(Attachment att, @Context UriInfo info) throws Exception {
59 return Response.ok(
60 parseMetadata(att.getObject(InputStream.class), att.getHeaders(), info)).build();
61 }
62
63 @PUT
64 @Produces({"text/csv", "application/json", "application/rdf+xml"})
65 public Response getMetadata(InputStream is, @Context HttpHeaders httpHeaders, @Context UriInfo info) throws Exception {
66 return Response.ok(
67 parseMetadata(is, httpHeaders.getRequestHeaders(), info)).build();
68 }
69
70 /**
71 * Get a specific metadata field. If the input stream cannot be parsed, but a
72 * value was found for the given metadata field, then the value of the field
73 * is returned as part of a 200 OK response; otherwise a
74 * {@link javax.ws.rs.core.Response.Status#BAD_REQUEST} is generated. If the stream was successfully
75 * parsed but the specific metadata field was not found, then a
76 * {@link javax.ws.rs.core.Response.Status#NOT_FOUND} is returned.
77 * <p/>
78 * Note that this method handles multivalue fields and returns possibly more
79 * metadata value than requested.
80 * <p/>
81 * If you want XMP, you must be careful to specify the exact XMP key.
82 * For example, "Author" will return nothing, but "dc:creator" will return the correct value.
83 *
84 * @param is inputstream
85 * @param httpHeaders httpheaders
86 * @param info info
87 * @param field the tika metadata field name
88 * @return one of {@link javax.ws.rs.core.Response.Status#OK}, {@link javax.ws.rs.core.Response.Status#NOT_FOUND}, or
89 * {@link javax.ws.rs.core.Response.Status#BAD_REQUEST}
90 * @throws Exception
91 */
92 @PUT
93 @Path("{field}")
94 @Produces({"text/csv", "application/json", "application/rdf+xml", "text/plain"})
95 public Response getMetadataField(InputStream is, @Context HttpHeaders httpHeaders,
96 @Context UriInfo info, @PathParam("field") String field) throws Exception {
97
98 // use BAD request to indicate that we may not have had enough data to
99 // process the request
100 Response.Status defaultErrorResponse = Response.Status.BAD_REQUEST;
101 Metadata metadata = null;
102 try {
103 metadata = parseMetadata(is, httpHeaders.getRequestHeaders(), info);
104 // once we've parsed the document successfully, we should use NOT_FOUND
105 // if we did not see the field
106 defaultErrorResponse = Response.Status.NOT_FOUND;
107 } catch (Exception e) {
108 logger.info("Failed to process field " + field, e);
109 }
110
111 if (metadata == null || metadata.get(field) == null) {
112 return Response.status(defaultErrorResponse).entity("Failed to get metadata field " + field).build();
113 }
114
115 // remove fields we don't care about for the response
116 for (String name : metadata.names()) {
117 if (!field.equals(name)) {
118 metadata.remove(name);
119 }
120 }
121 return Response.ok(metadata).build();
122 }
123
124 private Metadata parseMetadata(InputStream is,
125 MultivaluedMap<String, String> httpHeaders, UriInfo info) throws IOException {
126 final Metadata metadata = new Metadata();
127 final ParseContext context = new ParseContext();
128 AutoDetectParser parser = TikaResource.createParser(tikaConfig);
129 TikaResource.fillMetadata(parser, metadata, context, httpHeaders);
130 //no need to pass parser for embedded document parsing
131 TikaResource.fillParseContext(context, httpHeaders, null);
132 TikaResource.logRequest(logger, info, metadata);
133 TikaResource.parse(parser, logger, info.getPath(), is, new DefaultHandler(), metadata, context);
134 return metadata;
135 }
136 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.resource;
18
19 import javax.ws.rs.Consumes;
20 import javax.ws.rs.POST;
21 import javax.ws.rs.PUT;
22 import javax.ws.rs.Path;
23 import javax.ws.rs.Produces;
24 import javax.ws.rs.core.Context;
25 import javax.ws.rs.core.HttpHeaders;
26 import javax.ws.rs.core.MultivaluedMap;
27 import javax.ws.rs.core.Response;
28 import javax.ws.rs.core.UriInfo;
29
30 import java.io.InputStream;
31
32 import org.apache.commons.logging.Log;
33 import org.apache.commons.logging.LogFactory;
34 import org.apache.cxf.jaxrs.ext.multipart.Attachment;
35 import org.apache.tika.config.TikaConfig;
36 import org.apache.tika.metadata.Metadata;
37 import org.apache.tika.parser.AutoDetectParser;
38 import org.apache.tika.parser.ParseContext;
39 import org.apache.tika.parser.RecursiveParserWrapper;
40 import org.apache.tika.sax.BasicContentHandlerFactory;
41 import org.apache.tika.server.MetadataList;
42 import org.xml.sax.helpers.DefaultHandler;
43
44 @Path("/rmeta")
45 public class RecursiveMetadataResource {
46 private static final Log logger = LogFactory.getLog(RecursiveMetadataResource.class);
47
48 private TikaConfig tikaConfig;
49
50 public RecursiveMetadataResource(TikaConfig tikaConfig) {
51 this.tikaConfig = tikaConfig;
52 }
53
54 @POST
55 @Consumes("multipart/form-data")
56 @Produces({"text/csv", "application/json"})
57 @Path("form")
58 public Response getMetadataFromMultipart(Attachment att, @Context UriInfo info) throws Exception {
59 return Response.ok(
60 parseMetadata(att.getObject(InputStream.class), att.getHeaders(), info)).build();
61 }
62
63 @PUT
64 @Produces("application/json")
65 public Response getMetadata(InputStream is, @Context HttpHeaders httpHeaders, @Context UriInfo info) throws Exception {
66 return Response.ok(
67 parseMetadata(is, httpHeaders.getRequestHeaders(), info)).build();
68 }
69
70 private MetadataList parseMetadata(InputStream is,
71 MultivaluedMap<String, String> httpHeaders, UriInfo info) throws Exception {
72 final Metadata metadata = new Metadata();
73 final ParseContext context = new ParseContext();
74 AutoDetectParser parser = TikaResource.createParser(tikaConfig);
75 //TODO: parameterize choice of handler and max chars?
76 BasicContentHandlerFactory.HANDLER_TYPE type = BasicContentHandlerFactory.HANDLER_TYPE.TEXT;
77 RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser,
78 new BasicContentHandlerFactory(type, -1));
79 TikaResource.fillMetadata(parser, metadata, context, httpHeaders);
80 //no need to add parser to parse recursively
81 TikaResource.fillParseContext(context, httpHeaders, null);
82 TikaResource.logRequest(logger, info, metadata);
83 TikaResource.parse(wrapper, logger, info.getPath(), is, new DefaultHandler(), metadata, context);
84 return new MetadataList(wrapper.getMetadata());
85 }
86 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server.resource;
17
18 import javax.ws.rs.GET;
19 import javax.ws.rs.Path;
20 import javax.ws.rs.Produces;
21
22 import java.util.ArrayList;
23 import java.util.HashMap;
24 import java.util.List;
25 import java.util.Map;
26
27 import org.apache.tika.config.TikaConfig;
28 import org.apache.tika.detect.CompositeDetector;
29 import org.apache.tika.detect.Detector;
30 import org.apache.tika.server.HTMLHelper;
31 import org.eclipse.jetty.util.ajax.JSON;
32
33 /**
34 * <p>Provides details of all the {@link Detector}s registered with
35 * Apache Tika, similar to <em>--list-detectors</em> with the Tika CLI.
36 */
37 @Path("/detectors")
38 public class TikaDetectors {
39 private TikaConfig tika;
40 private HTMLHelper html;
41
42 public TikaDetectors(TikaConfig tika) {
43 this.tika = tika;
44 this.html = new HTMLHelper();
45 }
46
47 @GET
48 @Produces("text/html")
49 public String getDectorsHTML() {
50 StringBuffer h = new StringBuffer();
51 html.generateHeader(h, "Detectors available to Apache Tika");
52 detectorAsHTML(tika.getDetector(), h, 2);
53 html.generateFooter(h);
54 return h.toString();
55 }
56
57 private void detectorAsHTML(Detector d, StringBuffer html, int level) {
58 html.append("<h");
59 html.append(level);
60 html.append(">");
61 String name = d.getClass().getName();
62 html.append(name.substring(name.lastIndexOf('.') + 1));
63 html.append("</h");
64 html.append(level);
65 html.append(">");
66 html.append("<p>Class: ");
67 html.append(name);
68 html.append("</p>");
69 if (d instanceof CompositeDetector) {
70 html.append("<p>Composite Detector</p>");
71 for (Detector cd : ((CompositeDetector) d).getDetectors()) {
72 detectorAsHTML(cd, html, level + 1);
73 }
74 }
75 }
76
77 @GET
78 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
79 public String getDetectorsJSON() {
80 Map<String, Object> details = new HashMap<String, Object>();
81 detectorAsMap(tika.getDetector(), details);
82 return JSON.toString(details);
83 }
84
85 private void detectorAsMap(Detector d, Map<String, Object> details) {
86 details.put("name", d.getClass().getName());
87
88 boolean isComposite = (d instanceof CompositeDetector);
89 details.put("composite", isComposite);
90 if (isComposite) {
91 List<Map<String, Object>> c = new ArrayList<Map<String, Object>>();
92 for (Detector cd : ((CompositeDetector) d).getDetectors()) {
93 Map<String, Object> cdet = new HashMap<String, Object>();
94 detectorAsMap(cd, cdet);
95 c.add(cdet);
96 }
97 details.put("children", c);
98 }
99 }
100
101 @GET
102 @Produces("text/plain")
103 public String getDetectorsPlain() {
104 StringBuffer text = new StringBuffer();
105 renderDetector(tika.getDetector(), text, 0);
106 return text.toString();
107 }
108
109 private void renderDetector(Detector d, StringBuffer text, int indent) {
110 boolean isComposite = (d instanceof CompositeDetector);
111 String name = d.getClass().getName();
112
113 for (int i = 0; i < indent; i++) {
114 text.append(" ");
115 }
116 text.append(name);
117 if (isComposite) {
118 text.append(" (Composite Detector):\n");
119
120 List<Detector> subDetectors = ((CompositeDetector) d).getDetectors();
121 for (Detector sd : subDetectors) {
122 renderDetector(sd, text, indent + 1);
123 }
124 } else {
125 text.append("\n"); }
126 }
127 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server.resource;
17
18 import javax.ws.rs.GET;
19 import javax.ws.rs.Path;
20 import javax.ws.rs.Produces;
21
22 import java.util.ArrayList;
23 import java.util.HashMap;
24 import java.util.List;
25 import java.util.Map;
26 import java.util.SortedMap;
27 import java.util.TreeMap;
28
29 import org.apache.tika.config.TikaConfig;
30 import org.apache.tika.mime.MediaType;
31 import org.apache.tika.mime.MediaTypeRegistry;
32 import org.apache.tika.parser.CompositeParser;
33 import org.apache.tika.parser.Parser;
34 import org.apache.tika.server.HTMLHelper;
35 import org.eclipse.jetty.util.ajax.JSON;
36
37 /**
38 * <p>Provides details of all the mimetypes known to Apache Tika,
39 * similar to <em>--list-supported-types</em> with the Tika CLI.
40 */
41 @Path("/mime-types")
42 public class TikaMimeTypes {
43 private TikaConfig tika;
44 private HTMLHelper html;
45
46 public TikaMimeTypes(TikaConfig tika) {
47 this.tika = tika;
48 this.html = new HTMLHelper();
49 }
50
51 @GET
52 @Produces("text/html")
53 public String getMimeTypesHTML() {
54 StringBuffer h = new StringBuffer();
55 html.generateHeader(h, "Apache Tika Supported Mime Types");
56
57 // Get our types
58 List<MediaTypeDetails> types = getMediaTypes();
59
60 // Get the first type in each section
61 SortedMap<String, String> firstType = new TreeMap<String, String>();
62 for (MediaTypeDetails type : types) {
63 if (!firstType.containsKey(type.type.getType())) {
64 firstType.put(type.type.getType(), type.type.toString());
65 }
66 }
67 h.append("<ul>");
68 for (String section : firstType.keySet()) {
69 h.append("<li><a href=\"#").append(firstType.get(section)).append("\">").append(section).append("</a></li>\n");
70 }
71 h.append("</ul>");
72
73 // Output all of them
74 for (MediaTypeDetails type : types) {
75 h.append("<a name=\"").append(type.type).append("\"></a>\n");
76 h.append("<h2>").append(type.type).append("</h2>\n");
77
78 for (MediaType alias : type.aliases) {
79 h.append("<div>Alias: ").append(alias).append("</div>\n");
80 }
81 if (type.supertype != null) {
82 h.append("<div>Super Type: <a href=\"#").append(type.supertype).append("\">").append(type.supertype).append("</a></div>\n");
83 }
84
85 if (type.parser != null) {
86 h.append("<div>Parser: ").append(type.parser).append("</div>\n");
87 }
88 }
89
90 html.generateFooter(h);
91 return h.toString();
92 }
93
94 @GET
95 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
96 public String getMimeTypesJSON() {
97 Map<String, Object> details = new HashMap<String, Object>();
98
99 for (MediaTypeDetails type : getMediaTypes()) {
100 Map<String, Object> typeDets = new HashMap<String, Object>();
101
102 typeDets.put("alias", type.aliases);
103 if (type.supertype != null) {
104 typeDets.put("supertype", type.supertype);
105 }
106 if (type.parser != null) {
107 typeDets.put("parser", type.parser);
108 }
109
110 details.put(type.type.toString(), typeDets);
111 }
112
113 return JSON.toString(details);
114 }
115
116 @GET
117 @Produces("text/plain")
118 public String getMimeTypesPlain() {
119 StringBuffer text = new StringBuffer();
120
121 for (MediaTypeDetails type : getMediaTypes()) {
122 text.append(type.type.toString());
123 text.append("\n");
124
125 for (MediaType alias : type.aliases) {
126 text.append(" alias: ").append(alias).append("\n");
127 }
128 if (type.supertype != null) {
129 text.append(" supertype: ").append(type.supertype.toString()).append("\n");
130 }
131
132 if (type.parser != null) {
133 text.append(" parser: ").append(type.parser).append("\n");
134 }
135 }
136
137 return text.toString();
138 }
139
140 protected List<MediaTypeDetails> getMediaTypes() {
141 MediaTypeRegistry registry = tika.getMediaTypeRegistry();
142 Map<MediaType, Parser> parsers = ((CompositeParser) tika.getParser()).getParsers();
143 List<MediaTypeDetails> types =
144 new ArrayList<TikaMimeTypes.MediaTypeDetails>(registry.getTypes().size());
145
146 for (MediaType type : registry.getTypes()) {
147 MediaTypeDetails details = new MediaTypeDetails();
148 details.type = type;
149 details.aliases = registry.getAliases(type).toArray(new MediaType[0]);
150
151 MediaType supertype = registry.getSupertype(type);
152 if (supertype != null && !MediaType.OCTET_STREAM.equals(supertype)) {
153 details.supertype = supertype;
154 }
155
156 Parser p = parsers.get(type);
157 if (p != null) {
158 if (p instanceof CompositeParser) {
159 p = ((CompositeParser) p).getParsers().get(type);
160 }
161 details.parser = p.getClass().getName();
162 }
163
164 types.add(details);
165 }
166
167 return types;
168 }
169
170 private static class MediaTypeDetails {
171 private MediaType type;
172 private MediaType[] aliases;
173 private MediaType supertype;
174 private String parser;
175 }
176 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server.resource;
17
18 import javax.ws.rs.GET;
19 import javax.ws.rs.Path;
20 import javax.ws.rs.Produces;
21
22 import java.util.ArrayList;
23 import java.util.Collections;
24 import java.util.Comparator;
25 import java.util.HashMap;
26 import java.util.HashSet;
27 import java.util.List;
28 import java.util.Map;
29 import java.util.Set;
30
31 import org.apache.tika.config.TikaConfig;
32 import org.apache.tika.mime.MediaType;
33 import org.apache.tika.parser.CompositeParser;
34 import org.apache.tika.parser.ParseContext;
35 import org.apache.tika.parser.Parser;
36 import org.apache.tika.parser.ParserDecorator;
37 import org.apache.tika.server.HTMLHelper;
38 import org.eclipse.jetty.util.ajax.JSON;
39
40 /**
41 * <p>Provides details of all the {@link Parser}s registered with
42 * Apache Tika, similar to <em>--list-parsers</em> and
43 * <em>--list-parser-details</em> within the Tika CLI.
44 */
45 @Path("/parsers")
46 public class TikaParsers {
47 private static final ParseContext EMPTY_PC = new ParseContext();
48 private TikaConfig tika;
49 private HTMLHelper html;
50
51 public TikaParsers(TikaConfig tika) {
52 this.tika = tika;
53 this.html = new HTMLHelper();
54 }
55
56 @GET
57 @Path("/details")
58 @Produces("text/html")
59 public String getParserDetailsHTML() {
60 return getParsersHTML(true);
61 }
62
63 @GET
64 @Produces("text/html")
65 public String getParsersHTML() {
66 return getParsersHTML(false);
67 }
68
69 protected String getParsersHTML(boolean withMimeTypes) {
70 ParserDetails p = new ParserDetails(tika.getParser());
71
72 StringBuffer h = new StringBuffer();
73 html.generateHeader(h, "Parsers available to Apache Tika");
74 parserAsHTML(p, withMimeTypes, h, 2);
75 html.generateFooter(h);
76 return h.toString();
77 }
78
79 private void parserAsHTML(ParserDetails p, boolean withMimeTypes, StringBuffer html, int level) {
80 html.append("<h");
81 html.append(level);
82 html.append(">");
83 html.append(p.shortName);
84 html.append("</h");
85 html.append(level);
86 html.append(">");
87 html.append("<p>Class: ");
88 html.append(p.className);
89 html.append("</p>");
90 if (p.isDecorated) {
91 html.append("<p>Decorated Parser</p>");
92 }
93 if (p.isComposite) {
94 html.append("<p>Composite Parser</p>");
95 for (Parser cp : p.childParsers) {
96 parserAsHTML(new ParserDetails(cp), withMimeTypes, html, level + 1);
97 }
98 } else if (withMimeTypes) {
99 html.append("<p>Mime Types:");
100 html.append("<ul>");
101 for (MediaType mt : p.supportedTypes) {
102 html.append("<li>");
103 html.append(mt.toString());
104 html.append("</li>");
105 }
106 html.append("</ul>");
107 html.append("</p>");
108 }
109 }
110
111 @GET
112 @Path("/details")
113 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
114 public String getParserDetailsJSON() {
115 return getParsersJSON(true);
116 }
117
118 @GET
119 @Produces(javax.ws.rs.core.MediaType.APPLICATION_JSON)
120 public String getParsersJSON() {
121 return getParsersJSON(false);
122 }
123
124 protected String getParsersJSON(boolean withMimeTypes) {
125 Map<String, Object> details = new HashMap<String, Object>();
126 parserAsMap(new ParserDetails(tika.getParser()), withMimeTypes, details);
127 return JSON.toString(details);
128 }
129
130 private void parserAsMap(ParserDetails p, boolean withMimeTypes, Map<String, Object> details) {
131 details.put("name", p.className);
132 details.put("composite", p.isComposite);
133 details.put("decorated", p.isDecorated);
134
135 if (p.isComposite) {
136 List<Map<String, Object>> c = new ArrayList<Map<String, Object>>();
137 for (Parser cp : p.childParsers) {
138 Map<String, Object> cdet = new HashMap<String, Object>();
139 parserAsMap(new ParserDetails(cp), withMimeTypes, cdet);
140 c.add(cdet);
141 }
142 details.put("children", c);
143 } else if (withMimeTypes) {
144 List<String> mts = new ArrayList<String>(p.supportedTypes.size());
145 for (MediaType mt : p.supportedTypes) {
146 mts.add(mt.toString());
147 }
148 details.put("supportedTypes", mts);
149 }
150 }
151
152 @GET
153 @Path("/details")
154 @Produces("text/plain")
155 public String getParserDetailssPlain() {
156 return getParsersPlain(true);
157 }
158
159 @GET
160 @Produces("text/plain")
161 public String getParsersPlain() {
162 return getParsersPlain(false);
163 }
164
165 protected String getParsersPlain(boolean withMimeTypes) {
166 StringBuffer text = new StringBuffer();
167 renderParser(new ParserDetails(tika.getParser()), withMimeTypes, text, "");
168 return text.toString();
169 }
170
171 private void renderParser(ParserDetails p, boolean withMimeTypes, StringBuffer text, String indent) {
172 String nextIndent = indent + " ";
173
174 text.append(indent);
175 text.append(p.className);
176 if (p.isDecorated) {
177 text.append(" (Decorated Parser)");
178 }
179 if (p.isComposite) {
180 text.append(" (Composite Parser):\n");
181
182 for (Parser cp : p.childParsers) {
183 renderParser(new ParserDetails(cp), withMimeTypes, text, nextIndent);
184 }
185 } else {
186 text.append("\n");
187 if (withMimeTypes) {
188 for (MediaType mt : p.supportedTypes) {
189 text.append(nextIndent);
190 text.append("Supports: ");
191 text.append(mt.toString());
192 text.append("\n");
193 }
194 }
195 }
196 }
197
198 private static class ParserDetails {
199 private String className;
200 private String shortName;
201 private boolean isComposite;
202 private boolean isDecorated;
203 private Set<MediaType> supportedTypes;
204 private List<Parser> childParsers;
205
206 private ParserDetails(Parser p) {
207 if (p instanceof ParserDecorator) {
208 isDecorated = true;
209 p = ((ParserDecorator) p).getWrappedParser();
210 }
211
212 className = p.getClass().getName();
213 shortName = className.substring(className.lastIndexOf('.') + 1);
214
215 if (p instanceof CompositeParser) {
216 isComposite = true;
217 supportedTypes = Collections.emptySet();
218
219 // Get the unique set of child parsers
220 Set<Parser> children = new HashSet<Parser>(
221 ((CompositeParser) p).getParsers(EMPTY_PC).values());
222 // Sort it by class name
223 childParsers = new ArrayList<Parser>(children);
224 Collections.sort(childParsers, new Comparator<Parser>() { @Override
225 public int compare(Parser p1, Parser p2) {
226 return p1.getClass().getName().compareTo(p2.getClass().getName());
227 }
228 });
229 } else {
230 supportedTypes = p.getSupportedTypes(EMPTY_PC);
231 }
232 }
233 }
234 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.resource;
18
19 import javax.mail.internet.ContentDisposition;
20 import javax.mail.internet.ParseException;
21 import javax.ws.rs.Consumes;
22 import javax.ws.rs.GET;
23 import javax.ws.rs.POST;
24 import javax.ws.rs.PUT;
25 import javax.ws.rs.Path;
26 import javax.ws.rs.Produces;
27 import javax.ws.rs.WebApplicationException;
28 import javax.ws.rs.core.Context;
29 import javax.ws.rs.core.HttpHeaders;
30 import javax.ws.rs.core.MultivaluedMap;
31 import javax.ws.rs.core.Response;
32 import javax.ws.rs.core.StreamingOutput;
33 import javax.ws.rs.core.UriInfo;
34 import javax.xml.transform.OutputKeys;
35 import javax.xml.transform.TransformerConfigurationException;
36 import javax.xml.transform.sax.SAXTransformerFactory;
37 import javax.xml.transform.sax.TransformerHandler;
38 import javax.xml.transform.stream.StreamResult;
39
40 import java.io.IOException;
41 import java.io.InputStream;
42 import java.io.OutputStream;
43 import java.io.OutputStreamWriter;
44 import java.io.Writer;
45 import java.lang.reflect.Field;
46 import java.util.Locale;
47 import java.util.Map;
48 import java.util.Set;
49
50 import org.apache.commons.lang.StringUtils;
51 import org.apache.commons.logging.Log;
52 import org.apache.commons.logging.LogFactory;
53 import org.apache.cxf.jaxrs.ext.multipart.Attachment;
54 import org.apache.poi.extractor.ExtractorFactory;
55 import org.apache.tika.config.TikaConfig;
56 import org.apache.tika.detect.Detector;
57 import org.apache.tika.exception.EncryptedDocumentException;
58 import org.apache.tika.io.IOUtils;
59 import org.apache.tika.io.TikaInputStream;
60 import org.apache.tika.metadata.Metadata;
61 import org.apache.tika.metadata.TikaMetadataKeys;
62 import org.apache.tika.mime.MediaType;
63 import org.apache.tika.parser.AutoDetectParser;
64 import org.apache.tika.parser.ParseContext;
65 import org.apache.tika.parser.Parser;
66 import org.apache.tika.parser.PasswordProvider;
67 import org.apache.tika.parser.html.HtmlParser;
68 import org.apache.tika.parser.ocr.TesseractOCRConfig;
69 import org.apache.tika.parser.pdf.PDFParserConfig;
70 import org.apache.tika.sax.BodyContentHandler;
71 import org.apache.tika.sax.ExpandedTitleContentHandler;
72 import org.apache.tika.server.RichTextContentHandler;
73 import org.apache.tika.server.TikaServerParseException;
74 import org.xml.sax.ContentHandler;
75 import org.xml.sax.SAXException;
76
77 @Path("/tika")
78 public class TikaResource {
79 public static final String GREETING = "This is Tika Server. Please PUT\n";
80 public static final String X_TIKA_OCR_HEADER_PREFIX = "X-Tika-OCR";
81 public static final String X_TIKA_PDF_HEADER_PREFIX = "X-Tika-PDF";
82
83
84 private final Log logger = LogFactory.getLog(TikaResource.class);
85
86 private TikaConfig tikaConfig;
87
88 public TikaResource(TikaConfig tikaConfig) {
89 this.tikaConfig = tikaConfig;
90 }
91
92 static {
93 ExtractorFactory.setAllThreadsPreferEventExtractors(true);
94 }
95
96 @SuppressWarnings("serial")
97 public static AutoDetectParser createParser(TikaConfig tikaConfig) {
98 final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
99
100 Map<MediaType, Parser> parsers = parser.getParsers();
101 parsers.put(MediaType.APPLICATION_XML, new HtmlParser());
102 parser.setParsers(parsers);
103
104 parser.setFallback(new Parser() {
105 public Set<MediaType> getSupportedTypes(ParseContext parseContext) {
106 return parser.getSupportedTypes(parseContext);
107 }
108
109 public void parse(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, ParseContext parseContext) {
110 throw new WebApplicationException(Response.Status.UNSUPPORTED_MEDIA_TYPE);
111 }
112 });
113
114 return parser;
115 }
116
117 public static String detectFilename(MultivaluedMap<String, String> httpHeaders) {
118
119 String disposition = httpHeaders.getFirst("Content-Disposition");
120 if (disposition != null) {
121 try {
122 ContentDisposition c = new ContentDisposition(disposition);
123
124 // only support "attachment" dispositions
125 if ("attachment".equals(c.getDisposition())) {
126 String fn = c.getParameter("filename");
127 if (fn != null) {
128 return fn;
129 }
130 }
131 } catch (ParseException e) {
132 // not a valid content-disposition field
133 }
134 }
135
136 // this really should not be used, since it's not an official field
137 return httpHeaders.getFirst("File-Name");
138 }
139
140 public static void fillParseContext(ParseContext parseContext, MultivaluedMap<String, String> httpHeaders,
141 Parser embeddedParser) {
142 TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
143 PDFParserConfig pdfParserConfig = new PDFParserConfig();
144 for (String key : httpHeaders.keySet()) {
145 if (StringUtils.startsWith(key, X_TIKA_OCR_HEADER_PREFIX)) {
146 processHeaderConfig(httpHeaders, ocrConfig, key, X_TIKA_OCR_HEADER_PREFIX);
147 } else if (StringUtils.startsWith(key, X_TIKA_PDF_HEADER_PREFIX)) {
148 processHeaderConfig(httpHeaders, pdfParserConfig, key, X_TIKA_PDF_HEADER_PREFIX);
149 }
150 }
151 parseContext.set(TesseractOCRConfig.class, ocrConfig);
152 parseContext.set(PDFParserConfig.class, pdfParserConfig);
153 if (embeddedParser != null) {
154 parseContext.set(Parser.class, embeddedParser);
155 }
156 }
157
158 /**
159 * Utility method to set a property on a class via reflection.
160 *
161 * @param httpHeaders the HTTP headers set.
162 * @param object the <code>Object</code> to set the property on.
163 * @param key the key of the HTTP Header.
164 * @param prefix the name of the HTTP Header prefix used to find property.
165 * @throws WebApplicationException thrown when field cannot be found.
166 */
167 private static void processHeaderConfig(MultivaluedMap<String, String> httpHeaders, Object object, String key, String prefix) {
168 try {
169 String property = StringUtils.removeStart(key, prefix);
170 Field field = object.getClass().getDeclaredField(StringUtils.uncapitalize(property));
171 field.setAccessible(true);
172 if (field.getType() == String.class) {
173 field.set(object, httpHeaders.getFirst(key));
174 } else if (field.getType() == int.class) {
175 field.setInt(object, Integer.parseInt(httpHeaders.getFirst(key)));
176 } else if (field.getType() == double.class) {
177 field.setDouble(object, Double.parseDouble(httpHeaders.getFirst(key)));
178 } else if (field.getType() == boolean.class) {
179 field.setBoolean(object, Boolean.parseBoolean(httpHeaders.getFirst(key)));
180 }
181 } catch (Throwable ex) {
182 throw new WebApplicationException(String.format(Locale.ROOT,
183 "%s is an invalid %s header", key, X_TIKA_OCR_HEADER_PREFIX));
184 }
185 }
186
187 @SuppressWarnings("serial")
188 public static void fillMetadata(AutoDetectParser parser, Metadata metadata, ParseContext context, MultivaluedMap<String, String> httpHeaders) {
189 String fileName = detectFilename(httpHeaders);
190 if (fileName != null) {
191 metadata.set(TikaMetadataKeys.RESOURCE_NAME_KEY, fileName);
192 }
193
194 String contentTypeHeader = httpHeaders.getFirst(HttpHeaders.CONTENT_TYPE);
195 javax.ws.rs.core.MediaType mediaType = contentTypeHeader == null ? null
196 : javax.ws.rs.core.MediaType.valueOf(contentTypeHeader);
197 if (mediaType != null && "xml".equals(mediaType.getSubtype())) {
198 mediaType = null;
199 }
200
201 if (mediaType != null && mediaType.equals(javax.ws.rs.core.MediaType.APPLICATION_OCTET_STREAM_TYPE)) {
202 mediaType = null;
203 }
204
205 if (mediaType != null) {
206 metadata.add(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE, mediaType.toString());
207
208 final Detector detector = parser.getDetector();
209
210 parser.setDetector(new Detector() {
211 public MediaType detect(InputStream inputStream, Metadata metadata) throws IOException {
212 String ct = metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE);
213
214 if (ct != null) {
215 return MediaType.parse(ct);
216 } else {
217 return detector.detect(inputStream, metadata);
218 }
219 }
220 });
221 }
222
223 final String password = httpHeaders.getFirst("Password");
224 if (password != null) {
225 context.set(PasswordProvider.class, new PasswordProvider() {
226 @Override
227 public String getPassword(Metadata metadata) {
228 return password;
229 }
230 });
231 }
232 }
233
234 public static void parse(Parser parser, Log logger, String path, InputStream inputStream,
235 ContentHandler handler, Metadata metadata, ParseContext parseContext) throws IOException {
236 try {
237 parser.parse(inputStream, handler, metadata, parseContext);
238 } catch (SAXException e) {
239 throw new TikaServerParseException(e);
240 } catch (EncryptedDocumentException e) {
241 logger.warn(String.format(
242 Locale.ROOT,
243 "%s: Encrypted document",
244 path
245 ), e);
246 throw new TikaServerParseException(e);
247 } catch (Exception e) {
248 logger.warn(String.format(
249 Locale.ROOT,
250 "%s: Text extraction failed",
251 path
252 ), e);
253 throw new TikaServerParseException(e);
254 }
255 }
256
257 public static void logRequest(Log logger, UriInfo info, Metadata metadata) {
258 if (metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE) == null) {
259 logger.info(String.format(
260 Locale.ROOT,
261 "%s (autodetecting type)",
262 info.getPath()
263 ));
264 } else {
265 logger.info(String.format(
266 Locale.ROOT,
267 "%s (%s)",
268 info.getPath(),
269 metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE)
270 ));
271 }
272 }
273
274 @GET
275 @Produces("text/plain")
276 public String getMessage() {
277 return GREETING;
278 }
279
280 @POST
281 @Consumes("multipart/form-data")
282 @Produces("text/plain")
283 @Path("form")
284 public StreamingOutput getTextFromMultipart(Attachment att, @Context final UriInfo info) {
285 return produceText(att.getObject(InputStream.class), att.getHeaders(), info);
286 }
287
288 @PUT
289 @Consumes("*/*")
290 @Produces("text/plain")
291 public StreamingOutput getText(final InputStream is, @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
292 return produceText(is, httpHeaders.getRequestHeaders(), info);
293 }
294
295 public StreamingOutput produceText(final InputStream is, MultivaluedMap<String, String> httpHeaders, final UriInfo info) {
296 final AutoDetectParser parser = createParser(tikaConfig);
297 final Metadata metadata = new Metadata();
298 final ParseContext context = new ParseContext();
299
300 fillMetadata(parser, metadata, context, httpHeaders);
301 fillParseContext(context, httpHeaders, parser);
302
303 logRequest(logger, info, metadata);
304
305 return new StreamingOutput() {
306 public void write(OutputStream outputStream) throws IOException, WebApplicationException {
307 Writer writer = new OutputStreamWriter(outputStream, IOUtils.UTF_8);
308
309 BodyContentHandler body = new BodyContentHandler(new RichTextContentHandler(writer));
310
311 TikaInputStream tis = TikaInputStream.get(is);
312
313 try {
314 parse(parser, logger, info.getPath(), tis, body, metadata, context);
315 } finally {
316 tis.close();
317 }
318 }
319 };
320 }
321
322 @POST
323 @Consumes("multipart/form-data")
324 @Produces("text/html")
325 @Path("form")
326 public StreamingOutput getHTMLFromMultipart(Attachment att, @Context final UriInfo info) {
327 return produceOutput(att.getObject(InputStream.class), att.getHeaders(), info, "html");
328 }
329
330 @PUT
331 @Consumes("*/*")
332 @Produces("text/html")
333 public StreamingOutput getHTML(final InputStream is, @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
334 return produceOutput(is, httpHeaders.getRequestHeaders(), info, "html");
335 }
336
337 @POST
338 @Consumes("multipart/form-data")
339 @Produces("text/xml")
340 @Path("form")
341 public StreamingOutput getXMLFromMultipart(Attachment att, @Context final UriInfo info) {
342 return produceOutput(att.getObject(InputStream.class), att.getHeaders(), info, "xml");
343 }
344
345 @PUT
346 @Consumes("*/*")
347 @Produces("text/xml")
348 public StreamingOutput getXML(final InputStream is, @Context HttpHeaders httpHeaders, @Context final UriInfo info) {
349 return produceOutput(is, httpHeaders.getRequestHeaders(), info, "xml");
350 }
351
352 private StreamingOutput produceOutput(final InputStream is, final MultivaluedMap<String, String> httpHeaders,
353 final UriInfo info, final String format) {
354 final AutoDetectParser parser = createParser(tikaConfig);
355 final Metadata metadata = new Metadata();
356 final ParseContext context = new ParseContext();
357
358 fillMetadata(parser, metadata, context, httpHeaders);
359 fillParseContext(context, httpHeaders, parser);
360
361
362 logRequest(logger, info, metadata);
363
364 return new StreamingOutput() {
365 public void write(OutputStream outputStream)
366 throws IOException, WebApplicationException {
367 Writer writer = new OutputStreamWriter(outputStream, IOUtils.UTF_8);
368 ContentHandler content;
369
370 try {
371 SAXTransformerFactory factory = (SAXTransformerFactory) SAXTransformerFactory.newInstance();
372 TransformerHandler handler = factory.newTransformerHandler();
373 handler.getTransformer().setOutputProperty(OutputKeys.METHOD, format);
374 handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
375 handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, IOUtils.UTF_8.name());
376 handler.setResult(new StreamResult(writer));
377 content = new ExpandedTitleContentHandler(handler);
378 } catch (TransformerConfigurationException e) {
379 throw new WebApplicationException(e);
380 }
381
382 TikaInputStream tis = TikaInputStream.get(is);
383
384 try {
385 parse(parser, logger, info.getPath(), tis, content, metadata, context);
386 } finally {
387 tis.close();
388 }
389 }
390 };
391 }
392 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server.resource;
17
18 import javax.ws.rs.GET;
19 import javax.ws.rs.Path;
20 import javax.ws.rs.Produces;
21
22 import org.apache.tika.Tika;
23 import org.apache.tika.config.TikaConfig;
24
25 @Path("/version")
26 public class TikaVersion {
27 private Tika tika;
28
29 public TikaVersion(TikaConfig tika) {
30 this.tika = new Tika(tika);
31 }
32
33 @GET
34 @Produces("text/plain")
35 public String getVersion() {
36 return tika.toString();
37 }
38 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16 package org.apache.tika.server.resource;
17
18 import javax.ws.rs.DELETE;
19 import javax.ws.rs.GET;
20 import javax.ws.rs.HEAD;
21 import javax.ws.rs.OPTIONS;
22 import javax.ws.rs.POST;
23 import javax.ws.rs.PUT;
24 import javax.ws.rs.Path;
25 import javax.ws.rs.Produces;
26
27 import java.lang.annotation.Annotation;
28 import java.lang.reflect.Method;
29 import java.util.ArrayList;
30 import java.util.Arrays;
31 import java.util.Collections;
32 import java.util.Comparator;
33 import java.util.HashMap;
34 import java.util.LinkedList;
35 import java.util.List;
36 import java.util.Map;
37 import java.util.regex.Matcher;
38 import java.util.regex.Pattern;
39
40 import org.apache.cxf.jaxrs.lifecycle.ResourceProvider;
41 import org.apache.tika.Tika;
42 import org.apache.tika.config.TikaConfig;
43 import org.apache.tika.server.HTMLHelper;
44
45 /**
46 * <p>Provides a basic welcome to the Apache Tika Server.</p>
47 */
48 @Path("/")
49 public class TikaWelcome {
50 private static final String DOCS_URL = "https://wiki.apache.org/tika/TikaJAXRS";
51
52 private static final Map<Class<? extends Annotation>, String> HTTP_METHODS =
53 new HashMap<Class<? extends Annotation>, String>();
54
55 static {
56 HTTP_METHODS.put(DELETE.class, "DELETE");
57 HTTP_METHODS.put(GET.class, "GET");
58 HTTP_METHODS.put(HEAD.class, "HEAD");
59 HTTP_METHODS.put(OPTIONS.class, "OPTIONS");
60 HTTP_METHODS.put(POST.class, "POST");
61 HTTP_METHODS.put(PUT.class, "PUT");
62 }
63
64 private Tika tika;
65 private HTMLHelper html;
66 private List<Class<?>> endpoints = new LinkedList<Class<?>>();
67
68 public TikaWelcome(TikaConfig tika, List<ResourceProvider> rCoreProviders) {
69 this.tika = new Tika(tika);
70 this.html = new HTMLHelper();
71 for (ResourceProvider rp : rCoreProviders) {
72 this.endpoints.add(rp.getResourceClass());
73 }
74 }
75
76 protected List<Endpoint> identifyEndpoints() {
77 List<Endpoint> found = new ArrayList<Endpoint>();
78 for (Class<?> endpoint : endpoints) {
79 Path p = endpoint.getAnnotation(Path.class);
80 String basePath = null;
81 if (p != null)
82 basePath = p.value().endsWith("/") ? p.value().substring(0, p.value().length()-2):p.value();
83
84 for (Method m : endpoint.getMethods()) {
85 String httpMethod = null;
86 String methodPath = null;
87 String[] produces = null;
88
89 for (Annotation a : m.getAnnotations()) {
90 for (Class<? extends Annotation> httpMethAnn : HTTP_METHODS.keySet()) {
91 if (httpMethAnn.isInstance(a)) {
92 httpMethod = HTTP_METHODS.get(httpMethAnn);
93 }
94 }
95 if (a instanceof Path) {
96 methodPath = ((Path) a).value();
97 }
98 if (a instanceof Produces) {
99 produces = ((Produces) a).value();
100 }
101 }
102
103 if (httpMethod != null) {
104 String mPath = basePath;
105 if (mPath == null) {
106 mPath = "";
107 }
108 if (methodPath != null) {
109 if(methodPath.startsWith("/")){
110 mPath += methodPath;
111 }
112 else{
113 mPath += "/"+ methodPath;
114 }
115 }
116 if (produces == null) {
117 produces = new String[0];
118 }
119 found.add(new Endpoint(endpoint, m, mPath, httpMethod, produces));
120 }
121 }
122 }
123 Collections.sort(found, new Comparator<Endpoint>() {
124 @Override
125 public int compare(Endpoint e1, Endpoint e2) {
126 int res = e1.path.compareTo(e2.path);
127 if (res == 0) {
128 res = e1.methodName.compareTo(e2.methodName);
129 }
130 return res;
131 }
132 });
133 return found;
134 }
135
136 @GET
137 @Produces("text/html")
138 public String getWelcomeHTML() {
139 StringBuffer h = new StringBuffer();
140 String tikaVersion = tika.toString();
141
142 html.generateHeader(h, "Welcome to the " + tikaVersion + " Server");
143
144 h.append("<p>For endpoints, please see <a href=\"");
145 h.append(DOCS_URL);
146 h.append("\">");
147 h.append(DOCS_URL);
148 h.append("</a>");
149
150 // TIKA-1269 -- Miredot documentation
151 // As the SNAPSHOT endpoints are updated, please update the website by running
152 // the server tests and doing step 12.6 of https://wiki.apache.org/tika/ReleaseProcess.
153 Pattern p = Pattern.compile("\\d+\\.\\d+");
154 Matcher m = p.matcher(tikaVersion);
155 if (m.find()) {
156 String versionNumber = m.group();
157 String miredot = "http://tika.apache.org/" + versionNumber + "/miredot/index.html";
158 h.append(" and <a href=\"")
159 .append(miredot)
160 .append("\">")
161 .append(miredot)
162 .append("</a>");
163 }
164 h.append("</p>\n");
165
166 h.append("<ul>\n");
167 for (Endpoint e : identifyEndpoints()) {
168 h.append("<li><b>");
169 h.append(e.httpMethod);
170 h.append("</b> <i><a href=\"");
171 h.append(e.path);
172 h.append("\">");
173 h.append(e.path);
174 h.append("</a></i><br />");
175 h.append("Class: ");
176 h.append(e.className);
177 h.append("<br />Method: ");
178 h.append(e.methodName);
179 for (String produces : e.produces) {
180 h.append("<br />Produces: ");
181 h.append(produces);
182 }
183 h.append("</li>\n");
184 }
185 h.append("</ul>\n");
186
187 html.generateFooter(h);
188 return h.toString();
189 }
190
191 @GET
192 @Produces("text/plain")
193 public String getWelcomePlain() {
194 StringBuffer text = new StringBuffer();
195
196 text.append(tika.toString());
197 text.append("\n");
198 text.append("For endpoints, please see ");
199 text.append(DOCS_URL);
200 text.append("\n\n");
201
202 for (Endpoint e : identifyEndpoints()) {
203 text.append(e.httpMethod);
204 text.append(" ");
205 text.append(e.path);
206 text.append("\n");
207 for (String produces : e.produces) {
208 text.append(" => ");
209 text.append(produces);
210 text.append("\n");
211 }
212 }
213
214 return text.toString();
215 }
216
217 protected class Endpoint {
218 public final String className;
219 public final String methodName;
220 public final String path;
221 public final String httpMethod;
222 public final List<String> produces;
223
224 protected Endpoint(Class<?> endpoint, Method method, String path,
225 String httpMethod, String[] produces) {
226 this.className = endpoint.getCanonicalName();
227 this.methodName = method.getName();
228 this.path = path;
229 this.httpMethod = httpMethod;
230 this.produces = Collections.unmodifiableList(Arrays.asList(produces));
231 }
232 }
233 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.resource;
18
19 import javax.ws.rs.PUT;
20 import javax.ws.rs.Path;
21 import javax.ws.rs.Produces;
22 import javax.ws.rs.WebApplicationException;
23 import javax.ws.rs.core.Context;
24 import javax.ws.rs.core.HttpHeaders;
25 import javax.ws.rs.core.Response;
26 import javax.ws.rs.core.UriInfo;
27
28 import java.io.ByteArrayInputStream;
29 import java.io.ByteArrayOutputStream;
30 import java.io.IOException;
31 import java.io.InputStream;
32 import java.io.OutputStream;
33 import java.io.OutputStreamWriter;
34 import java.util.ArrayList;
35 import java.util.Arrays;
36 import java.util.HashMap;
37 import java.util.Map;
38
39 import au.com.bytecode.opencsv.CSVWriter;
40 import org.apache.commons.lang.mutable.MutableInt;
41 import org.apache.commons.logging.Log;
42 import org.apache.commons.logging.LogFactory;
43 import org.apache.poi.poifs.filesystem.DirectoryEntry;
44 import org.apache.poi.poifs.filesystem.DocumentEntry;
45 import org.apache.poi.poifs.filesystem.DocumentInputStream;
46 import org.apache.poi.poifs.filesystem.Entry;
47 import org.apache.poi.poifs.filesystem.Ole10Native;
48 import org.apache.poi.poifs.filesystem.Ole10NativeException;
49 import org.apache.poi.poifs.filesystem.POIFSFileSystem;
50 import org.apache.poi.util.IOUtils;
51 import org.apache.tika.config.TikaConfig;
52 import org.apache.tika.extractor.EmbeddedDocumentExtractor;
53 import org.apache.tika.io.TikaInputStream;
54 import org.apache.tika.metadata.Metadata;
55 import org.apache.tika.metadata.TikaMetadataKeys;
56 import org.apache.tika.mime.MimeTypeException;
57 import org.apache.tika.parser.AutoDetectParser;
58 import org.apache.tika.parser.ParseContext;
59 import org.apache.tika.parser.microsoft.OfficeParser;
60 import org.apache.tika.sax.BodyContentHandler;
61 import org.apache.tika.server.RichTextContentHandler;
62 import org.xml.sax.ContentHandler;
63 import org.xml.sax.SAXException;
64 import org.xml.sax.helpers.DefaultHandler;
65
66 @Path("/unpack")
67 public class UnpackerResource {
68 public static final String TEXT_FILENAME = "__TEXT__";
69 private static final Log logger = LogFactory.getLog(UnpackerResource.class);
70 private static final String META_FILENAME = "__METADATA__";
71
72 private TikaConfig tikaConfig;
73
74 public UnpackerResource(TikaConfig tikaConfig) {
75 this.tikaConfig = tikaConfig;
76 }
77
78 public static void metadataToCsv(Metadata metadata, OutputStream outputStream) throws IOException {
79 CSVWriter writer = new CSVWriter(new OutputStreamWriter(outputStream, org.apache.tika.io.IOUtils.UTF_8));
80
81 for (String name : metadata.names()) {
82 String[] values = metadata.getValues(name);
83 ArrayList<String> list = new ArrayList<String>(values.length + 1);
84 list.add(name);
85 list.addAll(Arrays.asList(values));
86 writer.writeNext(list.toArray(values));
87 }
88
89 writer.close();
90 }
91
92 @Path("/{id:(/.*)?}")
93 @PUT
94 @Produces({"application/zip", "application/x-tar"})
95 public Map<String, byte[]> unpack(
96 InputStream is,
97 @Context HttpHeaders httpHeaders,
98 @Context UriInfo info
99 ) throws Exception {
100 return process(is, httpHeaders, info, false);
101 }
102
103 @Path("/all{id:(/.*)?}")
104 @PUT
105 @Produces({"application/zip", "application/x-tar"})
106 public Map<String, byte[]> unpackAll(
107 InputStream is,
108 @Context HttpHeaders httpHeaders,
109 @Context UriInfo info
110 ) throws Exception {
111 return process(is, httpHeaders, info, true);
112 }
113
114 private Map<String, byte[]> process(
115 InputStream is,
116 @Context HttpHeaders httpHeaders,
117 @Context UriInfo info,
118 boolean saveAll
119 ) throws Exception {
120 Metadata metadata = new Metadata();
121 ParseContext pc = new ParseContext();
122
123 AutoDetectParser parser = TikaResource.createParser(tikaConfig);
124
125 TikaResource.fillMetadata(parser, metadata, pc, httpHeaders.getRequestHeaders());
126 TikaResource.logRequest(logger, info, metadata);
127
128 ContentHandler ch;
129 ByteArrayOutputStream text = new ByteArrayOutputStream();
130
131 if (saveAll) {
132 ch = new BodyContentHandler(new RichTextContentHandler(new OutputStreamWriter(text, org.apache.tika.io.IOUtils.UTF_8)));
133 } else {
134 ch = new DefaultHandler();
135 }
136
137 Map<String, byte[]> files = new HashMap<String, byte[]>();
138 MutableInt count = new MutableInt();
139
140 pc.set(EmbeddedDocumentExtractor.class, new MyEmbeddedDocumentExtractor(count, files));
141 TikaResource.parse(parser, logger, info.getPath(), is, ch, metadata, pc);
142
143 if (count.intValue() == 0 && !saveAll) {
144 throw new WebApplicationException(Response.Status.NO_CONTENT);
145 }
146
147 if (saveAll) {
148 files.put(TEXT_FILENAME, text.toByteArray());
149
150 ByteArrayOutputStream metaStream = new ByteArrayOutputStream();
151 metadataToCsv(metadata, metaStream);
152
153 files.put(META_FILENAME, metaStream.toByteArray());
154 }
155
156 return files;
157 }
158
159 private class MyEmbeddedDocumentExtractor implements EmbeddedDocumentExtractor {
160 private final MutableInt count;
161 private final Map<String, byte[]> zout;
162
163 MyEmbeddedDocumentExtractor(MutableInt count, Map<String, byte[]> zout) {
164 this.count = count;
165 this.zout = zout;
166 }
167
168 public boolean shouldParseEmbedded(Metadata metadata) {
169 return true;
170 }
171
172 public void parseEmbedded(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, boolean b) throws SAXException, IOException {
173 ByteArrayOutputStream bos = new ByteArrayOutputStream();
174 IOUtils.copy(inputStream, bos);
175 byte[] data = bos.toByteArray();
176
177 String name = metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY);
178 String contentType = metadata.get(org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE);
179
180 if (name == null) {
181 name = Integer.toString(count.intValue());
182 }
183
184 if (!name.contains(".") && contentType != null) {
185 try {
186 String ext = tikaConfig.getMimeRepository().forName(contentType).getExtension();
187
188 if (ext != null) {
189 name += ext;
190 }
191 } catch (MimeTypeException e) {
192 logger.warn("Unexpected MimeTypeException", e);
193 }
194 }
195
196 if ("application/vnd.openxmlformats-officedocument.oleObject".equals(contentType)) {
197 POIFSFileSystem poifs = new POIFSFileSystem(new ByteArrayInputStream(data));
198 OfficeParser.POIFSDocumentType type = OfficeParser.POIFSDocumentType.detectType(poifs);
199
200 if (type == OfficeParser.POIFSDocumentType.OLE10_NATIVE) {
201 try {
202 Ole10Native ole = Ole10Native.createFromEmbeddedOleObject(poifs);
203 if (ole.getDataSize() > 0) {
204 String label = ole.getLabel();
205
206 if (label.startsWith("ole-")) {
207 label = Integer.toString(count.intValue()) + '-' + label;
208 }
209
210 name = label;
211
212 data = ole.getDataBuffer();
213 }
214 } catch (Ole10NativeException ex) {
215 logger.warn("Skipping invalid part", ex);
216 }
217 } else {
218 name += '.' + type.getExtension();
219 }
220 }
221
222 final String finalName = name;
223
224 if (data.length > 0) {
225 zout.put(finalName, data);
226
227 count.increment();
228 } else {
229 if (inputStream instanceof TikaInputStream) {
230 TikaInputStream tin = (TikaInputStream) inputStream;
231
232 if (tin.getOpenContainer() != null && tin.getOpenContainer() instanceof DirectoryEntry) {
233 POIFSFileSystem fs = new POIFSFileSystem();
234 copy((DirectoryEntry) tin.getOpenContainer(), fs.getRoot());
235 ByteArrayOutputStream bos2 = new ByteArrayOutputStream();
236 fs.writeFilesystem(bos2);
237 bos2.close();
238
239 zout.put(finalName, bos2.toByteArray());
240 }
241 }
242 }
243 }
244
245 protected void copy(DirectoryEntry sourceDir, DirectoryEntry destDir)
246 throws IOException {
247 for (Entry entry : sourceDir) {
248 if (entry instanceof DirectoryEntry) {
249 // Need to recurse
250 DirectoryEntry newDir = destDir.createDirectory(entry.getName());
251 copy((DirectoryEntry) entry, newDir);
252 } else {
253 // Copy entry
254 InputStream contents = new DocumentInputStream((DocumentEntry) entry);
255 try {
256 destDir.createDocument(entry.getName(), contents);
257 } finally {
258 contents.close();
259 }
260 }
261 }
262 }
263 }
264 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.writer;
18
19 import javax.ws.rs.Produces;
20 import javax.ws.rs.WebApplicationException;
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.MultivaluedMap;
23 import javax.ws.rs.ext.MessageBodyWriter;
24 import javax.ws.rs.ext.Provider;
25
26 import java.io.IOException;
27 import java.io.OutputStream;
28 import java.io.OutputStreamWriter;
29 import java.lang.annotation.Annotation;
30 import java.lang.reflect.Type;
31 import java.util.ArrayList;
32 import java.util.Arrays;
33
34 import au.com.bytecode.opencsv.CSVWriter;
35 import org.apache.tika.io.IOUtils;
36 import org.apache.tika.metadata.Metadata;
37
38 @Provider
39 @Produces("text/csv")
40 public class CSVMessageBodyWriter implements MessageBodyWriter<Metadata> {
41
42 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
43 return Metadata.class.isAssignableFrom(type);
44 }
45
46 public long getSize(Metadata data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
47 return -1;
48 }
49
50 @Override
51 @SuppressWarnings("resource")
52 public void writeTo(Metadata metadata, Class<?> type, Type genericType, Annotation[] annotations,
53 MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException,
54 WebApplicationException {
55
56 CSVWriter writer = new CSVWriter(new OutputStreamWriter(entityStream, IOUtils.UTF_8));
57
58 for (String name : metadata.names()) {
59 String[] values = metadata.getValues(name);
60 ArrayList<String> list = new ArrayList<String>(values.length + 1);
61 list.add(name);
62 list.addAll(Arrays.asList(values));
63 writer.writeNext(list.toArray(values));
64 }
65
66 // Don't close, just flush the stream
67 writer.flush();
68 }
69 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.writer;
18
19 import javax.ws.rs.Produces;
20 import javax.ws.rs.WebApplicationException;
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.MultivaluedMap;
23 import javax.ws.rs.ext.MessageBodyWriter;
24 import javax.ws.rs.ext.Provider;
25
26 import java.io.IOException;
27 import java.io.OutputStream;
28 import java.io.OutputStreamWriter;
29 import java.io.Writer;
30 import java.lang.annotation.Annotation;
31 import java.lang.reflect.Type;
32
33 import org.apache.tika.exception.TikaException;
34 import org.apache.tika.io.IOUtils;
35 import org.apache.tika.metadata.Metadata;
36 import org.apache.tika.metadata.serialization.JsonMetadata;
37
38 @Provider
39 @Produces(MediaType.APPLICATION_JSON)
40 public class JSONMessageBodyWriter implements MessageBodyWriter<Metadata> {
41
42 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
43 return Metadata.class.isAssignableFrom(type);
44 }
45
46 public long getSize(Metadata data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
47 return -1;
48 }
49
50 @Override
51 public void writeTo(Metadata metadata, Class<?> type, Type genericType, Annotation[] annotations,
52 MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException,
53 WebApplicationException {
54 try {
55 Writer writer = new OutputStreamWriter(entityStream, IOUtils.UTF_8);
56 JsonMetadata.toJson(metadata, writer);
57 writer.flush();
58 } catch (TikaException e) {
59 throw new IOException(e);
60 }
61 entityStream.flush();
62 }
63 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.writer;
18
19 import javax.ws.rs.Produces;
20 import javax.ws.rs.WebApplicationException;
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.MultivaluedMap;
23 import javax.ws.rs.ext.MessageBodyWriter;
24 import javax.ws.rs.ext.Provider;
25
26 import java.io.IOException;
27 import java.io.OutputStream;
28 import java.io.OutputStreamWriter;
29 import java.io.Writer;
30 import java.lang.annotation.Annotation;
31 import java.lang.reflect.Type;
32
33 import org.apache.tika.exception.TikaException;
34 import org.apache.tika.io.IOUtils;
35 import org.apache.tika.metadata.serialization.JsonMetadataList;
36 import org.apache.tika.server.MetadataList;
37
38 @Provider
39 @Produces(MediaType.APPLICATION_JSON)
40 public class MetadataListMessageBodyWriter implements MessageBodyWriter<MetadataList> {
41
42 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
43 if (!MediaType.APPLICATION_JSON_TYPE.equals(mediaType)) {
44 return false;
45 }
46 return type.isAssignableFrom(MetadataList.class);
47 }
48
49 public long getSize(MetadataList data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
50 return -1;
51 }
52
53 @Override
54 public void writeTo(MetadataList list, Class<?> type, Type genericType, Annotation[] annotations,
55 MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException,
56 WebApplicationException {
57 try {
58 Writer writer = new OutputStreamWriter(entityStream, IOUtils.UTF_8);
59 JsonMetadataList.toJson(list.getMetadata(), writer);
60 writer.flush();
61 } catch (TikaException e) {
62 throw new IOException(e);
63 }
64 entityStream.flush();
65 }
66 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.writer;
18
19 import javax.ws.rs.Produces;
20 import javax.ws.rs.WebApplicationException;
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.MultivaluedMap;
23 import javax.ws.rs.ext.MessageBodyWriter;
24 import javax.ws.rs.ext.Provider;
25
26 import java.io.IOException;
27 import java.io.OutputStream;
28 import java.lang.annotation.Annotation;
29 import java.lang.reflect.Type;
30 import java.util.Map;
31
32 import org.apache.commons.compress.archivers.tar.TarArchiveEntry;
33 import org.apache.commons.compress.archivers.tar.TarArchiveOutputStream;
34
35 @Provider
36 @Produces("application/x-tar")
37 public class TarWriter implements MessageBodyWriter<Map<String, byte[]>> {
38 private static void tarStoreBuffer(TarArchiveOutputStream zip, String name, byte[] dataBuffer) throws IOException {
39 TarArchiveEntry entry = new TarArchiveEntry(name);
40
41 entry.setSize(dataBuffer.length);
42
43 zip.putArchiveEntry(entry);
44
45 zip.write(dataBuffer);
46
47 zip.closeArchiveEntry();
48 }
49
50 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
51 return Map.class.isAssignableFrom(type);
52 }
53
54 public long getSize(Map<String, byte[]> stringMap, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
55 return -1;
56 }
57
58 public void writeTo(Map<String, byte[]> parts, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
59 TarArchiveOutputStream zip = new TarArchiveOutputStream(entityStream);
60
61 for (Map.Entry<String, byte[]> entry : parts.entrySet()) {
62 tarStoreBuffer(zip, entry.getKey(), entry.getValue());
63 }
64
65 zip.close();
66 }
67 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.writer;
18
19 import javax.ws.rs.Produces;
20 import javax.ws.rs.WebApplicationException;
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.MultivaluedMap;
23 import javax.ws.rs.ext.MessageBodyWriter;
24 import javax.ws.rs.ext.Provider;
25
26 import java.io.IOException;
27 import java.io.OutputStream;
28 import java.io.OutputStreamWriter;
29 import java.io.Writer;
30 import java.lang.annotation.Annotation;
31 import java.lang.reflect.Type;
32
33 import org.apache.tika.io.IOUtils;
34 import org.apache.tika.metadata.Metadata;
35
36 /**
37 * Returns simple text string for a particular metadata value.
38 * This assumes that the metadata object only has one key;
39 * if there is more than one key or no keys, this will throw a webapp exception.
40 * <p/>
41 * This will choose the first value returned for the one key.
42 */
43 @Provider
44 @Produces(MediaType.TEXT_PLAIN)
45 public class TextMessageBodyWriter implements MessageBodyWriter<Metadata> {
46
47 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
48 return mediaType.equals(MediaType.TEXT_PLAIN_TYPE) && Metadata.class.isAssignableFrom(type);
49 }
50
51 public long getSize(Metadata data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
52 return -1;
53 }
54
55 @Override
56 @SuppressWarnings("resource")
57 public void writeTo(Metadata metadata, Class<?> type, Type genericType, Annotation[] annotations,
58 MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException,
59 WebApplicationException {
60
61 if (metadata.names().length != 1) {
62 throw new WebApplicationException("Metadata object must only have one entry!");
63 }
64 Writer writer = new OutputStreamWriter(entityStream, IOUtils.UTF_8);
65
66 for (String name : metadata.names()) {
67 writer.write(metadata.get(name));
68 }
69
70 // Don't close, just flush the stream
71 writer.flush();
72 }
73 }
74
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.writer;
18
19 import javax.ws.rs.Produces;
20 import javax.ws.rs.WebApplicationException;
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.MultivaluedMap;
23 import javax.ws.rs.ext.MessageBodyWriter;
24 import javax.ws.rs.ext.Provider;
25
26 import java.io.IOException;
27 import java.io.OutputStream;
28 import java.io.OutputStreamWriter;
29 import java.io.Writer;
30 import java.lang.annotation.Annotation;
31 import java.lang.reflect.Type;
32
33 import org.apache.tika.exception.TikaException;
34 import org.apache.tika.io.IOUtils;
35 import org.apache.tika.metadata.Metadata;
36 import org.apache.tika.xmp.XMPMetadata;
37
38 @Provider
39 @Produces("application/rdf+xml")
40 public class XMPMessageBodyWriter implements MessageBodyWriter<Metadata> {
41
42 private static MediaType RDF_XML = MediaType.valueOf("application/rdf+xml");
43
44 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
45 return mediaType.equals(RDF_XML) && Metadata.class.isAssignableFrom(type);
46 }
47
48 public long getSize(Metadata data, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
49 return -1;
50 }
51
52 @Override
53 public void writeTo(Metadata metadata, Class<?> type, Type genericType, Annotation[] annotations,
54 MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException,
55 WebApplicationException {
56 try {
57 Writer writer = new OutputStreamWriter(entityStream, IOUtils.UTF_8);
58 XMPMetadata xmp = new XMPMetadata(metadata);
59 writer.write(xmp.toString());
60 writer.flush();
61 } catch (TikaException e) {
62 throw new IOException(e);
63 }
64 entityStream.flush();
65 }
66 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server.writer;
18
19 import javax.ws.rs.Produces;
20 import javax.ws.rs.WebApplicationException;
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.MultivaluedMap;
23 import javax.ws.rs.ext.MessageBodyWriter;
24 import javax.ws.rs.ext.Provider;
25
26 import java.io.IOException;
27 import java.io.OutputStream;
28 import java.lang.annotation.Annotation;
29 import java.lang.reflect.Type;
30 import java.util.Map;
31 import java.util.UUID;
32 import java.util.zip.CRC32;
33 import java.util.zip.ZipEntry;
34 import java.util.zip.ZipException;
35 import java.util.zip.ZipOutputStream;
36
37 import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
38 import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
39
40 @Provider
41 @Produces("application/zip")
42 public class ZipWriter implements MessageBodyWriter<Map<String, byte[]>> {
43 private static void zipStoreBuffer(ZipArchiveOutputStream zip, String name, byte[] dataBuffer) throws IOException {
44 ZipEntry zipEntry = new ZipEntry(name != null ? name : UUID.randomUUID().toString());
45 zipEntry.setMethod(ZipOutputStream.STORED);
46
47 zipEntry.setSize(dataBuffer.length);
48 CRC32 crc32 = new CRC32();
49 crc32.update(dataBuffer);
50 zipEntry.setCrc(crc32.getValue());
51
52 try {
53 zip.putArchiveEntry(new ZipArchiveEntry(zipEntry));
54 } catch (ZipException ex) {
55 if (name != null) {
56 zipStoreBuffer(zip, "x-" + name, dataBuffer);
57 return;
58 }
59 }
60
61 zip.write(dataBuffer);
62
63 zip.closeArchiveEntry();
64 }
65
66 public boolean isWriteable(Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
67 return Map.class.isAssignableFrom(type);
68 }
69
70 public long getSize(Map<String, byte[]> stringMap, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType) {
71 return -1;
72 }
73
74 public void writeTo(Map<String, byte[]> parts, Class<?> type, Type genericType, Annotation[] annotations, MediaType mediaType, MultivaluedMap<String, Object> httpHeaders, OutputStream entityStream) throws IOException, WebApplicationException {
75 ZipArchiveOutputStream zip = new ZipArchiveOutputStream(entityStream);
76
77 zip.setMethod(ZipArchiveOutputStream.STORED);
78
79 for (Map.Entry<String, byte[]> entry : parts.entrySet()) {
80 zipStoreBuffer(zip, entry.getKey(), entry.getValue());
81 }
82
83 zip.close();
84 }
85 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertTrue;
20 import static org.junit.Assert.assertFalse;
21
22 import java.io.ByteArrayOutputStream;
23 import java.io.File;
24 import java.io.FileOutputStream;
25 import java.io.IOException;
26 import java.io.InputStream;
27 import java.util.Enumeration;
28 import java.util.HashMap;
29 import java.util.Map;
30
31 import org.apache.commons.codec.digest.DigestUtils;
32 import org.apache.commons.compress.archivers.ArchiveEntry;
33 import org.apache.commons.compress.archivers.ArchiveInputStream;
34 import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
35 import org.apache.commons.compress.archivers.zip.ZipFile;
36 import org.apache.cxf.binding.BindingFactoryManager;
37 import org.apache.cxf.endpoint.Server;
38 import org.apache.cxf.jaxrs.JAXRSBindingFactory;
39 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
40 import org.apache.tika.config.TikaConfig;
41 import org.apache.tika.io.IOUtils;
42 import org.junit.After;
43 import org.junit.Before;
44
45 public abstract class CXFTestBase {
46 protected static final String endPoint =
47 "http://localhost:" + TikaServerCli.DEFAULT_PORT;
48 protected Server server;
49 protected TikaConfig tika;
50
51 @Before
52 public void setUp() {
53 this.tika = TikaConfig.getDefaultConfig();
54
55 JAXRSServerFactoryBean sf = new JAXRSServerFactoryBean();
56 setUpResources(sf);
57 setUpProviders(sf);
58 sf.setAddress(endPoint + "/");
59
60 BindingFactoryManager manager = sf.getBus().getExtension(
61 BindingFactoryManager.class
62 );
63
64 JAXRSBindingFactory factory = new JAXRSBindingFactory();
65 factory.setBus(sf.getBus());
66
67 manager.registerBindingFactory(
68 JAXRSBindingFactory.JAXRS_BINDING_ID,
69 factory
70 );
71
72 server = sf.create();
73 }
74
75 /**
76 * Have the test do {@link JAXRSServerFactoryBean#setResourceClasses(Class...)}
77 * and {@link JAXRSServerFactoryBean#setResourceProvider(Class, org.apache.cxf.jaxrs.lifecycle.ResourceProvider)}
78 */
79 protected abstract void setUpResources(JAXRSServerFactoryBean sf);
80 /**
81 * Have the test do {@link JAXRSServerFactoryBean#setProviders(java.util.List)}, if needed
82 */
83 protected abstract void setUpProviders(JAXRSServerFactoryBean sf);
84
85 @After
86 public void tearDown() throws Exception {
87 server.stop();
88 server.destroy();
89 }
90
91 public static void assertContains(String needle, String haystack) {
92 assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle));
93 }
94 public static void assertNotFound(String needle, String haystack) {
95 assertFalse(needle + " unexpectedly found in:\n" + haystack, haystack.contains(needle));
96 }
97
98 protected String getStringFromInputStream(InputStream in) throws Exception {
99 return IOUtils.toString(in);
100 }
101
102 protected Map<String, String> readZipArchive(InputStream inputStream) throws IOException {
103 Map<String, String> data = new HashMap<String, String>();
104 File tempFile = writeTemporaryArchiveFile(inputStream, "zip");
105 ZipFile zip = new ZipFile(tempFile);
106 Enumeration<ZipArchiveEntry> entries = zip.getEntries();
107 while (entries.hasMoreElements()) {
108 ZipArchiveEntry entry = entries.nextElement();
109 ByteArrayOutputStream bos = new ByteArrayOutputStream();
110 IOUtils.copy(zip.getInputStream(entry), bos);
111 data.put(entry.getName(), DigestUtils.md5Hex(bos.toByteArray()));
112 }
113
114 zip.close();
115 tempFile.delete();
116 return data;
117 }
118
119 protected String readArchiveText(InputStream inputStream) throws IOException {
120 File tempFile = writeTemporaryArchiveFile(inputStream, "zip");
121 ZipFile zip = new ZipFile(tempFile);
122 zip.getEntry(UnpackerResource.TEXT_FILENAME);
123 ByteArrayOutputStream bos = new ByteArrayOutputStream();
124 IOUtils.copy(zip.getInputStream(zip.getEntry(UnpackerResource.TEXT_FILENAME)), bos);
125
126 zip.close();
127 tempFile.delete();
128 return bos.toString("UTF-8");
129 }
130
131 protected Map<String, String> readArchiveFromStream(ArchiveInputStream zip) throws IOException {
132 Map<String, String> data = new HashMap<String, String>();
133 while (true) {
134 ArchiveEntry entry = zip.getNextEntry();
135 if (entry == null) {
136 break;
137 }
138
139 ByteArrayOutputStream bos = new ByteArrayOutputStream();
140 IOUtils.copy(zip, bos);
141 data.put(entry.getName(), DigestUtils.md5Hex(bos.toByteArray()));
142 }
143
144 return data;
145 }
146
147 private File writeTemporaryArchiveFile(InputStream inputStream, String archiveType) throws IOException {
148 File tempFile = File.createTempFile("tmp-", "." + archiveType);
149 IOUtils.copy(inputStream, new FileOutputStream(tempFile));
150 return tempFile;
151 }
152 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertFalse;
20 import static org.junit.Assert.assertTrue;
21
22 import java.io.ByteArrayInputStream;
23 import java.io.ByteArrayOutputStream;
24 import java.io.File;
25 import java.io.FileOutputStream;
26 import java.io.IOException;
27 import java.io.InputStream;
28 import java.util.Enumeration;
29 import java.util.HashMap;
30 import java.util.Map;
31
32 import org.apache.commons.codec.digest.DigestUtils;
33 import org.apache.commons.compress.archivers.ArchiveEntry;
34 import org.apache.commons.compress.archivers.ArchiveInputStream;
35 import org.apache.commons.compress.archivers.zip.ZipArchiveEntry;
36 import org.apache.commons.compress.archivers.zip.ZipFile;
37 import org.apache.cxf.binding.BindingFactoryManager;
38 import org.apache.cxf.endpoint.Server;
39 import org.apache.cxf.jaxrs.JAXRSBindingFactory;
40 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
41 import org.apache.tika.config.TikaConfig;
42 import org.apache.tika.io.IOUtils;
43 import org.apache.tika.server.resource.UnpackerResource;
44 import org.junit.After;
45 import org.junit.Before;
46
47 public abstract class CXFTestBase {
48 protected static final String endPoint =
49 "http://localhost:" + TikaServerCli.DEFAULT_PORT;
50 protected Server server;
51 protected TikaConfig tika;
52
53 public static void assertContains(String needle, String haystack) {
54 assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle));
55 }
56
57 public static void assertNotFound(String needle, String haystack) {
58 assertFalse(needle + " unexpectedly found in:\n" + haystack, haystack.contains(needle));
59 }
60
61 protected static InputStream copy(InputStream in, int remaining) throws IOException {
62 ByteArrayOutputStream out = new ByteArrayOutputStream();
63 while (remaining > 0) {
64 byte[] bytes = new byte[remaining];
65 int n = in.read(bytes);
66 if (n <= 0) {
67 break;
68 }
69 out.write(bytes, 0, n);
70 remaining -= n;
71 }
72 return new ByteArrayInputStream(out.toByteArray());
73 }
74
75 @Before
76 public void setUp() {
77 this.tika = TikaConfig.getDefaultConfig();
78
79 JAXRSServerFactoryBean sf = new JAXRSServerFactoryBean();
80 setUpResources(sf);
81 setUpProviders(sf);
82 sf.setAddress(endPoint + "/");
83
84 BindingFactoryManager manager = sf.getBus().getExtension(
85 BindingFactoryManager.class
86 );
87
88 JAXRSBindingFactory factory = new JAXRSBindingFactory();
89 factory.setBus(sf.getBus());
90
91 manager.registerBindingFactory(
92 JAXRSBindingFactory.JAXRS_BINDING_ID,
93 factory
94 );
95
96 server = sf.create();
97 }
98
99 /**
100 * Have the test do {@link JAXRSServerFactoryBean#setResourceClasses(Class...)}
101 * and {@link JAXRSServerFactoryBean#setResourceProvider(Class, org.apache.cxf.jaxrs.lifecycle.ResourceProvider)}
102 */
103 protected abstract void setUpResources(JAXRSServerFactoryBean sf);
104
105 /**
106 * Have the test do {@link JAXRSServerFactoryBean#setProviders(java.util.List)}, if needed
107 */
108 protected abstract void setUpProviders(JAXRSServerFactoryBean sf);
109
110 @After
111 public void tearDown() throws Exception {
112 server.stop();
113 server.destroy();
114 }
115
116 protected String getStringFromInputStream(InputStream in) throws Exception {
117 return IOUtils.toString(in);
118 }
119
120 protected Map<String, String> readZipArchive(InputStream inputStream) throws IOException {
121 Map<String, String> data = new HashMap<String, String>();
122 File tempFile = writeTemporaryArchiveFile(inputStream, "zip");
123 ZipFile zip = new ZipFile(tempFile);
124 Enumeration<ZipArchiveEntry> entries = zip.getEntries();
125 while (entries.hasMoreElements()) {
126 ZipArchiveEntry entry = entries.nextElement();
127 ByteArrayOutputStream bos = new ByteArrayOutputStream();
128 IOUtils.copy(zip.getInputStream(entry), bos);
129 data.put(entry.getName(), DigestUtils.md5Hex(bos.toByteArray()));
130 }
131
132 zip.close();
133 tempFile.delete();
134 return data;
135 }
136
137 protected String readArchiveText(InputStream inputStream) throws IOException {
138 File tempFile = writeTemporaryArchiveFile(inputStream, "zip");
139 ZipFile zip = new ZipFile(tempFile);
140 zip.getEntry(UnpackerResource.TEXT_FILENAME);
141 ByteArrayOutputStream bos = new ByteArrayOutputStream();
142 IOUtils.copy(zip.getInputStream(zip.getEntry(UnpackerResource.TEXT_FILENAME)), bos);
143
144 zip.close();
145 tempFile.delete();
146 return bos.toString(IOUtils.UTF_8.name());
147 }
148
149 protected Map<String, String> readArchiveFromStream(ArchiveInputStream zip) throws IOException {
150 Map<String, String> data = new HashMap<String, String>();
151 while (true) {
152 ArchiveEntry entry = zip.getNextEntry();
153 if (entry == null) {
154 break;
155 }
156
157 ByteArrayOutputStream bos = new ByteArrayOutputStream();
158 IOUtils.copy(zip, bos);
159 data.put(entry.getName(), DigestUtils.md5Hex(bos.toByteArray()));
160 }
161
162 return data;
163 }
164
165 private File writeTemporaryArchiveFile(InputStream inputStream, String archiveType) throws IOException {
166 File tempFile = File.createTempFile("tmp-", "." + archiveType);
167 IOUtils.copy(inputStream, new FileOutputStream(tempFile));
168 return tempFile;
169 }
170
171 }
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertNotNull;
21
22 import java.io.InputStream;
23 import java.util.ArrayList;
24 import java.util.List;
25
26 import javax.ws.rs.core.Response;
27
28 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
29 import org.apache.cxf.jaxrs.client.WebClient;
30 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
31 import org.junit.Test;
32
33 public class DetectorResourceTest extends CXFTestBase {
34
35 private static final String DETECT_PATH = "/detect";
36 private static final String DETECT_STREAM_PATH = DETECT_PATH + "/stream";
37 private static final String FOO_CSV = "foo.csv";
38 private static final String CDEC_CSV_NO_EXT = "CDEC_WEATHER_2010_03_02";
39
40 @Override
41 protected void setUpResources(JAXRSServerFactoryBean sf) {
42 sf.setResourceClasses(DetectorResource.class);
43 sf.setResourceProvider(DetectorResource.class,
44 new SingletonResourceProvider(new DetectorResource(tika)));
45
46 }
47
48 @Override
49 protected void setUpProviders(JAXRSServerFactoryBean sf) {
50 List<Object> providers = new ArrayList<Object>();
51 providers.add(new TarWriter());
52 providers.add(new ZipWriter());
53 providers.add(new TikaExceptionMapper());
54 sf.setProviders(providers);
55
56 }
57
58 @Test
59 public void testDetectCsvWithExt() throws IllegalStateException, Exception {
60 String url = endPoint + DETECT_STREAM_PATH;
61 Response response = WebClient
62 .create(endPoint + DETECT_STREAM_PATH)
63 .type("text/csv")
64 .accept("*/*")
65 .header("Content-Disposition",
66 "attachment; filename=" + FOO_CSV)
67 .put(ClassLoader.getSystemResourceAsStream(FOO_CSV));
68 assertNotNull(response);
69 String readMime = getStringFromInputStream((InputStream) response
70 .getEntity());
71 assertEquals("text/csv", readMime);
72
73 }
74
75 @Test
76 public void testDetectCsvNoExt() throws IllegalStateException, Exception {
77 String url = endPoint + DETECT_STREAM_PATH;
78 Response response = WebClient
79 .create(endPoint + DETECT_STREAM_PATH)
80 .type("text/csv")
81 .accept("*/*")
82 .header("Content-Disposition",
83 "attachment; filename=" + CDEC_CSV_NO_EXT)
84 .put(ClassLoader.getSystemResourceAsStream(CDEC_CSV_NO_EXT));
85 assertNotNull(response);
86 String readMime = getStringFromInputStream((InputStream) response
87 .getEntity());
88 assertEquals("text/plain", readMime);
89
90 // now trick it by adding .csv to the end
91 response = WebClient
92 .create(endPoint + DETECT_STREAM_PATH)
93 .type("text/csv")
94 .accept("*/*")
95 .header("Content-Disposition",
96 "attachment; filename=" + CDEC_CSV_NO_EXT + ".csv")
97 .put(ClassLoader.getSystemResourceAsStream(CDEC_CSV_NO_EXT));
98 assertNotNull(response);
99 readMime = getStringFromInputStream((InputStream) response.getEntity());
100 assertEquals("text/csv", readMime);
101
102 }
103 }
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertNotNull;
21
22 import javax.ws.rs.core.Response;
23
24 import java.io.InputStream;
25 import java.util.ArrayList;
26 import java.util.List;
27
28 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
29 import org.apache.cxf.jaxrs.client.WebClient;
30 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
31 import org.apache.tika.server.resource.DetectorResource;
32 import org.apache.tika.server.writer.TarWriter;
33 import org.apache.tika.server.writer.ZipWriter;
34 import org.junit.Test;
35
36 public class DetectorResourceTest extends CXFTestBase {
37
38 private static final String DETECT_PATH = "/detect";
39 private static final String DETECT_STREAM_PATH = DETECT_PATH + "/stream";
40 private static final String FOO_CSV = "foo.csv";
41 private static final String CDEC_CSV_NO_EXT = "CDEC_WEATHER_2010_03_02";
42
43 @Override
44 protected void setUpResources(JAXRSServerFactoryBean sf) {
45 sf.setResourceClasses(DetectorResource.class);
46 sf.setResourceProvider(DetectorResource.class,
47 new SingletonResourceProvider(new DetectorResource(tika)));
48
49 }
50
51 @Override
52 protected void setUpProviders(JAXRSServerFactoryBean sf) {
53 List<Object> providers = new ArrayList<Object>();
54 providers.add(new TarWriter());
55 providers.add(new ZipWriter());
56 providers.add(new TikaServerParseExceptionMapper(false));
57 sf.setProviders(providers);
58
59 }
60
61 @Test
62 public void testDetectCsvWithExt() throws Exception {
63 String url = endPoint + DETECT_STREAM_PATH;
64 Response response = WebClient
65 .create(endPoint + DETECT_STREAM_PATH)
66 .type("text/csv")
67 .accept("*/*")
68 .header("Content-Disposition",
69 "attachment; filename=" + FOO_CSV)
70 .put(ClassLoader.getSystemResourceAsStream(FOO_CSV));
71 assertNotNull(response);
72 String readMime = getStringFromInputStream((InputStream) response
73 .getEntity());
74 assertEquals("text/csv", readMime);
75
76 }
77
78 @Test
79 public void testDetectCsvNoExt() throws Exception {
80 String url = endPoint + DETECT_STREAM_PATH;
81 Response response = WebClient
82 .create(endPoint + DETECT_STREAM_PATH)
83 .type("text/csv")
84 .accept("*/*")
85 .header("Content-Disposition",
86 "attachment; filename=" + CDEC_CSV_NO_EXT)
87 .put(ClassLoader.getSystemResourceAsStream(CDEC_CSV_NO_EXT));
88 assertNotNull(response);
89 String readMime = getStringFromInputStream((InputStream) response
90 .getEntity());
91 assertEquals("text/plain", readMime);
92
93 // now trick it by adding .csv to the end
94 response = WebClient
95 .create(endPoint + DETECT_STREAM_PATH)
96 .type("text/csv")
97 .accept("*/*")
98 .header("Content-Disposition",
99 "attachment; filename=" + CDEC_CSV_NO_EXT + ".csv")
100 .put(ClassLoader.getSystemResourceAsStream(CDEC_CSV_NO_EXT));
101 assertNotNull(response);
102 readMime = getStringFromInputStream((InputStream) response.getEntity());
103 assertEquals("text/csv", readMime);
104
105 }
106 }
+0
-169
tika-server/src/test/java/org/apache/tika/server/MetadataEPTest.java less more
0 package org.apache.tika.server;
1
2 /*
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 *
10 * http://www.apache.org/licenses/LICENSE-2.0
11 *
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertNotNull;
21
22 import java.io.ByteArrayInputStream;
23 import java.io.ByteArrayOutputStream;
24 import java.io.IOException;
25 import java.io.InputStream;
26 import java.io.InputStreamReader;
27 import java.io.Reader;
28 import java.io.StringWriter;
29 import java.util.ArrayList;
30 import java.util.HashMap;
31 import java.util.List;
32 import java.util.Map;
33
34 import javax.ws.rs.core.MediaType;
35 import javax.ws.rs.core.Response;
36 import javax.ws.rs.core.Response.Status;
37
38 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
39 import org.apache.cxf.jaxrs.client.WebClient;
40 import org.apache.tika.io.IOUtils;
41 import org.apache.tika.metadata.Metadata;
42 import org.apache.tika.metadata.serialization.JsonMetadata;
43 import org.junit.Assert;
44 import org.junit.Test;
45
46 import au.com.bytecode.opencsv.CSVReader;
47
48 public class MetadataEPTest extends CXFTestBase {
49 private static final String META_PATH = "/metadata";
50
51 @Override
52 protected void setUpResources(JAXRSServerFactoryBean sf) {
53 sf.setResourceClasses(MetadataEP.class);
54 }
55
56 @Override
57 protected void setUpProviders(JAXRSServerFactoryBean sf) {
58 List<Object> providers = new ArrayList<Object>();
59 providers.add(new CSVMessageBodyWriter());
60 providers.add(new JSONMessageBodyWriter());
61 sf.setProviders(providers);
62 }
63
64 private static InputStream copy(InputStream in, int remaining) throws IOException {
65 ByteArrayOutputStream out = new ByteArrayOutputStream();
66 while (remaining > 0) {
67 byte[] bytes = new byte[remaining];
68 int n = in.read(bytes);
69 if (n <= 0) {
70 break;
71 }
72 out.write(bytes, 0, n);
73 remaining -= n;
74 }
75 return new ByteArrayInputStream(out.toByteArray());
76 }
77
78 @Test
79 public void testSimpleWord_CSV() throws Exception {
80 Response response = WebClient.create(endPoint + META_PATH).type("application/msword").accept("text/csv")
81 .post(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
82 Assert.assertEquals(Status.OK.getStatusCode(), response.getStatus());
83
84 Reader reader = new InputStreamReader((InputStream) response.getEntity());
85
86 @SuppressWarnings("resource")
87 CSVReader csvReader = new CSVReader(reader);
88
89 Map<String, String> metadata = new HashMap<String, String>();
90
91 String[] nextLine;
92 while ((nextLine = csvReader.readNext()) != null) {
93 metadata.put(nextLine[0], nextLine[1]);
94 }
95
96 assertNotNull(metadata.get("Author"));
97 assertEquals("Maxim Valyanskiy", metadata.get("Author"));
98 }
99
100 @Test
101 public void testSimpleWord_JSON() throws Exception {
102 Response response = WebClient.create(endPoint + META_PATH).type("application/msword")
103 .accept(MediaType.APPLICATION_JSON).post(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
104
105 Assert.assertEquals(Status.OK.getStatusCode(), response.getStatus());
106
107 Reader reader = new InputStreamReader((InputStream) response.getEntity());
108 Metadata metadata = JsonMetadata.fromJson(reader);
109 assertNotNull(metadata.get("Author"));
110 assertEquals("Maxim Valyanskiy", metadata.get("Author"));
111 }
112
113 @Test
114 public void testGetField_Author_TEXT() throws Exception {
115 Response response = WebClient.create(endPoint + META_PATH + "/Author").type("application/msword")
116 .accept(MediaType.TEXT_PLAIN).post(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
117 Assert.assertEquals(Status.OK.getStatusCode(), response.getStatus());
118
119 StringWriter w = new StringWriter();
120 IOUtils.copy((InputStream) response.getEntity(), w);
121 assertEquals("Maxim Valyanskiy", w.toString());
122 }
123
124 @Test
125 public void testGetField_Author_JSON() throws Exception {
126 Response response = WebClient.create(endPoint + META_PATH + "/Author").type("application/msword")
127 .accept(MediaType.APPLICATION_JSON).post(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
128 Assert.assertEquals(Status.OK.getStatusCode(), response.getStatus());
129
130 Reader reader = new InputStreamReader((InputStream) response.getEntity());
131 Metadata metadata = JsonMetadata.fromJson(reader);
132
133 assertNotNull(metadata.get("Author"));
134 assertEquals("Maxim Valyanskiy", metadata.get("Author"));
135 }
136
137 @Test
138 public void testGetField_XXX_NotFound() throws Exception {
139 Response response = WebClient.create(endPoint + META_PATH + "/xxx").type("application/msword")
140 .accept(MediaType.APPLICATION_JSON).post(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
141 Assert.assertEquals(Status.NOT_FOUND.getStatusCode(), response.getStatus());
142 }
143
144 @Test
145 public void testGetField_Author_TEXT_Partial_BAD_REQUEST() throws Exception {
146
147 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
148
149 Response response = WebClient.create(endPoint + META_PATH + "/Author").type("application/msword")
150 .accept(MediaType.TEXT_PLAIN).post(copy(stream, 8000));
151 Assert.assertEquals(Status.BAD_REQUEST.getStatusCode(), response.getStatus());
152 }
153
154 @Test
155 public void testGetField_Author_TEXT_Partial_Found() throws Exception {
156
157 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
158
159 Response response = WebClient.create(endPoint + META_PATH + "/Author").type("application/msword")
160 .accept(MediaType.TEXT_PLAIN).post(copy(stream, 12000));
161 Assert.assertEquals(Status.OK.getStatusCode(), response.getStatus());
162
163 StringWriter w = new StringWriter();
164 IOUtils.copy((InputStream) response.getEntity(), w);
165 assertEquals("Maxim Valyanskiy", w.toString());
166 }
167
168 }
1919 import static org.junit.Assert.assertEquals;
2020 import static org.junit.Assert.assertNotNull;
2121
22 import javax.ws.rs.core.MediaType;
23 import javax.ws.rs.core.Response;
24
2225 import java.io.InputStream;
2326 import java.io.InputStreamReader;
2427 import java.io.Reader;
28 import java.util.ArrayList;
2529 import java.util.HashMap;
30 import java.util.List;
2631 import java.util.Map;
2732
28 import javax.ws.rs.core.Response;
29
33 import au.com.bytecode.opencsv.CSVReader;
34 import org.apache.cxf.helpers.IOUtils;
3035 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
3136 import org.apache.cxf.jaxrs.client.WebClient;
3237 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
38 import org.apache.tika.metadata.Metadata;
39 import org.apache.tika.metadata.serialization.JsonMetadata;
40 import org.apache.tika.server.resource.MetadataResource;
41 import org.apache.tika.server.writer.CSVMessageBodyWriter;
42 import org.apache.tika.server.writer.JSONMessageBodyWriter;
43 import org.apache.tika.server.writer.TextMessageBodyWriter;
44 import org.apache.tika.server.writer.XMPMessageBodyWriter;
45 import org.junit.Assert;
3346 import org.junit.Test;
34
35 import au.com.bytecode.opencsv.CSVReader;
3647
3748 public class MetadataResourceTest extends CXFTestBase {
3849 private static final String META_PATH = "/meta";
4152 protected void setUpResources(JAXRSServerFactoryBean sf) {
4253 sf.setResourceClasses(MetadataResource.class);
4354 sf.setResourceProvider(MetadataResource.class,
44 new SingletonResourceProvider(new MetadataResource(tika)));
55 new SingletonResourceProvider(new MetadataResource(tika)));
4556 }
4657
4758 @Override
48 protected void setUpProviders(JAXRSServerFactoryBean sf) {}
49
50 @Test
51 public void testSimpleWord() throws Exception {
52 Response response = WebClient
53 .create(endPoint + META_PATH)
54 .type("application/msword")
55 .accept("text/csv")
56 .put(ClassLoader
57 .getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
58
59 Reader reader = new InputStreamReader(
60 (InputStream) response.getEntity());
61
62 CSVReader csvReader = new CSVReader(reader);
63
64 Map<String, String> metadata = new HashMap<String, String>();
65
66 String[] nextLine;
67 while ((nextLine = csvReader.readNext()) != null) {
68 metadata.put(nextLine[0], nextLine[1]);
69 }
70 csvReader.close();
71
72 assertNotNull(metadata.get("Author"));
73 assertEquals("Maxim Valyanskiy", metadata.get("Author"));
74 }
59 protected void setUpProviders(JAXRSServerFactoryBean sf) {
60 List<Object> providers = new ArrayList<Object>();
61 providers.add(new JSONMessageBodyWriter());
62 providers.add(new CSVMessageBodyWriter());
63 providers.add(new XMPMessageBodyWriter());
64 providers.add(new TextMessageBodyWriter());
65 sf.setProviders(providers);
66 }
67
68 @Test
69 public void testSimpleWord() throws Exception {
70 Response response = WebClient
71 .create(endPoint + META_PATH)
72 .type("application/msword")
73 .accept("text/csv")
74 .put(ClassLoader
75 .getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
76
77 Reader reader = new InputStreamReader((InputStream) response.getEntity(), org.apache.tika.io.IOUtils.UTF_8);
78
79 CSVReader csvReader = new CSVReader(reader);
80
81 Map<String, String> metadata = new HashMap<String, String>();
82
83 String[] nextLine;
84 while ((nextLine = csvReader.readNext()) != null) {
85 metadata.put(nextLine[0], nextLine[1]);
86 }
87 csvReader.close();
88
89 assertNotNull(metadata.get("Author"));
90 assertEquals("Maxim Valyanskiy", metadata.get("Author"));
91 }
92
93 @Test
94 public void testPasswordProtected() throws Exception {
95 Response response = WebClient
96 .create(endPoint + META_PATH)
97 .type("application/vnd.ms-excel")
98 .accept("text/csv")
99 .put(ClassLoader
100 .getSystemResourceAsStream(TikaResourceTest.TEST_PASSWORD_PROTECTED));
101
102 // Won't work, no password given
103 assertEquals(500, response.getStatus());
104
105 // Try again, this time with the wrong password
106 response = WebClient
107 .create(endPoint + META_PATH)
108 .type("application/vnd.ms-excel")
109 .accept("text/csv")
110 .header("Password", "wrong password")
111 .put(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_PASSWORD_PROTECTED));
112
113 assertEquals(500, response.getStatus());
114
115 // Try again, this time with the password
116 response = WebClient
117 .create(endPoint + META_PATH)
118 .type("application/vnd.ms-excel")
119 .accept("text/csv")
120 .header("Password", "password")
121 .put(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_PASSWORD_PROTECTED));
122
123 // Will work
124 assertEquals(200, response.getStatus());
125
126 // Check results
127 Reader reader = new InputStreamReader((InputStream) response.getEntity(), org.apache.tika.io.IOUtils.UTF_8);
128 CSVReader csvReader = new CSVReader(reader);
129
130 Map<String, String> metadata = new HashMap<String, String>();
131
132 String[] nextLine;
133 while ((nextLine = csvReader.readNext()) != null) {
134 metadata.put(nextLine[0], nextLine[1]);
135 }
136 csvReader.close();
137
138 assertNotNull(metadata.get("Author"));
139 assertEquals("pavel", metadata.get("Author"));
140 }
141
142 @Test
143 public void testJSON() throws Exception {
144 Response response = WebClient
145 .create(endPoint + META_PATH)
146 .type("application/msword")
147 .accept("application/json")
148 .put(ClassLoader
149 .getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
150
151 Reader reader = new InputStreamReader((InputStream) response.getEntity(), org.apache.tika.io.IOUtils.UTF_8);
152
153 Metadata metadata = JsonMetadata.fromJson(reader);
154 assertNotNull(metadata.get("Author"));
155 assertEquals("Maxim Valyanskiy", metadata.get("Author"));
156 }
157
158 @Test
159 public void testXMP() throws Exception {
160 Response response = WebClient
161 .create(endPoint + META_PATH)
162 .type("application/msword")
163 .accept("application/rdf+xml")
164 .put(ClassLoader
165 .getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
166
167 String result = IOUtils.readStringFromStream((InputStream) response.getEntity());
168 assertContains("<rdf:li>Maxim Valyanskiy</rdf:li>", result);
169 }
170
171 //Now test requesting one field
172 @Test
173 public void testGetField_XXX_NotFound() throws Exception {
174 Response response = WebClient.create(endPoint + META_PATH + "/xxx").type("application/msword")
175 .accept(MediaType.APPLICATION_JSON).put(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC));
176 Assert.assertEquals(Response.Status.NOT_FOUND.getStatusCode(), response.getStatus());
177 }
178
179 @Test
180 public void testGetField_Author_TEXT_Partial_BAD_REQUEST() throws Exception {
181
182 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
183
184 Response response = WebClient.create(endPoint + META_PATH + "/Author").type("application/msword")
185 .accept(MediaType.TEXT_PLAIN).put(copy(stream, 8000));
186 Assert.assertEquals(Response.Status.BAD_REQUEST.getStatusCode(), response.getStatus());
187 }
188
189 @Test
190 public void testGetField_Author_TEXT_Partial_Found() throws Exception {
191
192 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
193
194 Response response = WebClient.create(endPoint + META_PATH + "/Author").type("application/msword")
195 .accept(MediaType.TEXT_PLAIN).put(copy(stream, 12000));
196 Assert.assertEquals(Response.Status.OK.getStatusCode(), response.getStatus());
197 String s = IOUtils.readStringFromStream((InputStream) response.getEntity());
198 assertEquals("Maxim Valyanskiy", s);
199 }
200
201 @Test
202 public void testGetField_Author_JSON_Partial_Found() throws Exception {
203
204 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
205
206 Response response = WebClient.create(endPoint + META_PATH + "/Author").type("application/msword")
207 .accept(MediaType.APPLICATION_JSON).put(copy(stream, 12000));
208 Assert.assertEquals(Response.Status.OK.getStatusCode(), response.getStatus());
209 Metadata metadata = JsonMetadata.fromJson(new InputStreamReader(
210 (InputStream) response.getEntity(), org.apache.tika.io.IOUtils.UTF_8));
211 assertEquals("Maxim Valyanskiy", metadata.get("Author"));
212 assertEquals(1, metadata.names().length);
213 }
214
215 @Test
216 public void testGetField_Author_XMP_Partial_Found() throws Exception {
217
218 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
219
220 Response response = WebClient.create(endPoint + META_PATH + "/dc:creator").type("application/msword")
221 .accept("application/rdf+xml").put(copy(stream, 12000));
222 Assert.assertEquals(Response.Status.OK.getStatusCode(), response.getStatus());
223 String s = IOUtils.readStringFromStream((InputStream) response.getEntity());
224 assertContains("<rdf:li>Maxim Valyanskiy</rdf:li>", s);
225 }
226
75227
76228 }
229
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertNotNull;
21
22 import javax.ws.rs.core.Response;
23
24 import java.io.InputStream;
25 import java.io.InputStreamReader;
26 import java.io.Reader;
27 import java.util.ArrayList;
28 import java.util.List;
29
30 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
31 import org.apache.cxf.jaxrs.client.WebClient;
32 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
33 import org.apache.tika.io.IOUtils;
34 import org.apache.tika.metadata.Metadata;
35 import org.apache.tika.metadata.serialization.JsonMetadataList;
36 import org.apache.tika.server.resource.RecursiveMetadataResource;
37 import org.apache.tika.server.writer.MetadataListMessageBodyWriter;
38 import org.junit.Test;
39
40 public class RecursiveMetadataResourceTest extends CXFTestBase {
41 private static final String META_PATH = "/rmeta";
42 private static final String TEST_RECURSIVE_DOC = "test_recursive_embedded.docx";
43
44 @Override
45 protected void setUpResources(JAXRSServerFactoryBean sf) {
46 sf.setResourceClasses(RecursiveMetadataResource.class);
47 sf.setResourceProvider(RecursiveMetadataResource.class,
48 new SingletonResourceProvider(new RecursiveMetadataResource(tika)));
49 }
50
51 @Override
52 protected void setUpProviders(JAXRSServerFactoryBean sf) {
53 List<Object> providers = new ArrayList<Object>();
54 providers.add(new MetadataListMessageBodyWriter());
55 sf.setProviders(providers);
56 }
57
58 @Test
59 public void testSimpleWord() throws Exception {
60 Response response = WebClient
61 .create(endPoint + META_PATH)
62 .accept("application/json")
63 .put(ClassLoader
64 .getSystemResourceAsStream(TEST_RECURSIVE_DOC));
65
66 Reader reader = new InputStreamReader((InputStream) response.getEntity(), IOUtils.UTF_8);
67 List<Metadata> metadataList = JsonMetadataList.fromJson(reader);
68
69 assertEquals(11, metadataList.size());
70 assertEquals("Microsoft Office Word", metadataList.get(0).get("Application-Name"));
71 assertContains("plundered our seas", metadataList.get(5).get("X-TIKA:content"));
72 }
73
74 @Test
75 public void testPasswordProtected() throws Exception {
76 Response response = WebClient
77 .create(endPoint + META_PATH)
78 .type("application/vnd.ms-excel")
79 .accept("application/json")
80 .put(ClassLoader
81 .getSystemResourceAsStream(TikaResourceTest.TEST_PASSWORD_PROTECTED));
82
83 // Won't work, no password given
84 assertEquals(500, response.getStatus());
85
86 // Try again, this time with the password
87 response = WebClient
88 .create(endPoint + META_PATH)
89 .type("application/vnd.ms-excel")
90 .accept("application/json")
91 .header("Password", "password")
92 .put(ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_PASSWORD_PROTECTED));
93
94 // Will work
95 assertEquals(200, response.getStatus());
96
97 // Check results
98 Reader reader = new InputStreamReader((InputStream) response.getEntity(), IOUtils.UTF_8);
99 List<Metadata> metadataList = JsonMetadataList.fromJson(reader);
100 assertNotNull(metadataList.get(0).get("Author"));
101 assertEquals("pavel", metadataList.get(0).get("Author"));
102 }
103 }
0 package org.apache.tika.server;
1
2 /**
3 * Licensed to the Apache Software Foundation (ASF) under one or more
4 * contributor license agreements. See the NOTICE file distributed with
5 * this work for additional information regarding copyright ownership.
6 * The ASF licenses this file to You under the Apache License, Version 2.0
7 * (the "License"); you may not use this file except in compliance with
8 * the License. You may obtain a copy of the License at
9 * <p/>
10 * http://www.apache.org/licenses/LICENSE-2.0
11 * <p/>
12 * Unless required by applicable law or agreed to in writing, software
13 * distributed under the License is distributed on an "AS IS" BASIS,
14 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15 * See the License for the specific language governing permissions and
16 * limitations under the License.
17 */
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertNotNull;
21
22 import javax.ws.rs.core.MediaType;
23 import javax.ws.rs.core.Response;
24
25 import java.io.InputStream;
26 import java.util.ArrayList;
27 import java.util.List;
28
29 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
30 import org.apache.cxf.jaxrs.client.WebClient;
31 import org.apache.cxf.jaxrs.lifecycle.ResourceProvider;
32 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
33 import org.apache.tika.server.resource.DetectorResource;
34 import org.apache.tika.server.resource.MetadataResource;
35 import org.apache.tika.server.resource.RecursiveMetadataResource;
36 import org.apache.tika.server.resource.TikaResource;
37 import org.apache.tika.server.resource.UnpackerResource;
38 import org.apache.tika.server.writer.CSVMessageBodyWriter;
39 import org.apache.tika.server.writer.JSONMessageBodyWriter;
40 import org.apache.tika.server.writer.TextMessageBodyWriter;
41 import org.apache.tika.server.writer.XMPMessageBodyWriter;
42 import org.junit.Assert;
43 import org.junit.Test;
44
45
46 /**
47 * Test to make sure that no stack traces are returned
48 * when the stack trace param is set to false.
49 */
50 public class StackTraceOffTest extends CXFTestBase {
51 public static final String TEST_NULL = "mock/null_pointer.xml";
52 public static final String TEST_PASSWORD_PROTECTED = "password.xls";
53
54 private static final String[] PATHS = new String[]{
55 "/tika",
56 "/rmeta",
57 "/unpack",
58 "/meta",
59 };
60 private static final int UNPROCESSEABLE = 422;
61
62 @Override
63 protected void setUpResources(JAXRSServerFactoryBean sf) {
64 List<ResourceProvider> rCoreProviders = new ArrayList<ResourceProvider>();
65 rCoreProviders.add(new SingletonResourceProvider(new MetadataResource(tika)));
66 rCoreProviders.add(new SingletonResourceProvider(new RecursiveMetadataResource(tika)));
67 rCoreProviders.add(new SingletonResourceProvider(new DetectorResource(tika)));
68 rCoreProviders.add(new SingletonResourceProvider(new TikaResource(tika)));
69 rCoreProviders.add(new SingletonResourceProvider(new UnpackerResource(tika)));
70 sf.setResourceProviders(rCoreProviders);
71 }
72
73 @Override
74 protected void setUpProviders(JAXRSServerFactoryBean sf) {
75 List<Object> providers = new ArrayList<Object>();
76 providers.add(new TikaServerParseExceptionMapper(false));
77 providers.add(new JSONMessageBodyWriter());
78 providers.add(new CSVMessageBodyWriter());
79 providers.add(new XMPMessageBodyWriter());
80 providers.add(new TextMessageBodyWriter());
81 sf.setProviders(providers);
82 }
83
84 @Test
85 public void testEncrypted() throws Exception {
86 for (String path : PATHS) {
87 Response response = WebClient
88 .create(endPoint + path)
89 .accept("*/*")
90 .header("Content-Disposition",
91 "attachment; filename=" + TEST_PASSWORD_PROTECTED)
92 .put(ClassLoader.getSystemResourceAsStream(TEST_PASSWORD_PROTECTED));
93 assertNotNull("null response: " + path, response);
94 assertEquals("unprocessable: " + path, UNPROCESSEABLE, response.getStatus());
95 String msg = getStringFromInputStream((InputStream) response
96 .getEntity());
97 assertEquals("should be empty: " + path, "", msg);
98 }
99 }
100
101 @Test
102 public void testNullPointerOnTika() throws Exception {
103 for (String path : PATHS) {
104 Response response = WebClient
105 .create(endPoint + path)
106 .accept("*/*")
107 .put(ClassLoader.getSystemResourceAsStream(TEST_NULL));
108 assertNotNull("null response: " + path, response);
109 assertEquals("unprocessable: " + path, UNPROCESSEABLE, response.getStatus());
110 String msg = getStringFromInputStream((InputStream) response
111 .getEntity());
112 assertEquals("should be empty: " + path, "", msg);
113 }
114 }
115
116 @Test
117 public void test415() throws Exception {
118 //no stack traces for 415
119 for (String path : PATHS) {
120 Response response = WebClient
121 .create(endPoint + path)
122 .type("blechdeblah/deblechdeblah")
123 .accept("*/*")
124 .header("Content-Disposition",
125 "attachment; filename=null_pointer.evil")
126 .put(ClassLoader.getSystemResourceAsStream(TEST_NULL));
127 assertNotNull("null response: " + path, response);
128 assertEquals("bad type: " + path, 415, response.getStatus());
129 String msg = getStringFromInputStream((InputStream) response
130 .getEntity());
131 assertEquals("should be empty: " + path, "", msg);
132 }
133 }
134
135 //For now, make sure that non-complete document
136 //still returns BAD_REQUEST. We may want to
137 //make MetadataResource return the same types of parse
138 //exceptions as the others...
139 @Test
140 public void testMeta() throws Exception {
141 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
142
143 Response response = WebClient.create(endPoint + "/meta" + "/Author").type("application/msword")
144 .accept(MediaType.TEXT_PLAIN).put(copy(stream, 8000));
145 Assert.assertEquals(Response.Status.BAD_REQUEST.getStatusCode(), response.getStatus());
146 String msg = getStringFromInputStream((InputStream) response.getEntity());
147 assertEquals("Failed to get metadata field Author", msg);
148 }
149 }
0 package org.apache.tika.server;
1 /**
2 * Licensed to the Apache Software Foundation (ASF) under one or more
3 * contributor license agreements. See the NOTICE file distributed with
4 * this work for additional information regarding copyright ownership.
5 * The ASF licenses this file to You under the Apache License, Version 2.0
6 * (the "License"); you may not use this file except in compliance with
7 * the License. You may obtain a copy of the License at
8 * <p/>
9 * http://www.apache.org/licenses/LICENSE-2.0
10 * <p/>
11 * Unless required by applicable law or agreed to in writing, software
12 * distributed under the License is distributed on an "AS IS" BASIS,
13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14 * See the License for the specific language governing permissions and
15 * limitations under the License.
16 */
17
18 import static org.junit.Assert.assertEquals;
19 import static org.junit.Assert.assertNotNull;
20
21 import javax.ws.rs.core.MediaType;
22 import javax.ws.rs.core.Response;
23
24 import java.io.InputStream;
25 import java.util.ArrayList;
26 import java.util.List;
27
28 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
29 import org.apache.cxf.jaxrs.client.WebClient;
30 import org.apache.cxf.jaxrs.lifecycle.ResourceProvider;
31 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
32 import org.apache.tika.server.resource.DetectorResource;
33 import org.apache.tika.server.resource.MetadataResource;
34 import org.apache.tika.server.resource.RecursiveMetadataResource;
35 import org.apache.tika.server.resource.TikaResource;
36 import org.apache.tika.server.resource.UnpackerResource;
37 import org.apache.tika.server.writer.CSVMessageBodyWriter;
38 import org.apache.tika.server.writer.JSONMessageBodyWriter;
39 import org.apache.tika.server.writer.TextMessageBodyWriter;
40 import org.apache.tika.server.writer.XMPMessageBodyWriter;
41 import org.junit.Assert;
42 import org.junit.Test;
43
44 public class StackTraceTest extends CXFTestBase {
45 public static final String TEST_NULL = "mock/null_pointer.xml";
46 public static final String TEST_PASSWORD_PROTECTED = "password.xls";
47
48 private static final String[] PATHS = new String[]{
49 "/tika",
50 "/rmeta",
51 "/unpack",
52 "/meta",
53 };
54 private static final int UNPROCESSEABLE = 422;
55
56 @Override
57 protected void setUpResources(JAXRSServerFactoryBean sf) {
58 List<ResourceProvider> rCoreProviders = new ArrayList<ResourceProvider>();
59 rCoreProviders.add(new SingletonResourceProvider(new MetadataResource(tika)));
60 rCoreProviders.add(new SingletonResourceProvider(new RecursiveMetadataResource(tika)));
61 rCoreProviders.add(new SingletonResourceProvider(new DetectorResource(tika)));
62 rCoreProviders.add(new SingletonResourceProvider(new TikaResource(tika)));
63 rCoreProviders.add(new SingletonResourceProvider(new UnpackerResource(tika)));
64 sf.setResourceProviders(rCoreProviders);
65 }
66
67 @Override
68 protected void setUpProviders(JAXRSServerFactoryBean sf) {
69 List<Object> providers = new ArrayList<Object>();
70 providers.add(new TikaServerParseExceptionMapper(true));
71 providers.add(new JSONMessageBodyWriter());
72 providers.add(new CSVMessageBodyWriter());
73 providers.add(new XMPMessageBodyWriter());
74 providers.add(new TextMessageBodyWriter());
75 sf.setProviders(providers);
76 }
77
78 @Test
79 public void testEncrypted() throws Exception {
80 for (String path : PATHS) {
81 Response response = WebClient
82 .create(endPoint + path)
83 .accept("*/*")
84 .header("Content-Disposition",
85 "attachment; filename=" + TEST_PASSWORD_PROTECTED)
86 .put(ClassLoader.getSystemResourceAsStream(TEST_PASSWORD_PROTECTED));
87 assertNotNull("null response: " + path, response);
88 assertEquals("unprocessable: " + path, UNPROCESSEABLE, response.getStatus());
89 String msg = getStringFromInputStream((InputStream) response
90 .getEntity());
91 assertContains("org.apache.tika.exception.EncryptedDocumentException",
92 msg);
93 }
94 }
95
96 @Test
97 public void testNullPointerOnTika() throws Exception {
98 for (String path : PATHS) {
99 Response response = WebClient
100 .create(endPoint + path)
101 .accept("*/*")
102 .put(ClassLoader.getSystemResourceAsStream(TEST_NULL));
103 assertNotNull("null response: " + path, response);
104 assertEquals("unprocessable: " + path, UNPROCESSEABLE, response.getStatus());
105 String msg = getStringFromInputStream((InputStream) response
106 .getEntity());
107 assertContains("Caused by: java.lang.NullPointerException: null pointer message",
108 msg);
109 }
110 }
111
112 @Test
113 public void test415() throws Exception {
114 //no stack traces for 415
115 for (String path : PATHS) {
116 Response response = WebClient
117 .create(endPoint + path)
118 .type("blechdeblah/deblechdeblah")
119 .accept("*/*")
120 .header("Content-Disposition",
121 "attachment; filename=null_pointer.evil")
122 .put(ClassLoader.getSystemResourceAsStream(TEST_NULL));
123 assertNotNull("null response: " + path, response);
124 assertEquals("bad type: " + path, 415, response.getStatus());
125 String msg = getStringFromInputStream((InputStream) response
126 .getEntity());
127 assertEquals("should be empty: " + path, "", msg);
128 }
129 }
130
131 //For now, make sure that non-complete document
132 //still returns BAD_REQUEST. We may want to
133 //make MetadataResource return the same types of parse
134 //exceptions as the others...
135 @Test
136 public void testMeta() throws Exception {
137 InputStream stream = ClassLoader.getSystemResourceAsStream(TikaResourceTest.TEST_DOC);
138
139 Response response = WebClient.create(endPoint + "/meta" + "/Author").type("application/msword")
140 .accept(MediaType.TEXT_PLAIN).put(copy(stream, 8000));
141 Assert.assertEquals(Response.Status.BAD_REQUEST.getStatusCode(), response.getStatus());
142 String msg = getStringFromInputStream((InputStream) response.getEntity());
143 assertEquals("Failed to get metadata field Author", msg);
144 }
145 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
22 import java.io.InputStream;
23 import java.util.Map;
24
25 import javax.ws.rs.core.Response;
26
27 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
28 import org.apache.cxf.jaxrs.client.WebClient;
29 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
30 import org.apache.tika.mime.MimeTypes;
31 import org.apache.tika.parser.microsoft.POIFSContainerDetector;
32 import org.apache.tika.parser.pkg.ZipContainerDetector;
33 import org.eclipse.jetty.util.ajax.JSON;
34 import org.gagravarr.tika.OggDetector;
35 import org.junit.Test;
36
37 public class TikaDetectorsTest extends CXFTestBase {
38 private static final String DETECTORS_PATH = "/detectors";
39
40 @Override
41 protected void setUpResources(JAXRSServerFactoryBean sf) {
42 sf.setResourceClasses(TikaDetectors.class);
43 sf.setResourceProvider(
44 TikaDetectors.class,
45 new SingletonResourceProvider(new TikaDetectors(tika))
46 );
47 }
48
49 @Override
50 protected void setUpProviders(JAXRSServerFactoryBean sf) {}
51
52 @Test
53 public void testGetPlainText() throws Exception {
54 Response response = WebClient
55 .create(endPoint + DETECTORS_PATH)
56 .type("text/plain")
57 .accept("text/plain")
58 .get();
59
60 String text = getStringFromInputStream((InputStream) response.getEntity());
61 assertContains("org.apache.tika.detect.DefaultDetector (Composite Detector)", text);
62 assertContains(OggDetector.class.getName(), text);
63 assertContains(POIFSContainerDetector.class.getName(), text);
64 assertContains(ZipContainerDetector.class.getName(), text);
65 assertContains(MimeTypes.class.getName(), text);
66 }
67
68 @Test
69 public void testGetHTML() throws Exception {
70 Response response = WebClient
71 .create(endPoint + DETECTORS_PATH)
72 .type("text/html")
73 .accept("text/html")
74 .get();
75
76 String text = getStringFromInputStream((InputStream) response.getEntity());
77 assertContains("<h2>DefaultDetector</h2>", text);
78 assertContains("Composite", text);
79
80 assertContains("<h3>OggDetector", text);
81 assertContains("<h3>POIFSContainerDetector", text);
82 assertContains("<h3>MimeTypes", text);
83
84 assertContains(OggDetector.class.getName(), text);
85 assertContains(POIFSContainerDetector.class.getName(), text);
86 assertContains(ZipContainerDetector.class.getName(), text);
87 assertContains(MimeTypes.class.getName(), text);
88 }
89
90 @Test
91 @SuppressWarnings("unchecked")
92 public void testGetJSON() throws Exception {
93 Response response = WebClient
94 .create(endPoint + DETECTORS_PATH)
95 .type(javax.ws.rs.core.MediaType.APPLICATION_JSON)
96 .accept(javax.ws.rs.core.MediaType.APPLICATION_JSON)
97 .get();
98
99 String jsonStr = getStringFromInputStream((InputStream) response.getEntity());
100 Map<String,Map<String,Object>> json = (Map<String,Map<String,Object>>)JSON.parse(jsonStr);
101
102 // Should have a nested structure
103 assertEquals(true, json.containsKey("name"));
104 assertEquals(true, json.containsKey("composite"));
105 assertEquals(true, json.containsKey("children"));
106 assertEquals("org.apache.tika.detect.DefaultDetector", json.get("name"));
107 assertEquals(Boolean.TRUE, json.get("composite"));
108
109 // At least 4 child detectors, none of them composite
110 Object[] children = (Object[])(Object)json.get("children");
111 assertTrue(children.length >= 4);
112 boolean hasOgg = false, hasPOIFS = false, hasZIP = false, hasMime = false;
113 for (Object o : children) {
114 Map<String,Object> d = (Map<String,Object>)o;
115 assertEquals(true, d.containsKey("name"));
116 assertEquals(true, d.containsKey("composite"));
117 assertEquals(Boolean.FALSE, d.get("composite"));
118 assertEquals(false, d.containsKey("children"));
119
120 String name = (String)d.get("name");
121 if (OggDetector.class.getName().equals(name)) {
122 hasOgg = true;
123 }
124 if (POIFSContainerDetector.class.getName().equals(name)) {
125 hasPOIFS = true;
126 }
127 if (ZipContainerDetector.class.getName().equals(name)) {
128 hasZIP = true;
129 }
130 if (MimeTypes.class.getName().equals(name)) {
131 hasMime = true;
132 }
133 }
134 assertEquals(true, hasOgg);
135 assertEquals(true, hasPOIFS);
136 assertEquals(true, hasZIP);
137 assertEquals(true, hasMime);
138 }
139 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
22 import javax.ws.rs.core.Response;
23
24 import java.io.InputStream;
25 import java.util.Map;
26
27 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
28 import org.apache.cxf.jaxrs.client.WebClient;
29 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
30 import org.apache.tika.mime.MimeTypes;
31 import org.apache.tika.parser.microsoft.POIFSContainerDetector;
32 import org.apache.tika.parser.pkg.ZipContainerDetector;
33 import org.apache.tika.server.resource.TikaDetectors;
34 import org.eclipse.jetty.util.ajax.JSON;
35 import org.gagravarr.tika.OggDetector;
36 import org.junit.Test;
37
38 public class TikaDetectorsTest extends CXFTestBase {
39 private static final String DETECTORS_PATH = "/detectors";
40
41 @Override
42 protected void setUpResources(JAXRSServerFactoryBean sf) {
43 sf.setResourceClasses(TikaDetectors.class);
44 sf.setResourceProvider(
45 TikaDetectors.class,
46 new SingletonResourceProvider(new TikaDetectors(tika))
47 );
48 }
49
50 @Override
51 protected void setUpProviders(JAXRSServerFactoryBean sf) {
52 }
53
54 @Test
55 public void testGetPlainText() throws Exception {
56 Response response = WebClient
57 .create(endPoint + DETECTORS_PATH)
58 .type("text/plain")
59 .accept("text/plain")
60 .get();
61
62 String text = getStringFromInputStream((InputStream) response.getEntity());
63 assertContains("org.apache.tika.detect.DefaultDetector (Composite Detector)", text);
64 assertContains(OggDetector.class.getName(), text);
65 assertContains(POIFSContainerDetector.class.getName(), text);
66 assertContains(ZipContainerDetector.class.getName(), text);
67 assertContains(MimeTypes.class.getName(), text);
68 }
69
70 @Test
71 public void testGetHTML() throws Exception {
72 Response response = WebClient
73 .create(endPoint + DETECTORS_PATH)
74 .type("text/html")
75 .accept("text/html")
76 .get();
77
78 String text = getStringFromInputStream((InputStream) response.getEntity());
79 assertContains("<h2>DefaultDetector</h2>", text);
80 assertContains("Composite", text);
81
82 assertContains("<h3>OggDetector", text);
83 assertContains("<h3>POIFSContainerDetector", text);
84 assertContains("<h3>MimeTypes", text);
85
86 assertContains(OggDetector.class.getName(), text);
87 assertContains(POIFSContainerDetector.class.getName(), text);
88 assertContains(ZipContainerDetector.class.getName(), text);
89 assertContains(MimeTypes.class.getName(), text);
90 }
91
92 @Test
93 @SuppressWarnings("unchecked")
94 public void testGetJSON() throws Exception {
95 Response response = WebClient
96 .create(endPoint + DETECTORS_PATH)
97 .type(javax.ws.rs.core.MediaType.APPLICATION_JSON)
98 .accept(javax.ws.rs.core.MediaType.APPLICATION_JSON)
99 .get();
100
101 String jsonStr = getStringFromInputStream((InputStream) response.getEntity());
102 Map<String, Map<String, Object>> json = (Map<String, Map<String, Object>>) JSON.parse(jsonStr);
103
104 // Should have a nested structure
105 assertTrue(json.containsKey("name"));
106 assertTrue(json.containsKey("composite"));
107 assertTrue(json.containsKey("children"));
108 assertEquals("org.apache.tika.detect.DefaultDetector", json.get("name"));
109 assertEquals(Boolean.TRUE, json.get("composite"));
110
111 // At least 4 child detectors, none of them composite
112 Object[] children = (Object[]) (Object) json.get("children");
113 assertTrue(children.length >= 4);
114 boolean hasOgg = false, hasPOIFS = false, hasZIP = false, hasMime = false;
115 for (Object o : children) {
116 Map<String, Object> d = (Map<String, Object>) o;
117 assertTrue(d.containsKey("name"));
118 assertTrue(d.containsKey("composite"));
119 assertEquals(Boolean.FALSE, d.get("composite"));
120 assertEquals(false, d.containsKey("children"));
121
122 String name = (String) d.get("name");
123 if (OggDetector.class.getName().equals(name)) {
124 hasOgg = true;
125 }
126 if (POIFSContainerDetector.class.getName().equals(name)) {
127 hasPOIFS = true;
128 }
129 if (ZipContainerDetector.class.getName().equals(name)) {
130 hasZIP = true;
131 }
132 if (MimeTypes.class.getName().equals(name)) {
133 hasMime = true;
134 }
135 }
136 assertTrue(hasOgg);
137 assertTrue(hasPOIFS);
138 assertTrue(hasZIP);
139 assertTrue(hasMime);
140 }
141 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20
21 import java.io.InputStream;
22 import java.util.Map;
23
24 import javax.ws.rs.core.Response;
25
26 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
27 import org.apache.cxf.jaxrs.client.WebClient;
28 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
29 import org.eclipse.jetty.util.ajax.JSON;
30 import org.junit.Test;
31
32 public class TikaMimeTypesTest extends CXFTestBase {
33 private static final String MIMETYPES_PATH = "/mime-types";
34
35 @Override
36 protected void setUpResources(JAXRSServerFactoryBean sf) {
37 sf.setResourceClasses(TikaMimeTypes.class);
38 sf.setResourceProvider(
39 TikaMimeTypes.class,
40 new SingletonResourceProvider(new TikaMimeTypes(tika))
41 );
42 }
43
44 @Override
45 protected void setUpProviders(JAXRSServerFactoryBean sf) {}
46
47 @Test
48 public void testGetPlainText() throws Exception {
49 Response response = WebClient
50 .create(endPoint + MIMETYPES_PATH)
51 .type("text/plain")
52 .accept("text/plain")
53 .get();
54
55 String text = getStringFromInputStream((InputStream) response.getEntity());
56 assertContains("text/plain", text);
57 assertContains("application/xml", text);
58 assertContains("video/x-ogm", text);
59
60 assertContains("supertype: video/ogg", text);
61
62 assertContains("alias: image/bmp", text);
63 }
64
65 @Test
66 public void testGetHTML() throws Exception {
67 Response response = WebClient
68 .create(endPoint + MIMETYPES_PATH)
69 .type("text/html")
70 .accept("text/html")
71 .get();
72
73 String text = getStringFromInputStream((InputStream) response.getEntity());
74 assertContains("text/plain", text);
75 assertContains("application/xml", text);
76 assertContains("video/x-ogm", text);
77
78 assertContains("<h2>text/plain", text);
79 assertContains("name=\"text/plain", text);
80
81 assertContains("Super Type: <a href=\"#video/ogg\">video/ogg", text);
82
83 assertContains("Alias: image/bmp", text);
84 }
85
86 @Test
87 @SuppressWarnings("unchecked")
88 public void testGetJSON() throws Exception {
89 Response response = WebClient
90 .create(endPoint + MIMETYPES_PATH)
91 .type(javax.ws.rs.core.MediaType.APPLICATION_JSON)
92 .accept(javax.ws.rs.core.MediaType.APPLICATION_JSON)
93 .get();
94
95 String jsonStr = getStringFromInputStream((InputStream) response.getEntity());
96 Map<String,Map<String,Object>> json = (Map<String,Map<String,Object>>)JSON.parse(jsonStr);
97
98 assertEquals(true, json.containsKey("text/plain"));
99 assertEquals(true, json.containsKey("application/xml"));
100 assertEquals(true, json.containsKey("video/x-ogm"));
101 assertEquals(true, json.containsKey("image/x-ms-bmp"));
102
103 Map<String,Object> bmp = json.get("image/x-ms-bmp");
104 assertEquals(true, bmp.containsKey("alias"));
105 Object[] aliases = (Object[])bmp.get("alias");
106 assertEquals(1, aliases.length);
107 assertEquals("image/bmp", aliases[0]);
108 assertEquals("org.apache.tika.parser.image.ImageParser", bmp.get("parser"));
109
110 Map<String,Object> ogm = json.get("video/x-ogm");
111 assertEquals("video/ogg", ogm.get("supertype"));
112 assertEquals("org.gagravarr.tika.OggParser", ogm.get("parser"));
113 }
114 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
22 import javax.ws.rs.core.Response;
23
24 import java.io.InputStream;
25 import java.util.Map;
26
27 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
28 import org.apache.cxf.jaxrs.client.WebClient;
29 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
30 import org.apache.tika.server.resource.TikaMimeTypes;
31 import org.eclipse.jetty.util.ajax.JSON;
32 import org.junit.Test;
33
34 public class TikaMimeTypesTest extends CXFTestBase {
35 private static final String MIMETYPES_PATH = "/mime-types";
36
37 @Override
38 protected void setUpResources(JAXRSServerFactoryBean sf) {
39 sf.setResourceClasses(TikaMimeTypes.class);
40 sf.setResourceProvider(
41 TikaMimeTypes.class,
42 new SingletonResourceProvider(new TikaMimeTypes(tika))
43 );
44 }
45
46 @Override
47 protected void setUpProviders(JAXRSServerFactoryBean sf) {
48 }
49
50 @Test
51 public void testGetPlainText() throws Exception {
52 Response response = WebClient
53 .create(endPoint + MIMETYPES_PATH)
54 .type("text/plain")
55 .accept("text/plain")
56 .get();
57
58 String text = getStringFromInputStream((InputStream) response.getEntity());
59 assertContains("text/plain", text);
60 assertContains("application/xml", text);
61 assertContains("video/x-ogm", text);
62
63 assertContains("supertype: video/ogg", text);
64
65 assertContains("alias: image/bmp", text);
66 }
67
68 @Test
69 public void testGetHTML() throws Exception {
70 Response response = WebClient
71 .create(endPoint + MIMETYPES_PATH)
72 .type("text/html")
73 .accept("text/html")
74 .get();
75
76 String text = getStringFromInputStream((InputStream) response.getEntity());
77 assertContains("text/plain", text);
78 assertContains("application/xml", text);
79 assertContains("video/x-ogm", text);
80
81 assertContains("<h2>text/plain", text);
82 assertContains("name=\"text/plain", text);
83
84 assertContains("Super Type: <a href=\"#video/ogg\">video/ogg", text);
85
86 assertContains("Alias: image/bmp", text);
87 }
88
89 @Test
90 @SuppressWarnings("unchecked")
91 public void testGetJSON() throws Exception {
92 Response response = WebClient
93 .create(endPoint + MIMETYPES_PATH)
94 .type(javax.ws.rs.core.MediaType.APPLICATION_JSON)
95 .accept(javax.ws.rs.core.MediaType.APPLICATION_JSON)
96 .get();
97
98 String jsonStr = getStringFromInputStream((InputStream) response.getEntity());
99 Map<String, Map<String, Object>> json = (Map<String, Map<String, Object>>) JSON.parse(jsonStr);
100
101 assertEquals(true, json.containsKey("text/plain"));
102 assertEquals(true, json.containsKey("application/xml"));
103 assertEquals(true, json.containsKey("video/x-ogm"));
104 assertEquals(true, json.containsKey("image/x-ms-bmp"));
105
106 Map<String, Object> bmp = json.get("image/x-ms-bmp");
107 assertEquals(true, bmp.containsKey("alias"));
108 Object[] aliases = (Object[]) bmp.get("alias");
109 assertEquals(1, aliases.length);
110 assertEquals("image/bmp", aliases[0]);
111
112 String whichParser = bmp.get("parser").toString();
113 assertTrue("Which parser", whichParser.equals("org.apache.tika.parser.ocr.TesseractOCRParser") ||
114 whichParser.equals("org.apache.tika.parser.image.ImageParser"));
115
116 Map<String, Object> ogm = json.get("video/x-ogm");
117 assertEquals("video/ogg", ogm.get("supertype"));
118 assertEquals("org.gagravarr.tika.OggParser", ogm.get("parser"));
119 }
120 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
22 import java.io.InputStream;
23 import java.util.Map;
24
25 import javax.ws.rs.core.Response;
26
27 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
28 import org.apache.cxf.jaxrs.client.WebClient;
29 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
30 import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
31 import org.apache.tika.parser.pdf.PDFParser;
32 import org.apache.tika.parser.pkg.PackageParser;
33 import org.eclipse.jetty.util.ajax.JSON;
34 import org.gagravarr.tika.OpusParser;
35 import org.junit.Test;
36
37 public class TikaParsersTest extends CXFTestBase {
38 private static final String PARSERS_SUMMARY_PATH = "/parsers";
39 private static final String PARSERS_DETAILS_PATH = "/parsers/details";
40
41 @Override
42 protected void setUpResources(JAXRSServerFactoryBean sf) {
43 sf.setResourceClasses(TikaParsers.class);
44 sf.setResourceProvider(
45 TikaParsers.class,
46 new SingletonResourceProvider(new TikaParsers(tika))
47 );
48 }
49
50 @Override
51 protected void setUpProviders(JAXRSServerFactoryBean sf) {}
52
53 protected String getPath(boolean withDetails) {
54 return withDetails ? PARSERS_DETAILS_PATH : PARSERS_SUMMARY_PATH;
55 }
56
57 @Test
58 public void testGetPlainText() throws Exception {
59 for (boolean details : new boolean[] { false, true }) {
60 Response response = WebClient
61 .create(endPoint + getPath(details))
62 .type("text/plain")
63 .accept("text/plain")
64 .get();
65
66 String text = getStringFromInputStream((InputStream) response.getEntity());
67 assertContains("org.apache.tika.parser.DefaultParser (Composite Parser)", text);
68 assertContains(OpusParser.class.getName(), text);
69 assertContains(PackageParser.class.getName(), text);
70 assertContains(OOXMLParser.class.getName(), text);
71
72 if (details) {
73 // Should have the mimetypes they handle
74 assertContains("text/plain", text);
75 assertContains("application/pdf", text);
76 assertContains("audio/ogg", text);
77 } else {
78 // Shouldn't do
79 assertNotFound("text/plain", text);
80 assertNotFound("application/pdf", text);
81 assertNotFound("audio/ogg", text);
82 }
83 }
84 }
85
86 @Test
87 public void testGetHTML() throws Exception {
88 for (boolean details : new boolean[] { false, true }) {
89 Response response = WebClient
90 .create(endPoint + getPath(details))
91 .type("text/html")
92 .accept("text/html")
93 .get();
94
95 String text = getStringFromInputStream((InputStream) response.getEntity());
96 assertContains("<h2>DefaultParser</h2>", text);
97 assertContains("Composite", text);
98
99 assertContains("<h3>OpusParser", text);
100 assertContains("<h3>PackageParser", text);
101 assertContains("<h3>OOXMLParser", text);
102
103 assertContains(OpusParser.class.getName(), text);
104 assertContains(PackageParser.class.getName(), text);
105 assertContains(OOXMLParser.class.getName(), text);
106
107 if (details) {
108 // Should have the mimetypes they handle
109 assertContains("<li>text/plain", text);
110 assertContains("<li>application/pdf", text);
111 assertContains("<li>audio/ogg", text);
112 } else {
113 // Shouldn't do
114 assertNotFound("text/plain", text);
115 assertNotFound("application/pdf", text);
116 assertNotFound("audio/ogg", text);
117 }
118 }
119 }
120
121 @Test
122 @SuppressWarnings("unchecked")
123 public void testGetJSON() throws Exception {
124 for (boolean details : new boolean[] { false, true }) {
125 Response response = WebClient
126 .create(endPoint + getPath(details))
127 .type(javax.ws.rs.core.MediaType.APPLICATION_JSON)
128 .accept(javax.ws.rs.core.MediaType.APPLICATION_JSON)
129 .get();
130
131 String jsonStr = getStringFromInputStream((InputStream) response.getEntity());
132 Map<String,Map<String,Object>> json = (Map<String,Map<String,Object>>)JSON.parse(jsonStr);
133
134 // Should have a nested structure
135 assertEquals(true, json.containsKey("name"));
136 assertEquals(true, json.containsKey("composite"));
137 assertEquals(true, json.containsKey("children"));
138 assertEquals("org.apache.tika.parser.DefaultParser", json.get("name"));
139 assertEquals(Boolean.TRUE, json.get("composite"));
140
141 // At least 20 child parsers which aren't composite
142 Object[] children = (Object[])(Object)json.get("children");
143 assertTrue(children.length >= 20);
144 boolean hasOpus = false, hasOOXML = false, hasPDF = false, hasZip = false;
145 int nonComposite = 0;
146 for (Object o : children) {
147 Map<String,Object> d = (Map<String,Object>)o;
148 assertEquals(true, d.containsKey("name"));
149 assertEquals(true, d.containsKey("composite"));
150 assertEquals(Boolean.FALSE, d.get("composite"));
151 assertEquals(false, d.containsKey("children"));
152
153 if (d.get("composite") == Boolean.FALSE) nonComposite++;
154
155 // Will only have mime types if requested
156 assertEquals(details, d.containsKey("supportedTypes"));
157
158 String name = (String)d.get("name");
159 if (OpusParser.class.getName().equals(name)) {
160 hasOpus = true;
161 }
162 if (OOXMLParser.class.getName().equals(name)) {
163 hasOOXML = true;
164 }
165 if (PDFParser.class.getName().equals(name)) {
166 hasPDF = true;
167 }
168 if (PackageParser.class.getName().equals(name)) {
169 hasZip = true;
170 }
171 }
172 assertEquals(true, hasOpus);
173 assertEquals(true, hasOOXML);
174 assertEquals(true, hasPDF);
175 assertEquals(true, hasZip);
176 assertTrue(nonComposite > 20);
177 }
178 }
179 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20 import static org.junit.Assert.assertTrue;
21
22 import javax.ws.rs.core.Response;
23
24 import java.io.InputStream;
25 import java.util.Map;
26
27 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
28 import org.apache.cxf.jaxrs.client.WebClient;
29 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
30 import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
31 import org.apache.tika.parser.pdf.PDFParser;
32 import org.apache.tika.parser.pkg.PackageParser;
33 import org.apache.tika.server.resource.TikaParsers;
34 import org.eclipse.jetty.util.ajax.JSON;
35 import org.gagravarr.tika.OpusParser;
36 import org.junit.Test;
37
38 public class TikaParsersTest extends CXFTestBase {
39 private static final String PARSERS_SUMMARY_PATH = "/parsers";
40 private static final String PARSERS_DETAILS_PATH = "/parsers/details";
41
42 @Override
43 protected void setUpResources(JAXRSServerFactoryBean sf) {
44 sf.setResourceClasses(TikaParsers.class);
45 sf.setResourceProvider(
46 TikaParsers.class,
47 new SingletonResourceProvider(new TikaParsers(tika))
48 );
49 }
50
51 @Override
52 protected void setUpProviders(JAXRSServerFactoryBean sf) {
53 }
54
55 protected String getPath(boolean withDetails) {
56 return withDetails ? PARSERS_DETAILS_PATH : PARSERS_SUMMARY_PATH;
57 }
58
59 @Test
60 public void testGetPlainText() throws Exception {
61 for (boolean details : new boolean[]{false, true}) {
62 Response response = WebClient
63 .create(endPoint + getPath(details))
64 .type("text/plain")
65 .accept("text/plain")
66 .get();
67
68 String text = getStringFromInputStream((InputStream) response.getEntity());
69 assertContains("org.apache.tika.parser.DefaultParser (Composite Parser)", text);
70 assertContains(OpusParser.class.getName(), text);
71 assertContains(PackageParser.class.getName(), text);
72 assertContains(OOXMLParser.class.getName(), text);
73
74 if (details) {
75 // Should have the mimetypes they handle
76 assertContains("text/plain", text);
77 assertContains("application/pdf", text);
78 assertContains("audio/ogg", text);
79 } else {
80 // Shouldn't do
81 assertNotFound("text/plain", text);
82 assertNotFound("application/pdf", text);
83 assertNotFound("audio/ogg", text);
84 }
85 }
86 }
87
88 @Test
89 public void testGetHTML() throws Exception {
90 for (boolean details : new boolean[]{false, true}) {
91 Response response = WebClient
92 .create(endPoint + getPath(details))
93 .type("text/html")
94 .accept("text/html")
95 .get();
96
97 String text = getStringFromInputStream((InputStream) response.getEntity());
98 assertContains("<h2>DefaultParser</h2>", text);
99 assertContains("Composite", text);
100
101 assertContains("<h3>OpusParser", text);
102 assertContains("<h3>PackageParser", text);
103 assertContains("<h3>OOXMLParser", text);
104
105 assertContains(OpusParser.class.getName(), text);
106 assertContains(PackageParser.class.getName(), text);
107 assertContains(OOXMLParser.class.getName(), text);
108
109 if (details) {
110 // Should have the mimetypes they handle
111 assertContains("<li>text/plain", text);
112 assertContains("<li>application/pdf", text);
113 assertContains("<li>audio/ogg", text);
114 } else {
115 // Shouldn't do
116 assertNotFound("text/plain", text);
117 assertNotFound("application/pdf", text);
118 assertNotFound("audio/ogg", text);
119 }
120 }
121 }
122
123 @Test
124 @SuppressWarnings("unchecked")
125 public void testGetJSON() throws Exception {
126 for (boolean details : new boolean[]{false, true}) {
127 Response response = WebClient
128 .create(endPoint + getPath(details))
129 .type(javax.ws.rs.core.MediaType.APPLICATION_JSON)
130 .accept(javax.ws.rs.core.MediaType.APPLICATION_JSON)
131 .get();
132
133 String jsonStr = getStringFromInputStream((InputStream) response.getEntity());
134 Map<String, Map<String, Object>> json = (Map<String, Map<String, Object>>) JSON.parse(jsonStr);
135
136 // Should have a nested structure
137 assertEquals(true, json.containsKey("name"));
138 assertEquals(true, json.containsKey("composite"));
139 assertEquals(true, json.containsKey("children"));
140 assertEquals("org.apache.tika.parser.DefaultParser", json.get("name"));
141 assertEquals(Boolean.TRUE, json.get("composite"));
142
143 // At least 20 child parsers which aren't composite
144 Object[] children = (Object[]) (Object) json.get("children");
145 assertTrue(children.length >= 20);
146 boolean hasOpus = false, hasOOXML = false, hasPDF = false, hasZip = false;
147 int nonComposite = 0;
148 for (Object o : children) {
149 Map<String, Object> d = (Map<String, Object>) o;
150 assertEquals(true, d.containsKey("name"));
151 assertEquals(true, d.containsKey("composite"));
152 assertEquals(Boolean.FALSE, d.get("composite"));
153 assertEquals(false, d.containsKey("children"));
154
155 if (d.get("composite") == Boolean.FALSE) nonComposite++;
156
157 // Will only have mime types if requested
158 assertEquals(details, d.containsKey("supportedTypes"));
159
160 String name = (String) d.get("name");
161 if (OpusParser.class.getName().equals(name)) {
162 hasOpus = true;
163 }
164 if (OOXMLParser.class.getName().equals(name)) {
165 hasOOXML = true;
166 }
167 if (PDFParser.class.getName().equals(name)) {
168 hasPDF = true;
169 }
170 if (PackageParser.class.getName().equals(name)) {
171 hasZip = true;
172 }
173 }
174 assertEquals(true, hasOpus);
175 assertEquals(true, hasOOXML);
176 assertEquals(true, hasPDF);
177 assertEquals(true, hasZip);
178 assertTrue(nonComposite > 20);
179 }
180 }
181 }
1919 import static org.junit.Assert.assertEquals;
2020 import static org.junit.Assert.assertTrue;
2121
22 import javax.ws.rs.core.Response;
23
2224 import java.io.InputStream;
23
24 import javax.ws.rs.core.Response;
25 import java.util.ArrayList;
26 import java.util.List;
2527
2628 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
2729 import org.apache.cxf.jaxrs.client.WebClient;
2830 import org.apache.cxf.jaxrs.ext.multipart.Attachment;
2931 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
32 import org.apache.tika.server.resource.TikaResource;
3033 import org.junit.Test;
3134
3235 public class TikaResourceTest extends CXFTestBase {
33 private static final String TIKA_PATH = "/tika";
3436 public static final String TEST_DOC = "test.doc";
3537 public static final String TEST_XLSX = "16637.xlsx";
38 public static final String TEST_PASSWORD_PROTECTED = "password.xls";
39 private static final String TEST_RECURSIVE_DOC = "test_recursive_embedded.docx";
40
41 private static final String TIKA_PATH = "/tika";
3642 private static final int UNPROCESSEABLE = 422;
3743
3844 @Override
3945 protected void setUpResources(JAXRSServerFactoryBean sf) {
4046 sf.setResourceClasses(TikaResource.class);
4147 sf.setResourceProvider(TikaResource.class,
42 new SingletonResourceProvider(new TikaResource(tika)));
48 new SingletonResourceProvider(new TikaResource(tika)));
4349 }
4450
4551 @Override
46 protected void setUpProviders(JAXRSServerFactoryBean sf) {}
52 protected void setUpProviders(JAXRSServerFactoryBean sf) {
53 List<Object> providers = new ArrayList<Object>();
54 providers.add(new TikaServerParseExceptionMapper(false));
55 sf.setProviders(providers);
56 }
4757
48 @Test
49 public void testHelloWorld() throws Exception {
50 Response response = WebClient.create(endPoint + TIKA_PATH)
51 .type("text/plain").accept("text/plain").get();
52 assertEquals(TikaResource.GREETING,
53 getStringFromInputStream((InputStream) response.getEntity()));
54 }
58 @Test
59 public void testHelloWorld() throws Exception {
60 Response response = WebClient.create(endPoint + TIKA_PATH)
61 .type("text/plain").accept("text/plain").get();
62 assertEquals(TikaResource.GREETING,
63 getStringFromInputStream((InputStream) response.getEntity()));
64 }
5565
56 @Test
57 public void testSimpleWord() throws Exception {
58 Response response = WebClient.create(endPoint + TIKA_PATH)
59 .type("application/msword")
60 .accept("text/plain")
61 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC));
62 String responseMsg = getStringFromInputStream((InputStream) response
63 .getEntity());
64 assertTrue(responseMsg.contains("test"));
65 }
66 @Test
67 public void testSimpleWord() throws Exception {
68 Response response = WebClient.create(endPoint + TIKA_PATH)
69 .type("application/msword")
70 .accept("text/plain")
71 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC));
72 String responseMsg = getStringFromInputStream((InputStream) response
73 .getEntity());
74 assertTrue(responseMsg.contains("test"));
75 }
6676
67 @Test
68 public void testApplicationWadl() throws Exception {
69 Response response = WebClient
70 .create(endPoint + TIKA_PATH + "?_wadl")
71 .accept("text/plain").get();
72 String resp = getStringFromInputStream((InputStream) response
73 .getEntity());
74 assertTrue(resp.startsWith("<application"));
75 }
77 @Test
78 public void testApplicationWadl() throws Exception {
79 Response response = WebClient
80 .create(endPoint + TIKA_PATH + "?_wadl")
81 .accept("text/plain").get();
82 String resp = getStringFromInputStream((InputStream) response
83 .getEntity());
84 assertTrue(resp.startsWith("<application"));
85 }
7686
77 @Test
78 public void testPasswordXLS() throws Exception {
79 Response response = WebClient.create(endPoint + TIKA_PATH)
80 .type("application/vnd.ms-excel")
81 .accept("text/plain")
82 .put(ClassLoader.getSystemResourceAsStream("password.xls"));
87 @Test
88 public void testPasswordXLS() throws Exception {
89 Response response = WebClient.create(endPoint + TIKA_PATH)
90 .type("application/vnd.ms-excel")
91 .accept("text/plain")
92 .put(ClassLoader.getSystemResourceAsStream("password.xls"));
8393
84 assertEquals(UNPROCESSEABLE, response.getStatus());
85 }
94 assertEquals(UNPROCESSEABLE, response.getStatus());
95 }
8696
87 @Test
88 public void testSimpleWordHTML() throws Exception {
89 Response response = WebClient.create(endPoint + TIKA_PATH)
90 .type("application/msword")
91 .accept("text/html")
92 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC));
93 String responseMsg = getStringFromInputStream((InputStream) response
94 .getEntity());
95 assertTrue(responseMsg.contains("test"));
96 }
97 @Test
98 public void testSimpleWordHTML() throws Exception {
99 Response response = WebClient.create(endPoint + TIKA_PATH)
100 .type("application/msword")
101 .accept("text/html")
102 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC));
103 String responseMsg = getStringFromInputStream((InputStream) response
104 .getEntity());
105 assertTrue(responseMsg.contains("test"));
106 }
97107
98 @Test
99 public void testPasswordXLSHTML() throws Exception {
100 Response response = WebClient.create(endPoint + TIKA_PATH)
101 .type("application/vnd.ms-excel")
102 .accept("text/html")
103 .put(ClassLoader.getSystemResourceAsStream("password.xls"));
108 @Test
109 public void testPasswordXLSHTML() throws Exception {
110 Response response = WebClient.create(endPoint + TIKA_PATH)
111 .type("application/vnd.ms-excel")
112 .accept("text/html")
113 .put(ClassLoader.getSystemResourceAsStream("password.xls"));
104114
105 assertEquals(UNPROCESSEABLE, response.getStatus());
106 }
115 assertEquals(UNPROCESSEABLE, response.getStatus());
116 }
107117
108 @Test
109 public void testSimpleWordXML() throws Exception {
110 Response response = WebClient.create(endPoint + TIKA_PATH)
111 .type("application/msword")
112 .accept("text/xml")
113 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC));
114 String responseMsg = getStringFromInputStream((InputStream) response
115 .getEntity());
116 assertTrue(responseMsg.contains("test"));
117 }
118 @Test
119 public void testSimpleWordXML() throws Exception {
120 Response response = WebClient.create(endPoint + TIKA_PATH)
121 .type("application/msword")
122 .accept("text/xml")
123 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC));
124 String responseMsg = getStringFromInputStream((InputStream) response
125 .getEntity());
126 assertTrue(responseMsg.contains("test"));
127 }
118128
119 @Test
120 public void testPasswordXLSXML() throws Exception {
121 Response response = WebClient.create(endPoint + TIKA_PATH)
122 .type("application/vnd.ms-excel")
123 .accept("text/xml")
124 .put(ClassLoader.getSystemResourceAsStream("password.xls"));
129 @Test
130 public void testPasswordXLSXML() throws Exception {
131 Response response = WebClient.create(endPoint + TIKA_PATH)
132 .type("application/vnd.ms-excel")
133 .accept("text/xml")
134 .put(ClassLoader.getSystemResourceAsStream("password.xls"));
125135
126 assertEquals(UNPROCESSEABLE, response.getStatus());
127 }
128
129 @Test
130 public void testSimpleWordMultipartXML() throws Exception {
131 ClassLoader.getSystemResourceAsStream(TEST_DOC);
132 Attachment attachmentPart =
133 new Attachment("myworddoc", "application/msword", ClassLoader.getSystemResourceAsStream(TEST_DOC));
134 WebClient webClient = WebClient.create(endPoint + TIKA_PATH + "/form");
135 Response response = webClient.type("multipart/form-data")
136 .accept("text/xml")
137 .put(attachmentPart);
138 String responseMsg = getStringFromInputStream((InputStream) response
139 .getEntity());
140 assertTrue(responseMsg.contains("test"));
141 }
142
136 assertEquals(UNPROCESSEABLE, response.getStatus());
137 }
138
139 @Test
140 public void testSimpleWordMultipartXML() throws Exception {
141 ClassLoader.getSystemResourceAsStream(TEST_DOC);
142 Attachment attachmentPart =
143 new Attachment("myworddoc", "application/msword", ClassLoader.getSystemResourceAsStream(TEST_DOC));
144 WebClient webClient = WebClient.create(endPoint + TIKA_PATH + "/form");
145 Response response = webClient.type("multipart/form-data")
146 .accept("text/xml")
147 .post(attachmentPart);
148 String responseMsg = getStringFromInputStream((InputStream) response
149 .getEntity());
150 assertTrue(responseMsg.contains("test"));
151 }
152
153 @Test
154 public void testEmbedded() throws Exception {
155 //first try text
156 Response response = WebClient.create(endPoint + TIKA_PATH)
157 .accept("text/plain")
158 .put(ClassLoader.getSystemResourceAsStream(TEST_RECURSIVE_DOC));
159 String responseMsg = getStringFromInputStream((InputStream) response
160 .getEntity());
161 assertTrue(responseMsg.contains("Course of human events"));
162
163 //now go for xml -- different call than text
164 response = WebClient.create(endPoint + TIKA_PATH)
165 .accept("text/xml")
166 .put(ClassLoader.getSystemResourceAsStream(TEST_RECURSIVE_DOC));
167 responseMsg = getStringFromInputStream((InputStream) response
168 .getEntity());
169 assertTrue(responseMsg.contains("Course of human events"));
170 }
171
143172 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20
21 import java.io.InputStream;
22
23 import javax.ws.rs.core.Response;
24
25 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
26 import org.apache.cxf.jaxrs.client.WebClient;
27 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
28 import org.apache.tika.Tika;
29 import org.junit.Test;
30
31 public class TikaVersionTest extends CXFTestBase {
32 protected static final String VERSION_PATH = "/version";
33
34 @Override
35 protected void setUpResources(JAXRSServerFactoryBean sf) {
36 sf.setResourceClasses(TikaVersion.class);
37 sf.setResourceProvider(
38 TikaVersion.class,
39 new SingletonResourceProvider(new TikaVersion(tika))
40 );
41 }
42
43 @Override
44 protected void setUpProviders(JAXRSServerFactoryBean sf) {}
45
46 @Test
47 public void testGetVersion() throws Exception {
48 Response response = WebClient
49 .create(endPoint + VERSION_PATH)
50 .type("text/plain")
51 .accept("text/plain")
52 .get();
53
54 assertEquals(new Tika().toString(),
55 getStringFromInputStream((InputStream) response.getEntity()));
56 }
57 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import static org.junit.Assert.assertEquals;
20
21 import javax.ws.rs.core.Response;
22
23 import java.io.InputStream;
24
25 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
26 import org.apache.cxf.jaxrs.client.WebClient;
27 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
28 import org.apache.tika.Tika;
29 import org.apache.tika.server.resource.TikaVersion;
30 import org.junit.Test;
31
32 public class TikaVersionTest extends CXFTestBase {
33 protected static final String VERSION_PATH = "/version";
34
35 @Override
36 protected void setUpResources(JAXRSServerFactoryBean sf) {
37 sf.setResourceClasses(TikaVersion.class);
38 sf.setResourceProvider(
39 TikaVersion.class,
40 new SingletonResourceProvider(new TikaVersion(tika))
41 );
42 }
43
44 @Override
45 protected void setUpProviders(JAXRSServerFactoryBean sf) {
46 }
47
48 @Test
49 public void testGetVersion() throws Exception {
50 Response response = WebClient
51 .create(endPoint + VERSION_PATH)
52 .type("text/plain")
53 .accept("text/plain")
54 .get();
55
56 assertEquals(new Tika().toString(),
57 getStringFromInputStream((InputStream) response.getEntity()));
58 }
59 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import java.io.InputStream;
20
21 import javax.ws.rs.core.Response;
22
23 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
24 import org.apache.cxf.jaxrs.client.WebClient;
25 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
26 import org.apache.tika.Tika;
27 import org.junit.Test;
28
29 public class TikaWelcomeTest extends CXFTestBase {
30 protected static final String WELCOME_PATH = "/";
31 private static final String VERSION_PATH = TikaVersionTest.VERSION_PATH;
32
33 @Override
34 protected void setUpResources(JAXRSServerFactoryBean sf) {
35 sf.setResourceClasses(TikaWelcome.class, TikaVersion.class);
36 sf.setResourceProvider(
37 TikaWelcome.class,
38 new SingletonResourceProvider(new TikaWelcome(tika, sf))
39 );
40 sf.setResourceProvider(
41 TikaVersion.class,
42 new SingletonResourceProvider(new TikaVersion(tika))
43 );
44 }
45
46 @Override
47 protected void setUpProviders(JAXRSServerFactoryBean sf) {}
48
49 @Test
50 public void testGetHTMLWelcome() throws Exception {
51 Response response = WebClient
52 .create(endPoint + WELCOME_PATH)
53 .type("text/html")
54 .accept("text/html")
55 .get();
56
57 String html = getStringFromInputStream((InputStream) response.getEntity());
58
59 assertContains(new Tika().toString(), html);
60 assertContains("href=\"http", html);
61
62 // Check our details were found
63 assertContains("GET", html);
64 assertContains(WELCOME_PATH, html);
65 assertContains("text/plain", html);
66 assertContains("text/html", html);
67
68 // Check that the Tika Version details come through too
69 assertContains(VERSION_PATH, html);
70 }
71
72 @Test
73 public void testGetTextWelcome() throws Exception {
74 Response response = WebClient
75 .create(endPoint + WELCOME_PATH)
76 .type("text/plain")
77 .accept("text/plain")
78 .get();
79
80 String text = getStringFromInputStream((InputStream) response.getEntity());
81 assertContains(new Tika().toString(), text);
82
83 // Check our details were found
84 assertContains("GET " + WELCOME_PATH, text);
85 assertContains("=> text/plain", text);
86 assertContains("=> text/html", text);
87
88 // Check that the Tika Version details come through too
89 assertContains("GET " + VERSION_PATH, text);
90 }
91 }
0 /*
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.server;
18
19 import javax.ws.rs.core.Response;
20
21 import java.io.InputStream;
22 import java.util.ArrayList;
23 import java.util.List;
24
25 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
26 import org.apache.cxf.jaxrs.client.WebClient;
27 import org.apache.cxf.jaxrs.lifecycle.ResourceProvider;
28 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
29 import org.apache.tika.Tika;
30 import org.apache.tika.server.resource.DetectorResource;
31 import org.apache.tika.server.resource.MetadataResource;
32 import org.apache.tika.server.resource.TikaVersion;
33 import org.apache.tika.server.resource.TikaWelcome;
34 import org.junit.Test;
35
36 public class TikaWelcomeTest extends CXFTestBase {
37 protected static final String WELCOME_PATH = "/";
38 private static final String VERSION_PATH = TikaVersionTest.VERSION_PATH;
39 protected static final String PATH_RESOURCE = "/detect/stream"; // TIKA-1567
40 protected static final String PATH_RESOURCE_2 = "/meta/form"; //TIKA-1567
41
42 @Override
43 protected void setUpResources(JAXRSServerFactoryBean sf) {
44 List<ResourceProvider> rpsCore =
45 new ArrayList<ResourceProvider>();
46 rpsCore.add(new SingletonResourceProvider(new TikaVersion(tika)));
47 rpsCore.add(new SingletonResourceProvider(new DetectorResource(tika)));
48 rpsCore.add(new SingletonResourceProvider(new MetadataResource(tika)));
49 List<ResourceProvider> all = new ArrayList<ResourceProvider>(rpsCore);
50 all.add(new SingletonResourceProvider(new TikaWelcome(tika, rpsCore)));
51 sf.setResourceProviders(all);
52 }
53
54 @Override
55 protected void setUpProviders(JAXRSServerFactoryBean sf) {
56 }
57
58 @Test
59 public void testGetHTMLWelcome() throws Exception {
60 String html = WebClient
61 .create(endPoint + WELCOME_PATH)
62 .type("text/html")
63 .accept("text/html")
64 .get(String.class);
65
66
67 assertContains(new Tika().toString(), html);
68 assertContains("href=\"http", html);
69
70 // Check our details were found
71 assertContains("GET", html);
72 assertContains(WELCOME_PATH, html);
73 assertContains("text/plain", html);
74
75 // Check that the Tika Version details come through too
76 assertContains(VERSION_PATH, html);
77 }
78
79 @Test
80 public void testGetTextWelcome() throws Exception {
81 Response response = WebClient
82 .create(endPoint + WELCOME_PATH)
83 .type("text/plain")
84 .accept("text/plain")
85 .get();
86
87 String text = getStringFromInputStream((InputStream) response.getEntity());
88 assertContains(new Tika().toString(), text);
89
90 // Check our details were found
91 assertContains("GET " + WELCOME_PATH, text);
92 assertContains("=> text/plain", text);
93
94 // Check that the Tika Version details come through too
95 assertContains("GET " + VERSION_PATH, text);
96 }
97
98
99 @Test
100 public void testProperPathWelcome() throws Exception{
101 Response response = WebClient
102 .create(endPoint + WELCOME_PATH)
103 .type("text/html")
104 .accept("text/html")
105 .get();
106
107 String html = getStringFromInputStream((InputStream) response.getEntity());
108 assertContains(PATH_RESOURCE, html);
109 assertContains(PATH_RESOURCE_2, html);
110 }
111 }
2121 import static org.junit.Assert.assertNotNull;
2222 import static org.junit.Assert.assertTrue;
2323
24 import javax.ws.rs.core.Response;
25
2426 import java.io.InputStream;
2527 import java.util.ArrayList;
2628 import java.util.List;
2729 import java.util.Map;
2830
29 import javax.ws.rs.core.Response;
30
3131 import org.apache.commons.compress.archivers.tar.TarArchiveInputStream;
3232 import org.apache.cxf.jaxrs.JAXRSServerFactoryBean;
3333 import org.apache.cxf.jaxrs.client.WebClient;
3434 import org.apache.cxf.jaxrs.lifecycle.SingletonResourceProvider;
35 import org.apache.tika.server.writer.TarWriter;
36 import org.apache.tika.server.resource.UnpackerResource;
37 import org.apache.tika.server.writer.ZipWriter;
3538 import org.junit.Test;
3639
3740 public class UnpackerResourceTest extends CXFTestBase {
38 private static final String BASE_PATH = "/unpack";
39 private static final String UNPACKER_PATH = BASE_PATH + "";
40 private static final String ALL_PATH = BASE_PATH + "/all";
41 private static final String BASE_PATH = "/unpack";
42 private static final String UNPACKER_PATH = BASE_PATH + "";
43 private static final String ALL_PATH = BASE_PATH + "/all";
4144
42 private static final String TEST_DOC_WAV = "Doc1_ole.doc";
43 private static final String WAV1_MD5 = "bdd0a78a54968e362445364f95d8dc96";
44 private static final String WAV1_NAME = "_1310388059/MSj00974840000[1].wav";
45 private static final String WAV2_MD5 = "3bbd42fb1ac0e46a95350285f16d9596";
46 private static final String WAV2_NAME = "_1310388058/MSj00748450000[1].wav";
47 private static final String JPG_NAME = "image1.jpg";
48 private static final String XSL_IMAGE1_MD5 = "68ead8f4995a3555f48a2f738b2b0c3d";
49 private static final String JPG_MD5 = XSL_IMAGE1_MD5;
50 private static final String JPG2_NAME = "image2.jpg";
51 private static final String JPG2_MD5 = "b27a41d12c646d7fc4f3826cf8183c68";
52 private static final String TEST_DOCX_IMAGE = "2pic.docx";
53 private static final String DOCX_IMAGE1_MD5 = "5516590467b069fa59397432677bad4d";
54 private static final String DOCX_IMAGE2_MD5 = "a5dd81567427070ce0a2ff3e3ef13a4c";
55 private static final String DOCX_IMAGE1_NAME = "image1.jpeg";
56 private static final String DOCX_IMAGE2_NAME = "image2.jpeg";
57 private static final String DOCX_EXE1_MD5 = "d71ffa0623014df725f8fd2710de4411";
58 private static final String DOCX_EXE1_NAME = "GMapTool.exe";
59 private static final String DOCX_EXE2_MD5 = "2485435c7c22d35f2de9b4c98c0c2e1a";
60 private static final String DOCX_EXE2_NAME = "Setup.exe";
61 private static final String XSL_IMAGE2_MD5 = "8969288f4245120e7c3870287cce0ff3";
62 private static final String APPLICATION_MSWORD = "application/msword";
63 private static final String APPLICATION_XML = "application/xml";
64 private static final String CONTENT_TYPE = "Content-type";
45 private static final String TEST_DOC_WAV = "Doc1_ole.doc";
46 private static final String WAV1_MD5 = "bdd0a78a54968e362445364f95d8dc96";
47 private static final String WAV1_NAME = "_1310388059/MSj00974840000[1].wav";
48 private static final String WAV2_MD5 = "3bbd42fb1ac0e46a95350285f16d9596";
49 private static final String WAV2_NAME = "_1310388058/MSj00748450000[1].wav";
50 private static final String JPG_NAME = "image1.jpg";
51 private static final String XSL_IMAGE1_MD5 = "68ead8f4995a3555f48a2f738b2b0c3d";
52 private static final String JPG_MD5 = XSL_IMAGE1_MD5;
53 private static final String JPG2_NAME = "image2.jpg";
54 private static final String JPG2_MD5 = "b27a41d12c646d7fc4f3826cf8183c68";
55 private static final String TEST_DOCX_IMAGE = "2pic.docx";
56 private static final String DOCX_IMAGE1_MD5 = "5516590467b069fa59397432677bad4d";
57 private static final String DOCX_IMAGE2_MD5 = "a5dd81567427070ce0a2ff3e3ef13a4c";
58 private static final String DOCX_IMAGE1_NAME = "image1.jpeg";
59 private static final String DOCX_IMAGE2_NAME = "image2.jpeg";
60 private static final String DOCX_EXE1_MD5 = "d71ffa0623014df725f8fd2710de4411";
61 private static final String DOCX_EXE1_NAME = "GMapTool.exe";
62 private static final String DOCX_EXE2_MD5 = "2485435c7c22d35f2de9b4c98c0c2e1a";
63 private static final String DOCX_EXE2_NAME = "Setup.exe";
64 private static final String XSL_IMAGE2_MD5 = "8969288f4245120e7c3870287cce0ff3";
65 private static final String APPLICATION_MSWORD = "application/msword";
66 private static final String APPLICATION_XML = "application/xml";
67 private static final String CONTENT_TYPE = "Content-type";
6568
6669 @Override
6770 protected void setUpResources(JAXRSServerFactoryBean sf) {
6871 sf.setResourceClasses(UnpackerResource.class);
6972 sf.setResourceProvider(UnpackerResource.class,
70 new SingletonResourceProvider(new UnpackerResource(tika)));
73 new SingletonResourceProvider(new UnpackerResource(tika)));
7174 }
7275
7376 @Override
7578 List<Object> providers = new ArrayList<Object>();
7679 providers.add(new TarWriter());
7780 providers.add(new ZipWriter());
78 providers.add(new TikaExceptionMapper());
81 providers.add(new TikaServerParseExceptionMapper(false));
7982 sf.setProviders(providers);
8083 }
8184
82 @Test
83 public void testDocWAV() throws Exception {
84 Response response = WebClient.create(endPoint + UNPACKER_PATH)
85 .type(APPLICATION_MSWORD).accept("application/zip")
86 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
85 @Test
86 public void testDocWAV() throws Exception {
87 Response response = WebClient.create(endPoint + UNPACKER_PATH)
88 .type(APPLICATION_MSWORD).accept("application/zip")
89 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
8790
88 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
89 assertEquals(WAV1_MD5, data.get(WAV1_NAME));
90 assertEquals(WAV2_MD5, data.get(WAV2_NAME));
91 assertFalse(data.containsKey(UnpackerResource.TEXT_FILENAME));
92 }
91 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
92 assertEquals(WAV1_MD5, data.get(WAV1_NAME));
93 assertEquals(WAV2_MD5, data.get(WAV2_NAME));
94 assertFalse(data.containsKey(UnpackerResource.TEXT_FILENAME));
95 }
9396
94 @Test
95 public void testDocWAVText() throws Exception {
96 Response response = WebClient.create(endPoint + ALL_PATH)
97 .type(APPLICATION_MSWORD).accept("application/zip")
98 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
97 @Test
98 public void testDocWAVText() throws Exception {
99 Response response = WebClient.create(endPoint + ALL_PATH)
100 .type(APPLICATION_MSWORD).accept("application/zip")
101 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
99102
100 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
101 assertEquals(WAV1_MD5, data.get(WAV1_NAME));
102 assertEquals(WAV2_MD5, data.get(WAV2_NAME));
103 assertTrue(data.containsKey(UnpackerResource.TEXT_FILENAME));
104 }
103 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
104 assertEquals(WAV1_MD5, data.get(WAV1_NAME));
105 assertEquals(WAV2_MD5, data.get(WAV2_NAME));
106 assertTrue(data.containsKey(UnpackerResource.TEXT_FILENAME));
107 }
105108
106 @Test
107 public void testDocPicture() throws Exception {
108 Response response = WebClient.create(endPoint + UNPACKER_PATH)
109 .type(APPLICATION_MSWORD).accept("application/zip")
110 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
109 @Test
110 public void testDocPicture() throws Exception {
111 Response response = WebClient.create(endPoint + UNPACKER_PATH)
112 .type(APPLICATION_MSWORD).accept("application/zip")
113 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
111114
112 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
115 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
113116
114 assertEquals(JPG_MD5, data.get(JPG_NAME));
115 }
117 assertEquals(JPG_MD5, data.get(JPG_NAME));
118 }
116119
117 @Test
118 public void testDocPictureNoOle() throws Exception {
119 Response response = WebClient.create(endPoint + UNPACKER_PATH)
120 .type(APPLICATION_MSWORD).accept("application/zip")
121 .put(ClassLoader.getSystemResourceAsStream("2pic.doc"));
120 @Test
121 public void testDocPictureNoOle() throws Exception {
122 Response response = WebClient.create(endPoint + UNPACKER_PATH)
123 .type(APPLICATION_MSWORD).accept("application/zip")
124 .put(ClassLoader.getSystemResourceAsStream("2pic.doc"));
122125
123 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
124 assertEquals(JPG2_MD5, data.get(JPG2_NAME));
125 }
126 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
127 assertEquals(JPG2_MD5, data.get(JPG2_NAME));
128 }
126129
127 @Test
128 public void testImageDOCX() throws Exception {
129 Response response = WebClient.create(endPoint + UNPACKER_PATH)
130 .accept("application/zip").put(
131 ClassLoader.getSystemResourceAsStream(TEST_DOCX_IMAGE));
130 @Test
131 public void testImageDOCX() throws Exception {
132 Response response = WebClient.create(endPoint + UNPACKER_PATH)
133 .accept("application/zip").put(
134 ClassLoader.getSystemResourceAsStream(TEST_DOCX_IMAGE));
132135
133 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
134 assertEquals(DOCX_IMAGE1_MD5, data.get(DOCX_IMAGE1_NAME));
135 assertEquals(DOCX_IMAGE2_MD5, data.get(DOCX_IMAGE2_NAME));
136 }
136 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
137 assertEquals(DOCX_IMAGE1_MD5, data.get(DOCX_IMAGE1_NAME));
138 assertEquals(DOCX_IMAGE2_MD5, data.get(DOCX_IMAGE2_NAME));
139 }
137140
138 @Test
139 public void test415() throws Exception {
140 Response response = WebClient.create(endPoint + UNPACKER_PATH)
141 .type("xxx/xxx")
142 .accept("*/*")
143 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
141 @Test
142 public void test415() throws Exception {
143 Response response = WebClient.create(endPoint + UNPACKER_PATH)
144 .type("xxx/xxx")
145 .accept("*/*")
146 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
144147
145 assertEquals(415, response.getStatus());
146 }
148 assertEquals(415, response.getStatus());
149 }
147150
148 @Test
149 public void testExeDOCX() throws Exception {
150 String TEST_DOCX_EXE = "2exe.docx";
151 Response response = WebClient.create(endPoint + UNPACKER_PATH)
152 .accept("application/zip")
153 .put(ClassLoader.getSystemResourceAsStream(TEST_DOCX_EXE));
151 @Test
152 public void testExeDOCX() throws Exception {
153 String TEST_DOCX_EXE = "2exe.docx";
154 Response response = WebClient.create(endPoint + UNPACKER_PATH)
155 .accept("application/zip")
156 .put(ClassLoader.getSystemResourceAsStream(TEST_DOCX_EXE));
154157
155 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
158 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
156159
157 assertEquals(DOCX_EXE1_MD5, data.get(DOCX_EXE1_NAME));
158 assertEquals(DOCX_EXE2_MD5, data.get(DOCX_EXE2_NAME));
159 }
160 assertEquals(DOCX_EXE1_MD5, data.get(DOCX_EXE1_NAME));
161 assertEquals(DOCX_EXE2_MD5, data.get(DOCX_EXE2_NAME));
162 }
160163
161 @Test
162 public void testImageXSL() throws Exception {
163 Response response = WebClient.create(endPoint + UNPACKER_PATH)
164 .accept("application/zip")
165 .put(ClassLoader.getSystemResourceAsStream("pic.xls"));
164 @Test
165 public void testImageXSL() throws Exception {
166 Response response = WebClient.create(endPoint + UNPACKER_PATH)
167 .accept("application/zip")
168 .put(ClassLoader.getSystemResourceAsStream("pic.xls"));
166169
167 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
168 assertEquals(XSL_IMAGE1_MD5, data.get("0.jpg"));
169 assertEquals(XSL_IMAGE2_MD5, data.get("1.jpg"));
170 }
170 Map<String, String> data = readZipArchive((InputStream) response.getEntity());
171 assertEquals(XSL_IMAGE1_MD5, data.get("0.jpg"));
172 assertEquals(XSL_IMAGE2_MD5, data.get("1.jpg"));
173 }
171174
172 @Test
173 public void testTarDocPicture() throws Exception {
174 Response response = WebClient.create(endPoint + UNPACKER_PATH)
175 .type(APPLICATION_MSWORD).accept("application/x-tar")
176 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
175 @Test
176 public void testTarDocPicture() throws Exception {
177 Response response = WebClient.create(endPoint + UNPACKER_PATH)
178 .type(APPLICATION_MSWORD).accept("application/x-tar")
179 .put(ClassLoader.getSystemResourceAsStream(TEST_DOC_WAV));
177180
178 Map<String, String> data = readArchiveFromStream(new TarArchiveInputStream((InputStream) response.getEntity()));
181 Map<String, String> data = readArchiveFromStream(new TarArchiveInputStream((InputStream) response.getEntity()));
179182
180 assertEquals(JPG_MD5, data.get(JPG_NAME));
181 }
183 assertEquals(JPG_MD5, data.get(JPG_NAME));
184 }
182185
183 @Test
184 public void testText() throws Exception {
185 Response response = WebClient.create(endPoint + ALL_PATH)
186 .header(CONTENT_TYPE, APPLICATION_XML)
187 .accept("application/zip")
188 .put(ClassLoader.getSystemResourceAsStream("test.doc"));
186 @Test
187 public void testText() throws Exception {
188 Response response = WebClient.create(endPoint + ALL_PATH)
189 .header(CONTENT_TYPE, APPLICATION_XML)
190 .accept("application/zip")
191 .put(ClassLoader.getSystemResourceAsStream("test.doc"));
189192
190 String responseMsg = readArchiveText((InputStream) response.getEntity());
191 assertNotNull(responseMsg);
192 assertTrue(responseMsg.contains("test"));
193 }
193 String responseMsg = readArchiveText((InputStream) response.getEntity());
194 assertNotNull(responseMsg);
195 assertTrue(responseMsg.contains("test"));
196 }
194197
195198 }
2424 <parent>
2525 <groupId>org.apache.tika</groupId>
2626 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
27 <version>1.8</version>
2828 <relativePath>../tika-parent/pom.xml</relativePath>
2929 </parent>
3030
3737 <dependency>
3838 <groupId>org.apache.tika</groupId>
3939 <artifactId>tika-core</artifactId>
40 <version>1.6</version>
40 <version>${project.version}</version>
4141 </dependency>
4242 <dependency>
4343 <groupId>com.memetix</groupId>
5050 <artifactId>cxf-rt-frontend-jaxrs</artifactId>
5151 <version>2.7.8</version>
5252 </dependency>
53 <dependency>
54 <groupId>com.fasterxml.jackson.jaxrs</groupId>
55 <artifactId>jackson-jaxrs-json-provider</artifactId>
56 <version>2.4.0</version>
57 </dependency>
53 <dependency>
54 <groupId>com.fasterxml.jackson.jaxrs</groupId>
55 <artifactId>jackson-jaxrs-json-provider</artifactId>
56 <version>2.4.0</version>
57 </dependency>
5858
5959 <!-- Test dependencies -->
6060 <dependency>
6161 <groupId>junit</groupId>
6262 <artifactId>junit</artifactId>
63 <scope>test</scope>
64 <version>4.11</version>
6563 </dependency>
6664 </dependencies>
67 <build>
68 <plugins>
69 <plugin>
70 <groupId>org.apache.felix</groupId>
71 <artifactId>maven-bundle-plugin</artifactId>
72 <extensions>true</extensions>
73 <configuration>
74 <instructions>
75 <Bundle-DocURL>${project.url}</Bundle-DocURL>
76 <Bundle-Activator>
77 org.apache.tika.parser.internal.Activator
78 </Bundle-Activator>
79 <Import-Package>
80 org.w3c.dom,
81 org.apache.tika.*,
82 *;resolution:=optional
83 </Import-Package>
84 </instructions>
85 </configuration>
86 </plugin>
87 <plugin>
88 <groupId>org.apache.rat</groupId>
89 <artifactId>apache-rat-plugin</artifactId>
90 <configuration>
91 <excludes>
92 <exclude>src/main/java/org/apache/tika/parser/txt/Charset*.java</exclude>
93 <exclude>src/test/resources/test-documents/**</exclude>
94 </excludes>
95 </configuration>
96 </plugin>
97 <plugin>
98 <groupId>org.apache.maven.plugins</groupId>
99 <artifactId>maven-jar-plugin</artifactId>
100 <executions>
101 <execution>
102 <goals>
103 <goal>test-jar</goal>
104 </goals>
105 </execution>
106 </executions>
107 </plugin>
108 </plugins>
65 <build>
66 <plugins>
67 <plugin>
68 <groupId>org.apache.felix</groupId>
69 <artifactId>maven-bundle-plugin</artifactId>
70 <extensions>true</extensions>
71 <configuration>
72 <instructions>
73 <Bundle-DocURL>${project.url}</Bundle-DocURL>
74 <Bundle-Activator>
75 org.apache.tika.parser.internal.Activator
76 </Bundle-Activator>
77 <Import-Package>
78 org.w3c.dom,
79 org.apache.tika.*,
80 *;resolution:=optional
81 </Import-Package>
82 </instructions>
83 </configuration>
84 </plugin>
85 <plugin>
86 <groupId>org.apache.rat</groupId>
87 <artifactId>apache-rat-plugin</artifactId>
88 <configuration>
89 <excludes>
90 <exclude>src/main/java/org/apache/tika/parser/txt/Charset*.java</exclude>
91 <exclude>src/test/resources/test-documents/**</exclude>
92 </excludes>
93 </configuration>
94 </plugin>
95 <plugin>
96 <groupId>org.apache.maven.plugins</groupId>
97 <artifactId>maven-jar-plugin</artifactId>
98 <executions>
99 <execution>
100 <goals>
101 <goal>test-jar</goal>
102 </goals>
103 </execution>
104 </executions>
105 </plugin>
106 </plugins>
109107
110 <pluginManagement>
111 <plugins>
112 <!-- This plugin's configuration is used to store Eclipse m2e -->
113 <!-- settings only. It has no influence on the Maven build itself. -->
114 <plugin>
115 <groupId>org.eclipse.m2e</groupId>
116 <artifactId>lifecycle-mapping</artifactId>
117 <version>1.0.0</version>
118 <configuration>
119 <lifecycleMappingMetadata>
120 <pluginExecutions>
121 <pluginExecution>
122 <pluginExecutionFilter>
123 <groupId>org.apache.felix</groupId>
124 <artifactId>maven-scr-plugin</artifactId>
125 <versionRange>[1.7.2,)</versionRange>
126 <goals>
127 <goal>scr</goal>
128 </goals>
129 </pluginExecutionFilter>
130 <action>
131 <execute />
132 </action>
133 </pluginExecution>
134 </pluginExecutions>
135 </lifecycleMappingMetadata>
136 </configuration>
137 </plugin>
138 </plugins>
139 </pluginManagement>
140 </build>
108 <pluginManagement>
109 <plugins>
110 <!-- This plugin's configuration is used to store Eclipse m2e -->
111 <!-- settings only. It has no influence on the Maven build itself. -->
112 <plugin>
113 <groupId>org.eclipse.m2e</groupId>
114 <artifactId>lifecycle-mapping</artifactId>
115 <version>1.0.0</version>
116 <configuration>
117 <lifecycleMappingMetadata>
118 <pluginExecutions>
119 <pluginExecution>
120 <pluginExecutionFilter>
121 <groupId>org.apache.felix</groupId>
122 <artifactId>maven-scr-plugin</artifactId>
123 <versionRange>[1.7.2,)</versionRange>
124 <goals>
125 <goal>scr</goal>
126 </goals>
127 </pluginExecutionFilter>
128 <action>
129 <execute />
130 </action>
131 </pluginExecution>
132 </pluginExecutions>
133 </lifecycleMappingMetadata>
134 </configuration>
135 </plugin>
136 </plugins>
137 </pluginManagement>
138 </build>
141139
142 <description>This is the translate Apache Tika™ toolkit. Translator implementations may depend on web services. </description>
140 <description>This is the translate Apache Tika™ toolkit. Translator implementations may depend on web services.
141 </description>
143142 <organization>
144 <name>The Apache Software Foundation</name>
145 <url>http://www.apache.org</url>
143 <name>The Apache Software Foundation</name>
144 <url>http://www.apache.org</url>
146145 </organization>
147146 <scm>
148 <url>http://svn.apache.org/viewvc/tika/tags/1.6/translate</url>
149 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/translate</connection>
150 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/translate</developerConnection>
147 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-translate</url>
148 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-translate</connection>
149 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-translate</developerConnection>
151150 </scm>
152151 <issueManagement>
153 <system>JIRA</system>
154 <url>https://issues.apache.org/jira/browse/TIKA</url>
152 <system>JIRA</system>
153 <url>https://issues.apache.org/jira/browse/TIKA</url>
155154 </issueManagement>
156155 <ciManagement>
157 <system>Jenkins</system>
158 <url>https://builds.apache.org/job/Tika-trunk/</url>
156 <system>Jenkins</system>
157 <url>https://builds.apache.org/job/Tika-trunk/</url>
159158 </ciManagement>
160159 </project>
1717 package org.apache.tika.language.translate;
1818
1919 import com.fasterxml.jackson.databind.util.LRUMap;
20 import org.apache.tika.exception.TikaException;
2021 import org.apache.tika.language.LanguageIdentifier;
2122 import org.apache.tika.language.LanguageProfile;
2223
24 import java.io.IOException;
2325 import java.util.HashMap;
2426
2527 /**
4648 }
4749
4850 @Override
49 public String translate(String text, String sourceLanguage, String targetLanguage) throws Exception {
51 public String translate(String text, String sourceLanguage, String targetLanguage) throws TikaException, IOException {
5052 HashMap<String, String> translationCache = getTranslationCache(sourceLanguage, targetLanguage);
5153 String translatedText = translationCache.get(text);
5254 if (translatedText == null) {
5759 }
5860
5961 @Override
60 public String translate(String text, String targetLanguage) throws Exception {
62 public String translate(String text, String targetLanguage) throws TikaException, IOException {
6163 LanguageIdentifier language = new LanguageIdentifier(
6264 new LanguageProfile(text));
6365 String sourceLanguage = language.getLanguage();
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.language.translate;
18
19 import org.apache.tika.exception.TikaException;
20 import org.apache.tika.language.LanguageIdentifier;
21 import org.apache.tika.language.LanguageProfile;
22
23 import java.io.BufferedReader;
24 import java.io.File;
25 import java.io.IOException;
26 import java.io.InputStreamReader;
27 import java.io.Reader;
28 import java.nio.charset.Charset;
29 import java.util.Locale;
30
31 /**
32 * Abstract class used to interact with command line/external Translators.
33 *
34 * @see org.apache.tika.language.translate.MosesTranslator for an example of extending this class.
35 *
36 * @since Tika 1.7
37 */
38 public abstract class ExternalTranslator implements Translator {
39
40 /**
41 * Run the given command and return the output written to standard out.
42 *
43 * @param command The complete command to run.
44 * @param env The environment to pass along to the Runtime.
45 * @param workingDirectory The directory from which to run the command.
46 * @return The output of the command written to standard out.
47 * @throws IOException
48 * @throws InterruptedException
49 */
50 public Reader runAndGetOutput(String command, String[] env, File workingDirectory) throws IOException, InterruptedException {
51 Process process = Runtime.getRuntime().exec(command, env, workingDirectory);
52 InputStreamReader reader = new InputStreamReader(process.getInputStream(), Charset.defaultCharset());
53 BufferedReader bufferedReader = new BufferedReader(reader);
54 process.waitFor();
55 return bufferedReader;
56 }
57
58 /**
59 * Checks to see if the command can be run. Typically used with
60 * something like "myapp --version" to check to see if "myapp"
61 * is installed and on the path.
62 *
63 * @param checkCommandString The command to run and check the return code of.
64 * @param successCodes Return codes that signify success.
65 */
66 public boolean checkCommand(String checkCommandString, int... successCodes) {
67 try {
68 Process process = Runtime.getRuntime().exec(checkCommandString);
69 process.waitFor();
70 int result = process.waitFor();
71 for (int code : successCodes) {
72 if (code == result) return true;
73 }
74 return false;
75 } catch(IOException e) {
76 // Some problem, command is there or is broken
77 System.err.println("Broken pipe");
78 return false;
79 } catch (InterruptedException ie) {
80 // Some problem, command is there or is broken
81 System.err.println("Interrupted");
82 return false;
83 }
84 }
85
86 /**
87 * Default translate method which uses built Tika language identification.
88 * @param text The text to translate.
89 * @param targetLanguage The desired language to translate to (for example, "hi").
90 * @return The translated text.
91 * @throws Exception
92 */
93 @Override
94 public String translate(String text, String targetLanguage) throws TikaException, IOException {
95 LanguageIdentifier language = new LanguageIdentifier(
96 new LanguageProfile(text));
97 String sourceLanguage = language.getLanguage();
98 return translate(text, sourceLanguage, targetLanguage);
99 }
100 }
1616
1717 package org.apache.tika.language.translate;
1818
19 import javax.ws.rs.core.MediaType;
20 import javax.ws.rs.core.Response;
21
1922 import java.io.BufferedReader;
23 import java.io.IOException;
2024 import java.io.InputStream;
2125 import java.io.InputStreamReader;
2226 import java.util.Properties;
2327 import java.util.logging.Logger;
2428
25 import javax.ws.rs.core.MediaType;
26 import javax.ws.rs.core.Response;
27
29 import com.fasterxml.jackson.databind.JsonNode;
30 import com.fasterxml.jackson.databind.ObjectMapper;
2831 import org.apache.cxf.jaxrs.client.WebClient;
32 import org.apache.tika.exception.TikaException;
33 import org.apache.tika.io.IOUtils;
2934 import org.apache.tika.language.LanguageIdentifier;
3035 import org.apache.tika.language.LanguageProfile;
31
32 import com.fasterxml.jackson.databind.JsonNode;
33 import com.fasterxml.jackson.databind.ObjectMapper;
3436
3537 /**
3638 * An implementation of a REST client to the <a
7779
7880 @Override
7981 public String translate(String text, String sourceLanguage,
80 String targetLanguage) throws Exception {
82 String targetLanguage) throws TikaException, IOException {
8183 if (!this.isAvailable)
8284 return text;
8385 Response response = client.accept(MediaType.APPLICATION_JSON)
8486 .query("key", apiKey).query("source", sourceLanguage)
8587 .query("target", targetLanguage).query("q", text).get();
8688 BufferedReader reader = new BufferedReader(new InputStreamReader(
87 (InputStream) response.getEntity()));
89 (InputStream) response.getEntity(), IOUtils.UTF_8));
8890 String line = null;
8991 StringBuffer responseText = new StringBuffer();
9092 while ((line = reader.readLine()) != null) {
98100
99101 @Override
100102 public String translate(String text, String targetLanguage)
101 throws Exception {
103 throws TikaException, IOException {
102104 if (!this.isAvailable)
103105 return text;
104106 LanguageIdentifier language = new LanguageIdentifier(
1616
1717 package org.apache.tika.language.translate;
1818
19 import javax.ws.rs.core.MediaType;
20 import javax.ws.rs.core.Response;
21
22 import java.io.BufferedReader;
23 import java.io.IOException;
24 import java.io.InputStream;
25 import java.io.InputStreamReader;
26 import java.util.Properties;
27
1928 import com.fasterxml.jackson.databind.JsonNode;
2029 import com.fasterxml.jackson.databind.ObjectMapper;
2130 import org.apache.cxf.jaxrs.client.WebClient;
2231 import org.apache.tika.exception.TikaException;
32 import org.apache.tika.io.IOUtils;
2333 import org.apache.tika.language.LanguageIdentifier;
2434 import org.apache.tika.language.LanguageProfile;
25
26 import javax.ws.rs.core.MediaType;
27 import javax.ws.rs.core.Response;
28 import java.io.BufferedReader;
29 import java.io.InputStream;
30 import java.io.InputStreamReader;
31 import java.util.Properties;
3235
3336 /**
3437 * An implementation of a REST client for the
6871
6972 @Override
7073 public String translate(String text, String sourceLanguage,
71 String targetLanguage) throws Exception {
74 String targetLanguage) throws TikaException, IOException {
7275 if (!this.isAvailable)
7376 return text;
7477 Response response = client.accept(MediaType.APPLICATION_JSON)
7578 .query("user_key", userKey).query("source", sourceLanguage)
7679 .query("target", targetLanguage).query("q", text).get();
7780 BufferedReader reader = new BufferedReader(new InputStreamReader(
78 (InputStream) response.getEntity()));
81 (InputStream) response.getEntity(), IOUtils.UTF_8));
7982 String line = null;
8083 StringBuffer responseText = new StringBuffer();
8184 while ((line = reader.readLine()) != null) {
9396
9497 @Override
9598 public String translate(String text, String targetLanguage)
96 throws Exception {
99 throws TikaException, IOException {
97100 if (!this.isAvailable)
98101 return text;
99102 LanguageIdentifier language = new LanguageIdentifier(
1818
1919 import com.memetix.mst.language.Language;
2020 import com.memetix.mst.translate.Translate;
21 import org.apache.tika.exception.TikaException;
2122
2223 import java.io.IOException;
2324 import java.io.InputStream;
5556 props.load(stream);
5657 clientId = props.getProperty(ID_PROPERTY);
5758 clientSecret = props.getProperty(SECRET_PROPERTY);
58 if (!clientId.equals(DEFAULT_ID) && !clientSecret.equals(DEFAULT_SECRET)) available = true;
59 this.available = checkAvailable();
5960 }
6061 } catch (IOException e) {
6162 e.printStackTrace();
7677 * @see org.apache.tika.language.translate.Translator
7778 * @since Tika 1.6
7879 */
79 public String translate(String text, String sourceLanguage, String targetLanguage) throws Exception {
80 public String translate(String text, String sourceLanguage, String targetLanguage) throws TikaException, IOException {
8081 if (!available) return text;
8182 Language source = Language.fromString(sourceLanguage);
8283 Language target = Language.fromString(targetLanguage);
8384 Translate.setClientId(clientId);
8485 Translate.setClientSecret(clientSecret);
85 return Translate.execute(text, source, target);
86 try {
87 return Translate.execute(text, source, target);
88 } catch (Exception e) {
89 throw new TikaException("Error with Microsoft Translation: " + e.getMessage());
90 }
8691 }
8792
8893 /**
95100 * @see org.apache.tika.language.translate.Translator
96101 * @since Tika 1.6
97102 */
98 public String translate(String text, String targetLanguage) throws Exception {
103 public String translate(String text, String targetLanguage) throws TikaException, IOException {
99104 if (!available) return text;
100105 Language target = Language.fromString(targetLanguage);
101106 Translate.setClientId(clientId);
102107 Translate.setClientSecret(clientSecret);
103 return Translate.execute(text, target);
108 try {
109 return Translate.execute(text, target);
110 } catch (Exception e) {
111 throw new TikaException("Error with Microsoft Translation: " + e.getMessage());
112 }
104113 }
105114
106115 /**
118127 */
119128 public void setId(String id){
120129 this.clientId = id;
121 if (!clientId.equals(DEFAULT_ID) && !clientSecret.equals(DEFAULT_SECRET)) available = true;
130 this.available = checkAvailable();
122131 }
123132
124133 /**
127136 */
128137 public void setSecret(String secret){
129138 this.clientSecret = secret;
130 if (!clientId.equals(DEFAULT_ID) && !clientSecret.equals(DEFAULT_SECRET)) available = true;
139 this.available = checkAvailable();
140 }
141
142 private boolean checkAvailable(){
143 return clientId != null &&
144 !clientId.equals(DEFAULT_ID) &&
145 clientSecret != null &&
146 !clientSecret.equals(DEFAULT_SECRET);
131147 }
132148 }
0 /**
1 * Licensed to the Apache Software Foundation (ASF) under one or more
2 * contributor license agreements. See the NOTICE file distributed with
3 * this work for additional information regarding copyright ownership.
4 * The ASF licenses this file to You under the Apache License, Version 2.0
5 * (the "License"); you may not use this file except in compliance with
6 * the License. You may obtain a copy of the License at
7 *
8 * http://www.apache.org/licenses/LICENSE-2.0
9 *
10 * Unless required by applicable law or agreed to in writing, software
11 * distributed under the License is distributed on an "AS IS" BASIS,
12 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13 * See the License for the specific language governing permissions and
14 * limitations under the License.
15 */
16
17 package org.apache.tika.language.translate;
18
19 import org.apache.tika.exception.TikaException;
20
21 import java.io.BufferedReader;
22 import java.io.BufferedWriter;
23 import java.io.File;
24 import java.io.FileInputStream;
25 import java.io.FileOutputStream;
26 import java.io.FileReader;
27 import java.io.FileWriter;
28 import java.io.IOException;
29 import java.io.InputStreamReader;
30 import java.io.OutputStreamWriter;
31 import java.nio.charset.Charset;
32 import java.util.Properties;
33
34 /**
35 * Translator that uses the Moses decoder for translation.
36 * Users must install the Moses system before using this Translator. @link http://www.statmt.org/moses/.
37 */
38 public class MosesTranslator extends ExternalTranslator {
39
40 private static final String DEFAULT_PATH = "dummy-path";
41 private static final String TMP_FILE_NAME = "tika.moses.translation.tmp";
42
43 private String smtPath = DEFAULT_PATH;
44 private String scriptPath = DEFAULT_PATH;
45
46 /**
47 * Default constructor that attempts to read the smt jar and script paths from the
48 * translator.moses.properties file.
49 *
50 * @throws java.lang.AssertionError When the properties file is unreadable.
51 */
52 public MosesTranslator() {
53 Properties config = new Properties();
54 try {
55 config.load(MosesTranslator.class
56 .getClassLoader()
57 .getResourceAsStream("org/apache/tika/language/translate/translator.moses.properties"));
58 new MosesTranslator(
59 config.getProperty("translator.smt_path"),
60 config.getProperty("translator.script_path"));
61 } catch (IOException e) {
62 throw new AssertionError("Failed to read translator.moses.properties.");
63 }
64 }
65
66 /**
67 * Create a Moses Translator with the specified smt jar and script paths.
68 *
69 * @param smtPath Full path to the jar to run.
70 * @param scriptPath Full path to the script to pass to the smt jar.
71 */
72 public MosesTranslator(String smtPath, String scriptPath) {
73 this.smtPath = smtPath;
74 this.scriptPath = scriptPath;
75 System.out.println(buildCommand(smtPath, scriptPath));
76 }
77
78 @Override
79 public String translate(String text, String sourceLanguage, String targetLanguage) throws TikaException, IOException {
80 if (!isAvailable() || !checkCommand(buildCheckCommand(smtPath), 1)) return text;
81 File tmpFile = new File(TMP_FILE_NAME);
82 OutputStreamWriter out = new OutputStreamWriter(new FileOutputStream(tmpFile), Charset.defaultCharset());
83 out.append(text).append('\n').close();
84
85 Runtime.getRuntime().exec(buildCommand(smtPath, scriptPath), new String[]{}, buildWorkingDirectory(scriptPath));
86
87 File tmpTranslatedFile = new File(TMP_FILE_NAME + ".translated");
88
89 StringBuilder stringBuilder = new StringBuilder();
90 BufferedReader reader = new BufferedReader(new InputStreamReader(
91 new FileInputStream(tmpTranslatedFile),
92 Charset.defaultCharset()
93 ));
94 String line;
95 while ((line = reader.readLine()) != null) stringBuilder.append(line);
96
97 if (!tmpFile.delete() || !tmpTranslatedFile.delete()){
98 throw new IOException("Failed to delete temporary files.");
99 }
100 return stringBuilder.toString();
101 }
102
103 @Override
104 public boolean isAvailable() {
105 return !smtPath.equals(DEFAULT_PATH) && !scriptPath.equals(DEFAULT_PATH);
106 }
107
108 /**
109 * Build the command String to be executed.
110 * @param smtPath Full path to the jar to run.
111 * @param scriptPath Full path to the script to pass to the smt jar.
112 * @return String to run on the command line.
113 */
114 private String buildCommand(String smtPath, String scriptPath) {
115 return "java -jar " + smtPath +
116 " -c NONE " +
117 scriptPath + " " +
118 System.getProperty("user.dir") + "/" + TMP_FILE_NAME;
119 }
120
121 /**
122 * Build the command String to check if we can execute the smt jar.
123 * @param smtPath Full path to the jar to run.
124 * @return String to run on the command line.
125 */
126 private String buildCheckCommand(String smtPath) {
127 return "java -jar " + smtPath;
128 }
129
130 /**
131 * Build the File that represents the desired working directory. In this case,
132 * the directory the script is in.
133 * @param scriptPath Full path to the script passed to the smt jar.
134 * @return File of the directory with the script in it.
135 */
136 private File buildWorkingDirectory(String scriptPath) {
137 return new File(scriptPath.substring(0, scriptPath.lastIndexOf("/") + 1));
138 }
139
140 }
0 # Licensed to the Apache Software Foundation (ASF) under one or more
1 # contributor license agreements. See the NOTICE file distributed with
2 # this work for additional information regarding copyright ownership.
3 # The ASF licenses this file to You under the Apache License, Version 2.0
4 # (the "License"); you may not use this file except in compliance with
5 # the License. You may obtain a copy of the License at
6 #
7 # http://www.apache.org/licenses/LICENSE-2.0
8 #
9 # Unless required by applicable law or agreed to in writing, software
10 # distributed under the License is distributed on an "AS IS" BASIS,
11 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12 # See the License for the specific language governing permissions and
13 # limitations under the License.
14 # Must set the client keys in this file to use translation. Please see
15 # https://code.google.com/p/microsoft-translator-java-api/ and
16 # http://msdn.microsoft.com/en-us/library/hh454950.aspx for help with
17 # getting these keys. As of now (6/2014) 2,000,000 characters/month
18 # are free.
19
20 # smt_path is the full path to the Moses jar to run.
21 # script_path is the full path to the script to pass to the smt jar.
22
23 translator.smt_path=dummy-path
24 translator.script_path=dummy-path
1919 import org.junit.Before;
2020 import org.junit.Test;
2121
22 import static org.junit.Assert.*;
23
22 import static org.junit.Assert.assertEquals;
23 import static org.junit.Assert.assertFalse;
24 import static org.junit.Assert.assertNotNull;
25 import static org.junit.Assert.assertTrue;
2426
2527 /**
2628 * Test harness for the {@link CachedTranslator}. Take care to choose your target language carefully
2020 import org.junit.Test;
2121
2222 import static org.junit.Assert.assertTrue;
23 import static org.junit.Assert.fail;
2324
2425 /**
2526 * Test cases for the {@link MicrosoftTranslator} class.
2627 */
2728 public class MicrosoftTranslatorTest {
28 Tika tika;
29 MicrosoftTranslator translator;
2930 @Before
3031 public void setUp() {
31 tika = new Tika();
32 translator = new MicrosoftTranslator();
3233 }
3334
3435 @Test
3536 public void testSimpleTranslate() throws Exception {
3637 String source = "hello";
3738 String expected = "salut";
38 String translated = tika.translate(source, "en", "fr");
39 System.err.println(tika.getTranslator().isAvailable());
40 if (tika.getTranslator().isAvailable()) assertTrue("Translate " + source + " to " + expected + " (was " + translated + ")",
39 String translated = translator.translate(source, "en", "fr");
40 if (translator.isAvailable()) assertTrue("Translate " + source + " to " + expected + " (was " + translated + ")",
4141 expected.equalsIgnoreCase(translated));
4242 }
4343
4545 public void testSimpleDetectTranslate() throws Exception {
4646 String source = "hello";
4747 String expected = "salut";
48 String translated = tika.translate(source, "fr");
49 System.err.println(tika.getTranslator().isAvailable());
50 if (tika.getTranslator().isAvailable()) assertTrue("Translate " + source + " to " + expected + " (was " + translated + ")",
48 String translated = translator.translate(source, "fr");
49 if (translator.isAvailable()) assertTrue("Translate " + source + " to " + expected + " (was " + translated + ")",
5150 expected.equalsIgnoreCase(translated));
51 }
52
53 @Test
54 public void testSettersAndIsAvailable(){
55 try{
56 translator.setId("foo");
57 translator.setSecret("bar");
58 }
59 catch(Exception e){
60 e.printStackTrace();
61 fail(e.getMessage());
62 }
63 //reset
64 translator = new MicrosoftTranslator();
65 try{
66 translator.setSecret("bar");
67 translator.setId("foo");
68 }
69 catch(Exception e){
70 e.printStackTrace();
71 fail(e.getMessage());
72 }
5273 }
5374
5475 }
0 package org.apache.tika.language.translate;
1
2 import org.junit.Before;
3 import org.junit.Test;
4
5 import static org.junit.Assert.assertTrue;
6
7 public class MosesTranslatorTest {
8 MosesTranslator translator;
9 @Before
10 public void setUp() {
11 translator = new MosesTranslator();
12 }
13
14 @Test
15 public void testSimpleTranslate() throws Exception {
16 String source = "hola";
17 String expected = "hello";
18 String translated = translator.translate(source, "sp", "en");
19 if (translator.isAvailable()) assertTrue("Translate " + source + " to " + expected + " (was " + translated + ")",
20 expected.equalsIgnoreCase(translated));
21 }
22 }
2424 <parent>
2525 <groupId>org.apache.tika</groupId>
2626 <artifactId>tika-parent</artifactId>
27 <version>1.6</version>
27 <version>1.8</version>
2828 <relativePath>../tika-parent/pom.xml</relativePath>
2929 </parent>
3030
8989 <dependency>
9090 <groupId>junit</groupId>
9191 <artifactId>junit</artifactId>
92 <scope>test</scope>
93 <version>4.11</version>
9492 </dependency>
9593 </dependencies>
9694
9795 <url>http://tika.apache.org/</url>
9896 <organization>
99 <name>The Apache Software Foundation</name>
100 <url>http://www.apache.org</url>
97 <name>The Apache Software Foundation</name>
98 <url>http://www.apache.org</url>
10199 </organization>
102100 <scm>
103 <url>http://svn.apache.org/viewvc/tika/tags/1.6/tika-xmp</url>
104 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.6/tika-xmp</connection>
105 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.6/tika-xmp</developerConnection>
101 <url>http://svn.apache.org/viewvc/tika/tags/1.8-rc2/tika-xmp</url>
102 <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-xmp</connection>
103 <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.8-rc2/tika-xmp</developerConnection>
106104 </scm>
107105 <issueManagement>
108 <system>JIRA</system>
109 <url>https://issues.apache.org/jira/browse/TIKA</url>
106 <system>JIRA</system>
107 <url>https://issues.apache.org/jira/browse/TIKA</url>
110108 </issueManagement>
111109 <ciManagement>
112 <system>Jenkins</system>
113 <url>https://builds.apache.org/job/Tika-trunk/</url>
110 <system>Jenkins</system>
111 <url>https://builds.apache.org/job/Tika-trunk/</url>
114112 </ciManagement>
115113 </project>
1515 */
1616 package org.apache.tika.xmp;
1717
18 import static org.junit.Assert.*;
19
2018 import java.util.Date;
2119 import java.util.Properties;
2220
3533 import com.adobe.xmp.XMPMeta;
3634 import com.adobe.xmp.XMPUtils;
3735 import com.adobe.xmp.properties.XMPProperty;
36
37 import static org.junit.Assert.assertEquals;
38 import static org.junit.Assert.assertFalse;
39 import static org.junit.Assert.assertNotNull;
40 import static org.junit.Assert.assertNull;
41 import static org.junit.Assert.assertTrue;
3842
3943 public class XMPMetadataTest {
4044 private Metadata tikaMetadata;