diff --git a/.gitignore b/.gitignore index e03fecf..260bf13 100644 --- a/.gitignore +++ b/.gitignore @@ -1,4 +1,3 @@ -.svn target .idea .classpath @@ -7,5 +6,3 @@ *.iml *.ipr *.iws -nbactions.xml -nb-configuration.xml \ No newline at end of file diff --git a/CHANGES.txt b/CHANGES.txt index e0c77ae..b90c297 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,459 +1,3 @@ -Release 1.11 - 10/18/2015 - - * Java7 API support for allowing java.nio.file.Path as method arguments - was added to Tika and to ParsingReader, TikaFileTypeDetector, and to - Tika Config (TIKA-1745, TIKA-1746, TIKA-1751). - - * MIME support was added for WebVTT: The Web Video Text Tracks Format - files (TIKA-1772). - - * MIME magic improved to ensure emails detected as message/rfc822 - (TIKA-1771). - - * Upgrade to Jackcess Encrypt 2.1.1 to avoid binary incompatibility - with Bouncy Castle (TIKA-1736). - - * Make div and other markup more consistent between PPT and - PPTX (TIKA-1755). - - * Parse multiple authors from MSOffice's semi-colon delimited - author field (TIKA-1765). - - * Include CTAKESConfig.properties within tika-parsers resources - by default (TIKA-1741). - - * Prevent infinite recursion when processing inline images - in PDF files by limiting extraction of duplicate images - within the same page (TIKA-1742). - - * Upgrade to POI 3.13-final (via Andreas Beeker) (TIKA-1707). - - * Upgraded tika-batch to use Path throughout (TIKA-1747 and - (TIKA-1754). - - * Upgraded to Path in TikaInputStream (via Yaniv Kunda) (TIKA-1744). - - * Changed default content handler type for "/rmeta" in tika-server - to "xml" to align with "-J" option in tika-app. - Clients can now specify handler types via PathParam. (TIKA-1716). - - * The fantastic GROBID (or Grobid) GeneRation Of BIbliographic Data - for machine learning from PDF files is now integrated as a - Tika parser (TIKA-1699, TIKA-1712). - - * The ability to specify the Tesseract Config Path was added - to the OCR Parser (TIKA-1703). - - * Upgraded to ASM 5.0.4 (TIKA-1705). - - * Corrected Tika Config XML detector definition explicit loading - of MimeTypes (TIKA-1708) - - * In Tika Parsers, Batch, Server, App and Examples, use Apache - Commons IO instead of inlined ex-Commons classes, and the Java 7 - Standard Charset definitions (TIKA-1710) - - * Upgraded to Commons Compress 1.10, which enables zlib compressed - archives support (TIKA-1718) - - -Release 1.10 - 8/1/2015 - - * Tika Config XML can now be used to create composite detectors, - and exclude detectors that DefaultDetector would otherwise - have used. This brings support in-line with Parsers. (TIKA-1702) - - * Reverted to legacy sort order of parsers that was - mistakenly reversed in Tika 1.9 (TIKA-1689). - - * Upgrade to POI 3.13-beta1 (TIKA-1667). - - * Upgrade to PDFBox 1.8.10 (TIKA-1588). - - * MimeTypes now tries to find a registered type with and - without parameters (TIKA-1692). - - * Added more robust error handling for encoding detection - of .MSG files (TIKA-1238). - - * Fixed bug in Tika's use of the Jackcess parser that - prevented reading of v97 Access files (TIKA-1681). - - * Upgrade xerial.org's sqlite-jdbc to 3.8.10.1. NOTE: - as of Tika 1.9, this jar is "provided." Make sure - to upgrade your provided jar! (TIKA-1687). - - * Add header/footer extraction to xls (via Aeham Abushwashi) - (TIKA-1400). - - * Drop the source file name from the embedded file path in - RecursiveParserWrapper's "X-TIKA:embedded_resource_path" - (TIKA-1673). - - * Upgraded to Java 7 (TIKA-1536). - - * Non-standards compliant emails are now correctly detected - as message/rfc822 (TIKA-1602). - - * Added parser for MS Access files via Jackcess. Many thanks - to Health Market Science, Brian O'Neill and James Ahlborn - for relicensing Jackcess to Apache v2! (TIKA-1601) - - * GDALParser now correctly sets "nitf" as a supported - MediaType (TIKA-1664). - - * Added DigestingParser to calculate digest hashes - and record them in metadata. Integrated with - tika-app and tika-server (TIKA-1663). - - * Fixed ZipContainerDetector to detect all IPA files - (TIKA-1659). - - -Release 1.9 - 6/6/2015 - - * The ability to use the cTAKES clinical text - knowledge extraction system for biomedical data is - now included as a Tika parser (TIKA-1645, TIKA-1642). - - * Tika-server allows a user to specify the Tika config - from the command line (TIKA-1652, TIKA-1426). - - * Matlab file detection has been improved (TIKA-1634). - - * The EXIFTool was added as an External parser - (TIKA-1639). - - * If FFMPEG is installed and on the PATH, it is a - usable Parser in Tika now (TIKA-1510). - - * Fixes have been applied to the ExternalParser to make - it functional (TIKA-1638). - - * Tika service loading can now be more verbose with the - org.apache.tika.service.error.warn system property (TIKA-1636). - - * Tika Server now allows for metadata extraction from remote - URLs and in addition it outputs the detected language as a - metadata field (TIKA-1625). - - * OUTPUT_FILE_TOKEN not being replaced in ExternalParser - contributed by Pascal Essiembre (TIKA-1620). - - * Tika REST server now supports language identification - (TIKA-1622). - - * All of the example code from the Tika in Action book has - been donated to Tika and added to tika-examples (TIKA-1562). - - * Tika server now logs errors determining ContentDisposition - (TIKA-1621). - - * An algorithm for using Byte Histogram frequencies to construct - a Neural Network and to perform MIME detection was added - (TIKA-1582). - - * A Bayesian algorithm for MIME detection by probabilistic - means was added (TIKA-1517). - - * Tika now incorporates the Apache Spatial Information - System capability of parsing Geographic ISO 19139 - files (TIKA-443). It can also detect those files as - well. - - * Update the MimeTypes code to support inheritance - (TIKA-1535). - - * Provide ability to parse and identify Global Change - Master Directory Interchange Format (GCMD DIF) - scientific data files (TIKA-1532). - - * Improvements to detect CBOR files by extension (TIKA-1610). - - * Change xerial.org's sqlite-jdbc jar to "provided" (TIKA-1511). - Users will now need to add sqlite-jdbc to their classpath for - the Sqlite3Parser to work. - - * ExternalParser.check now catches (suppresses) SecurityException - and returns false, so it's OK to run Tika with a security policy - that does not allow execution of external processes (TIKA-1628). - -Release 1.8 - 4/13/2015 - - * Fix null pointer when processing ODT footer styles (TIKA-1600). - - * Upgrade to com.drewnoakes' metadata-extractor to 2.0 and - add parser for webp metadata (TIKA-1594). - - * Duration extracted from MP3s with no ID3 tags (TIKA-1589). - - * Upgraded to PDFBox 1.8.9 (TIKA-1575). - - * Tika now supports the IsaTab data standard for bioinformatics - both in terms of MIME identification and in terms of parsing - (TIKA-1580). - - * Tika server can now enable CORS requests with the command line - "--cors" or "-C" option (TIKA-1586). - - * Update jhighlight dependency to avoid using LGPL license. Thank - @kkrugler for his great contribution (TIKA-1581). - - * Updated HDF and NetCDF parsers to output file version in - metadata (TIKA-1578 and TIKA-1579). - - * Upgraded to POI 3.12-beta1 (TIKA-1531). - - * Added tika-batch module for directory to directory batch - processing. This is a new, experimental capability, and the API will - likely change in future releases (TIKA-1330). - - * Translator.translate() Exceptions are now restricted to - TikaException and IOException (TIKA-1416). - - * Tika now supports MIME detection for Microsoft Extended - Makefiles (EMF) (TIKA-1554). - - * Tika has improved delineation in XML and HTML MIME detection - (TIKA-1365). - - * Upgraded the Drew Noakes metadata-extractor to version 2.7.2 - (TIKA-1576). - - * Added basic style support for ODF documents, contributed by - Axel Dörfler (TIKA-1063). - - * Move Tika server resources and writers to separate - org.apache.tika.server.resource and writer packages (TIKA-1564). - - * Upgrade UCAR dependencies to 4.5.5 (TIKA-1571). - - * Fix Paths in Tika server welcome page (TIKA-1567). - - * Fixed infinite recursion while parsing some PDFs (TIKA-1038). - - * XHTMLContentHandler now properly passes along body attributes, - contributed by Markus Jelsma (TIKA-995). - - * TikaCLI option --compare-file-magic to report mime types known to - the file(1) tool but not known / fully known to Tika. - - * MediaTypeRegistry support for returning known child types. - - * Support for excluding (blacklisting) certain Parsers from being - used by DefaultParser via the Tika Config file, using the new - parser-exclude tag (TIKA-1558). - - * Detect Global Change Master Directory (GCMD) Directory - Interchange Format (DIF) files (TIKA-1561). - - * Tika's JAX-RS server can now return stacktraces for - parse exceptions (TIKA-1323). - - * Added MockParser for testing handling of exceptions, errors - and hangs in code that uses parsers (TIKA-1553). - - * The ForkParser service removed from Activator. Rollback of (TIKA-1354). - - * Increased the speed of language identification by - a factor of two -- contributed by Toke Eskildsen (TIKA-1549). - - * Added parser for Sqlite3 db files. Some users will need to - exclude the dependency on xerial.org's sqlite-jdbc because - it contains native libs (TIKA-1511). - - * Use POST instead of PUT for tika-server form methods - (TIKA-1547). - - * A basic wrapper around the UNIX file command was - added to extract Strings. In addition a parse to - handle Strings parsing from octet-streams using Latin1 - charsets as added (TIKA-1541, TIKA-1483). - - * Add test files and detection mechanism for Gridded - Binary (GRIB) files (TIKA-1539). - - * The RAR parser was updated to handle Chinese characters - using the functionality provided by allowing encoding to - be used within ZipArchiveInputStream (TIKA-936). - - * Fix out of memory error in surefire plugin (TIKA-1537). - - * Build a parser to extract data from GRIB formats (TIKA-1423). - - * Upgrade to Commons Compress 1.9 (TIKA-1534). - - * Include media duration in metadata parsed by MP4Parser (TIKA-1530). - - * Support password protected 7zip files (using a PasswordProvider, - in keeping with the other password supporting formats) (TIKA-1521). - - * Password protected Zip files should not trigger an exception (TIKA-1028). - -Release 1.7 - 1/9/2015 - - * Fixed resource leak in OutlookPSTParser that caused TikaException - when invoked via AutoDetectParser on Windows (TIKA-1506). - - * HTML tags are properly stripped from content by FeedParser - (TIKA-1500). - - * Tika Server support for selecting a single metadata key; - wrapped MetadataEP into MetadataResource (TIKA-1499). - - * Tika Server support for JSON and XMP views of metadata (TIKA-1497). - - * Tika Parent uses dependency management to keep duplicate - dependencies in different modules the same version (TIKA-1384). - - * Upgraded slf4j to version 1.7.7 (TIKA-1496). - - * Tika Server support for RecursiveParserWrapper's JSON output - (endpoint=rmeta) equivalent to (TIKA-1451's) -J option - in tika-app (TIKA-1498). - - * Tika Server support for providing the password for files on a - per-request basis through the Password http header (TIKA-1494). - - * Simple support for the BPG (Better Portable Graphics) image format - (TIKA-1491, TIKA-1495). - - * Prevent exceptions from being thrown for some malformed - mp3 files (TIKA-1218). - - * Reformat pom.xml files to use two spaces per indent (TIKA-1475). - - * Fix warning of slf4j logger on Tika Server startup (TIKA-1472). - - * Tika CLI and GUI now have option to view JSON rendering of output - of RecursiveParserWrapper (TIKA-1451). - - * Tika now integrates the Geospatial Data Abstraction Library - (GDAL) for parsing hundreds of geospatial formats (TIKA-605, - TIKA-1503). - - * ExternalParsers can now use Regexs to specify dynamic keys - (TIKA-1441). - - * Thread safety issues in ImageMetadataExtractor were resolved - (TIKA-1369). - - * The ForkParser service is now registered in Activator - (TIKA-1354). - - * The Rome Library was upgraded to version 1.5 (TIKA-1435). - - * Add markup for files embedded in PDFs (TIKA-1427). - - * Extract files embedded in annotations in PDFS (TIKA-1433). - - * Upgrade to PDFBox 1.8.8 (TIKA-1419, TIKA-1442). - - * Add RecursiveParserWrapper (aka Jukka's and Nick's) - RecursiveMetadataParser (TIKA-1329) - - * Add example for how to dump TikaConfig to XML (TIKA-1418). - - * Allow users to specify a tika config file for tika-app (TIKA-1426). - - * PackageParser includes the last-modified date from the archive - in the metadata, when handling embedded entries (TIKA-1246) - - * Created a new Tesseract OCR Parser to extract text from images. - Requires installation of Tesseract before use (TIKA-93). - - * Basic parser for older Excel formats, such as Excel 4, 5 and 95, - which can get simple text, and metadata for Excel 5+95 (TIKA-1490) - - -Release 1.6 - 08/31/2014 - - * Parse output should indicate which Parser was actually used - (TIKA-674). - - * Use the forbidden-apis Maven plugin to check for unsafe Java - operations (TIKA-1387). - - * Created an ExternalTranslator class to interface with command - line Translators (TIKA-1385). - - * Created a MosesTranslator as a subclass of ExternalTranslator - that calls the Moses Decoder machine translation program (TIKA-1385). - - * Created the tika-example module. It will have examples of how to - use the main Tika interfaces (TIKA-1390). - - * Upgraded to Commons Compress 1.8.1 (TIKA-1275). - - * Upgraded to POI 3.11-beta1 (TIKA-1380). - - * Tika now extracts SDTCell content from tables in .docx files (TIKA-1317). - - * Tika now supports detection of the Persian/Farsi language. - (TIKA-1337) - - * The Tika Detector interface is now exposed through the JAX-RS - server (TIKA-1336, TIKA-1336). - - * Tika now has support for parsing binary Matlab files as part of - our larger effort to increase the number of scientific data formats - supported. (TIKA-1327) - - * The Tika Server URLs for the unpacker resources have been changed, - to bring them under a common prefix (TIKA-1324). The mapping is - /unpacker/{id} -> /unpack/{id} - /all/{id} -> /unpack/all/{id} - - * Added module and core Tika interface for translating text between - languages and added a default implementation that call's Microsoft's - translate service (TIKA-1319) - - * Added an Translator implementation that calls Lingo24's Premium - Machine Translation API (TIKA-1381) - - * Made RTFParser's list handling slightly more robust against corrupt - list metadata (TIKA-1305) - - * Fixed bug in CLI json output (TIKA-1291/TIKA-1310) - - * Added ability to turn off image extraction from PDFs (TIKA-1294). - Users must now turn on this capability via the PDFParserConfig. - - * Upgrade to PDFBox 1.8.6 (TIKA-1290, TIKA-1231, TIKA-1233, TIKA-1352) - - * Zip Container Detection for DWFX and XPS formats, which are OPC - based (TIKA-1204, TIKA-1221) - - * Added a user facing welcome page to the Tika Server, which - says what it is, and a very brief summary of what is available. - (TIKA-1269) - - * Added Tika Server endpoints to list the available mime types, - Parsers and Detectors, similar to the --list- methods on - the Tika CLI App (TIKA-1270) - - * Improvements to NetCDF and HDF parsing to mimic the output of - ncdump and extract text dimensions and spatial and variable - information from scientific data files (TIKA-1265) - - * Extract attachments from RTF files (TIKA-1010) - - * Support Outlook Personal Folders File Format *.pst (TIKA-623) - - * Added mime entries for additional Ogg based formats (TIKA-1259) - - * Updated the Ogg Vorbis plugin to v0.4, which adds detection for a wider - range of Ogg formats, and parsers for more Ogg Audio ones (TIKA-1113) - - * PDF: Images in PDF documents can now be extracted as embedded resources. - (TIKA-1268) - - * Fixed RuntimeException thrown for certain Word Documents (TIKA-1251). - - * CLI: TikaCLI now has another option: --list-parser-details-apt, which outputs - the list of supported parsers in APT format. This is used to generate the list - on the formats page (TIKA-411). - Release 1.5 - 02/04/2014 * Fixed bug in handling of embedded file processing in PDFs (TIKA-1228). diff --git a/KEYS b/KEYS index 6dd15f5..0649aa1 100644 --- a/KEYS +++ b/KEYS @@ -230,7 +230,7 @@ sub 4096R/84C15C40 2014-02-04 -----BEGIN PGP PUBLIC KEY BLOCK----- -Version: GnuPG v1.4.11 (GNU/Linux) +Version: GnuPG v1 mQINBFLxaSYBEADUywK+vv9sbxjLrW5aAM5bSxyZdPLgv8xUphG40XEGQPAamGiL aDg9cgob1eZNcxmzMmp/O4vHdcdjzHN0iRMUpsYaSlm9YjqbK3sYynrXqahmHJFa @@ -256,89 +256,28 @@ U3+oNL01iP7fdTp+Nu6eHqCg3GXIkCEwN88Vr9IbAkoQD3DrRWerh35X9zOeb56i GT+UulAJayBWIgypp6j+uiDqOtDWysOQBn1wQxkERSHzsHtKJ4OXTqXudZ+gNhAK cQPDzrm1vaT/WoGLxL/hvjf1jo0UD/UtCKnFbCphjKXifuXiRFmr0MkI12ui79rk -iQQcBBABCAAGBQJTMWqBAAoJEIqviNbYTkGu/ZkgAJuXfdwRbvhuodF8f944sRN1 -7dMDEBf2Uu07zIWR/zXw/ivduf89/5pOXKTjxWFtOqkxPNPPobc8qwAJ8RvnG8fJ -Wz5GP4X3b4Li56U+jSeIuMQqAPivnQQZAykKHwwZcjKrJK0zhpMwITbuQ9ng/tMk -kuV5S3twt73CrIcet2rd7VztesmD2cheNupwlal7tImMwOboE9lF6l4gln48dxzt -huW8h2jo2mWHqzGradcDLlFsecBFFYute3O5gVVweNG0p9/k7GtBTlcKga8ukhuK -nIKzcxGN2wDdMD/xXH0cSGhcDaxHPFvRcX5Og3bAVvdIdkrZueBoV7pd9BWpFA54 -ArTBPkK9WUd0JVRMN1yDZqpel7JHAgKgQ1YiTT41RKj2XFvjN6S4kYJ3elRvDX8f -W4kZ4XOKzmqagGfJV061wFRI4sx1Vp3H1vZiF4NkVxdXApBlCdKtO4tyCZiqqSKZ -MaDDPYBiBp1WaNvwmP9zhLw6ZLoO8/0XMKjpguUWzIJh4D4lqw1pN690Dcvfm05r -gKnJUBH8KCq2XgPg8pgAsjuJ6EzIjHLxmvXffiQQ1jgvRT1yU/XM4glSqRnaTVXB -ymg6KrKDTf6iZE94TpmRkZu1lxvg8bKGo9T+otAVq44Ns/qKPaZ/pgZZi+Ip9v80 -GQo4KHGaObjwYbZgHurwlZu59FRRTbm1iU0nt/h49xXOwGZuNM3LRO5gC3usvvL3 -BYv18jSX9RZnsA0MAVLv9AcjtHMhAYoYZlmogA8Y6S/dIUuEEqTcQMG76GP8hRRL -RUXIViQ7aaOw7y61xVL65OFrfJcuJ0VSB7tuUVFffK6sijRtxu93JOmigq7/IlHy -vatCv6UW+NXaCg1gKezwBFjUQqwQc8ECNt8cq9A8VqhIk5w3xhu/oQpTarUuVJ9H -ZygV5sD/qamJ015fm9lePA7FgNwS0Jf9Bn/BO70IP72/s4bUWvDHGd2a7w0PM9fd -eXt5xTFxNuJJ2BOpO4XIFCiT7igSz5aEC9HoSBv6P+LEqDjoqp/Wize61iWJ9DVP -pZoFQUF3UMH5WajQ/wZxF7YZJsJX+YNiV+cYahO/6bJ6/nMMo1vLtRYLoliujXXQ -K92x7au66cDKUvc/5F1HbpuJ8ZkLtzUERcOdncU5hMeJjTxBH6wmqmBGYUvcqXgY -IfYYGv/J3z/Fai66q2fDuHFV1cIBtdR8wM00XuprMYLDmS0capQc3s7Ft8jsJVGt -y4hhM/zNzltkM7UB3XeEJNDwMHFZi2+yY49H9sPFrzh8izOFrYW670YFLhBTVNCw -C2i2rESxDs13do8UvuxK4qPnQTN2pAh21lUGrzGTm/NUqRJlpuK+NZYJ1EqjIoSI -nAQQAQIABgUCUwOEzQAKCRDurUz9SaVj2UUQA/9rfy/wgwWejyiei60Fnn4H7bfO -FRuNj3etXWOGksF+KciFY+TwKEmtC+Sxgzfq4jqLZCcTJWIVpAv+xD+bU0wXbU83 -dv9BrYfuT1Q9O2r4m4YGGLROoaUU75/CKbyeKxUJZdulvB5DjxWPOVADUGV9k7Ct -2Xlpcxt0owafAWMa8rkCDQRS8WkmARAAqEDffKwNTY9rIX2lg3tz44aXe3+qop/O -s4im8PBLvwMXYhuV4T6WW2Jut/hijopwMD9E+FdRZZ/o/lx+udGLHntXXsWvIF7n -woX/ORSqv6wxo0zxtERbyOg5Bgz/ruLCSXcgI5GY6Ga1Dkemot5EiyHFQLAUU/zS -55Q535Y1XytYW6xNWKS1cSpnk2ybL2To0ANTIAooAJR9wAF3H2RPVGPR5Tx6zQwT -Ws73PPX85LYMlAUu2x/fSfQ8ZrViJ/16qbBRf5UPOs3oq1kkqHyb35OhuussUuQK -SgezteRmdQMqlpAMwyn7C33eIAH3s0XygyXFq2PPw8o7MkX26heh6AZ1dFDi6c6m -fFG4z+fuuFgxqmo1HOqVKFtJQSVMbtWWL4wsuzv5tRjUA/7+QLaItDNjFDi7ot9M -jdQMQvaoWyhWQfL29ReHDLk7grXxfhnjpPCG9oJfEYUlZiJa23pO6ecJOqT0fCX/ -/Z4MMkq+HYUAbNxnSR4fHwYHuM5IOpBaNaQqPfWxc4pwd4Umwld4U539riwG9iwo -s69dubgHOVDC/43h6kl0mbR8H3MtcGCipmspy3M/IjDDPgmgrjU1FS0tlSOD9KeX -h1B9UgC9ak/U9kt4KYDPuacmZngzlooLW1Die/UATNe5W6pXsbzUP7dU3Mljx5PW -1173QI6DQZ8AEQEAAYkCHwQYAQoACQUCUvFpJgIbDAAKCRBSQUsLDrMLByVAEACr -JT8tmTScDLxcimh35BdJIvrRg8t2cv6mFvGkBcZ87EeH05fwSTVwQodhKWExqUhE -n3b2XjF5PPo1FWravwvLS3xaA88Kz7ncAHpsMpxCfjKivJCSm5jL5l3Xc6iySqx1 -482Qpou9OrLAJIqKiF0iWGbfnfd7U3GbJla1sKMl1BPmQW0H5nbN9h6ns4nqZnxl -2rFp0Y9f34XWOV2ViawiGZ8AmnacSpT4++F8XxGrLXClbCPhFq+0o/Nwe0bBktSD -43JWO+zwTUwgUgdVN5AUw+5f3WAy16YiWeTgMkTvTK11uwPO+l4WgbEXXEcVKhHv -fDRS1YVriTukU2/PS4n9yxsbL7mh6SEA9DYJDTU2C/T3yG0Y02aST0LVhK7fnkSu -ELmUQNR2Ck/4bWFR+v6KrDCpayIfQKFVz0PhMQsPE6jaNI00odCncuEo9LwEiw83 -JKcxVc9nDKDwX4LI61NYfhXN5TD041lv2GVGj7e1fptY6yOIpJvhAuEyFirYBPlb -xwFx51DypthuGtmtGPmuR8G7c/uoUgsaWtljCFeMAYWqBHqEa4H5Zan6LzSdgSR6 -WQLxqmrxX/8qp6l7LSTeE9OviRHHs6wlEw8mfzNBjRlTav3HTQKXNVyJfrbpFAmE -v5oOrlFoVc31np7pDFg1nZ8o8Rb//arjsRzPe+oZIA== -=wanG ------END PGP PUBLIC KEY BLOCK----- -pub 2048R/D4F10117 2015-01-01 -uid Tyler Palsulich -sig 3 D4F10117 2015-01-01 Tyler Palsulich -sub 2048R/6137D1E6 2015-01-01 -sig D4F10117 2015-01-01 Tyler Palsulich - ------BEGIN PGP PUBLIC KEY BLOCK----- -Version: GnuPG v1 - -mQENBFSlspUBCADJfADZ0ep3o/wo5sUSHDcFvmcuTRsHZDgsoHrdk83oqsQtHBZK -EQ4KeTbPTONgyNSU13kQDT6BYX3CA4AB9rqSBCI/Gghi56+I4d8mjZODY5bpnILC -vU9FyLsJNdbV8J48+oDF/5LToo5VB8QYslZ8ZZ7DJZvNmh4EovlnP9bVVS4Txk7d -mywSr1MTy5u6lb71oczK95pxO2dRwvJzLcQNTAgh3nrqk1JCLMxJoGGaKKLiGZgF -psn5nusGzOoRHeUa33V3/ms3ZYM6mS/9MmyU5P1zOUZ2Exc9C6Tps0bYbB/oztgM -4bx9NFwpeuILi4OJ/wEIJNp809CXXoYFuWlNABEBAAG0J1R5bGVyIFBhbHN1bGlj -aCA8dHBhbHN1bGljaEBhcGFjaGUub3JnPokBOAQTAQIAIgUCVKWylQIbAwYLCQgH -AwIGFQgCCQoLBBYCAwECHgECF4AACgkQiBC7GdTxARd0nQf/S2yLJ8U7P/Hix5zR -3idwrAmfDtYhUJXuEedKCw9RFnq9Q45hs1zIHVsOtnYaPvyQqSF8rY/E5LR6KJ1W -I1reFc5wKJLfmCWPAJ0Og8U4N1DOwwxESesugUT16iAXQL58xbSAzGJ1/v4L8eTj -P7maZcEdW7FLLTqJFuSfJsu8VowU8pD+v2DGHehARhDyJhhQxrX1Zb1t8vffspXw -bND1CbdB87VZJOj1apRL47nG6Qev7On+XKEXR9tHz/MWdJ/0kyNju6OLcjPJ2QFb -Q/Dwj6VYblvKq5eIYuhSNzbaI2AayZGpC9/PpFSPPWPhqa+eukUoPd3rGEG2PGBh -1shjYLkBDQRUpbKVAQgAsHL1+04Um1nOQJyeBhZ6tIa5VBPvhwk+Gccy3rWFZ66W -4byZ16Hc4tM9mU2CcPpdLYITPJaAEi+T7frXuiJwmVeAe1o9LElVAOGwbDlybv6s -wJvQqnrbwRBQLmblXeSqffAE4bpz4iU4haD2LpyjKNs5D/YS9QfhjuTKh9gGu+uP -DhXmD1hGn0UvDy9GuX6PgWijeOIUlvuZaiN8cZjsG87MLXcLLxbvCZIfrmyheF22 -zSYMEvNB3r8dLTnCIt7SqbdGGyyV0kBMQWic2Epk7WzQWNsshCVPhZNkJ4oQN4Yo -AMdGyLHTJ8HvH6L8trDFQEdJrt1lIcLn43lv1AzF9QARAQABiQEfBBgBAgAJBQJU -pbKVAhsMAAoJEIgQuxnU8QEX4+oIALw2qD3KyAKKwHGK8X93woHY19tDH4zCKsQa -r2qXy7aoAsNhERkg24OUkJu0T/c/HzAQPs0RbEZUxqhzsezmJKwey+9TmNsmTcM6 -52nVMa5fl7+38A54dqLOtK965ZggSroM6Qyk9lrfsJRQ/4BbNfagsXPP7Fvs1DDe -JcWAy7md7XR9MiVgSQuw040wqSzcSA5M6RCFZ9gN+G0kP1CNZ5vDz+JktV4nJZzh -/i/wH25qTePHz6Clp6mye68cqtCTKX2RF5cTlFCWIqyFYFCfrKCi3LF0bhpWqq7S -JF8xV9E4P/Msl8hqmOOocZ4LDJdw/nt1UWlUmattMLBVWdSeuu0= -=pYQ7 ------END PGP PUBLIC KEY BLOCK----- +uQINBFLxaSYBEACoQN98rA1Nj2shfaWDe3Pjhpd7f6qin86ziKbw8Eu/AxdiG5Xh +PpZbYm63+GKOinAwP0T4V1Fln+j+XH650Ysee1dexa8gXufChf85FKq/rDGjTPG0 +RFvI6DkGDP+u4sJJdyAjkZjoZrUOR6ai3kSLIcVAsBRT/NLnlDnfljVfK1hbrE1Y +pLVxKmeTbJsvZOjQA1MgCigAlH3AAXcfZE9UY9HlPHrNDBNazvc89fzktgyUBS7b +H99J9DxmtWIn/XqpsFF/lQ86zeirWSSofJvfk6G66yxS5ApKB7O15GZ1AyqWkAzD +KfsLfd4gAfezRfKDJcWrY8/DyjsyRfbqF6HoBnV0UOLpzqZ8UbjP5+64WDGqajUc +6pUoW0lBJUxu1ZYvjCy7O/m1GNQD/v5Atoi0M2MUOLui30yN1AxC9qhbKFZB8vb1 +F4cMuTuCtfF+GeOk8Ib2gl8RhSVmIlrbek7p5wk6pPR8Jf/9ngwySr4dhQBs3GdJ +Hh8fBge4zkg6kFo1pCo99bFzinB3hSbCV3hTnf2uLAb2LCizr125uAc5UML/jeHq +SXSZtHwfcy1wYKKmaynLcz8iMMM+CaCuNTUVLS2VI4P0p5eHUH1SAL1qT9T2S3gp +gM+5pyZmeDOWigtbUOJ79QBM17lbqlexvNQ/t1TcyWPHk9bXXvdAjoNBnwARAQAB +iQIfBBgBCgAJBQJS8WkmAhsMAAoJEFJBSwsOswsHJUAQAKslPy2ZNJwMvFyKaHfk +F0ki+tGDy3Zy/qYW8aQFxnzsR4fTl/BJNXBCh2EpYTGpSESfdvZeMXk8+jUVatq/ +C8tLfFoDzwrPudwAemwynEJ+MqK8kJKbmMvmXddzqLJKrHXjzZCmi706ssAkioqI +XSJYZt+d93tTcZsmVrWwoyXUE+ZBbQfmds32HqeziepmfGXasWnRj1/fhdY5XZWJ +rCIZnwCadpxKlPj74XxfEastcKVsI+EWr7Sj83B7RsGS1IPjclY77PBNTCBSB1U3 +kBTD7l/dYDLXpiJZ5OAyRO9MrXW7A876XhaBsRdcRxUqEe98NFLVhWuJO6RTb89L +if3LGxsvuaHpIQD0NgkNNTYL9PfIbRjTZpJPQtWErt+eRK4QuZRA1HYKT/htYVH6 +/oqsMKlrIh9AoVXPQ+ExCw8TqNo0jTSh0Kdy4Sj0vASLDzckpzFVz2cMoPBfgsjr +U1h+Fc3lMPTjWW/YZUaPt7V+m1jrI4ikm+EC4TIWKtgE+VvHAXHnUPKm2G4a2a0Y ++a5Hwbtz+6hSCxpa2WMIV4wBhaoEeoRrgfllqfovNJ2BJHpZAvGqavFf/yqnqXst +JN4T06+JEcezrCUTDyZ/M0GNGVNq/cdNApc1XIl+tukUCYS/mg6uUWhVzfWenukM +WDWdnyjxFv/9quOxHM976hkg +=CmKg +-----END PGP PUBLIC KEY BLOCK----- diff --git a/LICENSE.txt b/LICENSE.txt index 9576237..2c98686 100644 --- a/LICENSE.txt +++ b/LICENSE.txt @@ -322,51 +322,3 @@ 14. This Specifications License Agreement reflects the entire agreement of the parties regarding the subject matter hereof and supersedes all prior agreements or representations regarding such matters, whether written or oral. To the extent any portion or provision of this Specifications License Agreement is found to be illegal or unenforceable, then the remaining provisions of this Specifications License Agreement will remain in full force and effect and the illegal or unenforceable provision will be construed to give it such effect as it may properly have that is consistent with the intentions of the parties. 15. This Specifications License Agreement may only be modified in writing signed by an authorized representative of the IPTC. 16. This Specifications License Agreement is governed by the law of United Kingdom, as such law is applied to contracts made and fully performed in the United Kingdom. Any disputes arising from or relating to this Specifications License Agreement will be resolved in the courts of the United Kingdom. You consent to the jurisdiction of such courts over you and covenant not to assert before such courts any objection to proceeding in such forums. - - -JUnRAR (https://github.com/edmund-wagner/junrar/) - - JUnRAR is based on the UnRAR tool, and covered by the same license - It was formerly available from http://java-unrar.svn.sourceforge.net/ - - ****** ***** ****** UnRAR - free utility for RAR archives - ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ****** ******* ****** License for use and distribution of - ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ** ** ** ** ** ** FREE portable version - ~~~~~~~~~~~~~~~~~~~~~ - - The source code of UnRAR utility is freeware. This means: - - 1. All copyrights to RAR and the utility UnRAR are exclusively - owned by the author - Alexander Roshal. - - 2. The UnRAR sources may be used in any software to handle RAR - archives without limitations free of charge, but cannot be used - to re-create the RAR compression algorithm, which is proprietary. - Distribution of modified UnRAR sources in separate form or as a - part of other software is permitted, provided that it is clearly - stated in the documentation and source comments that the code may - not be used to develop a RAR (WinRAR) compatible archiver. - - 3. The UnRAR utility may be freely distributed. It is allowed - to distribute UnRAR inside of other software packages. - - 4. THE RAR ARCHIVER AND THE UnRAR UTILITY ARE DISTRIBUTED "AS IS". - NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU USE AT - YOUR OWN RISK. THE AUTHOR WILL NOT BE LIABLE FOR DATA LOSS, - DAMAGES, LOSS OF PROFITS OR ANY OTHER KIND OF LOSS WHILE USING - OR MISUSING THIS SOFTWARE. - - 5. Installing and using the UnRAR utility signifies acceptance of - these terms and conditions of the license. - - 6. If you don't agree with terms of the license you must remove - UnRAR files from your storage devices and cease to use the - utility. - - Thank you for your interest in RAR and UnRAR. Alexander L. Roshal - -Sqlite (bundled in org.xerial's sqlite-jdbc) - This product bundles Sqlite, which is in the Public Domain. For details - see: https://www.sqlite.org/copyright.html diff --git a/NOTICE.txt b/NOTICE.txt index 7b50eb1..a1bf620 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -1,5 +1,5 @@ Apache Tika -Copyright 2015 The Apache Software Foundation +Copyright 2011 The Apache Software Foundation This product includes software developed at The Apache Software Foundation (http://www.apache.org/). @@ -7,10 +7,8 @@ Copyright 1993-2010 University Corporation for Atmospheric Research/Unidata This software contains code derived from UCAR/Unidata's NetCDF library. -Tika-server component uses CDDL-licensed dependencies: jersey (http://jersey.java.net/) and +Tika-server compoment uses CDDL-licensed dependencies: jersey (http://jersey.java.net/) and Grizzly (http://grizzly.java.net/) - -Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight (https://github.com/codelibs/jhighlight) OpenCSV: Copyright 2005 Bytecode Pty Ltd. Licensed under the Apache License, Version 2.0 diff --git a/README.md b/README.md deleted file mode 100644 index 8b50315..0000000 --- a/README.md +++ /dev/null @@ -1,85 +0,0 @@ -Welcome to Apache Tika -================================================= - -Apache Tika(TM) is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. - -Tika is a project of the [Apache Software Foundation](http://www.apache.org). - -Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika project logo are trademarks of The Apache Software Foundation. - -Getting Started ---------------- - -Tika is based on Java 6 and uses the [Maven 3](http://maven.apache.org) build system. To build Tika, use the following command in this directory: - - mvn clean install - -The build consists of a number of components, including a standalone runnable jar that you can use to try out Tika features. You can run it like this: - - java -jar tika-app/target/tika-app-*.jar --help - -Contributing via Github -======================= -To contribute a patch, follow these instructions (note that installing -[Hub](http://hub.github.com) is not strictly required, but is recommended). - -``` -0. Download and install hub.github.com -1. File JIRA issue for your fix at https://issues.apache.org/jira/browse/TIKA -- you will get issue id TIKA-xxx where xxx is the issue ID. -2. git clone http://github.com/apache/tika.git -3. cd tika -4. git checkout -b TIKA-xxx -5. edit files -6. git status (make sure it shows what files you expected to edit) -7. git add -8. git commit -m “fix for TIKA-xxx contributed by ” -9. git fork -10. git push -u TIKA-xxx -11. git pull-request -``` - -License (see also LICENSE.txt) ------------------------------- - -Collective work: Copyright 2011 The Apache Software Foundation. - -Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at - - - -Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. - -Apache Tika includes a number of subcomponents with separate copyright notices and license terms. Your use of these subcomponents is subject to the terms and conditions of the licenses listed in the LICENSE.txt file. - -Export control --------------- - -This distribution includes cryptographic software. The country in which you currently reside may have restrictions on the import, possession, use, and/or re-export to another country, of encryption software. BEFORE using any encryption software, please check your country's laws, regulations and policies concerning the import, possession, or use, and re-export of encryption software, to see if this is permitted. See for more information. - -The U.S. Government Department of Commerce, Bureau of Industry and Security (BIS), has classified this software as Export Commodity Control Number (ECCN) 5D002.C.1, which includes information security software using or performing cryptographic functions with asymmetric algorithms. The form and manner of this Apache Software Foundation distribution makes it eligible for export under the License Exception ENC Technology Software Unrestricted (TSU) exception (see the BIS Export Administration Regulations, Section 740.13) for both object code and source code. - -The following provides more details on the included cryptographic software: - -Apache Tika uses the Bouncy Castle generic encryption libraries for extracting text content and metadata from encrypted PDF files. See for more details on Bouncy Castle. - -Mailing Lists -------------- - -Discussion about Tika takes place on the following mailing lists: - -* user@tika.apache.org - About using Tika -* dev@tika.apache.org - About developing Tika - -Notification on all code changes are sent to the following mailing list: - -* commits@tika.apache.org - -The mailing lists are open to anyone and publicly archived. - -You can subscribe the mailing lists by sending a message to [LIST]-subscribe@tika.apache.org (for example user-subscribe@...). To unsubscribe, send a message to [LIST]-unsubscribe@tika.apache.org. For more instructions, send a message to [LIST]-help@tika.apache.org. - -Issue Tracker -------------- - -If you encounter errors in Tika or want to suggest an improvement or a new feature, please visit the [Tika issue tracker](https://issues.apache.org/jira/browse/TIKA). There you can also find the latest information on known issues and recent bug fixes and enhancements. diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..d93ffac --- /dev/null +++ b/README.txt @@ -0,0 +1,102 @@ +================================================= +Welcome to Apache Tika +================================================= + +Apache Tika(TM) is a toolkit for detecting and extracting metadata and +structured text content from various documents using existing parser +libraries. + +Tika is a project of the Apache Software Foundation . + +Apache Tika, Tika, Apache, the Apache feather logo, and the Apache Tika +project logo are trademarks of The Apache Software Foundation. + +Getting Started +=============== + +Tika is based on Java 5 and uses the Maven 2 +build system. To build Tika, use the following command in this directory: + + mvn clean install + +The build consists of a number of components, including a standalone runnable +jar that you can use to try out Tika features. You can run it like this: + + java -jar tika-app/target/tika-app-*.jar --help + +License (see also LICENSE.txt) +============================== + +Collective work: Copyright 2011 The Apache Software Foundation. + +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to You under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. + +Apache Tika includes a number of subcomponents with separate copyright +notices and license terms. Your use of these subcomponents is subject to +the terms and conditions of the licenses listed in the LICENSE.txt file. + +Export control +============== + +This distribution includes cryptographic software. The country in which +you currently reside may have restrictions on the import, possession, use, +and/or re-export to another country, of encryption software. BEFORE using +any encryption software, please check your country's laws, regulations and +policies concerning the import, possession, or use, and re-export of +encryption software, to see if this is permitted. See + for more information. + +The U.S. Government Department of Commerce, Bureau of Industry and +Security (BIS), has classified this software as Export Commodity Control +Number (ECCN) 5D002.C.1, which includes information security software using +or performing cryptographic functions with asymmetric algorithms. The form +and manner of this Apache Software Foundation distribution makes it eligible +for export under the License Exception ENC Technology Software Unrestricted +(TSU) exception (see the BIS Export Administration Regulations, Section +740.13) for both object code and source code. + +The following provides more details on the included cryptographic software: + + Apache Tika uses the Bouncy Castle generic encryption libraries for + extracting text content and metadata from encrypted PDF files. + See http://www.bouncycastle.org/ for more details on Bouncy Castle. + +Mailing Lists +============= + +Discussion about Tika takes place on the following mailing lists: + + user@tika.apache.org - About using Tika + dev@tika.apache.org - About developing Tika + +Notification on all code changes are sent to the following mailing list: + + commits@tika.apache.org + +The mailing lists are open to anyone and publicly archived. + +You can subscribe the mailing lists by sending a message to +-subscribe@tika.apache.org (for example user-subscribe@...). +To unsubscribe, send a message to -unsubscribe@tika.apache.org. +For more instructions, send a message to -help@tika.apache.org. + +Issue Tracker +============= + +If you encounter errors in Tika or want to suggest an improvement or +a new feature, please visit the Tika issue tracker at +https://issues.apache.org/jira/browse/TIKA. There you can also find the +latest information on known issues and recent bug fixes and enhancements. diff --git a/pom.xml b/pom.xml index db5c188..eae704d 100644 --- a/pom.xml +++ b/pom.xml @@ -25,7 +25,7 @@ org.apache.tika tika-parent - 1.11 + 1.5 tika-parent/pom.xml @@ -36,12 +36,12 @@ - scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1 + scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.5/ - scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1 + scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.5/ - http://svn.apache.org/viewvc/tika/tags/1.11-rc1 + http://svn.apache.org/viewvc/tika/tags/1.5/ @@ -49,15 +49,44 @@ tika-core tika-parsers tika-xmp - tika-serialization - tika-batch tika-app tika-bundle tika-server - tika-translate - tika-example - tika-java7 + + + + + maven-deploy-plugin + + true + + + + maven-site-plugin + + src/site + + + + + org.apache.rat + apache-rat-plugin + + + .*/** + CHANGES.txt + tika-dotnet/AssemblyInfo.cs + tika-dotnet/Tika.csproj + tika-dotnet/Tika.sln + tika-dotnet/Tika.sln.cache + tika-dotnet/obj/** + tika-dotnet/target/** + + + + + @@ -105,8 +134,7 @@ - - + @@ -125,41 +153,38 @@ From: ${username}@apache.org To: dev@tika.apache.org - user@tika.apache.org -Subject: [VOTE] Release Apache Tika ${project.version} Candidate #N +Subject: [VOTE] Release Apache Tika ${project.version} A candidate for the Tika ${project.version} release is available at: - https://dist.apache.org/repos/dist/dev/tika/ + + http://people.apache.org/~${username}/tika/${project.version}/ The release candidate is a zip archive of the sources in: - http://svn.apache.org/repos/asf/tika/tags/${project.version}-rcN/ - -The SHA1 checksum of the archive is - ${checksum}. - -In addition, a staged maven repository is available here: - https://repository.apache.org/content/repositories/orgapachetika-.../org/apache/tika + + http://svn.apache.org/repos/asf/tika/tags/${project.version}/ + +The SHA1 checksum of the archive is ${checksum}. Please vote on releasing this package as Apache Tika ${project.version}. The vote is open for the next 72 hours and passes if a majority of at least three +1 Tika PMC votes are cast. -[ ] +1 Release this package as Apache Tika ${project.version} -[ ] -1 Do not release this package because...${line.separator} + [ ] +1 Release this package as Apache Tika ${project.version} + [ ] -1 Do not release this package because...${line.separator} - The release candidate has been prepared in: - - ${basedir}/target/${project.version} - - Please deploy it to people.apache.org like this: - - scp -r ${basedir}/target/${project.version} people.apache.org:public_html/tika/ - - A release vote template has been generated for you: - - file://${basedir}/target/vote.txt +The release candidate has been prepared in: + + ${basedir}/target/${project.version} + +Please deploy it to people.apache.org like this: + + scp -r ${basedir}/target/${project.version} people.apache.org:public_html/tika/ + +A release vote template has been generated for you: + + file://${basedir}/target/vote.txt @@ -168,30 +193,37 @@ - org.apache.ant - ant-nodeps - 1.8.1 - + org.apache.ant + ant-nodeps + 1.8.1 + + + java7 + + [1.7,] + + + tika-java7 + + - The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents - using existing parser libraries. - + The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. - The Apache Software Foundation - http://www.apache.org + The Apache Software Foundation + http://www.apache.org - JIRA - https://issues.apache.org/jira/browse/TIKA + JIRA + https://issues.apache.org/jira/browse/TIKA - Jenkins - https://builds.apache.org/job/Tika-trunk/ + Jenkins + https://builds.apache.org/job/Tika-trunk/ diff --git a/src/site/apt/detection.apt b/src/site/apt/detection.apt new file mode 100644 index 0000000..4fd9883 --- /dev/null +++ b/src/site/apt/detection.apt @@ -0,0 +1,152 @@ + ----------------- + Content Detection + ----------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Content Detection + + This page gives you information on how content and language detection + works with Apache Tika, and how to tune the behaviour of Tika. + +%{toc|section=1|fromDepth=1} + +* {The Detector Interface} + + The + {{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}} + interface is the basis for most of the content type detection in Apache + Tika. All the different ways of detecting content all implement the + same common method: + +--- +MediaType detect(java.io.InputStream input, + Metadata metadata) throws java.io.IOException +--- + + The <<>> method takes the stream to inspect, and a + <<>> object that holds any additional information on + the content. The detector will return a + {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing + its best guess as to the type of the file. + + In general, only two keys on the Metadata object are used by Detectors. + These are <<>> which should hold the name + of the file (where known), and <<>> which should + hold the advertised content type of the file (eg from a webserver or + a content repository). + + +* {Mime Magic Detction} + + By looking for special ("magic") patterns of bytes near the start of + the file, it is often possible to detect the type of the file. For + some file types, this is a simple process. For others, typically + container based formats, the magic detection may not be enough. (More + detail on detecting container formats below) + + Tika is able to make use of a a mime magic info file, in the + {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop MIME-info}} + format to peform mime magic detection. + + This is provided within Tika by + {{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}. It is most commonly access via + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + normally sourced from the <<>> file. + + +* {Resource Name Based Detection} + + Where the name of the file is known, it is sometimes possible to guess + the file type from the name or extension. Within the + <<>> file is a list of patterns which are used to + identify the type from the filename. + + However, because files may be renamed, this method of detection is quick + but not always as accurate. + + This is provided within Tika by + {{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}. + + +* {Known Content Type "Detection} + + Sometimes, the mime type for a file is already known, such as when + downloading from a webserver, or when retrieving from a content store. + This information can be used by detectors, such as + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}, + + +* {The default Mime Types Detector} + + By default, the mime type detection in Tika is provided by + {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}. + This detector makes use of <<>> to power + magic based and filename based detection. + + Firstly, magic based detection is used on the start of the file. + If the file is an XML file, then the start of the XML is processed + to look for root elements. Next, if available, the filename + (from <<>>) is + then used to improve the detail of the detection, such as when magic + detects a text file, and the filename hints it's really a CSV. Finally, + if available, the supplied content type (from <<>>) + is used to further refine the type. + + +* {Container Aware Detection} + + Several common file formats are actually held within a common container + format. One example is the PowerPoint .ppt and Word .doc formats, which + are both held within an OLE2 container. Another is Apple iWork formats, + which are actually a series of XML files within a Zip file. + + Using magic detection, it is easy to spot that a given file is an OLE2 + document, or a Zip file. Using magic detection alone, it is very difficult + (and often impossible) to tell what kind of file lives inside the container. + + For some use cases, speed is important, so having a quick way to know the + container type is sufficient. For other cases however, you don't mind + spending a bit of time (and memory!) processing the container to get a + more accurate answer on its contents. For these cases, a container + aware detector should be used. + + Tika provides a wrapping detector in the parsers bundle, of + {{{./api/org/apache/tika/detect/ContainerAwareDetector.html}org.apache.tika.detect.ContainerAwareDetector}}. + This detector will check for certain known containers, and if found, + will open them and detect the appropriate type based on the contents. + If the file isn't a known container, it will fall back to another + detector for the answer (most commonly the default + <<>> detector) + + Because this detector needs to read the whole file to process the + container, it must be used with a + {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}. + If called with a regular <<>>, then all work will be done + by the fallback detector. + + For more information on container formats and Tika, see + {{{http://wiki.apache.org/tika/MetadataDiscussion}}} + + +* {Language Detection} + + Tika is able to help identify the language of a piece of text, which + is useful when extracting text from document formats which do not include + language information in their metadata. + + The language detection is provided by + {{{./api/org/apache/tika/language/LanguageIdentifier.html}org.apache.tika.language.LanguageIdentifier}} diff --git a/src/site/apt/formats.apt b/src/site/apt/formats.apt new file mode 100644 index 0000000..01086cc --- /dev/null +++ b/src/site/apt/formats.apt @@ -0,0 +1,145 @@ + -------------------------- + Supported Document Formats + -------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Supported Document Formats + + This page lists all the document formats supported by Apache Tika 0.6. + Follow the links to the various parser class javadocs for more detailed + information about each document format and how it is parsed by Tika. + +%{toc|section=1|fromDepth=1} + +* {HyperText Markup Language} + + The HyperText Markup Language (HTML) is the lingua franca of the web. + Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}} + library to support virtually any kind of HTML found on the web. + The output from the + {{{api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class + is guaranteed to be well-formed and valid XHTML, and various heuristics + are used to prevent things like inline scripts from cluttering the + extracted text content. + +* {XML and derived formats} + + The Extensible Markup Language (XML) format is a generic format that can + be used for all kinds of content. Tika has custom parsers for some widely + used XML vocabularies like XHTML, OOXML and ODF, but the default + {{{api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}} + class simply extracts the text content of the document and ignores any XML + structure. The only exception to this rule are Dublin Core metadata + elements that are used for the document metadata. + +* {Microsoft Office document formats} + + Microsoft Office and some related applications produce documents in the + generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The + older OLE 2 format was introduced in Microsoft Office version 97 and was + the default format until Office version 2007 and the new XML-based + OOXML format. The + {{{api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}} + and + {{{api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}} + classes use {{{http://poi.apache.org/}Apache POI}} libraries to support + text and metadata extraction from both OLE2 and OOXML documents. + +* {OpenDocument Format} + + The OpenDocument format (ODF) is used most notably as the default format + of the OpenOffice.org office suite. The + {{{api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}} + class supports this format and the earlier OpenOffice 1.0 format on which + ODF is based. + +* {Portable Document Format} + + The {{{api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class + parsers Portable Document Format (PDF) documents using the + {{{http://pdfbox.apache.org/}Apache PDFBox}} library. + +* {Electronic Publication Format} + + The {{{api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class + supports the Electronic Publication Format (EPUB) used for many digital + books. + +* {Rich Text Format} + + The {{{api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class + uses the standard javax.swing.text.rtf feature to extract text content + from Rich Text Format (RTF) documents. + +* {Compression and packaging formats} + + Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}} + library to support various compression and packaging formats. The + {{{api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}} + class and its subclasses first parse the top level compression or + packaging format and then pass the unpacked document streams to a + second parsing stage using the parser instance specified in the + parse context. + +* {Text formats} + + Extracting text content from plain text files seems like a simple task + until you start thinking of all the possible character encodings. The + {{{api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses + encoding detection code from the {{{http://site.icu-project.org/}ICU}} + project to automatically detect the character encoding of a text document. + +* {Audio formats} + + Tika can detect several common audio formats and extract metadata + from them. Even text extraction is supported for some audio files that + contain lyrics or other textual content. The + {{{api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}} + and {{{api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}} + classes use standard javax.sound features to process simple audio + formats, and the + {{{api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class + adds support for the widely used MP3 format. + +* {Image formats} + + The {{{api/org/apache/tika/parser/image/ImageParser.html}ImageParser}} + class uses the standard javax.imageio feature to extract simple metadata + from image formats supported by the Java platform. More complex image + metadata is available through the + {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class + that uses the metadata-extractor library to supports Exif metadata + extraction from Jpeg images. + +* {Video formats} + + Currently Tika only supports the Flash video format using a simple + parsing algorithm implemented in the + {{{api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class. + +* {Java class files and archives} + + The {{{api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class + extracts class names and method signatures from Java class files, and + the {{{api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class + supports also jar archives. + +* {The mbox format} + + The {{{api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can + extract email messages from the mbox format used by many email archives + and Unix-style mailboxes. diff --git a/src/site/apt/gettingstarted.apt b/src/site/apt/gettingstarted.apt new file mode 100644 index 0000000..61c9a0e --- /dev/null +++ b/src/site/apt/gettingstarted.apt @@ -0,0 +1,208 @@ + -------------------------------- + Getting Started with Apache Tika + -------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Getting Started with Apache Tika + + This document describes how to build Apache Tika from sources and + how to start using Tika in an application. + +Getting and building the sources + + To build Tika from sources you first need to either + {{{../download.html}download}} a source release or + {{{../source-repository.html}checkout}} the latest sources from + version control. + + Once you have the sources, you can build them using the + {{{http://maven.apache.org/}Maven 2}} build system. Executing the + following command in the base directory will build the sources + and install the resulting artifacts in your local Maven repository. + +--- +mvn install +--- + + See the Maven documentation for more information about the available + build options. + + Note that you need Java 5 or higher to build Tika. + +Build artifacts + + The Tika build consists of a number of components and produces + the following main binaries: + + [tika-core/target/tika-core-*.jar] + Tika core library. Contains the core interfaces and classes of Tika, + but none of the parser implementations. Depends only on Java 5. + + [tika-parsers/target/tika-parsers-*.jar] + Tika parsers. Collection of classes that implement the Tika Parser + interface based on various external parser libraries. + + [tika-app/target/tika-app-*.jar] + Tika application. Combines the above components and all the external + parser libraries into a single runnable jar with a GUI and a command + line interface. + + [tika-bundle/target/tika-bundle-*.jar] + Tika bundle. An OSGi bundle that combines tika-parsers with non-OSGified + parser libraries to make them easy to deploy in an OSGi environment. + +Using Tika as a Maven dependency + + The core library, tika-core, contains the key interfaces and classes of Tika + and can be used by itself if you don't need the full set of parsers from + the tika-parsers component. The tika-core dependency looks like this: + +--- + + org.apache.tika + tika-core + ... + +--- + + If you want to use Tika to parse documents (instead of simply detecting + document types, etc.), you'll want to depend on tika-parsers instead: + +--- + + org.apache.tika + tika-parsers + ... + +--- + + Note that adding this dependency will introduce a number of + transitive dependencies to your project, including one on tika-core. + You need to make sure that these dependencies won't conflict with your + existing project dependencies. You can use the following command in + the tika-parsers directory to get a full listing of all the dependencies. + +--- +$ mvn dependency:tree | grep :compile +--- + +Using Tika in an Ant project + + Unless you use a dependency manager tool like + {{{http://ant.apache.org/ivy/}Apache Ivy}}, the easiest way to use + Tika is to include either the tika-core or the tika-app jar in your + classpath, depending on whether you want just the core functionality + or also all the parser implementations. + +--- + + ... + + + + + + + +--- + +Using Tika as a command line utility + + The Tika application jar (tika-app-*.jar) can be used as a command + line utility for extracting text content and metadata from all sorts of + files. This runnable jar contains all the dependencies it needs, so + you don't need to worry about classpath settings to run it. + + The usage instructions are shown below. + +--- +usage: java -jar tika-app.jar [option...] [file|port...] + +Options: + -? or --help Print this usage message + -v or --verbose Print debug level messages + -V or --version Print the Apache Tika version number + + -g or --gui Start the Apache Tika GUI + -s or --server Start the Apache Tika server + -f or --fork Use Fork Mode for out-of-process extraction + + -x or --xml Output XHTML content (default) + -h or --html Output HTML content + -t or --text Output plain text content + -T or --text-main Output plain text content (main content only) + -m or --metadata Output only metadata + -j or --json Output metadata in JSON + -y or --xmp Output metadata in XMP + -l or --language Output only language + -d or --detect Detect document type + -eX or --encoding=X Use output encoding X + -pX or --password=X Use document password X + -z or --extract Extract all attachements into current directory + --extract-dir= Specify target directory for -z + -r or --pretty-print For XML and XHTML outputs, adds newlines and + whitespace, for better readability + + --create-profile=X + Create NGram profile, where X is a profile name + --list-parsers + List the available document parsers + --list-parser-details + List the available document parsers, and their supported mime types + --list-detectors + List the available document detectors + --list-met-models + List the available metadata models, and their supported keys + --list-supported-types + List all known media types and related information + +Description: + Apache Tika will parse the file(s) specified on the + command line and output the extracted text content + or metadata to standard output. + + Instead of a file name you can also specify the URL + of a document to be parsed. + + If no file name or URL is specified (or the special + name "-" is used), then the standard input stream + is parsed. If no arguments were given and no input + data is available, the GUI is started instead. + +- GUI mode + + Use the "--gui" (or "-g") option to start the + Apache Tika GUI. You can drag and drop files from + a normal file explorer to the GUI window to extract + text content and metadata from the files. + +- Server mode + + Use the "--server" (or "-s") option to start the + Apache Tika server. The server will listen to the + ports you specify as one or more arguments. +--- + + You can also use the jar as a component in a Unix pipeline or + as an external tool in many scripting languages. + +--- +# Check if an Internet resource contains a specific keyword +curl http://.../document.doc \ + | java -jar tika-app.jar --text \ + | grep -q keyword +--- diff --git a/src/site/apt/index.apt b/src/site/apt/index.apt new file mode 100644 index 0000000..3abaef2 --- /dev/null +++ b/src/site/apt/index.apt @@ -0,0 +1,31 @@ + --------------- + Apache Tika 1.3 + --------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Apache Tika 1.3 + + The most notable changes in Tika 1.3 over the previous release are: + + * TBD + + The following people have contributed to Tika 1.3 by submitting or + commenting on the issues resolved in this release: + + * TBD + + See TBD for more details on these contributions. diff --git a/src/site/apt/parser.apt b/src/site/apt/parser.apt new file mode 100644 index 0000000..9f35aaa --- /dev/null +++ b/src/site/apt/parser.apt @@ -0,0 +1,245 @@ + -------------------- + The Parser interface + -------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +The Parser interface + + The + {{{api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}} + interface is the key concept of Apache Tika. It hides the complexity of + different file formats and parsing libraries while providing a simple and + powerful mechanism for client applications to extract structured text + content and metadata from all sorts of documents. All this is achieved + with a single method: + +--- +void parse( + InputStream stream, ContentHandler handler, Metadata metadata, + ParseContext context) throws IOException, SAXException, TikaException; +--- + + The <<>> method takes the document to be parsed and related metadata + as input and outputs the results as XHTML SAX events and extra metadata. + The parse context argument is used to specify context information (like + the current local) that is not related to any individual document. + The main criteria that lead to this design were: + + [Streamed parsing] The interface should require neither the client + application nor the parser implementation to keep the full document + content in memory or spooled to disk. This allows even huge documents + to be parsed without excessive resource requirements. + + [Structured content] A parser implementation should be able to + include structural information (headings, links, etc.) in the extracted + content. A client application can use this information for example to + better judge the relevance of different parts of the parsed document. + + [Input metadata] A client application should be able to include metadata + like the file name or declared content type with the document to be + parsed. The parser implementation can use this information to better + guide the parsing process. + + [Output metadata] A parser implementation should be able to return + document metadata in addition to document content. Many document + formats contain metadata like the name of the author that may be useful + to client applications. + + [Context sensitivity] While the default settings and behaviour of Tika + parsers should work well for most use cases, there are still situations + where more fine-grained control over the parsing process is desirable. + It should be easy to inject such context-specific information to the + parsing process without breaking the layers of abstraction. + + [] + + These criteria are reflected in the arguments of the <<>> method. + +* Document input stream + + The first argument is an + {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}} + for reading the document to be parsed. + + If this document stream can not be read, then parsing stops and the thrown + {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}} + is passed up to the client application. If the stream can be read but + not parsed (for example if the document is corrupted), then the parser + throws a {{{api/org/apache/tika/exception/TikaException.html}TikaException}}. + + The parser implementation will consume this stream but . + Closing the stream is the responsibility of the client application that + opened it in the first place. The recommended pattern for using streams + with the <<>> method is: + +--- +InputStream stream = ...; // open the stream +try { + parser.parse(stream, ...); // parse the stream +} finally { + stream.close(); // close the stream +} +--- + + Some document formats like the OLE2 Compound Document Format used by + Microsoft Office are best parsed as random access files. In such cases the + content of the input stream is automatically spooled to a temporary file + that gets removed once parsed. A future version of Tika may make it possible + to avoid this extra file if the input document is already a file in the + local file system. See + {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status + of this feature request. + +* XHTML SAX events + + The parsed content of the document stream is returned to the client + application as a sequence of XHTML SAX events. XHTML is used to express + structured content of the document and SAX events enable streamed + processing. Note that the XHTML format is used here only to convey + structural information, not to render the documents for browsing! + + The XHTML SAX events produced by the parser implementation are sent to a + {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}} + instance given to the <<>> method. If this the content handler + fails to process an event, then parsing stops and the thrown + {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}} + is passed up to the client application. + + The overall structure of the generated event stream is (with indenting + added for clarity): + +--- + + + ... + + + ... + + +--- + + Parser implementations typically use the + {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}} + utility class to generate the XHTML output. + + Dealing with the raw SAX events can be a bit complex, so Apache Tika + comes with a number of utility classes that can be used to process and + convert the event stream to other representations. + + For example, the + {{{api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}} + class can be used to extract just the body part of the XHTML output and + feed it either as SAX events to another content handler or as characters + to an output stream, a writer, or simply a string. The following code + snippet parses a document from the standard input stream and outputs the + extracted text content to standard output: + +--- +ContentHandler handler = new BodyContentHandler(System.out); +parser.parse(System.in, handler, ...); +--- + + Another useful class is + {{{api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that + uses a background thread to parse the document and returns the extracted + text content as a character stream: + +--- +InputStream stream = ...; // the document to be parsed +Reader reader = new ParsingReader(parser, stream, ...); +try { + ...; // read the document text using the reader +} finally { + reader.close(); // the document stream is closed automatically +} +--- + +* Document metadata + + The third argument to the <<>> method is used to pass document + metadata both in and out of the parser. Document metadata is expressed + as an {{{api/org/apache/tika/metadata/Metadata.html}Metadata}} object. + + The following are some of the more interesting metadata properties: + + [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains + the document. + + A client application can set this property to allow the parser to use + file name heuristics to determine the format of the document. + + The parser implementation may set this property if the file format + contains the canonical name of the file (for example the Gzip format + has a slot for the file name). + + [Metadata.CONTENT_TYPE] The declared content type of the document. + + A client application can set this property based on for example a HTTP + Content-Type header. The declared content type may help the parser to + correctly interpret the document. + + The parser implementation sets this property to the content type according + to which the document was parsed. + + [Metadata.TITLE] The title of the document. + + The parser implementation sets this property if the document format + contains an explicit title field. + + [Metadata.AUTHOR] The name of the author of the document. + + The parser implementation sets this property if the document format + contains an explicit author field. + + [] + + Note that metadata handling is still being discussed by the Tika development + team, and it is likely that there will be some (backwards incompatible) + changes in metadata handling before Tika 1.0. + +* Parse context + + The final argument to the <<>> method is used to inject + context-specific information to the parsing process. This is useful + for example when dealing with locale-specific date and number formats + in Microsoft Excel spreadsheets. Another important use of the parse + context is passing in the delegate parser instance to be used by + two-phase parsers like the + {{{api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses. + Some parser classes allow customization of the parsing process through + strategy objects in the parse context. + +* Parser implementations + + Apache Tika comes with a number of parser classes for parsing + {{{formats.html}various document formats}}. You can also extend Tika + with your own parsers, and of course any contributions to Tika are + warmly welcome. + + The goal of Tika is to reuse existing parser libraries like + {{{http://www.pdfbox.org/}PDFBox}} or + {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most + of the parser classes in Tika are adapters to such external libraries. + + Tika also contains some general purpose parser implementations that are + not targeted at any specific document formats. The most notable of these + is the {{{apidocs/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}} + class that encapsulates all Tika functionality into a single parser that + can handle any types of documents. This parser will automatically determine + the type of the incoming document based on various heuristics and will then + parse the document accordingly. diff --git a/src/site/apt/parser_guide.apt b/src/site/apt/parser_guide.apt new file mode 100644 index 0000000..778d50c --- /dev/null +++ b/src/site/apt/parser_guide.apt @@ -0,0 +1,135 @@ + -------------------------------------------- + Get Tika parsing up and running in 5 minutes + -------------------------------------------- + Arturo Beltran + -------------------------------------------- + +~~ Licensed to the Apache Software Foundation (ASF) under one or more +~~ contributor license agreements. See the NOTICE file distributed with +~~ this work for additional information regarding copyright ownership. +~~ The ASF licenses this file to You under the Apache License, Version 2.0 +~~ (the "License"); you may not use this file except in compliance with +~~ the License. You may obtain a copy of the License at +~~ +~~ http://www.apache.org/licenses/LICENSE-2.0 +~~ +~~ Unless required by applicable law or agreed to in writing, software +~~ distributed under the License is distributed on an "AS IS" BASIS, +~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +~~ See the License for the specific language governing permissions and +~~ limitations under the License. + +Get Tika parsing up and running in 5 minutes + + This page is a quick start guide showing how to add a new parser to Apache Tika. + Following the simple steps listed below your new parser can be running in only 5 minutes. + +%{toc|section=1|fromDepth=1} + +* {Getting Started} + + The {{{gettingstarted.html}Getting Started}} document describes how to + build Apache Tika from sources and how to start using Tika in an application. Pay close attention + and follow the instructions in the "Getting and building the sources" section. + + +* {Add your MIME-Type} + + You first need to modify {{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}} + in order to Tika can map the file extension with its MIME-Type. You should add something like this: + +--- + + + +--- + +* {Create your Parser class} + + Now, you need to create your new parser. This is a class that must implement the Parser interface + offered by Tika. A very simple Tika Parser looks like this: + +--- +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + * + * @Author: Arturo Beltran + */ +package org.apache.tika.parser.hello; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Collections; +import java.util.Set; + +import org.apache.tika.exception.TikaException; +import org.apache.tika.metadata.Metadata; +import org.apache.tika.mime.MediaType; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; +import org.apache.tika.sax.XHTMLContentHandler; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; + +public class HelloParser implements Parser { + + private static final Set SUPPORTED_TYPES = Collections.singleton(MediaType.application("hello")); + public static final String HELLO_MIME_TYPE = "application/hello"; + + public Set getSupportedTypes(ParseContext context) { + return SUPPORTED_TYPES; + } + + public void parse( + InputStream stream, ContentHandler handler, + Metadata metadata, ParseContext context) + throws IOException, SAXException, TikaException { + + metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE); + metadata.set("Hello", "World"); + + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + xhtml.endDocument(); + } + + /** + * @deprecated This method will be removed in Apache Tika 1.0. + */ + public void parse( + InputStream stream, ContentHandler handler, Metadata metadata) + throws IOException, SAXException, TikaException { + parse(stream, handler, metadata, new ParseContext()); + } +} +--- + + Pay special attention to the definition of the SUPPORTED_TYPES static class + field in the parser class that defines what MIME-Types it supports. + + Is in the "parse" method where you will do all your work. This is, extract + the information of the resource and then set the metadata. + +* {List the new parser} + + Finally, you should explicitly tell the AutoDetectParser to include your new + parser. This step is only needed if you want to use the AutoDetectParser functionality. + If you figure out the correct parser in a different way, it isn't needed. + + List your new parser in: + {{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}} + + diff --git a/src/site/resources/css/site.css b/src/site/resources/css/site.css new file mode 100644 index 0000000..0d3b63b --- /dev/null +++ b/src/site/resources/css/site.css @@ -0,0 +1,324 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#search { + position: relative; + right: 10px; + width: 100%; + font-size: 70%; + white-space: nowrap; + text-align: right; + z-index:0; + + bottom: -1px; /* compensate for IE rendering issue */ +} + +#bookpromo { + position: relative; + top: 35px; + left: 10px; + width: 100%; + white-space: nowrap; + text-align: center; + z-index:0; + bottom: -1px; +} + +#searchform { +} + +body { + margin: 0px; + padding: 0px 0px 10px 0px; +} + +/* From maven-theme.css */ + +body, td, select, input, li { + font-family: Verdana, Helvetica, Arial, sans-serif; + font-size: 13px; +} + +code{ + font-family: Courier, monospace; + font-size: 13px; +} +a { + text-decoration: none; +} +a:link { + color:#36a; +} +a:visited { + color:#47a; +} +a:active, a:hover { + color:#69c; +} +#legend li.externalLink { + background: url(../images/external.png) left top no-repeat; + padding-left: 18px; +} +a.externalLink, a.externalLink:link, a.externalLink:visited, a.externalLink:active, a.externalLink:hover { + background: url(../images/external.png) right center no-repeat; + padding-right: 18px; +} +#legend li.newWindow { + background: url(../images/newwindow.png) left top no-repeat; + padding-left: 18px; +} +a.newWindow, a.newWindow:link, a.newWindow:visited, a.newWindow:active, a.newWindow:hover { + background: url(../images/newwindow.png) right center no-repeat; + padding-right: 18px; +} +h2 { + padding: 4px 4px 4px 6px; + border: 1px solid #999; + color: #900; + background-color: #ddd; + font-weight:900; + font-size: x-large; +} +h3 { + padding: 4px 4px 4px 6px; + border: 1px solid #aaa; + color: #900; + background-color: #eee; + font-weight: normal; + font-size: large; +} +h4 { + padding: 4px 4px 4px 6px; + border: 1px solid #bbb; + color: #900; + background-color: #fff; + font-weight: normal; + font-size: large; +} +h5 { + padding: 4px 4px 4px 6px; + color: #900; + font-size: normal; +} +p { + line-height: 1.3em; + font-size: small; +} +#breadcrumbs { + border-top: 1px solid #aaa; + border-bottom: 1px solid #aaa; + background-color: #ccc; +} +#leftColumn { + margin: 10px 0 0 5px; + border: 1px solid #999; + background-color: #eee; +} +#navcolumn h5 { + font-size: smaller; + border-bottom: 1px solid #aaaaaa; + padding-top: 2px; + color: #000; +} + +table.bodyTable th { + color: white; + background-color: #bbb; + text-align: left; + font-weight: bold; +} + +table.bodyTable th, table.bodyTable td { + font-size: 1em; +} + +table.bodyTable tr.a { + background-color: #ddd; +} + +table.bodyTable tr.b { + background-color: #eee; +} + +.source { + border: 1px solid #999; +} +dl { + padding: 4px 4px 4px 6px; + border: 1px solid #aaa; + background-color: #ffc; +} +dt { + color: #900; +} +#organizationLogo img, #projectLogo img, #projectLogo span{ + margin: 8px; +} +#banner { + border-bottom: 1px solid #fff; +} +.errormark, .warningmark, .donemark, .infomark { + background: url(../images/icon_error_sml.gif) no-repeat; +} + +.warningmark { + background-image: url(../images/icon_warning_sml.gif); +} + +.donemark { + background-image: url(../images/icon_success_sml.gif); +} + +.infomark { + background-image: url(../images/icon_info_sml.gif); +} + +/* From maven-base.css */ + +img { + border:none; +} +table { + padding:0px; + width: 100%; + margin-left: -2px; + margin-right: -2px; +} +acronym { + cursor: help; + border-bottom: 1px dotted #feb; +} +table.bodyTable th, table.bodyTable td { + padding: 2px 4px 2px 4px; + vertical-align: top; +} +div.clear{ + clear:both; + visibility: hidden; +} +div.clear hr{ + display: none; +} +#bannerLeft, #bannerRight { + font-size: xx-large; + font-weight: bold; +} +#bannerLeft img, #bannerRight img { + margin: 0px; +} +.xleft, #bannerLeft img { + float:left; + text-shadow: #7CFC00 1px 1px 1px; +} +.xright, #bannerRight { + float:right; + text-shadow: #7CFC00 1px 1px 1px; +} +#banner { + padding: 0px; +} +#banner img { + border: none; +} +#breadcrumbs { + padding: 3px 10px 3px 10px; +} +#leftColumn { + width: 170px; + float:left; + overflow: auto; +} +#bodyColumn { + margin-right: 1.5em; + margin-left: 197px; +} +#legend { + padding: 8px 0 8px 0; +} +#navcolumn { + padding: 8px 4px 0 8px; +} +#navcolumn h5 { + margin: 0; + padding: 0; + font-size: small; +} +#navcolumn ul { + margin: 0; + padding: 0; + font-size: small; +} +#navcolumn li { + list-style-type: none; + background-image: none; + background-repeat: no-repeat; + background-position: 0 0.4em; + padding-left: 16px; + list-style-position: outside; + line-height: 1.2em; + font-size: smaller; +} +#navcolumn li.expanded { + background-image: url(../images/expanded.gif); +} +#navcolumn li.collapsed { + background-image: url(../images/collapsed.gif); +} +#navcolumn img { + margin-top: 10px; + margin-bottom: 3px; +} +#search img { + margin: 0px; + display: block; +} +#search #q, #search #btnG { + border: 1px solid #999; + margin-bottom:10px; +} +#search form { + margin: 0px; +} +#lastPublished { + font-size: x-small; +} +.navSection { + margin-bottom: 2px; + padding: 8px; +} +.navSectionHead { + font-weight: bold; + font-size: x-small; +} +.section { + padding: 4px; +} +#footer p { + padding: 3px 10px 3px 10px; + font-size: x-small; + text-align: center; +} +.source { + padding: 12px; + margin: 1em 7px 1em 7px; +} +.source pre { + margin: 0px; + padding: 0px; +} diff --git a/src/site/resources/tika.png b/src/site/resources/tika.png new file mode 100644 index 0000000..1ed6844 Binary files /dev/null and b/src/site/resources/tika.png differ diff --git a/src/site/resources/tika.svg b/src/site/resources/tika.svg new file mode 100644 index 0000000..dd01d85 --- /dev/null +++ b/src/site/resources/tika.svg @@ -0,0 +1,5318 @@ + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + generated by pstoedit version:3.45 from Z:/asf_logo_1999.eps + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1010100101010101011010101110101011101011101010111101011100101011101011010011000010101010101111011000110101010101100101110111010110011101<Apache Tika/> + diff --git a/src/site/resources/tikaNoText.svg b/src/site/resources/tikaNoText.svg new file mode 100644 index 0000000..1ebadbd --- /dev/null +++ b/src/site/resources/tikaNoText.svg @@ -0,0 +1,5305 @@ + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + generated by pstoedit version:3.45 from Z:/asf_logo_1999.eps + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1010100101010101011010101110101011101011101010111101011100101011101011010011000010101010101111011000110101010101100101110111010110011101 + diff --git a/src/site/resources/tikaNoText16.png b/src/site/resources/tikaNoText16.png new file mode 100644 index 0000000..6f16549 Binary files /dev/null and b/src/site/resources/tikaNoText16.png differ diff --git a/src/site/resources/tikaNoText32.png b/src/site/resources/tikaNoText32.png new file mode 100644 index 0000000..caa7033 Binary files /dev/null and b/src/site/resources/tikaNoText32.png differ diff --git a/src/site/resources/tikaNoText64.png b/src/site/resources/tikaNoText64.png new file mode 100644 index 0000000..cf8eddc Binary files /dev/null and b/src/site/resources/tikaNoText64.png differ diff --git a/src/site/site.vm b/src/site/site.vm new file mode 100644 index 0000000..647d204 --- /dev/null +++ b/src/site/site.vm @@ -0,0 +1,283 @@ + + + + +#macro ( link $href $name ) + #if ( ( $href.toLowerCase().startsWith("http") || $href.toLowerCase().startsWith("https") ) ) + $name + #else + $name + #end +#end + +#macro ( banner $banner $id ) + #if ( $banner ) + #if( $banner.href ) + + #else + + #end + #end +#end + +#macro ( links $links ) + #set ( $counter = 0 ) + #foreach( $item in $links ) + #set ( $counter = $counter + 1 ) + #set ( $currentItemHref = $PathTool.calculateLink( $item.href, $relativePath ) ) + #set ( $currentItemHref = $currentItemHref.replaceAll( "\\", "/" ) ) + #link( $currentItemHref $item.name ) + #if ( $links.size() > $counter ) + | + #end + #end +#end + +#macro ( breadcrumbs $breadcrumbs ) + #set ( $counter = 0 ) + #foreach( $item in $breadcrumbs ) + #set ( $counter = $counter + 1 ) + #set ( $currentItemHref = $PathTool.calculateLink( $item.href, $relativePath ) ) + #set ( $currentItemHref = $currentItemHref.replaceAll( "\\", "/" ) ) + + #if ( $currentItemHref == $alignedFileName || $currentItemHref == "" ) + $item.name + #else + #link( $currentItemHref $item.name ) + #end + #if ( $breadcrumbs.size() > $counter ) + > + #end + #end +#end + +#macro ( displayTree $display $item ) + #if ( $item && $item.items && $item.items.size() > 0 ) + #foreach( $subitem in $item.items ) + #set ( $subitemHref = $PathTool.calculateLink( $subitem.href, $relativePath ) ) + #set ( $subitemHref = $subitemHref.replaceAll( "\\", "/" ) ) + #if ( $alignedFileName == $subitemHref ) + #set ( $display = true ) + #end + + #displayTree( $display $subitem ) + #end + #end +#end + +#macro ( menuItem $item ) + #set ( $collapse = "none" ) + #set ( $currentItemHref = $PathTool.calculateLink( $item.href, $relativePath ) ) + #set ( $currentItemHref = $currentItemHref.replaceAll( "\\", "/" ) ) + + #if ( $item && $item.items && $item.items.size() > 0 ) + #if ( $item.collapse == false ) + #set ( $collapse = "expanded" ) + #else + ## By default collapsed + #set ( $collapse = "collapsed" ) + #end + + #set ( $display = false ) + #displayTree( $display $item ) + + #if ( $alignedFileName == $currentItemHref || $display ) + #set ( $collapse = "expanded" ) + #end + #end +
  • + #if ( $item.img ) + #if ( ! ( $item.img.toLowerCase().startsWith("http") || $item.img.toLowerCase().startsWith("https") ) ) + #set ( $src = $PathTool.calculateLink( $item.img, $relativePath ) ) + #set ( $src = $src.replaceAll( "\\", "/" ) ) + + #else + + #end + #end + #if ( $alignedFileName == $currentItemHref ) + $item.name + #else + #link( $currentItemHref $item.name ) + #end + #if ( $item && $item.items && $item.items.size() > 0 ) + #if ( $collapse == "expanded" ) +
      + #foreach( $subitem in $item.items ) + #menuItem( $subitem ) + #end +
    + #end + #end +
  • +#end + +#macro ( mainMenu $menus ) + #foreach( $menu in $menus ) + #if ( $menu.name ) +
    $menu.name
    + #end + #if ( $menu.items && $menu.items.size() > 0 ) +
      + #foreach( $item in $menu.items ) + #menuItem( $item ) + #end +
    + #end + #end +#end + + + + + $title + + + + + + + +
    + +
    + +
    +
    +
    +
    + $bodyContent +
    +
    +
    +
    +
    + + + diff --git a/src/site/site.xml b/src/site/site.xml new file mode 100644 index 0000000..9d099c0 --- /dev/null +++ b/src/site/site.xml @@ -0,0 +1,47 @@ + + + + + Apache Tika + http://tika.apache.org/tika.png + http://tika.apache.org + + + Apache + http://www.apache.org/images/feather-small.gif + www.apache.org + + + + + + + + + + + + + + + + + + diff --git a/tika-app/pom.xml b/tika-app/pom.xml index 96fbb62..3efe302 100644 --- a/tika-app/pom.xml +++ b/tika-app/pom.xml @@ -25,7 +25,7 @@ org.apache.tika tika-parent - 1.11 + 1.5 ../tika-parent/pom.xml @@ -42,21 +42,6 @@ ${project.groupId} tika-parsers ${project.version} - - - commons-logging - commons-logging - - - commons-logging - commons-logging-api - - - - - ${project.groupId} - tika-serialization - ${project.version} ${project.groupId} @@ -64,37 +49,26 @@ ${project.version} - ${project.groupId} - tika-batch - ${project.version} - - - org.slf4j slf4j-log4j12 - - - org.slf4j - jul-to-slf4j - - - org.slf4j - jcl-over-slf4j - - - log4j - log4j - 1.2.17 - - + 1.5.6 + + + com.google.code.gson + gson + 1.7.1 + junit junit + test + 4.11 commons-io commons-io - ${commons.io.version} + 2.1 + test @@ -132,15 +106,10 @@ CHANGES README builddef.lst - - resources/grib1/nasa/README*.pdf - resources/grib1/**/readme*.txt - resources/grib2/**/readme*.txt ucar/nc2/iosp/fysat/Fysat*.class ucar/nc2/dataset/transform/VOceanSG1*class ucar/unidata/geoloc/vertical/OceanSG*.class - @@ -161,33 +130,10 @@ META-INF/DEPENDENCIES target/classes/META-INF/DEPENDENCIES - - META-INF/cxf/bus-extensions.txt - - - - org.apache.maven.plugins - maven-jar-plugin - - - - test-jar - - - - - - org.apache.rat - apache-rat-plugin - - - src/test/resources/test-data/** - - @@ -264,20 +210,20 @@ - The Apache Software Foundation - http://www.apache.org + The Apache Software Foundation + http://www.apache.org - http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-app - scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-app - scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-app + http://svn.apache.org/viewvc/tika/tags/1.5/tika-app + scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.5/tika-app + scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.5/tika-app - JIRA - https://issues.apache.org/jira/browse/TIKA + JIRA + https://issues.apache.org/jira/browse/TIKA - Jenkins - https://builds.apache.org/job/Tika-trunk/ + Jenkins + https://builds.apache.org/job/Tika-trunk/ diff --git a/tika-app/src/main/appended-resources/META-INF/LICENSE b/tika-app/src/main/appended-resources/META-INF/LICENSE index f828d69..ad34fa2 100644 --- a/tika-app/src/main/appended-resources/META-INF/LICENSE +++ b/tika-app/src/main/appended-resources/META-INF/LICENSE @@ -1122,8 +1122,3 @@ this Agreement will bring a legal action under this Agreement more than one year after the cause of action arose. Each party waives its rights to a jury trial in any resulting litigation. - -Sqlite (included in the "provided" org.xerial's sqlite-jdbc) - Sqlite is in the Public Domain. For details - see: https://www.sqlite.org/copyright.html - diff --git a/tika-app/src/main/java/org/apache/tika/batch/DigestingAutoDetectParserFactory.java b/tika-app/src/main/java/org/apache/tika/batch/DigestingAutoDetectParserFactory.java deleted file mode 100644 index bb7a98c..0000000 --- a/tika-app/src/main/java/org/apache/tika/batch/DigestingAutoDetectParserFactory.java +++ /dev/null @@ -1,43 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.batch; - -import org.apache.tika.config.TikaConfig; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.DigestingParser; -import org.apache.tika.parser.Parser; - -public class DigestingAutoDetectParserFactory extends ParserFactory { - - private DigestingParser.Digester digester = null; - - - @Override - public Parser getParser(TikaConfig config) { - Parser p = new AutoDetectParser(config); - if (digester == null) { - return p; - } - DigestingParser d = new DigestingParser(p, digester); - return d; - } - - public void setDigester(DigestingParser.Digester digester) { - this.digester = digester; - } -} diff --git a/tika-app/src/main/java/org/apache/tika/batch/builders/AppParserFactoryBuilder.java b/tika-app/src/main/java/org/apache/tika/batch/builders/AppParserFactoryBuilder.java deleted file mode 100644 index 998f649..0000000 --- a/tika-app/src/main/java/org/apache/tika/batch/builders/AppParserFactoryBuilder.java +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.batch.builders; - -import java.util.Locale; -import java.util.Map; - -import org.apache.tika.batch.DigestingAutoDetectParserFactory; -import org.apache.tika.batch.ParserFactory; -import org.apache.tika.parser.DigestingParser; -import org.apache.tika.parser.utils.CommonsDigester; -import org.apache.tika.util.ClassLoaderUtil; -import org.apache.tika.util.XMLDOMUtil; -import org.w3c.dom.Node; - -public class AppParserFactoryBuilder implements IParserFactoryBuilder { - - @Override - public ParserFactory build(Node node, Map runtimeAttrs) { - Map localAttrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttrs); - String className = localAttrs.get("class"); - ParserFactory pf = ClassLoaderUtil.buildClass(ParserFactory.class, className); - - if (localAttrs.containsKey("parseRecursively")) { - String bString = localAttrs.get("parseRecursively").toLowerCase(Locale.ENGLISH); - if (bString.equals("true")) { - pf.setParseRecursively(true); - } else if (bString.equals("false")) { - pf.setParseRecursively(false); - } else { - throw new RuntimeException("parseRecursively must have value of \"true\" or \"false\": "+ - bString); - } - } - if (pf instanceof DigestingAutoDetectParserFactory) { - DigestingParser.Digester d = buildDigester(localAttrs); - ((DigestingAutoDetectParserFactory)pf).setDigester(d); - } - return pf; - } - - private DigestingParser.Digester buildDigester(Map localAttrs) { - String digestString = localAttrs.get("digest"); - CommonsDigester.DigestAlgorithm[] algos = CommonsDigester.parse(digestString); - - String readLimitString = localAttrs.get("digestMarkLimit"); - if (readLimitString == null) { - throw new IllegalArgumentException("Must specify \"digestMarkLimit\" for "+ - "the DigestingAutoDetectParserFactory"); - } - int readLimit = -1; - - try { - readLimit = Integer.parseInt(readLimitString); - } catch (NumberFormatException e) { - throw new IllegalArgumentException("Parameter \"digestMarkLimit\" must be a parseable int: "+ - readLimitString); - } - return new CommonsDigester(readLimit, algos); - } -} diff --git a/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java b/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java deleted file mode 100644 index da44956..0000000 --- a/tika-app/src/main/java/org/apache/tika/cli/BatchCommandLineBuilder.java +++ /dev/null @@ -1,209 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.cli; - -import java.io.IOException; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.ArrayList; -import java.util.LinkedHashMap; -import java.util.List; -import java.util.Map; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -/** - * This takes a TikaCLI commandline and builds the full commandline for - * org.apache.tika.batch.fs.FSBatchProcessCLI. - *

    - * The "default" batch config file that this relies on - * if no batch config file is specified on the commandline - * is: tika-batch/src/main/resources/.../default-tika-batch-config.xml - */ -class BatchCommandLineBuilder { - - static Pattern JVM_OPTS_PATTERN = Pattern.compile("^(--?)J(.+)"); - - protected static String[] build(String[] args) throws IOException { - Map processArgs = new LinkedHashMap(); - Map jvmOpts = new LinkedHashMap(); - //take the args, and divide them into process args and options for - //the child jvm process (i.e. log files, etc) - mapifyArgs(args, processArgs, jvmOpts); - - //now modify processArgs in place - translateCommandLine(args, processArgs); - - //maybe the user specified a different classpath?! - if (! jvmOpts.containsKey("-cp") && ! jvmOpts.containsKey("--classpath")) { - String cp = System.getProperty("java.class.path"); - //need to test for " " on *nix, can't just add double quotes - //across platforms. - if (cp.contains(" ")){ - cp = "\""+cp+"\""; - } - jvmOpts.put("-cp", cp); - } - - boolean hasLog4j = false; - for (String k : jvmOpts.keySet()) { - if (k.startsWith("-Dlog4j.configuration=")) { - hasLog4j = true; - break; - } - } - //use the log4j config file inside the app /resources/log4j_batch_process.properties - if (! hasLog4j) { - jvmOpts.put("-Dlog4j.configuration=\"log4j_batch_process.properties\"", ""); - } - //now build the full command line - List fullCommand = new ArrayList(); - fullCommand.add("java"); - for (Map.Entry e : jvmOpts.entrySet()) { - fullCommand.add(e.getKey()); - if (e.getValue().length() > 0) { - fullCommand.add(e.getValue()); - } - } - fullCommand.add("org.apache.tika.batch.fs.FSBatchProcessCLI"); - //now add the process commands - for (Map.Entry e : processArgs.entrySet()) { - fullCommand.add(e.getKey()); - if (e.getValue().length() > 0) { - fullCommand.add(e.getValue()); - } - } - return fullCommand.toArray(new String[fullCommand.size()]); - } - - - /** - * Take the input args and separate them into args that belong on the commandline - * and those that belong as jvm args for the child process. - * @param args -- literal args from TikaCLI commandline - * @param commandLine args that should be part of the batch commandline - * @param jvmArgs args that belong as jvm arguments for the child process - */ - private static void mapifyArgs(final String[] args, - final Map commandLine, - final Map jvmArgs) { - - if (args.length == 0) { - return; - } - - Matcher matcher = JVM_OPTS_PATTERN.matcher(""); - for (int i = 0; i < args.length; i++) { - if (matcher.reset(args[i]).find()) { - String jvmArg = matcher.group(1)+matcher.group(2); - String v = ""; - if (i < args.length-1 && ! args[i+1].startsWith("-")){ - v = args[i+1]; - i++; - } - jvmArgs.put(jvmArg, v); - } else if (args[i].startsWith("-")) { - String k = args[i]; - String v = ""; - if (i < args.length-1 && ! args[i+1].startsWith("-")){ - v = args[i+1]; - i++; - } - commandLine.put(k, v); - } - } - } - - private static void translateCommandLine(String[] args, Map map) throws IOException { - //if there are only two args and they are both directories, treat the first - //as input and the second as output. - if (args.length == 2 && !args[0].startsWith("-") && ! args[1].startsWith("-")) { - Path candInput = Paths.get(args[0]); - Path candOutput = Paths.get(args[1]); - - if (Files.isRegularFile(candOutput)) { - throw new IllegalArgumentException("Can't specify an existing file as the "+ - "second argument for the output directory of a batch process"); - } - - if (Files.isDirectory(candInput)) { - map.put("-inputDir", args[0]); - map.put("-outputDir", args[1]); - } - } - //look for tikaConfig - for (String arg : args) { - if (arg.startsWith("--config=")) { - String configPath = arg.substring("--config=".length()); - map.put("-c", configPath); - break; - } - } - //now translate output types - if (map.containsKey("-h") || map.containsKey("--html")) { - map.remove("-h"); - map.remove("--html"); - map.put("-basicHandlerType", "html"); - map.put("-outputSuffix", "html"); - } else if (map.containsKey("-x") || map.containsKey("--xml")) { - map.remove("-x"); - map.remove("--xml"); - map.put("-basicHandlerType", "xml"); - map.put("-outputSuffix", "xml"); - } else if (map.containsKey("-t") || map.containsKey("--text")) { - map.remove("-t"); - map.remove("--text"); - map.put("-basicHandlerType", "text"); - map.put("-outputSuffix", "txt"); - } else if (map.containsKey("-m") || map.containsKey("--metadata")) { - map.remove("-m"); - map.remove("--metadata"); - map.put("-basicHandlerType", "ignore"); - map.put("-outputSuffix", "json"); - } else if (map.containsKey("-T") || map.containsKey("--text-main")) { - map.remove("-T"); - map.remove("--text-main"); - map.put("-basicHandlerType", "body"); - map.put("-outputSuffix", "txt"); - } - - if (map.containsKey("-J") || map.containsKey("--jsonRecursive")) { - map.remove("-J"); - map.remove("--jsonRecursive"); - map.put("-recursiveParserWrapper", "true"); - //overwrite outputSuffix - map.put("-outputSuffix", "json"); - } - - if (map.containsKey("--inputDir") || map.containsKey("-i")) { - String v1 = map.remove("--inputDir"); - String v2 = map.remove("-i"); - String v = (v1 == null) ? v2 : v1; - map.put("-inputDir", v); - } - - if (map.containsKey("--outputDir") || map.containsKey("-o")) { - String v1 = map.remove("--outputDir"); - String v2 = map.remove("-o"); - String v = (v1 == null) ? v2 : v1; - map.put("-outputDir", v); - } - - } -} diff --git a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java index 50f3463..7a44060 100644 --- a/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java +++ b/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java @@ -16,20 +16,10 @@ */ package org.apache.tika.cli; -import static java.nio.charset.StandardCharsets.UTF_8; - -import javax.xml.transform.OutputKeys; -import javax.xml.transform.TransformerConfigurationException; -import javax.xml.transform.sax.SAXTransformerFactory; -import javax.xml.transform.sax.TransformerHandler; -import javax.xml.transform.stream.StreamResult; -import java.io.BufferedReader; import java.io.File; -import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; -import java.io.InputStreamReader; import java.io.OutputStream; import java.io.OutputStreamWriter; import java.io.PrintStream; @@ -41,37 +31,34 @@ import java.net.Socket; import java.net.URI; import java.net.URL; -import java.nio.charset.Charset; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; +import java.text.NumberFormat; +import java.text.ParsePosition; import java.util.Arrays; import java.util.Comparator; -import java.util.Enumeration; import java.util.HashMap; import java.util.HashSet; import java.util.List; -import java.util.Locale; +import java.util.Map.Entry; import java.util.Map; -import java.util.Map.Entry; import java.util.Set; -import java.util.TreeSet; - -import org.apache.commons.io.FilenameUtils; -import org.apache.commons.io.IOUtils; -import org.apache.commons.io.input.CloseShieldInputStream; +import javax.xml.transform.OutputKeys; +import javax.xml.transform.TransformerConfigurationException; +import javax.xml.transform.sax.SAXTransformerFactory; +import javax.xml.transform.sax.TransformerHandler; +import javax.xml.transform.stream.StreamResult; + import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; +import org.apache.log4j.BasicConfigurator; import org.apache.log4j.Level; -import org.apache.log4j.LogManager; import org.apache.log4j.Logger; -import org.apache.log4j.PropertyConfigurator; +import org.apache.log4j.SimpleLayout; +import org.apache.log4j.WriterAppender; import org.apache.poi.poifs.filesystem.DirectoryEntry; import org.apache.poi.poifs.filesystem.DocumentEntry; import org.apache.poi.poifs.filesystem.DocumentInputStream; import org.apache.poi.poifs.filesystem.POIFSFileSystem; import org.apache.tika.Tika; -import org.apache.tika.batch.BatchProcessDriverCLI; import org.apache.tika.config.TikaConfig; import org.apache.tika.detect.CompositeDetector; import org.apache.tika.detect.DefaultDetector; @@ -80,65 +67,46 @@ import org.apache.tika.extractor.EmbeddedDocumentExtractor; import org.apache.tika.fork.ForkParser; import org.apache.tika.gui.TikaGUI; +import org.apache.tika.io.CloseShieldInputStream; +import org.apache.tika.io.IOUtils; import org.apache.tika.io.TikaInputStream; import org.apache.tika.language.LanguageProfilerBuilder; import org.apache.tika.language.ProfilingHandler; import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.serialization.JsonMetadata; -import org.apache.tika.metadata.serialization.JsonMetadataList; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MediaTypeRegistry; -import org.apache.tika.mime.MimeType; import org.apache.tika.mime.MimeTypeException; -import org.apache.tika.mime.MimeTypes; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.DigestingParser; import org.apache.tika.parser.NetworkParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.parser.ParserDecorator; import org.apache.tika.parser.PasswordProvider; -import org.apache.tika.parser.RecursiveParserWrapper; import org.apache.tika.parser.html.BoilerpipeContentHandler; -import org.apache.tika.parser.utils.CommonsDigester; -import org.apache.tika.sax.BasicContentHandlerFactory; import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.ContentHandlerFactory; import org.apache.tika.sax.ExpandedTitleContentHandler; import org.apache.tika.xmp.XMPMetadata; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; +import com.google.gson.Gson; +import org.apache.tika.io.FilenameUtils; /** * Simple command line interface for Apache Tika. */ public class TikaCLI { - - private final int MAX_MARK = 20*1024*1024;//20MB - private File extractDir = new File("."); private static final Log logger = LogFactory.getLog(TikaCLI.class); public static void main(String[] args) throws Exception { + BasicConfigurator.configure( + new WriterAppender(new SimpleLayout(), System.err)); + Logger.getRootLogger().setLevel(Level.INFO); TikaCLI cli = new TikaCLI(); - if (! isConfigured()) { - PropertyConfigurator.configure(cli.getClass().getResourceAsStream("/log4j.properties")); - } - - if (cli.testForHelp(args)) { - cli.usage(); - return; - } else if (cli.testForBatch(args)) { - String[] batchArgs = BatchCommandLineBuilder.build(args); - BatchProcessDriverCLI batchDriver = new BatchProcessDriverCLI(batchArgs); - batchDriver.execute(); - return; - } - if (args.length > 0) { for (int i = 0; i < args.length; i++) { cli.process(args[i]); @@ -161,22 +129,6 @@ } } - private static boolean isConfigured() { - //Borrowed from: http://wiki.apache.org/logging-log4j/UsefulCode - Enumeration appenders = LogManager.getRootLogger().getAllAppenders(); - if (appenders.hasMoreElements()) { - return true; - } - else { - Enumeration loggers = LogManager.getCurrentLoggers() ; - while (loggers.hasMoreElements()) { - Logger c = (Logger) loggers.nextElement(); - if (c.getAllAppenders().hasMoreElements()) - return true; - } - } - return false; - } private class OutputType { public void process( @@ -326,11 +278,7 @@ private Parser parser; - private String configFilePath; - private OutputType type = XML; - - private boolean recursiveJSON = false; private LanguageProfilerBuilder ngp = null; @@ -343,8 +291,6 @@ * Password for opening encrypted documents, or null. */ private String password = System.getenv("TIKA_PASSWORD"); - - private DigestingParser.Digester digester = null; private boolean pipeMode = true; @@ -379,44 +325,27 @@ Logger.getRootLogger().setLevel(Level.DEBUG); } else if (arg.equals("-g") || arg.equals("--gui")) { pipeMode = false; - if (configFilePath != null){ - TikaGUI.main(new String[]{configFilePath}); - } else { - TikaGUI.main(new String[0]); - } + TikaGUI.main(new String[0]); } else if (arg.equals("--list-parser") || arg.equals("--list-parsers")) { pipeMode = false; - displayParsers(false, false); + displayParsers(false); } else if (arg.equals("--list-detector") || arg.equals("--list-detectors")) { pipeMode = false; displayDetectors(); } else if (arg.equals("--list-parser-detail") || arg.equals("--list-parser-details")) { pipeMode = false; - displayParsers(true, false); - } else if (arg.equals("--list-parser-detail-apt") || arg.equals("--list-parser-details-apt")) { - pipeMode = false; - displayParsers(true, true); + displayParsers(true); } else if(arg.equals("--list-met-models")){ pipeMode = false; displayMetModels(); } else if(arg.equals("--list-supported-types")){ pipeMode = false; displaySupportedTypes(); - } else if (arg.startsWith("--compare-file-magic=")) { - pipeMode = false; - compareFileMagic(arg.substring(arg.indexOf('=')+1)); } else if (arg.equals("--container-aware") || arg.equals("--container-aware-detector")) { // ignore, as container-aware detectors are now always used } else if (arg.equals("-f") || arg.equals("--fork")) { fork = true; - } else if (arg.startsWith("--config=")) { - configure(arg.substring("--config=".length())); - } else if (arg.startsWith("--digest=")) { - CommonsDigester.DigestAlgorithm[] algos = CommonsDigester.parse( - arg.substring("--digest=".length())); - digester = new CommonsDigester(MAX_MARK,algos); - parser = new DigestingParser(parser, digester); } else if (arg.startsWith("-e")) { encoding = arg.substring("-e".length()); } else if (arg.startsWith("--encoding=")) { @@ -427,9 +356,7 @@ password = arg.substring("--password=".length()); } else if (arg.equals("-j") || arg.equals("--json")) { type = JSON; - } else if (arg.equals("-J") || arg.equals("--jsonRecursive")) { - recursiveJSON = true; - } else if (arg.equals("-y") || arg.equals("--xmp")) { + } else if (arg.equals("-y") || arg.equals("--xmp")) { type = XMP; } else if (arg.equals("-x") || arg.equals("--xml")) { type = XML; @@ -470,9 +397,12 @@ if (serverMode) { new TikaServer(Integer.parseInt(arg)).start(); } else if (arg.equals("-")) { - try (InputStream stream = TikaInputStream.get( - new CloseShieldInputStream(System.in))) { + InputStream stream = + TikaInputStream.get(new CloseShieldInputStream(System.in)); + try { type.process(stream, System.out, new Metadata()); + } finally { + stream.close(); } } else { URL url; @@ -482,51 +412,18 @@ } else { url = new URL(arg); } - if (recursiveJSON) { - handleRecursiveJson(url, System.out); - } else { - Metadata metadata = new Metadata(); - try (InputStream input = - TikaInputStream.get(url, metadata)) { - type.process(input, System.out, metadata); - } finally { - System.out.flush(); - } - } - } - } - } - - private void handleRecursiveJson(URL url, OutputStream output) throws IOException, SAXException, TikaException { - Metadata metadata = new Metadata(); - RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, getContentHandlerFactory(type)); - try (InputStream input = TikaInputStream.get(url, metadata)) { - wrapper.parse(input, null, metadata, context); - } - JsonMetadataList.setPrettyPrinting(prettyPrint); - Writer writer = getOutputWriter(output, encoding); - try { - JsonMetadataList.toJson(wrapper.getMetadata(), writer); - } finally { - writer.flush(); - } - } - - private ContentHandlerFactory getContentHandlerFactory(OutputType type) { - BasicContentHandlerFactory.HANDLER_TYPE handlerType = BasicContentHandlerFactory.HANDLER_TYPE.IGNORE; - if (type.equals(HTML)) { - handlerType = BasicContentHandlerFactory.HANDLER_TYPE.HTML; - } else if (type.equals(XML)) { - handlerType = BasicContentHandlerFactory.HANDLER_TYPE.XML; - } else if (type.equals(TEXT)) { - handlerType = BasicContentHandlerFactory.HANDLER_TYPE.TEXT; - } else if (type.equals(TEXT_MAIN)) { - handlerType = BasicContentHandlerFactory.HANDLER_TYPE.BODY; - } else if (type.equals(METADATA)) { - handlerType = BasicContentHandlerFactory.HANDLER_TYPE.IGNORE; - } - return new BasicContentHandlerFactory(handlerType, -1); - } + Metadata metadata = new Metadata(); + InputStream input = TikaInputStream.get(url, metadata); + try { + type.process(input, System.out, metadata); + } finally { + input.close(); + System.out.flush(); + } + } + } + } + private void usage() { PrintStream out = System.out; out.println("usage: java -jar tika-app.jar [option...] [file|port...]"); @@ -540,9 +437,6 @@ out.println(" -s or --server Start the Apache Tika server"); out.println(" -f or --fork Use Fork Mode for out-of-process extraction"); out.println(); - out.println(" --config="); - out.println(" TikaConfig file. Must be specified before -g, -s or -f!"); - out.println(""); out.println(" -x or --xml Output XHTML content (default)"); out.println(" -h or --html Output HTML content"); out.println(" -t or --text Output plain text content"); @@ -550,18 +444,13 @@ out.println(" -m or --metadata Output only metadata"); out.println(" -j or --json Output metadata in JSON"); out.println(" -y or --xmp Output metadata in XMP"); - out.println(" -J or --jsonRecursive Output metadata and content from all"); - out.println(" embedded files (choose content type"); - out.println(" with -x, -h, -t or -m; default is -x)"); out.println(" -l or --language Output only language"); out.println(" -d or --detect Detect document type"); - out.println(" --digest=X Include digest X (md2, md5, sha1,"); - out.println(" sha256, sha384, sha512"); out.println(" -eX or --encoding=X Use output encoding X"); out.println(" -pX or --password=X Use document password X"); out.println(" -z or --extract Extract all attachements into current directory"); out.println(" --extract-dir=

    Specify target directory for -z"); - out.println(" -r or --pretty-print For JSON, XML and XHTML outputs, adds newlines and"); + out.println(" -r or --pretty-print For XML and XHTML outputs, adds newlines and"); out.println(" whitespace, for better readability"); out.println(); out.println(" --create-profile=X"); @@ -569,9 +458,7 @@ out.println(" --list-parsers"); out.println(" List the available document parsers"); out.println(" --list-parser-details"); - out.println(" List the available document parsers and their supported mime types"); - out.println(" --list-parser-details-apt"); - out.println(" List the available document parsers and their supported mime types in apt format."); + out.println(" List the available document parsers, and their supported mime types"); out.println(" --list-detectors"); out.println(" List the available document detectors"); out.println(" --list-met-models"); @@ -579,9 +466,6 @@ out.println(" --list-supported-types"); out.println(" List all known media types and related information"); out.println(); - out.println(); - out.println(" --compare-file-magic="); - out.println(" Compares Tika's known media types to the File(1) tool's magic directory"); out.println("Description:"); out.println(" Apache Tika will parse the file(s) specified on the"); out.println(" command line and output the extracted text content"); @@ -608,78 +492,10 @@ out.println(" Apache Tika server. The server will listen to the"); out.println(" ports you specify as one or more arguments."); out.println(); - out.println("- Batch mode"); - out.println(); - out.println(" Simplest method."); - out.println(" Specify two directories as args with no other args:"); - out.println(" java -jar tika-app.jar "); - out.println(); - out.println("Batch Options:"); - out.println(" -i or --inputDir Input directory"); - out.println(" -o or --outputDir Output directory"); - out.println(" -numConsumers Number of processing threads"); - out.println(" -bc Batch config file"); - out.println(" -maxRestarts Maximum number of times the "); - out.println(" watchdog process will restart the child process."); - out.println(" -timeoutThresholdMillis Number of milliseconds allowed to a parse"); - out.println(" before the process is killed and restarted"); - out.println(" -fileList List of files to process, with"); - out.println(" paths relative to the input directory"); - out.println(" -includeFilePat Regular expression to determine which"); - out.println(" files to process, e.g. \"(?i)\\.pdf\""); - out.println(" -excludeFilePat Regular expression to determine which"); - out.println(" files to avoid processing, e.g. \"(?i)\\.pdf\""); - out.println(" -maxFileSizeBytes Skip files longer than this value"); - out.println(); - out.println(" Control the type of output with -x, -h, -t and/or -J."); - out.println(); - out.println(" To modify child process jvm args, prepend \"J\" as in:"); - out.println(" -JXmx4g or -JDlog4j.configuration=file:log4j.xml."); } private void version() { System.out.println(new Tika().toString()); - } - - private boolean testForHelp(String[] args) { - for (String s : args) { - if (s.equals("-?") || s.equals("--help")) { - return true; - } - } - return false; - } - - private boolean testForBatch(String[] args) { - if (args.length == 2 && ! args[0].startsWith("-") - && ! args[1].startsWith("-")) { - Path inputCand = Paths.get(args[0]); - Path outputCand = Paths.get(args[1]); - if (Files.isDirectory(inputCand) && - !Files.isRegularFile(outputCand)) { - return true; - } - } - - for (String s : args) { - if (s.equals("-inputDir") || s.equals("--inputDir") || s.equals("-i")) { - return true; - } - } - return false; - } - - - - private void configure(String configFilePath) throws Exception { - this.configFilePath = configFilePath; - TikaConfig config = new TikaConfig(new File(configFilePath)); - parser = new AutoDetectParser(config); - if (digester != null) { - parser = new DigestingParser(parser, digester); - } - detector = config.getDetector(); - context.set(Parser.class, parser); } private void displayMetModels(){ @@ -713,41 +529,26 @@ * If a parser is a composite parser, it will list the * sub parsers and their mime-types. */ - private void displayParsers(boolean includeMimeTypes, boolean aptListFormat) { - displayParser(parser, includeMimeTypes, aptListFormat, 3); + private void displayParsers(boolean includeMimeTypes) { + displayParser(parser, includeMimeTypes, 0); } - private void displayParser(Parser p, boolean includeMimeTypes, boolean apt, int i) { - String decorated = null; - if (p instanceof ParserDecorator) { - ParserDecorator pd = (ParserDecorator)p; - decorated = " (Wrapped by " + pd.getDecorationName() + ")"; - p = pd.getWrappedParser(); - } - + private void displayParser(Parser p, boolean includeMimeTypes, int i) { boolean isComposite = (p instanceof CompositeParser); - String name = p.getClass().getName(); - - if (apt) { - name = name.substring(0, name.lastIndexOf(".") + 1) + "{{{./api/" + name.replace(".", "/") + "}" + name.substring(name.lastIndexOf(".") + 1) + "}}"; - } else if (decorated != null) { - name += decorated; - } - if ((apt && !isComposite) || !apt) { // Don't display Composite parsers in the apt output. - System.out.println(indent(i) + ((apt) ? "* " : "") + name + (isComposite ? " (Composite Parser):" : "")); - if (apt) System.out.println(); - if (includeMimeTypes && !isComposite) { - for (MediaType mt : p.getSupportedTypes(context)) { - System.out.println(indent(i + 3) + ((apt) ? "* " : "") + mt); - if (apt) System.out.println(); - } + String name = (p instanceof ParserDecorator) ? + ((ParserDecorator) p).getWrappedParser().getClass().getName() : + p.getClass().getName(); + System.out.println(indent(i) + name + (isComposite ? " (Composite Parser):" : "")); + if (includeMimeTypes && !isComposite) { + for (MediaType mt : p.getSupportedTypes(context)) { + System.out.println(indent(i+2) + mt); } } if (isComposite) { Parser[] subParsers = sortParsers(invertMediaTypeMap(((CompositeParser) p).getParsers())); for(Parser sp : subParsers) { - displayParser(sp, includeMimeTypes, apt, i + ((apt) ? 0 : 3)); // Don't indent for Composites in apt. + displayParser(sp, includeMimeTypes, i+2); } } } @@ -820,133 +621,8 @@ } Parser p = parsers.get(type); if (p != null) { - if (p instanceof CompositeParser) { - p = ((CompositeParser)p).getParsers().get(type); - } System.out.println(" parser: " + p.getClass().getName()); } - } - } - - /** - * Compares our mime types registry with the File(1) tool's - * directory of (uncompiled) Magic entries. - * (Well, those with mimetypes anyway) - * @param magicDir Path to the magic directory - */ - private void compareFileMagic(String magicDir) throws Exception { - Set tikaLacking = new TreeSet(); - Set tikaNoMagic = new TreeSet(); - - // Sanity check - File dir = new File(magicDir); - if ((new File(dir, "elf")).exists() && - (new File(dir, "mime")).exists() && - (new File(dir, "vorbis")).exists()) { - // Looks plausible - } else { - throw new IllegalArgumentException( - magicDir + " doesn't seem to hold uncompressed file magic entries"); - } - - // Find all the mimetypes in the directory - Set fileMimes = new HashSet(); - for (File mf : dir.listFiles()) { - if (mf.isFile()) { - BufferedReader r = new BufferedReader(new InputStreamReader( - new FileInputStream(mf), UTF_8)); - String line; - while ((line = r.readLine()) != null) { - if (line.startsWith("!:mime") || - line.startsWith("#!:mime")) { - String mime = line.substring(7).trim(); - fileMimes.add(mime); - } - } - r.close(); - } - } - - // See how those compare to the Tika ones - TikaConfig config = TikaConfig.getDefaultConfig(); - MimeTypes mimeTypes = config.getMimeRepository(); - MediaTypeRegistry registry = config.getMediaTypeRegistry(); - for (String mime : fileMimes) { - try { - final MimeType type = mimeTypes.getRegisteredMimeType(mime); - - if (type == null) { - // Tika doesn't know about this one - tikaLacking.add(mime); - } else { - // Tika knows about this one! - - // Does Tika have magic for it? - boolean hasMagic = type.hasMagic(); - - // How about the children? - if (!hasMagic) { - for (MediaType child : registry.getChildTypes(type.getType())) { - MimeType childType = mimeTypes.getRegisteredMimeType(child.toString()); - if (childType != null && childType.hasMagic()) { - hasMagic = true; - } - } - } - - // How about the parents? - MimeType parentType = type; - while (parentType != null && !hasMagic) { - if (parentType.hasMagic()) { - // Has magic, fine - hasMagic = true; - } else { - // Check the parent next - MediaType parent = registry.getSupertype(type.getType()); - if (parent == MediaType.APPLICATION_XML || - parent == MediaType.TEXT_PLAIN || - parent == MediaType.OCTET_STREAM) { - // Stop checking parents if we hit a top level type - parent = null; - } - if (parent != null) { - parentType = mimeTypes.getRegisteredMimeType(parent.toString()); - } else { - parentType = null; - } - } - } - if (!hasMagic) { - tikaNoMagic.add(mime); - } - } - } catch (MimeTypeException e) { - // Broken entry in the file magic directory - // Silently skip - } - } - - // Check how many tika knows about - int tikaTypes = 0; - int tikaAliases = 0; - for (MediaType type : registry.getTypes()) { - tikaTypes++; - tikaAliases += registry.getAliases(type).size(); - } - - // Report - System.out.println("Tika knows about " + tikaTypes + " unique mime types"); - System.out.println("Tika knows about " + (tikaTypes+tikaAliases) + " mime types including aliases"); - System.out.println("The File Magic directory knows about " + fileMimes.size() + " unique mime types"); - System.out.println(); - System.out.println("The following mime types are known to File but not Tika:"); - for (String mime : tikaLacking) { - System.out.println(" " + mime); - } - System.out.println(); - System.out.println("The following mime types from File have no Tika magic (but their children might):"); - for (String mime : tikaNoMagic) { - System.out.println(" " + mime); } } @@ -966,11 +642,11 @@ if (encoding != null) { return new OutputStreamWriter(output, encoding); } else if (System.getProperty("os.name") - .toLowerCase(Locale.ROOT).startsWith("mac os x")) { + .toLowerCase().startsWith("mac os x")) { // TIKA-324: Override the default encoding on Mac OS X - return new OutputStreamWriter(output, UTF_8); + return new OutputStreamWriter(output, "UTF-8"); } else { - return new OutputStreamWriter(output, Charset.defaultCharset()); + return new OutputStreamWriter(output); } } @@ -1046,7 +722,11 @@ } System.out.println("Extracting '"+name+"' ("+contentType+") to " + outputFile); - try (FileOutputStream os = new FileOutputStream(outputFile)) { + FileOutputStream os = null; + + try { + os = new FileOutputStream(outputFile); + if (inputStream instanceof TikaInputStream) { TikaInputStream tin = (TikaInputStream) inputStream; @@ -1065,13 +745,16 @@ // being a CLI program messages should go to the stderr too // String msg = String.format( - Locale.ROOT, "Ignoring unexpected exception trying to save embedded file %s (%s)", name, e.getMessage() ); System.err.println(msg); logger.warn(msg, e); + } finally { + if (os != null) { + os.close(); + } } } @@ -1084,9 +767,11 @@ copy((DirectoryEntry) entry, newDir); } else { // Copy entry - try (InputStream contents = - new DocumentInputStream((DocumentEntry) entry)) { + InputStream contents = new DocumentInputStream((DocumentEntry) entry); + try { destDir.createDocument(entry.getName(), contents); + } finally { + contents.close(); } } } @@ -1122,17 +807,13 @@ @Override public void run() { try { - InputStream input = null; try { InputStream rawInput = socket.getInputStream(); OutputStream output = socket.getOutputStream(); - input = TikaInputStream.get(rawInput); + InputStream input = TikaInputStream.get(rawInput); type.process(input, output, new Metadata()); output.flush(); } finally { - if (input != null) { - input.close(); - } socket.close(); } } catch (Exception e) { @@ -1214,27 +895,68 @@ } } } - - private class NoDocumentJSONMetHandler extends DefaultHandler { - - protected final Metadata metadata; - - protected PrintWriter writer; - - public NoDocumentJSONMetHandler(Metadata metadata, PrintWriter writer) { - this.metadata = metadata; - this.writer = writer; - } - - @Override - public void endDocument() throws SAXException { - try { - JsonMetadata.setPrettyPrinting(prettyPrint); - JsonMetadata.toJson(metadata, writer); - writer.flush(); - } catch (TikaException e) { - throw new SAXException(e); - } - } + + /** + * Uses GSON to do the JSON escaping, but does + * the general JSON glueing ourselves. + */ + private class NoDocumentJSONMetHandler extends NoDocumentMetHandler { + private NumberFormat formatter; + private Gson gson; + + public NoDocumentJSONMetHandler(Metadata metadata, PrintWriter writer){ + super(metadata, writer); + + formatter = NumberFormat.getInstance(); + gson = new Gson(); + } + + @Override + public void outputMetadata(String[] names) { + writer.print("{ "); + boolean first = true; + for (String name : names) { + if(! first) { + writer.println(", "); + } else { + first = false; + } + gson.toJson(name, writer); + writer.print(":"); + outputValues(metadata.getValues(name)); + } + writer.print(" }"); + } + + public void outputValues(String[] values) { + if(values.length > 1) { + writer.print("["); + } + for(int i=0; i 0) { + writer.print(", "); + } + + if(value == null || value.length() == 0) { + writer.print("null"); + } else { + // Is it a number? + ParsePosition pos = new ParsePosition(0); + formatter.parse(value, pos); + if(value.length() == pos.getIndex()) { + // It's a number. Remove leading zeros and output + value = value.replaceFirst("^0+(\\d)", "$1"); + writer.print(value); + } else { + // Not a number, escape it + gson.toJson(value, writer); + } + } + } + if(values.length > 1) { + writer.print("]"); + } + } } } diff --git a/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java b/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java index f9a4e2d..87ce1e9 100644 --- a/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java +++ b/tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java @@ -15,6 +15,28 @@ * limitations under the License. */ package org.apache.tika.gui; + +import java.awt.CardLayout; +import java.awt.Color; +import java.awt.Dimension; +import java.awt.Toolkit; +import java.awt.event.ActionEvent; +import java.awt.event.ActionListener; +import java.awt.event.KeyEvent; +import java.awt.event.WindowEvent; +import java.io.File; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.PrintWriter; +import java.io.StringWriter; +import java.io.Writer; +import java.net.MalformedURLException; +import java.net.URL; +import java.util.Arrays; +import java.util.HashMap; +import java.util.Map; +import java.util.Set; import javax.swing.Box; import javax.swing.JDialog; @@ -39,45 +61,18 @@ import javax.xml.transform.sax.SAXTransformerFactory; import javax.xml.transform.sax.TransformerHandler; import javax.xml.transform.stream.StreamResult; -import java.awt.CardLayout; -import java.awt.Color; -import java.awt.Dimension; -import java.awt.Toolkit; -import java.awt.event.ActionEvent; -import java.awt.event.ActionListener; -import java.awt.event.KeyEvent; -import java.awt.event.WindowEvent; -import java.io.File; -import java.io.FileOutputStream; -import java.io.IOException; -import java.io.InputStream; -import java.io.PrintWriter; -import java.io.StringWriter; -import java.io.Writer; -import java.net.MalformedURLException; -import java.net.URL; -import java.util.Arrays; -import java.util.HashMap; -import java.util.Map; -import java.util.Set; - -import org.apache.commons.io.IOUtils; -import org.apache.tika.config.TikaConfig; + import org.apache.tika.exception.TikaException; import org.apache.tika.extractor.DocumentSelector; +import org.apache.tika.io.IOUtils; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.serialization.JsonMetadataList; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.DigestingParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; -import org.apache.tika.parser.RecursiveParserWrapper; import org.apache.tika.parser.html.BoilerpipeContentHandler; -import org.apache.tika.parser.utils.CommonsDigester; -import org.apache.tika.sax.BasicContentHandlerFactory; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.ContentHandlerDecorator; import org.apache.tika.sax.TeeContentHandler; @@ -87,8 +82,6 @@ import org.xml.sax.SAXException; import org.xml.sax.helpers.AttributesImpl; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * Simple Swing GUI for Apache Tika. You can drag and drop files on top * of the window to have them parsed. @@ -96,9 +89,6 @@ public class TikaGUI extends JFrame implements ActionListener, HyperlinkListener { - //maximum length to allow for mark for reparse to get JSON - private static final int MAX_MARK = 20*1024*1024;//20MB - /** * Serial version UID. */ @@ -113,21 +103,10 @@ * @throws Exception if an error occurs */ public static void main(String[] args) throws Exception { - TikaConfig config = TikaConfig.getDefaultConfig(); - if (args.length > 0) { - File configFile = new File(args[0]); - config = new TikaConfig(configFile); - } UIManager.setLookAndFeel(UIManager.getSystemLookAndFeelClassName()); - final TikaConfig finalConfig = config; SwingUtilities.invokeLater(new Runnable() { public void run() { - new TikaGUI(new DigestingParser( - new AutoDetectParser(finalConfig), - new CommonsDigester(MAX_MARK, - CommonsDigester.DigestAlgorithm.MD5, - CommonsDigester.DigestAlgorithm.SHA256) - )).setVisible(true); + new TikaGUI(new AutoDetectParser()).setVisible(true); } }); } @@ -176,11 +155,6 @@ * Raw XHTML source. */ private final JEditorPane xml; - - /** - * Raw JSON source. - */ - private final JEditorPane json; /** * Document metadata. @@ -205,7 +179,6 @@ text = addCard(cards, "text/plain", "text"); textMain = addCard(cards, "text/plain", "main"); xml = addCard(cards, "text/plain", "xhtml"); - json = addCard(cards, "text/plain", "json"); add(cards); layout.show(cards, "welcome"); @@ -238,7 +211,6 @@ addMenuItem(view, "Plain text", "text", KeyEvent.VK_P); addMenuItem(view, "Main content", "main", KeyEvent.VK_C); addMenuItem(view, "Structured text", "xhtml", KeyEvent.VK_S); - addMenuItem(view, "Recursive JSON", "json", KeyEvent.VK_J); bar.add(view); bar.add(Box.createHorizontalGlue()); @@ -289,8 +261,6 @@ layout.show(cards, command); } else if ("metadata".equals(command)) { layout.show(cards, command); - } else if ("json".equals(command)) { - layout.show(cards, command); } else if ("about".equals(command)) { textDialog( "About Apache Tika", @@ -304,8 +274,11 @@ public void openFile(File file) { try { Metadata metadata = new Metadata(); - try (TikaInputStream stream = TikaInputStream.get(file, metadata)) { + TikaInputStream stream = TikaInputStream.get(file, metadata); + try { handleStream(stream, metadata); + } finally { + stream.close(); } } catch (Throwable t) { handleError(file.getPath(), t); @@ -315,8 +288,11 @@ public void openURL(URL url) { try { Metadata metadata = new Metadata(); - try (TikaInputStream stream = TikaInputStream.get(url, metadata)) { + TikaInputStream stream = TikaInputStream.get(url, metadata); + try { handleStream(stream, metadata); + } finally { + stream.close(); } } catch (Throwable t) { handleError(url.toString(), t); @@ -339,21 +315,8 @@ context.set(DocumentSelector.class, new ImageDocumentSelector()); - input = TikaInputStream.get(new ProgressMonitorInputStream( - this, "Parsing stream", input)); - - if (input.markSupported()) { - int mark = -1; - if (input instanceof TikaInputStream) { - if (((TikaInputStream)input).hasFile()) { - mark = (int)((TikaInputStream)input).getLength(); - } - } - if (mark == -1) { - mark = MAX_MARK; - } - input.mark(mark); - } + input = new ProgressMonitorInputStream( + this, "Parsing stream", input); parser.parse(input, handler, md, context); String[] names = md.names(); @@ -377,31 +340,6 @@ setText(text, textBuffer.toString()); setText(textMain, textMainBuffer.toString()); setText(html, htmlBuffer.toString()); - if (!input.markSupported()) { - setText(json, "InputStream does not support mark/reset for Recursive Parsing"); - layout.show(cards, "metadata"); - return; - } - boolean isReset = false; - try { - input.reset(); - isReset = true; - } catch (IOException e) { - setText(json, "Error during stream reset.\n"+ - "There's a limit of "+MAX_MARK + " bytes for this type of processing in the GUI.\n"+ - "Try the app with command line argument of -J." - ); - } - if (isReset) { - RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, - new BasicContentHandlerFactory( - BasicContentHandlerFactory.HANDLER_TYPE.BODY, -1)); - wrapper.parse(input, null, new Metadata(), new ParseContext()); - StringWriter jsonBuffer = new StringWriter(); - JsonMetadataList.setPrettyPrinting(true); - JsonMetadataList.toJson(wrapper.getMetadata(), jsonBuffer); - setText(json, jsonBuffer.toString()); - } layout.show(cards, "metadata"); } @@ -471,9 +409,13 @@ if (e.getEventType() == EventType.ACTIVATED) { try { URL url = e.getURL(); - try (InputStream stream = url.openStream()) { + InputStream stream = url.openStream(); + try { + StringWriter writer = new StringWriter(); + IOUtils.copy(stream, writer, "UTF-8"); + JEditorPane editor = - new JEditorPane("text/plain", IOUtils.toString(stream, UTF_8)); + new JEditorPane("text/plain", writer.toString()); editor.setEditable(false); editor.setBackground(Color.WHITE); editor.setCaretPosition(0); @@ -486,6 +428,8 @@ dialog.add(new JScrollPane(editor)); dialog.pack(); dialog.setVisible(true); + } finally { + stream.close(); } } catch (IOException exception) { exception.printStackTrace(); diff --git a/tika-app/src/main/resources/log4j.properties b/tika-app/src/main/resources/log4j.properties deleted file mode 100644 index 7d3b372..0000000 --- a/tika-app/src/main/resources/log4j.properties +++ /dev/null @@ -1,24 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -#info,debug, error,fatal ... -log4j.rootLogger=info,stderr - -#console -log4j.appender.stderr=org.apache.log4j.ConsoleAppender -log4j.appender.stderr.layout=org.apache.log4j.PatternLayout -log4j.appender.stderr.Target=System.err - -log4j.appender.stderr.layout.ConversionPattern= %-5p %m%n diff --git a/tika-app/src/main/resources/log4j_batch_process.properties b/tika-app/src/main/resources/log4j_batch_process.properties deleted file mode 100644 index 9fc74fd..0000000 --- a/tika-app/src/main/resources/log4j_batch_process.properties +++ /dev/null @@ -1,24 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -#info,debug, error,fatal ... -log4j.rootLogger=info,stdout - -#console -log4j.appender.stdout=org.apache.log4j.ConsoleAppender -log4j.appender.stdout.layout=org.apache.log4j.PatternLayout - - -log4j.appender.stdout.layout.ConversionPattern=%m%n diff --git a/tika-app/src/main/resources/tika-app-batch-config.xml b/tika-app/src/main/resources/tika-app-batch-config.xml deleted file mode 100644 index e2f1204..0000000 --- a/tika-app/src/main/resources/tika-app-batch-config.xml +++ /dev/null @@ -1,136 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchCommandLineTest.java b/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchCommandLineTest.java deleted file mode 100644 index 260273e..0000000 --- a/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchCommandLineTest.java +++ /dev/null @@ -1,207 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.cli; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; - -import java.io.IOException; -import java.io.OutputStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.LinkedHashMap; -import java.util.Map; - -import org.apache.commons.io.FileUtils; -import org.apache.commons.io.IOUtils; -import org.junit.After; -import org.junit.Before; -import org.junit.Test; - -public class TikaCLIBatchCommandLineTest { - - Path testInput = null; - Path testFile = null; - - String testInputPathForCommandLine; - - @Before - public void init() { - testInput = Paths.get("testInput"); - try { - Files.createDirectories(testInput); - } catch (IOException e) { - throw new RuntimeException("Failed to open test input directory"); - } - testFile = Paths.get("testFile.txt"); - try (OutputStream os = Files.newOutputStream(testFile)) { - IOUtils.write("test output", os, UTF_8); - } catch (IOException e) { - throw new RuntimeException("Couldn't open testFile"); - } - testInputPathForCommandLine = testInput.toAbsolutePath().toString(); - } - - @After - public void tearDown() { - try { - //TODO: refactor this to use our FileUtils.deleteDirectory(Path) - //when that is ready - FileUtils.deleteDirectory(testInput.toFile()); - } catch (IOException e) { - throw new RuntimeException(e); - } finally { - try { - Files.deleteIfExists(testFile); - } catch (IOException e) { - throw new RuntimeException(e); - } - - } - } - - @Test - public void testJVMOpts() throws Exception { - String[] params = {"-JXmx1g", "-JDlog4j.configuration=batch_process_log4j.xml", "-inputDir", - testInputPathForCommandLine, "-outputDir", "testout-output"}; - - - String[] commandLine = BatchCommandLineBuilder.build(params); - StringBuilder sb = new StringBuilder(); - - for (String s : commandLine) { - sb.append(s).append(" "); - } - String s = sb.toString(); - int classInd = s.indexOf("org.apache.tika.batch.fs.FSBatchProcessCLI"); - int xmx = s.indexOf("-Xmx1g"); - int inputDir = s.indexOf("-inputDir"); - int log = s.indexOf("-Dlog4j.configuration"); - assertTrue(classInd > -1); - assertTrue(xmx > -1); - assertTrue(inputDir > -1); - assertTrue(log > -1); - assertTrue(xmx < classInd); - assertTrue(log < classInd); - assertTrue(inputDir > classInd); - } - - @Test - public void testBasicMappingOfArgs() throws Exception { - String[] params = {"-JXmx1g", "-JDlog4j.configuration=batch_process_log4j.xml", - "-bc", "batch-config.xml", - "-J", "-h", "-inputDir", testInputPathForCommandLine}; - - String[] commandLine = BatchCommandLineBuilder.build(params); - Map attrs = mapify(commandLine); - assertEquals("true", attrs.get("-recursiveParserWrapper")); - assertEquals("html", attrs.get("-basicHandlerType")); - assertEquals("json", attrs.get("-outputSuffix")); - assertEquals("batch-config.xml", attrs.get("-bc")); - assertEquals(testInputPathForCommandLine, attrs.get("-inputDir")); - } - - @Test - public void testTwoDirsNoFlags() throws Exception { - String outputRoot = "outputRoot"; - - String[] params = {testInputPathForCommandLine, outputRoot}; - - String[] commandLine = BatchCommandLineBuilder.build(params); - Map attrs = mapify(commandLine); - assertEquals(testInputPathForCommandLine, attrs.get("-inputDir")); - assertEquals(outputRoot, attrs.get("-outputDir")); - } - - @Test - public void testTwoDirsVarious() throws Exception { - String outputRoot = "outputRoot"; - String[] params = {"-i", testInputPathForCommandLine, "-o", outputRoot}; - - String[] commandLine = BatchCommandLineBuilder.build(params); - Map attrs = mapify(commandLine); - assertEquals(testInputPathForCommandLine, attrs.get("-inputDir")); - assertEquals(outputRoot, attrs.get("-outputDir")); - - params = new String[]{"--inputDir", testInputPathForCommandLine, "--outputDir", outputRoot}; - - commandLine = BatchCommandLineBuilder.build(params); - attrs = mapify(commandLine); - assertEquals(testInputPathForCommandLine, attrs.get("-inputDir")); - assertEquals(outputRoot, attrs.get("-outputDir")); - - params = new String[]{"-inputDir", testInputPathForCommandLine, "-outputDir", outputRoot}; - - commandLine = BatchCommandLineBuilder.build(params); - attrs = mapify(commandLine); - assertEquals(testInputPathForCommandLine, attrs.get("-inputDir")); - assertEquals(outputRoot, attrs.get("-outputDir")); - } - - @Test - public void testConfig() throws Exception { - String outputRoot = "outputRoot"; - String configPath = "c:/somewhere/someConfig.xml"; - - String[] params = {"--inputDir", testInputPathForCommandLine, "--outputDir", outputRoot, - "--config="+configPath}; - String[] commandLine = BatchCommandLineBuilder.build(params); - Map attrs = mapify(commandLine); - assertEquals(testInputPathForCommandLine, attrs.get("-inputDir")); - assertEquals(outputRoot, attrs.get("-outputDir")); - assertEquals(configPath, attrs.get("-c")); - - } - - @Test - public void testOneDirOneFileException() throws Exception { - boolean ex = false; - try { - String path = testFile.toAbsolutePath().toString(); - if (path.contains(" ")) { - path = "\"" + path + "\""; - } - String[] params = {testInputPathForCommandLine, path}; - - String[] commandLine = BatchCommandLineBuilder.build(params); - fail("Not allowed to have one dir and one file"); - } catch (IllegalArgumentException e) { - ex = true; - } - assertTrue("exception on ", ex); - } - - private Map mapify(String[] args) { - Map map = new LinkedHashMap<>(); - for (int i = 0; i < args.length; i++) { - if (args[i].startsWith("-")) { - String k = args[i]; - String v = ""; - if (i < args.length - 1 && !args[i + 1].startsWith("-")) { - v = args[i + 1]; - i++; - } - map.put(k, v); - } - } - return map; - } - -} diff --git a/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java b/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java deleted file mode 100644 index 1da91db..0000000 --- a/tika-app/src/test/java/org/apache/tika/cli/TikaCLIBatchIntegrationTest.java +++ /dev/null @@ -1,174 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.cli; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertTrue; - -import java.io.ByteArrayOutputStream; -import java.io.OutputStream; -import java.io.PrintStream; -import java.io.Reader; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.List; - -import org.apache.commons.io.FileUtils; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.serialization.JsonMetadataList; -import org.apache.tika.parser.RecursiveParserWrapper; -import org.junit.After; -import org.junit.Before; -import org.junit.Test; - -public class TikaCLIBatchIntegrationTest { - - private Path testInputDir = Paths.get("src/test/resources/test-data"); - private String testInputDirForCommandLine; - private Path tempOutputDir; - private String tempOutputDirForCommandLine; - private OutputStream out = null; - private OutputStream err = null; - private ByteArrayOutputStream outBuffer = null; - - @Before - public void setup() throws Exception { - tempOutputDir = Files.createTempDirectory("tika-cli-test-batch-"); - outBuffer = new ByteArrayOutputStream(); - PrintStream outWriter = new PrintStream(outBuffer, true, UTF_8.name()); - ByteArrayOutputStream errBuffer = new ByteArrayOutputStream(); - PrintStream errWriter = new PrintStream(errBuffer, true, UTF_8.name()); - out = System.out; - err = System.err; - System.setOut(outWriter); - System.setErr(errWriter); - testInputDirForCommandLine = testInputDir.toAbsolutePath().toString(); - tempOutputDirForCommandLine = tempOutputDir.toAbsolutePath().toString(); - } - - @After - public void tearDown() throws Exception { - System.setOut(new PrintStream(out, true, UTF_8.name())); - System.setErr(new PrintStream(err, true, UTF_8.name())); - //TODO: refactor to use our deleteDirectory with straight path - FileUtils.deleteDirectory(tempOutputDir.toFile()); - } - - @Test - public void testSimplestBatchIntegration() throws Exception { - String[] params = {testInputDirForCommandLine, - tempOutputDirForCommandLine}; - TikaCLI.main(params); - - assertFileExists(tempOutputDir.resolve("bad_xml.xml.xml")); - assertFileExists(tempOutputDir.resolve("coffee.xls.xml")); - } - - @Test - public void testBasicBatchIntegration() throws Exception { - String[] params = {"-i", testInputDirForCommandLine, - "-o", tempOutputDirForCommandLine, - "-numConsumers", "2" - }; - TikaCLI.main(params); - - assertFileExists(tempOutputDir.resolve("bad_xml.xml.xml")); - assertFileExists(tempOutputDir.resolve("coffee.xls.xml")); - } - - @Test - public void testJsonRecursiveBatchIntegration() throws Exception { - String[] params = {"-i", testInputDirForCommandLine, - "-o", tempOutputDirForCommandLine, - "-numConsumers", "10", - "-J", //recursive Json - "-t" //plain text in content - }; - TikaCLI.main(params); - - Path jsonFile = tempOutputDir.resolve("test_recursive_embedded.docx.json"); - try (Reader reader = Files.newBufferedReader(jsonFile, UTF_8)) { - List metadataList = JsonMetadataList.fromJson(reader); - assertEquals(12, metadataList.size()); - assertTrue(metadataList.get(6).get(RecursiveParserWrapper.TIKA_CONTENT).contains("human events")); - } - } - - @Test - public void testProcessLogFileConfig() throws Exception { - String[] params = {"-i", testInputDirForCommandLine, - "-o", tempOutputDirForCommandLine, - "-numConsumers", "2", - "-JDlog4j.configuration=log4j_batch_process_test.properties"}; - TikaCLI.main(params); - - assertFileExists(tempOutputDir.resolve("bad_xml.xml.xml")); - assertFileExists(tempOutputDir.resolve("coffee.xls.xml")); - String sysOutString = new String(outBuffer.toByteArray(), UTF_8); - assertTrue(sysOutString.contains("MY_CUSTOM_LOG_CONFIG")); - } - - @Test - public void testDigester() throws Exception { -/* - try { - String[] params = {"-i", escape(testDataFile.getAbsolutePath()), - "-o", escape(tempOutputDir.getAbsolutePath()), - "-numConsumers", "10", - "-J", //recursive Json - "-t" //plain text in content - }; - TikaCLI.main(params); - reader = new InputStreamReader( - new FileInputStream(new File(tempOutputDir, "test_recursive_embedded.docx.json")), UTF_8); - List metadataList = JsonMetadataList.fromJson(reader); - assertEquals(12, metadataList.size()); - assertEquals("59f626e09a8c16ab6dbc2800c685f772", metadataList.get(0).get("X-TIKA:digest:MD5")); - assertEquals("22e6e91f408d018417cd452d6de3dede", metadataList.get(5).get("X-TIKA:digest:MD5")); - } finally { - IOUtils.closeQuietly(reader); - } -*/ - String[] params = {"-i", testInputDirForCommandLine, - "-o", tempOutputDirForCommandLine, - "-numConsumers", "10", - "-J", //recursive Json - "-t", //plain text in content - "-digest", "sha512" - }; - TikaCLI.main(params); - Path jsonFile = tempOutputDir.resolve("test_recursive_embedded.docx.json"); - try (Reader reader = Files.newBufferedReader(jsonFile, UTF_8)) { - - List metadataList = JsonMetadataList.fromJson(reader); - assertEquals(12, metadataList.size()); - assertNotNull(metadataList.get(0).get("X-TIKA:digest:SHA512")); - assertTrue(metadataList.get(0).get("X-TIKA:digest:SHA512").startsWith("ee46d973ee1852c01858")); - } - } - - private void assertFileExists(Path path) { - assertTrue("File doesn't exist: "+path.toAbsolutePath(), - Files.isRegularFile(path)); - } - - -} diff --git a/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java b/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java index f9d5a5d..d458d6d 100644 --- a/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java +++ b/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java @@ -16,18 +16,15 @@ */ package org.apache.tika.cli; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - import java.io.ByteArrayOutputStream; import java.io.File; import java.io.PrintStream; import java.net.URI; import org.apache.commons.io.FileUtils; -import org.apache.tika.exception.TikaException; + import org.junit.After; +import static org.junit.Assert.assertTrue; import org.junit.Before; import org.junit.Test; @@ -40,17 +37,15 @@ private File profile = null; private ByteArrayOutputStream outContent = null; private PrintStream stdout = null; - private File testDataFile = new File("src/test/resources/test-data"); - private URI testDataURI = testDataFile.toURI(); - private String resourcePrefix; + private URI testDataURI = new File("src/test/resources/test-data/").toURI(); + private String resourcePrefix = testDataURI.toString(); @Before public void setUp() throws Exception { profile = new File("welsh.ngp"); outContent = new ByteArrayOutputStream(); - resourcePrefix = testDataURI.toString(); stdout = System.out; - System.setOut(new PrintStream(outContent, true, UTF_8.name())); + System.setOut(new PrintStream(outContent)); } /** @@ -74,7 +69,7 @@ public void testListParserDetail() throws Exception{ String[] params = {"--list-parser-detail"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("application/vnd.oasis.opendocument.text-web")); + assertTrue(outContent.toString().contains("application/vnd.oasis.opendocument.text-web")); } /** @@ -87,7 +82,7 @@ String[] params = {"--list-parser"}; TikaCLI.main(params); //Assert was commented temporarily for finding the problem - // Assert.assertTrue(outContent != null && outContent.toString("UTF-8").contains("org.apache.tika.parser.iwork.IWorkPackageParser")); + // Assert.assertTrue(outContent != null && outContent.toString().contains("org.apache.tika.parser.iwork.IWorkPackageParser")); } /** @@ -99,13 +94,7 @@ public void testXMLOutput() throws Exception{ String[] params = {"-x", resourcePrefix + "alice.cli.test"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("?xml version=\"1.0\" encoding=\"UTF-8\"?")); - - params = new String[]{"-x", "--digest=SHA256", resourcePrefix + "alice.cli.test"}; - TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()) - .contains(" element should be present", - outContent.toString(UTF_8.name()).contains("")); - - params = new String[]{"-h", "--digest=SHA384", resourcePrefix + "alice.cli.test"}; - TikaCLI.main(params); - assertTrue(outContent.toString("UTF-8") - .contains("")); } /** @@ -136,7 +120,7 @@ public void testTextOutput() throws Exception{ String[] params = {"-t", resourcePrefix + "alice.cli.test"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("finished off the cake")); + assertTrue(outContent.toString().contains("finished off the cake")); } /** @@ -147,58 +131,7 @@ public void testMetadataOutput() throws Exception{ String[] params = {"-m", resourcePrefix + "alice.cli.test"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("text/plain")); - - params = new String[]{"-m", "--digest=SHA512", resourcePrefix + "alice.cli.test"}; - TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("text/plain")); - assertTrue(outContent.toString(UTF_8.name()) - .contains("X-TIKA:digest:SHA512: dd459d99bc19ff78fd31fbae46e0")); - } - - /** - * Basic tests for -json option - * - * @throws Exception - */ - @Test - public void testJsonMetadataOutput() throws Exception { - String[] params = {"--json", "--digest=MD2", resourcePrefix + "testJsonMultipleInts.html"}; - TikaCLI.main(params); - String json = outContent.toString(UTF_8.name()); - //TIKA-1310 - assertTrue(json.contains("\"fb:admins\":\"1,2,3,4\",")); - - //test legacy alphabetic sort of keys - int enc = json.indexOf("\"Content-Encoding\""); - int fb = json.indexOf("fb:admins"); - int title = json.indexOf("\"title\""); - assertTrue(enc > -1 && fb > -1 && enc < fb); - assertTrue (fb > -1 && title > -1 && fb < title); - assertTrue(json.contains("\"X-TIKA:digest:MD2\":")); - } - - /** - * Test for -json with prettyprint option - * - * @throws Exception - */ - @Test - public void testJsonMetadataPrettyPrintOutput() throws Exception { - String[] params = {"--json", "-r", resourcePrefix + "testJsonMultipleInts.html"}; - TikaCLI.main(params); - String json = outContent.toString(UTF_8.name()); - - assertTrue(json.contains(" \"X-Parsed-By\": [\n" + - " \"org.apache.tika.parser.DefaultParser\",\n" + - " \"org.apache.tika.parser.html.HtmlParser\"\n" + - " ],\n")); - //test legacy alphabetic sort of keys - int enc = json.indexOf("\"Content-Encoding\""); - int fb = json.indexOf("fb:admins"); - int title = json.indexOf("\"title\""); - assertTrue(enc > -1 && fb > -1 && enc < fb); - assertTrue (fb > -1 && title > -1 && fb < title); + assertTrue(outContent.toString().contains("text/plain")); } /** @@ -210,7 +143,7 @@ public void testLanguageOutput() throws Exception{ String[] params = {"-l", resourcePrefix + "alice.cli.test"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("en")); + assertTrue(outContent.toString().contains("en")); } /** @@ -222,7 +155,7 @@ public void testDetectOutput() throws Exception{ String[] params = {"-d", resourcePrefix + "alice.cli.test"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("text/plain")); + assertTrue(outContent.toString().contains("text/plain")); } /** @@ -234,7 +167,7 @@ public void testListMetModels() throws Exception{ String[] params = {"--list-met-models", resourcePrefix + "alice.cli.test"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("text/plain")); + assertTrue(outContent.toString().contains("text/plain")); } /** @@ -246,7 +179,7 @@ public void testListSupportedTypes() throws Exception{ String[] params = {"--list-supported-types", resourcePrefix + "alice.cli.test"}; TikaCLI.main(params); - assertTrue(outContent.toString(UTF_8.name()).contains("supertype: application/octet-stream")); + assertTrue(outContent.toString().contains("supertype: application/octet-stream")); } /** @@ -271,47 +204,24 @@ TikaCLI.main(params); - StringBuffer allFiles = new StringBuffer(); - for (String f : tempFile.list()) { - if (allFiles.length() > 0) allFiles.append(" : "); - allFiles.append(f); - } - // ChemDraw file - File expectedCDX = new File(tempFile, "MBD002B040A.cdx"); - // Image of the ChemDraw molecule - File expectedIMG = new File(tempFile, "file4.png"); + File expected1 = new File(tempFile, "MBD002B040A.cdx"); // OLE10Native - File expectedOLE10 = new File(tempFile, "MBD002B0FA6_file5.bin"); - // Something that really isnt a text file... Not sure what it is??? - File expected262FE3 = new File(tempFile, "MBD00262FE3.txt"); + File expected2 = new File(tempFile, "MBD002B0FA6_file5"); // Image of one of the embedded resources - File expectedEMF = new File(tempFile, "file0.emf"); - - assertExtracted(expectedCDX, allFiles.toString()); - assertExtracted(expectedIMG, allFiles.toString()); - assertExtracted(expectedOLE10, allFiles.toString()); - assertExtracted(expected262FE3, allFiles.toString()); - assertExtracted(expectedEMF, allFiles.toString()); + File expected3 = new File(tempFile, "file0.emf"); + + assertTrue(expected1.exists()); + assertTrue(expected2.exists()); + assertTrue(expected3.exists()); + + assertTrue(expected1.length()>0); + assertTrue(expected2.length()>0); + assertTrue(expected3.length()>0); } finally { FileUtils.deleteDirectory(tempFile); } - } - protected static void assertExtracted(File f, String allFiles) { - - assertTrue( - "File " + f.getName() + " not found in " + allFiles, - f.exists() - ); - - assertFalse( - "File " + f.getName() + " is a directory!", f.isDirectory() - ); - - assertTrue( - "File " + f.getName() + " wasn't extracted with contents", - f.length() > 0 - ); + } // TIKA-920 @@ -319,7 +229,7 @@ public void testMultiValuedMetadata() throws Exception { String[] params = {"-m", resourcePrefix + "testMultipleSheets.numbers"}; TikaCLI.main(params); - String content = outContent.toString(UTF_8.name()); + String content = outContent.toString(); assertTrue(content.contains("sheetNames: Checking")); assertTrue(content.contains("sheetNames: Secon sheet")); assertTrue(content.contains("sheetNames: Logical Sheet 3")); @@ -333,77 +243,10 @@ new File("subdir/foo.txt").delete(); new File("subdir").delete(); TikaCLI.main(params); - String content = outContent.toString(UTF_8.name()); + String content = outContent.toString(); assertTrue(content.contains("Extracting 'subdir/foo.txt'")); // clean up. TODO: These should be in target. new File("target/subdir/foo.txt").delete(); new File("target/subdir").delete(); } - - @Test - public void testDefaultConfigException() throws Exception { - //default xml parser will throw TikaException - //this and TestConfig() are broken into separate tests so that - //setUp and tearDown() are called each time - String[] params = {resourcePrefix + "bad_xml.xml"}; - boolean tikaEx = false; - try { - TikaCLI.main(params); - } catch (TikaException e) { - tikaEx = true; - } - assertTrue(tikaEx); - } - - @Test - public void testConfig() throws Exception { - String[] params = new String[]{"--config="+testDataFile.toString()+"/tika-config1.xml", resourcePrefix+"bad_xml.xml"}; - TikaCLI.main(params); - String content = outContent.toString(UTF_8.name()); - assertTrue(content.contains("apple")); - assertTrue(content.contains("org.apache.tika.parser.html.HtmlParser")); - } - - @Test - public void testJsonRecursiveMetadataParserMetadataOnly() throws Exception { - String[] params = new String[]{"-m", "-J", "-r", resourcePrefix+"test_recursive_embedded.docx"}; - TikaCLI.main(params); - String content = outContent.toString(UTF_8.name()); - assertTrue(content.contains("[\n" + - " {\n" + - " \"Application-Name\": \"Microsoft Office Word\",\n" + - " \"Application-Version\": \"15.0000\",\n" + - " \"Character Count\": \"28\",\n" + - " \"Character-Count-With-Spaces\": \"31\",")); - assertTrue(content.contains("\"X-TIKA:embedded_resource_path\": \"/embed1.zip\"")); - assertFalse(content.contains("X-TIKA:content")); - - } - - @Test - public void testJsonRecursiveMetadataParserDefault() throws Exception { - String[] params = new String[]{"-J", "-r", resourcePrefix+"test_recursive_embedded.docx"}; - TikaCLI.main(params); - String content = outContent.toString(UTF_8.name()); - assertTrue(content.contains("\"X-TIKA:content\": \"\\u003chtml xmlns\\u003d\\\"http://www.w3.org/1999/xhtml")); - } - - @Test - public void testJsonRecursiveMetadataParserText() throws Exception { - String[] params = new String[]{"-J", "-r", "-t", resourcePrefix+"test_recursive_embedded.docx"}; - TikaCLI.main(params); - String content = outContent.toString(UTF_8.name()); - assertTrue(content.contains("\\n\\nembed_4\\n")); - assertTrue(content.contains("\\n\\nembed_0")); - } - - @Test - public void testDigestInJson() throws Exception { - String[] params = new String[]{"-J", "-r", "-t", "--digest=MD5", resourcePrefix+"test_recursive_embedded.docx"}; - TikaCLI.main(params); - String content = outContent.toString(UTF_8.name()); - assertTrue(content.contains("\"X-TIKA:digest:MD5\": \"59f626e09a8c16ab6dbc2800c685f772\",")); - assertTrue(content.contains("\"X-TIKA:digest:MD5\": \"f9627095ef86c482e61d99f0cc1cf87d\"")); - } - } diff --git a/tika-batch/pom.xml b/tika-batch/pom.xml deleted file mode 100644 index f0ee2b9..0000000 --- a/tika-batch/pom.xml +++ /dev/null @@ -1,185 +0,0 @@ - - - - - - 4.0.0 - - - org.apache.tika - tika-parent - 1.11 - ../tika-parent/pom.xml - - - tika-batch - bundle - Apache Tika batch - http://tika.apache.org/ - - - 1.2 - - - - - ${project.groupId} - tika-core - ${project.version} - - - ${project.groupId} - tika-serialization - ${project.version} - - - org.apache.commons - commons-compress - ${commons.compress.version} - - - org.slf4j - slf4j-log4j12 - - - commons-cli - commons-cli - ${cli.version} - - - commons-io - commons-io - ${commons.io.version} - - - org.apache.tika - tika-core - ${project.version} - test-jar - test - - - org.apache.tika - tika-parsers - ${project.version} - test-jar - test - - - junit - junit - test - - - - - - - maven-remote-resources-plugin - 1.5 - - - - bundle - - - - - - **/*.xml - - - - - - org.apache.felix - maven-bundle-plugin - true - - - ${project.url} - - org.apache.tika.config.TikaActivator - - lazy - - - - - org.apache.rat - apache-rat-plugin - - - src/test/resources/org/apache/tika/** - src/test/resources/*.txt - - - - - org.apache.maven.plugins - maven-jar-plugin - - - - test-jar - - - - - - maven-failsafe-plugin - 2.10 - - - - ${project.build.directory}/${project.build.finalName}.jar - - - - - - - integration-test - verify - - - - - - - - - The Apache Software Foundation - http://www.apache.org - - - http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-batch - scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-batch - scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-batch - - - JIRA - https://issues.apache.org/jira/browse/TIKA - - - Jenkins - https://builds.apache.org/job/Tika-trunk/ - - diff --git a/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java b/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java deleted file mode 100644 index c687f7c..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/AutoDetectParserFactory.java +++ /dev/null @@ -1,35 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.tika.config.TikaConfig; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.Parser; - -/** - * Simple class for AutoDetectParser - */ -public class AutoDetectParserFactory extends ParserFactory { - - @Override - public Parser getParser(TikaConfig config) { - return new AutoDetectParser(config); - } - - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/BatchNoRestartError.java b/tika-batch/src/main/java/org/apache/tika/batch/BatchNoRestartError.java deleted file mode 100644 index 3c8c154..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/BatchNoRestartError.java +++ /dev/null @@ -1,33 +0,0 @@ -package org.apache.tika.batch; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/** - * FileResourceConsumers should throw this if something - * catastrophic has happened and the BatchProcess should shutdown - * and not be restarted. - * - */ -public class BatchNoRestartError extends Error { - - public BatchNoRestartError(Throwable t) { - super(t); - } - public BatchNoRestartError(String message) { - super(message); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java b/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java deleted file mode 100644 index d5c556b..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/BatchProcess.java +++ /dev/null @@ -1,597 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.io.PrintStream; -import java.util.Date; -import java.util.List; -import java.util.concurrent.ArrayBlockingQueue; -import java.util.concurrent.Callable; -import java.util.concurrent.CompletionService; -import java.util.concurrent.ExecutionException; -import java.util.concurrent.ExecutorCompletionService; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.Future; -import java.util.concurrent.TimeUnit; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * This is the main processor class for a single process. - * This class can only be run once. - *

    - * It requires a {@link FileResourceCrawler} and {@link FileResourceConsumer}s, and it can also - * support a {@link StatusReporter} and an {@link Interrupter}. - *

    - * This is designed to shutdown if a parser has timed out or if there is - * an OutOfMemoryError. Consider using {@link BatchProcessDriverCLI} - * as a daemon/watchdog that monitors and can restart this batch process; - *

    - * Note that this classs redirects stderr to stdout so that it can - * communicate without interference with the parent process on stderr. - */ -public class BatchProcess implements Callable { - - public enum BATCH_CONSTANTS { - BATCH_PROCESS_EXCEEDED_MAX_ALIVE_TIME, - BATCH_PROCESS_FATAL_MUST_RESTART - } - - private enum CAUSE_FOR_TERMINATION { - COMPLETED_NORMALLY, - MAIN_LOOP_EXCEPTION_NO_RESTART, - CONSUMERS_MANAGER_DIDNT_INIT_IN_TIME_NO_RESTART, - MAIN_LOOP_EXCEPTION, - CRAWLER_TIMED_OUT, - TIMED_OUT_CONSUMER, - USER_INTERRUPTION, - BATCH_PROCESS_ALIVE_TOO_LONG, - } - - private static final Logger logger; - static { - logger = LoggerFactory.getLogger(BatchProcess.class); - } - - private PrintStream outputStreamWriter; - - // If a file hasn't been processed in this amount of time, - // report it to the console. When the directory crawler has stopped, the thread will - // be terminated and the file name will be logged - private long timeoutThresholdMillis = 5 * 60 * 1000; // 5 minutes - - private long timeoutCheckPulseMillis = 2 * 60 * 1000; //2 minutes - //if there was an early termination via the Interrupter - //or because of an uncaught runtime throwable, pause - //this long before shutting down to allow parsers to finish - private long pauseOnEarlyTerminationMillis = 30*1000; //30 seconds - - private final long consumersManagerMaxMillis; - - //maximum time that this process should stay alive - //to avoid potential memory leaks, not a bad idea to shutdown - //every hour or so. - private int maxAliveTimeSeconds = -1; - - private final FileResourceCrawler fileResourceCrawler; - - private final ConsumersManager consumersManager; - - private final StatusReporter reporter; - - private final Interrupter interrupter; - - private final ArrayBlockingQueue timedOuts; - - private boolean alreadyExecuted = false; - - public BatchProcess(FileResourceCrawler fileResourceCrawler, - ConsumersManager consumersManager, - StatusReporter reporter, - Interrupter interrupter) { - this.fileResourceCrawler = fileResourceCrawler; - this.consumersManager = consumersManager; - this.reporter = reporter; - this.interrupter = interrupter; - timedOuts = new ArrayBlockingQueue(consumersManager.getConsumers().size()); - this.consumersManagerMaxMillis = consumersManager.getConsumersManagerMaxMillis(); - } - - /** - * Runs main execution loop. - *

    - * Redirects stdout to stderr to keep clean communications - * over stdout with parent process - * @return result of the processing - * @throws InterruptedException - */ - public ParallelFileProcessingResult call() - throws InterruptedException { - if (alreadyExecuted) { - throw new IllegalStateException("Can only execute BatchRunner once."); - } - //redirect streams; all organic warnings should go to System.err; - //System.err should be redirected to System.out - PrintStream sysErr = System.err; - try { - outputStreamWriter = new PrintStream(sysErr, true, UTF_8.toString()); - } catch (IOException e) { - throw new RuntimeException("Can't redirect streams"); - } - System.setErr(System.out); - - ParallelFileProcessingResult result = null; - try { - int numConsumers = consumersManager.getConsumers().size(); - // fileResourceCrawler, statusReporter, the Interrupter, timeoutChecker - int numNonConsumers = 4; - - ExecutorService ex = Executors.newFixedThreadPool(numConsumers - + numNonConsumers); - CompletionService completionService = - new ExecutorCompletionService( - ex); - TimeoutChecker timeoutChecker = new TimeoutChecker(); - - try { - startConsumersManager(); - } catch (BatchNoRestartError e) { - return new - ParallelFileProcessingResult(0, 0, 0, 0, - 0, BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE, - CAUSE_FOR_TERMINATION.CONSUMERS_MANAGER_DIDNT_INIT_IN_TIME_NO_RESTART.toString()); - - } - - State state = mainLoop(completionService, timeoutChecker); - result = shutdown(ex, completionService, timeoutChecker, state); - } finally { - shutdownConsumersManager(); - } - return result; - } - - - private State mainLoop(CompletionService completionService, - TimeoutChecker timeoutChecker) { - alreadyExecuted = true; - State state = new State(); - logger.info("BatchProcess starting up"); - - - state.start = new Date().getTime(); - completionService.submit(interrupter); - completionService.submit(fileResourceCrawler); - completionService.submit(reporter); - completionService.submit(timeoutChecker); - - - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - completionService.submit(consumer); - } - - state.numConsumers = consumersManager.getConsumers().size(); - CAUSE_FOR_TERMINATION causeForTermination = null; - //main processing loop - while (true) { - try { - Future futureResult = - completionService.poll(1, TimeUnit.SECONDS); - - if (futureResult != null) { - state.removed++; - IFileProcessorFutureResult result = futureResult.get(); - if (result instanceof FileConsumerFutureResult) { - state.consumersRemoved++; - } else if (result instanceof FileResourceCrawlerFutureResult) { - state.crawlersRemoved++; - if (fileResourceCrawler.wasTimedOut()) { - causeForTermination = CAUSE_FOR_TERMINATION.CRAWLER_TIMED_OUT; - break; - } - } else if (result instanceof InterrupterFutureResult) { - causeForTermination = CAUSE_FOR_TERMINATION.USER_INTERRUPTION; - break; - } else if (result instanceof TimeoutFutureResult) { - causeForTermination = CAUSE_FOR_TERMINATION.TIMED_OUT_CONSUMER; - break; - } //only thing left should be StatusReporterResult - } - - if (state.consumersRemoved >= state.numConsumers) { - causeForTermination = CAUSE_FOR_TERMINATION.COMPLETED_NORMALLY; - break; - } - if (aliveTooLong(state.start)) { - causeForTermination = CAUSE_FOR_TERMINATION.BATCH_PROCESS_ALIVE_TOO_LONG; - break; - } - } catch (Throwable e) { - if (isNonRestart(e)) { - causeForTermination = CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART; - } else { - causeForTermination = CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION; - } - logger.error("Main loop execution exception: " + e.getMessage()); - break; - } - } - state.causeForTermination = causeForTermination; - return state; - } - - private ParallelFileProcessingResult shutdown(ExecutorService ex, - CompletionService completionService, - TimeoutChecker timeoutChecker, State state) { - - reporter.setIsShuttingDown(true); - int added = fileResourceCrawler.getAdded(); - int considered = fileResourceCrawler.getConsidered(); - - //TODO: figure out safe way to shutdown resource crawler - //if it isn't. Does it need to add poison at this point? - //fileResourceCrawler.pleaseShutdown(); - - //Step 1: prevent uncalled threads from being started - ex.shutdown(); - - //Step 2: ask consumers to shutdown politely. - //Under normal circumstances, they should all have completed by now. - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - consumer.pleaseShutdown(); - } - //The resourceCrawler should shutdown now. No need for poison. - fileResourceCrawler.shutDownNoPoison(); - //if there are any active/asked to shutdown consumers, await termination - //this can happen if a user interrupts the process - //of if the crawler stops early, or ... - politelyAwaitTermination(state.causeForTermination); - - //Step 3: Gloves come off. We've tried to ask kindly before. - //Now it is time shut down. This will corrupt - //nio channels via thread interrupts! Hopefully, everything - //has shut down by now. - logger.trace("About to shutdownNow()"); - List neverCalled = ex.shutdownNow(); - logger.trace("TERMINATED " + ex.isTerminated() + " : " - + state.consumersRemoved + " : " + state.crawlersRemoved); - - int end = state.numConsumers + state.numNonConsumers - state.removed - neverCalled.size(); - - for (int t = 0; t < end; t++) { - Future future = null; - try { - future = completionService.poll(10, TimeUnit.MILLISECONDS); - } catch (InterruptedException e) { - logger.warn("thread interrupt while polling in final shutdown loop"); - break; - } - logger.trace("In while future==null loop in final shutdown loop"); - if (future == null) { - break; - } - try { - IFileProcessorFutureResult result = future.get(); - if (result instanceof FileConsumerFutureResult) { - FileConsumerFutureResult consumerResult = (FileConsumerFutureResult) result; - FileStarted fileStarted = consumerResult.getFileStarted(); - if (fileStarted != null - && fileStarted.getElapsedMillis() > timeoutThresholdMillis) { - logger.warn(fileStarted.getResourceId() - + "\t caused a file processor to hang or crash. You may need to remove " - + "this file from your input set and rerun."); - } - } else if (result instanceof FileResourceCrawlerFutureResult) { - FileResourceCrawlerFutureResult crawlerResult = (FileResourceCrawlerFutureResult) result; - considered += crawlerResult.getConsidered(); - added += crawlerResult.getAdded(); - } //else ...we don't care about anything else stopping at this point - } catch (ExecutionException e) { - logger.error("Execution exception trying to shutdown after shutdownNow:" + e.getMessage()); - } catch (InterruptedException e) { - logger.error("Interrupted exception trying to shutdown after shutdownNow:" + e.getMessage()); - } - } - //do we need to restart? - String restartMsg = null; - if (state.causeForTermination == CAUSE_FOR_TERMINATION.USER_INTERRUPTION - || state.causeForTermination == CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART) { - //do not restart!!! - } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION) { - restartMsg = "Uncaught consumer throwable"; - } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.TIMED_OUT_CONSUMER) { - if (areResourcesPotentiallyRemaining()) { - restartMsg = "Consumer timed out with resources remaining"; - } - } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.BATCH_PROCESS_ALIVE_TOO_LONG) { - restartMsg = BATCH_CONSTANTS.BATCH_PROCESS_EXCEEDED_MAX_ALIVE_TIME.toString(); - } else if (state.causeForTermination == CAUSE_FOR_TERMINATION.CRAWLER_TIMED_OUT) { - restartMsg = "Crawler timed out."; - } else if (fileResourceCrawler.wasTimedOut()) { - restartMsg = "Crawler was timed out."; - } else if (fileResourceCrawler.isActive()) { - restartMsg = "Crawler is still active."; - } else if (! fileResourceCrawler.isQueueEmpty()) { - restartMsg = "Resources still exist for processing"; - } - - int exitStatus = getExitStatus(state.causeForTermination, restartMsg); - - //need to re-check, report, mark timed out consumers - timeoutChecker.checkForTimedOutConsumers(); - - for (FileStarted fs : timedOuts) { - logger.warn("A parser was still working on >" + fs.getResourceId() + - "< for " + fs.getElapsedMillis() + " milliseconds after it started." + - " This exceeds the maxTimeoutMillis parameter"); - } - double elapsed = ((double) new Date().getTime() - (double) state.start) / 1000.0; - int processed = 0; - int numExceptions = 0; - for (FileResourceConsumer c : consumersManager.getConsumers()) { - processed += c.getNumResourcesConsumed(); - numExceptions += c.getNumHandledExceptions(); - } - return new - ParallelFileProcessingResult(considered, added, processed, numExceptions, - elapsed, exitStatus, state.causeForTermination.toString()); - } - - private class State { - long start = -1; - int numConsumers = 0; - int numNonConsumers = 0; - int removed = 0; - int consumersRemoved = 0; - int crawlersRemoved = 0; - CAUSE_FOR_TERMINATION causeForTermination = null; - } - - private void startConsumersManager() { - if (consumersManagerMaxMillis < 0) { - consumersManager.init(); - return; - } - Thread timed = new Thread() { - public void run() { - logger.trace("about to start consumers manager"); - consumersManager.init(); - logger.trace("finished starting consumers manager"); - } - }; - //don't allow this thread to keep process alive - timed.setDaemon(true); - timed.start(); - try { - timed.join(consumersManagerMaxMillis); - } catch (InterruptedException e) { - logger.warn("interruption exception during consumers manager shutdown"); - } - if (timed.isAlive()) { - logger.error("ConsumersManager did not start within " + consumersManagerMaxMillis + "ms"); - throw new BatchNoRestartError("ConsumersManager did not start within "+consumersManagerMaxMillis+"ms"); - } - } - - private void shutdownConsumersManager() { - if (consumersManagerMaxMillis < 0) { - consumersManager.shutdown(); - return; - } - Thread timed = new Thread() { - public void run() { - logger.trace("starting to shutdown consumers manager"); - consumersManager.shutdown(); - logger.trace("finished shutting down consumers manager"); - } - }; - timed.setDaemon(true); - timed.start(); - try { - timed.join(consumersManagerMaxMillis); - } catch (InterruptedException e) { - logger.warn("interruption exception during consumers manager shutdown"); - } - if (timed.isAlive()) { - logger.error("ConsumersManager was still alive during shutdown!"); - throw new BatchNoRestartError("ConsumersManager did not shutdown within: "+ - consumersManagerMaxMillis+"ms"); - } - } - - /** - * This is used instead of awaitTermination(), because that interrupts - * the thread and then waits for its termination. This politely waits. - * - * @param causeForTermination reason for termination. - */ - private void politelyAwaitTermination(CAUSE_FOR_TERMINATION causeForTermination) { - if (causeForTermination == CAUSE_FOR_TERMINATION.COMPLETED_NORMALLY) { - return; - } - long start = new Date().getTime(); - while (countActiveConsumers() > 0) { - try { - Thread.sleep(500); - } catch (InterruptedException e) { - logger.warn("Thread interrupted while trying to politelyAwaitTermination"); - return; - } - long elapsed = new Date().getTime()-start; - if (pauseOnEarlyTerminationMillis > -1 && - elapsed > pauseOnEarlyTerminationMillis) { - logger.warn("I waited after an early termination for "+ - elapsed + ", but there was at least one active consumer"); - return; - } - } - } - - private boolean isNonRestart(Throwable e) { - if (e instanceof BatchNoRestartError) { - return true; - } - Throwable cause = e.getCause(); - return cause != null && isNonRestart(cause); - } - - private int getExitStatus(CAUSE_FOR_TERMINATION causeForTermination, String restartMsg) { - if (causeForTermination == CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART) { - logger.info(CAUSE_FOR_TERMINATION.MAIN_LOOP_EXCEPTION_NO_RESTART.name()); - return BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE; - } - - if (restartMsg != null) { - if (restartMsg.equals(BATCH_CONSTANTS.BATCH_PROCESS_EXCEEDED_MAX_ALIVE_TIME.toString())) { - logger.warn(restartMsg); - } else { - logger.error(restartMsg); - } - - //send over stdout wrapped in outputStreamWriter - outputStreamWriter.println( - BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString() + - " >> " + restartMsg); - outputStreamWriter.flush(); - return BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE; - } - return 0; - } - - //could new FileResources be consumed from the Queue? - //Because of race conditions, this can return a true - //when the real answer is false. - //This should never return false, though, if the answer is true! - private boolean areResourcesPotentiallyRemaining() { - if (fileResourceCrawler.isActive()) { - return true; - } - return !fileResourceCrawler.isQueueEmpty(); - } - - private boolean aliveTooLong(long started) { - if (maxAliveTimeSeconds < 0) { - return false; - } - double elapsedSeconds = (double) (new Date().getTime() - started) / (double) 1000; - return elapsedSeconds > (double) maxAliveTimeSeconds; - } - - //snapshot of non-retired consumers; actual number may be smaller by the time - //this returns a value! - private int countActiveConsumers() { - int active = 0; - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - if (consumer.isStillActive()) { - active++; - } - } - return active; - } - - /** - * If there is an early termination via an interrupt or too many timed out consumers - * or because a consumer or other Runnable threw a Throwable, pause this long - * before killing the consumers and other threads. - * - * Typically makes sense for this to be the same or slightly larger than - * timeoutThresholdMillis - * - * @param pauseOnEarlyTerminationMillis how long to pause if there is an early termination - */ - public void setPauseOnEarlyTerminationMillis(long pauseOnEarlyTerminationMillis) { - this.pauseOnEarlyTerminationMillis = pauseOnEarlyTerminationMillis; - } - - /** - * The amount of time allowed before a consumer should be timed out. - * - * @param timeoutThresholdMillis threshold in milliseconds before declaring a consumer timed out - */ - public void setTimeoutThresholdMillis(long timeoutThresholdMillis) { - this.timeoutThresholdMillis = timeoutThresholdMillis; - } - - public void setTimeoutCheckPulseMillis(long timeoutCheckPulseMillis) { - this.timeoutCheckPulseMillis = timeoutCheckPulseMillis; - } - - /** - * The maximum amount of time that this process can be alive. To avoid - * memory leaks, it is sometimes beneficial to shutdown (and restart) the - * process periodically. - *

    - * If the value is < 0, the process will run until completion, interruption or exception. - * - * @param maxAliveTimeSeconds maximum amount of time in seconds to remain alive - */ - public void setMaxAliveTimeSeconds(int maxAliveTimeSeconds) { - this.maxAliveTimeSeconds = maxAliveTimeSeconds; - } - - private class TimeoutChecker implements Callable { - - @Override - public TimeoutFutureResult call() throws Exception { - while (timedOuts.size() == 0) { - try { - Thread.sleep(timeoutCheckPulseMillis); - } catch (InterruptedException e) { - logger.debug("Thread interrupted exception in TimeoutChecker"); - break; - //just stop. - } - checkForTimedOutConsumers(); - if (countActiveConsumers() == 0) { - logger.info("No activeConsumers in TimeoutChecker"); - break; - } - } - logger.debug("TimeoutChecker quitting: " + timedOuts.size()); - return new TimeoutFutureResult(timedOuts.size()); - } - - private void checkForTimedOutConsumers() { - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - FileStarted fs = consumer.checkForTimedOutMillis(timeoutThresholdMillis); - if (fs != null) { - timedOuts.add(fs); - } - } - } - } - - private class TimeoutFutureResult implements IFileProcessorFutureResult { - //used to be used when more than one timeout was allowed - //TODO: get rid of this? - private final int timedOutCount; - - private TimeoutFutureResult(final int timedOutCount) { - this.timedOutCount = timedOutCount; - } - - protected int getTimedOutCount() { - return timedOutCount; - } - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java b/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java deleted file mode 100644 index b27dd20..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/BatchProcessDriverCLI.java +++ /dev/null @@ -1,403 +0,0 @@ -package org.apache.tika.batch; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import static java.nio.charset.StandardCharsets.UTF_8; - -import java.io.BufferedInputStream; -import java.io.BufferedReader; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.io.OutputStream; -import java.io.OutputStreamWriter; -import java.io.Writer; -import java.nio.file.Paths; -import java.util.ArrayList; -import java.util.List; -import java.util.Locale; - -import org.apache.commons.io.IOUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -public class BatchProcessDriverCLI { - - /** - * This relies on an special exit values of 254 (do not restart), - * 0 ended correctly, 253 ended with exception (do restart) - */ - public static final int PROCESS_RESTART_EXIT_CODE = 253; - //make sure this is above 255 to avoid stopping on system errors - //that is, if there is a system error (e.g. 143), you - //should restart the process. - public static final int PROCESS_NO_RESTART_EXIT_CODE = 254; - public static final int PROCESS_COMPLETED_SUCCESSFULLY = 0; - private static Logger logger = LoggerFactory.getLogger(BatchProcessDriverCLI.class); - - private int maxProcessRestarts = -1; - private long pulseMillis = 1000; - - //how many times to wait pulseMillis milliseconds if a restart - //message has been received through stdout, but the - //child process has not yet exited - private int waitNumLoopsAfterRestartmessage = 60; - int loopsAfterRestartMessageReceived = 0; - - - - private volatile boolean userInterrupted = false; - private boolean receivedRestartMsg = false; - private Process process = null; - - private StreamGobbler errorWatcher = null; - private StreamGobbler outGobbler = null; - private InterruptWriter interruptWriter = null; - private final InterruptWatcher interruptWatcher = - new InterruptWatcher(System.in); - - private Thread errorWatcherThread = null; - private Thread outGobblerThread = null; - private Thread interruptWriterThread = null; - private final Thread interruptWatcherThread = new Thread(interruptWatcher); - - private final String[] commandLine; - private int numRestarts = 0; - private boolean redirectChildProcessToStdOut = true; - - public BatchProcessDriverCLI(String[] commandLine){ - this.commandLine = tryToReadMaxRestarts(commandLine); - } - - private String[] tryToReadMaxRestarts(String[] commandLine) { - List args = new ArrayList(); - for (int i = 0; i < commandLine.length; i++) { - String arg = commandLine[i]; - if (arg.equals("-maxRestarts")) { - if (i == commandLine.length-1) { - throw new IllegalArgumentException("Must specify an integer after \"-maxRestarts\""); - } - String restartNumString = commandLine[i+1]; - try { - maxProcessRestarts = Integer.parseInt(restartNumString); - } catch (NumberFormatException e) { - throw new IllegalArgumentException("Must specify an integer after \"-maxRestarts\" arg."); - } - i++; - } else { - args.add(arg); - } - } - return args.toArray(new String[args.size()]); - } - - public void execute() throws Exception { - - interruptWatcherThread.setDaemon(true); - interruptWatcherThread.start(); - logger.info("about to start driver"); - start(); - while (!userInterrupted) { - Integer exit = null; - try { - logger.trace("about to check exit value"); - exit = process.exitValue(); - logger.info("The child process has finished with an exit value of: "+exit); - stop(); - } catch (IllegalThreadStateException e) { - //hasn't exited - logger.trace("process has not exited; IllegalThreadStateException"); - } - - logger.trace("Before sleep:" + - " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg); - - //Even if the process has exited, - //wait just a little bit to make sure that - //mustRestart hasn't been set to true - try { - Thread.sleep(pulseMillis); - } catch (InterruptedException e) { - logger.trace("interrupted exception during sleep"); - } - logger.trace("After sleep:" + - " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg); - //if we've gotten the message via stdout to restart - //but the process hasn't exited yet, give it another - //chance - if (receivedRestartMsg && exit == null && - loopsAfterRestartMessageReceived <= waitNumLoopsAfterRestartmessage) { - loopsAfterRestartMessageReceived++; - logger.warn("Must restart, still not exited; loops after restart: " + - loopsAfterRestartMessageReceived); - continue; - } - if (loopsAfterRestartMessageReceived > waitNumLoopsAfterRestartmessage) { - logger.trace("About to try to restart because:" + - " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg); - logger.warn("Restarting after exceeded wait loops waiting for exit: "+ - loopsAfterRestartMessageReceived); - boolean restarted = restart(exit, receivedRestartMsg); - if (!restarted) { - break; - } - } else if (exit != null && exit != BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE - && exit != BatchProcessDriverCLI.PROCESS_COMPLETED_SUCCESSFULLY) { - logger.trace("About to try to restart because:" + - " exit=" + exit + " receivedRestartMsg=" + receivedRestartMsg); - - if (exit == BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE) { - logger.info("Restarting on expected restart code"); - } else { - logger.warn("Restarting on unexpected restart code: "+exit); - } - boolean restarted = restart(exit, receivedRestartMsg); - if (!restarted) { - break; - } - } else if (exit != null && (exit == PROCESS_COMPLETED_SUCCESSFULLY - || exit == BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE)) { - logger.trace("Will not restart: "+exit); - break; - } - } - logger.trace("about to call shutdown driver now"); - shutdownDriverNow(); - logger.info("Process driver has completed"); - } - - private void shutdownDriverNow() { - if (process != null) { - for (int i = 0; i < 60; i++) { - - logger.trace("trying to shut down: "+i); - try { - int exit = process.exitValue(); - logger.trace("trying to stop:"+exit); - stop(); - interruptWatcherThread.interrupt(); - return; - } catch (IllegalThreadStateException e) { - //hasn't exited - } - try { - Thread.sleep(1000); - } catch (InterruptedException e) { - //swallow - } - } - logger.error("Process didn't stop after 60 seconds after shutdown. " + - "I am forcefully killing it."); - } - interruptWatcherThread.interrupt(); - } - - public int getNumRestarts() { - return numRestarts; - } - - public boolean getUserInterrupted() { - return userInterrupted; - } - - /** - * Tries to restart (stop and then start) the child process - * @return whether or not this was successful, will be false if numRestarts >= maxProcessRestarts - * @throws Exception - */ - private boolean restart(Integer exitValue, boolean receivedRestartMsg) throws Exception { - if (maxProcessRestarts > -1 && numRestarts >= maxProcessRestarts) { - logger.warn("Hit the maximum number of process restarts. Driver is shutting down now."); - stop(); - return false; - } - logger.warn("Must restart process (exitValue="+exitValue+" numRestarts="+numRestarts+ - " receivedRestartMessage="+receivedRestartMsg+")"); - stop(); - start(); - numRestarts++; - loopsAfterRestartMessageReceived = 0; - return true; - } - - private void stop() { - if (process != null) { - logger.trace("destroying a non-null process"); - process.destroy(); - } - - receivedRestartMsg = false; - //interrupt the writer thread first - interruptWriterThread.interrupt(); - - errorWatcher.stopGobblingAndDie(); - outGobbler.stopGobblingAndDie(); - errorWatcherThread.interrupt(); - outGobblerThread.interrupt(); - } - - private void start() throws Exception { - ProcessBuilder builder = new ProcessBuilder(commandLine); - builder.directory(Paths.get(".").toFile()); - process = builder.start(); - - errorWatcher = new StreamWatcher(process.getErrorStream()); - errorWatcherThread = new Thread(errorWatcher); - errorWatcherThread.start(); - - outGobbler = new StreamGobbler(process.getInputStream()); - outGobblerThread = new Thread(outGobbler); - outGobblerThread.start(); - - interruptWriter = new InterruptWriter(process.getOutputStream()); - interruptWriterThread = new Thread(interruptWriter); - interruptWriterThread.start(); - - } - - /** - * Typically only used for testing. This determines whether or not - * to redirect child process's stdOut to driver's stdout - * @param redirectChildProcessToStdOut should the driver redirect the child's stdout - */ - public void setRedirectChildProcessToStdOut(boolean redirectChildProcessToStdOut) { - this.redirectChildProcessToStdOut = redirectChildProcessToStdOut; - } - - /** - * Class to watch stdin from the driver for anything that is typed. - * This will currently cause an interrupt if anything followed by - * a return key is entered. We may want to add an "Are you sure?" dialogue. - */ - private class InterruptWatcher implements Runnable { - private BufferedReader reader; - - private InterruptWatcher(InputStream is) { - reader = new BufferedReader(new InputStreamReader(is, UTF_8)); - } - - @Override - public void run() { - try { - //this will block. - //as soon as it reads anything, - //set userInterrupted to true and stop - reader.readLine(); - userInterrupted = true; - } catch (IOException e) { - //swallow - } - } - } - - /** - * Class that writes to the child process - * to force an interrupt in the child process. - */ - private class InterruptWriter implements Runnable { - private final Writer writer; - - private InterruptWriter(OutputStream os) { - this.writer = new OutputStreamWriter(os, UTF_8); - } - - @Override - public void run() { - try { - while (true) { - Thread.sleep(500); - if (userInterrupted) { - writer.write(String.format(Locale.ENGLISH, "Ave atque vale!%n")); - writer.flush(); - } - } - } catch (IOException e) { - //swallow - } catch (InterruptedException e) { - //job is done, ok - } - } - } - - private class StreamGobbler implements Runnable { - //plagiarized from org.apache.oodt's StreamGobbler - protected final BufferedReader reader; - protected boolean running = true; - - private StreamGobbler(InputStream is) { - this.reader = new BufferedReader(new InputStreamReader(new BufferedInputStream(is), UTF_8)); - } - - @Override - public void run() { - String line = null; - try { - logger.trace("gobbler starting to read"); - while ((line = reader.readLine()) != null && this.running) { - if (redirectChildProcessToStdOut) { - System.out.println("BatchProcess:"+line); - } - } - } catch (IOException e) { - logger.trace("gobbler io exception"); - //swallow ioe - } - logger.trace("gobbler done"); - } - - private void stopGobblingAndDie() { - logger.trace("stop gobbling"); - running = false; - IOUtils.closeQuietly(reader); - } - } - - private class StreamWatcher extends StreamGobbler implements Runnable { - //plagiarized from org.apache.oodt's StreamGobbler - - private StreamWatcher(InputStream is){ - super(is); - } - - @Override - public void run() { - String line = null; - try { - logger.trace("watcher starting to read"); - while ((line = reader.readLine()) != null && this.running) { - if (line.startsWith(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString())) { - receivedRestartMsg = true; - } - logger.info("BatchProcess: "+line); - } - } catch (IOException e) { - logger.trace("watcher io exception"); - //swallow ioe - } - logger.trace("watcher done"); - } - } - - - public static void main(String[] args) throws Exception { - - BatchProcessDriverCLI runner = new BatchProcessDriverCLI(args); - runner.execute(); - System.out.println("FSBatchProcessDriver has gracefully completed"); - System.exit(0); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/ConsumersManager.java b/tika-batch/src/main/java/org/apache/tika/batch/ConsumersManager.java deleted file mode 100644 index a4f3b82..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/ConsumersManager.java +++ /dev/null @@ -1,80 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Collections; -import java.util.List; - -/** - * Simple interface around a collection of consumers that allows - * for initializing and shutting shared resources (e.g. db connection, index, writer, etc.) - */ -public abstract class ConsumersManager { - - //maximum time to allow the ConsumersManager for either init() - //or shutdown() - private long consumersManagerMaxMillis = 60000; - private final List consumers; - - public ConsumersManager(List consumers) { - this.consumers = Collections.unmodifiableList(consumers); - } - /** - * Get the consumers - * @return consumers - */ - public List getConsumers() { - return consumers; - } - - /** - * This is called by BatchProcess before submitting the threads - */ - public void init(){ - - } - - /** - * This is called by BatchProcess immediately before closing. - * Beware! Some of the consumers may have hung or may not - * have completed. - */ - public void shutdown(){ - - } - - /** - * {@link org.apache.tika.batch.BatchProcess} will throw an exception - * if the ConsumersManager doesn't complete init() or shutdown() - * within this amount of time. - * @return the maximum time allowed for init() or shutdown() - */ - public long getConsumersManagerMaxMillis() { - return consumersManagerMaxMillis; - } - - /** - * {@see #getConsumersManagerMaxMillis()} - * - * @param consumersManagerMaxMillis maximum number of milliseconds - * to allow for init() or shutdown() - */ - public void setConsumersManagerMaxMillis(long consumersManagerMaxMillis) { - this.consumersManagerMaxMillis = consumersManagerMaxMillis; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/FileConsumerFutureResult.java b/tika-batch/src/main/java/org/apache/tika/batch/FileConsumerFutureResult.java deleted file mode 100644 index be0b958..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/FileConsumerFutureResult.java +++ /dev/null @@ -1,37 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -class FileConsumerFutureResult implements IFileProcessorFutureResult { - - private final FileStarted fileStarted; - private final int filesProcessed; - - public FileConsumerFutureResult(FileStarted fs, int filesProcessed) { - this.fileStarted = fs; - this.filesProcessed = filesProcessed; - } - - public FileStarted getFileStarted() { - return fileStarted; - } - - public int getFilesProcessed() { - return filesProcessed; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/FileResource.java b/tika-batch/src/main/java/org/apache/tika/batch/FileResource.java deleted file mode 100644 index 02d5bdd..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/FileResource.java +++ /dev/null @@ -1,68 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Property; - -import java.io.IOException; -import java.io.InputStream; - - -/** - * This is a basic interface to handle a logical "file". - * This should enable code-agnostic handling of files from different - * sources: file system, database, etc. - * - */ -public interface FileResource { - - //The literal lowercased extension of a file. This may or may not - //have any relationship to the actual type of the file. - public static final Property FILE_EXTENSION = Property.internalText("tika:file_ext"); - - /** - * This is only used in logging to identify which file - * may have caused problems. While it is probably best - * to use unique ids for the sake of debugging, it is not - * necessary that the ids be unique. This id - * is never used as a hashkey by the batch processors, for example. - * - * @return an id for a FileResource - */ - public String getResourceId(); - - /** - * This gets the metadata available before the parsing of the file. - * This will typically be "external" metadata: file name, - * file size, file location, data stream, etc. That is, things - * that are known about the file from outside information, not - * file-internal metadata. - * - * @return Metadata - */ - public Metadata getMetadata(); - - /** - * - * @return an InputStream for the FileResource - * @throws java.io.IOException - */ - public InputStream openInputStream() throws IOException; - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java b/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java deleted file mode 100644 index f3ae6aa..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java +++ /dev/null @@ -1,429 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import javax.xml.stream.XMLOutputFactory; -import javax.xml.stream.XMLStreamException; -import javax.xml.stream.XMLStreamWriter; -import java.io.Closeable; -import java.io.Flushable; -import java.io.IOException; -import java.io.InputStream; -import java.io.PrintWriter; -import java.io.StringWriter; -import java.util.Date; -import java.util.concurrent.ArrayBlockingQueue; -import java.util.concurrent.Callable; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.atomic.AtomicInteger; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.xml.sax.ContentHandler; - - -/** - * This is a base class for file consumers. The - * goal of this class is to abstract out the multithreading - * and recordkeeping components. - *

    - */ -public abstract class FileResourceConsumer implements Callable { - - private enum STATE { - NOT_YET_STARTED, - ACTIVELY_CONSUMING, - SWALLOWED_POISON, - THREAD_INTERRUPTED, - EXCEEDED_MAX_CONSEC_WAIT_MILLIS, - ASKED_TO_SHUTDOWN, - TIMED_OUT, - CONSUMER_EXCEPTION, - CONSUMER_ERROR, - COMPLETED - } - - public static String TIMED_OUT = "timed_out"; - public static String OOM = "oom"; - public static String IO_IS = "io_on_inputstream"; - public static String IO_OS = "io_on_outputstream"; - public static String PARSE_ERR = "parse_err"; - public static String PARSE_EX = "parse_ex"; - - public static String ELAPSED_MILLIS = "elapsedMS"; - - private static AtomicInteger numConsumers = new AtomicInteger(-1); - protected static Logger logger = LoggerFactory.getLogger(FileResourceConsumer.class); - - private long maxConsecWaitInMillis = 10*60*1000;// 10 minutes - - private final ArrayBlockingQueue fileQueue; - - private final XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newFactory(); - private final int consumerId; - - //used to lock checks on state to prevent - private final Object lock = new Object(); - - //this records the file that is currently - //being processed. It is null if no file is currently being processed. - //no need for volatile because of lock for checkForStales - private FileStarted currentFile = null; - - //total number of files consumed; volatile so that reporter - //sees the latest - private volatile int numResourcesConsumed = 0; - - //total number of exceptions that were handled by subclasses; - //volatile so that reporter sees the latest - private volatile int numHandledExceptions = 0; - - //after this has been set to ACTIVELY_CONSUMING, - //this should only be set by setEndedState. - private volatile STATE currentState = STATE.NOT_YET_STARTED; - - public FileResourceConsumer(ArrayBlockingQueue fileQueue) { - this.fileQueue = fileQueue; - consumerId = numConsumers.incrementAndGet(); - } - - public IFileProcessorFutureResult call() { - currentState = STATE.ACTIVELY_CONSUMING; - - try { - FileResource fileResource = getNextFileResource(); - while (fileResource != null) { - logger.debug("file consumer is about to process: " + fileResource.getResourceId()); - boolean consumed = _processFileResource(fileResource); - logger.debug("file consumer has finished processing: " + fileResource.getResourceId()); - - if (consumed) { - numResourcesConsumed++; - } - fileResource = getNextFileResource(); - } - } catch (InterruptedException e) { - setEndedState(STATE.THREAD_INTERRUPTED); - } - - setEndedState(STATE.COMPLETED); - return new FileConsumerFutureResult(currentFile, numResourcesConsumed); - } - - - /** - * Main piece of code that needs to be implemented. Clients - * are responsible for closing streams and handling the exceptions - * that they'd like to handle. - *

    - * Unchecked throwables can be thrown past this, of course. When an unchecked - * throwable is thrown, this logs the error, and then rethrows the exception. - * Clients/subclasses should make sure to catch and handle everything they can. - *

    - * The design goal is that the whole process should close up and shutdown soon after - * an unchecked exception or error is thrown. - *

    - * Make sure to call {@link #incrementHandledExceptions()} appropriately in - * your implementation of this method. - *

    - * - * @param fileResource resource to process - * @return whether or not a file was successfully processed - */ - public abstract boolean processFileResource(FileResource fileResource); - - - /** - * Make sure to call this appropriately! - */ - protected void incrementHandledExceptions() { - numHandledExceptions++; - } - - - /** - * Returns whether or not the consumer is still could process - * a file or is still processing a file (ACTIVELY_CONSUMING or ASKED_TO_SHUTDOWN) - * @return whether this consumer is still active - */ - public boolean isStillActive() { - if (Thread.currentThread().isInterrupted()) { - return false; - } else if( currentState == STATE.NOT_YET_STARTED || - currentState == STATE.ACTIVELY_CONSUMING || - currentState == STATE.ASKED_TO_SHUTDOWN) { - return true; - } - return false; - } - - private boolean _processFileResource(FileResource fileResource) { - currentFile = new FileStarted(fileResource.getResourceId()); - boolean consumed = false; - try { - consumed = processFileResource(fileResource); - } catch (RuntimeException e) { - setEndedState(STATE.CONSUMER_EXCEPTION); - throw e; - } catch (Error e) { - setEndedState(STATE.CONSUMER_ERROR); - throw e; - } - //if anything is thrown from processFileResource, then the fileStarted - //will remain what it was right before the exception was thrown. - currentFile = null; - return consumed; - } - - /** - * This politely asks the consumer to shutdown. - * Before processing another file, the consumer will check to see - * if it has been asked to terminate. - *

    - * This offers another method for politely requesting - * that a FileResourceConsumer stop processing - * besides passing it {@link org.apache.tika.batch.PoisonFileResource}. - * - */ - public void pleaseShutdown() { - setEndedState(STATE.ASKED_TO_SHUTDOWN); - } - - /** - * Returns the name and start time of a file that is currently being processed. - * If no file is currently being processed, this will return null. - * - * @return FileStarted or null - */ - public FileStarted getCurrentFile() { - return currentFile; - } - - public int getNumResourcesConsumed() { - return numResourcesConsumed; - } - - public int getNumHandledExceptions() { - return numHandledExceptions; - } - - /** - * Checks to see if the currentFile being processed (if there is one) - * should be timed out (still being worked on after staleThresholdMillis). - *

    - * If the consumer should be timed out, this will return the currentFile and - * set the state to TIMED_OUT. - *

    - * If the consumer was already timed out earlier or - * is not processing a file or has been working on a file - * for less than #staleThresholdMillis, then this will return null. - *

    - * @param staleThresholdMillis threshold to determine whether the consumer has gone stale. - * @return null or the file started that triggered the stale condition - */ - public FileStarted checkForTimedOutMillis(long staleThresholdMillis) { - //if there isn't a current file, don't bother obtaining lock - if (currentFile == null) { - return null; - } - //if threshold is < 0, don't even look. - if (staleThresholdMillis < 0) { - return null; - } - synchronized(lock) { - //check again once the lock has been obtained - if (currentState != STATE.ACTIVELY_CONSUMING - && currentState != STATE.ASKED_TO_SHUTDOWN) { - return null; - } - FileStarted tmp = currentFile; - if (tmp == null) { - return null; - } - if (tmp.getElapsedMillis() > staleThresholdMillis) { - setEndedState(STATE.TIMED_OUT); - logger.error("{}", getXMLifiedLogMsg( - TIMED_OUT, - tmp.getResourceId(), - ELAPSED_MILLIS, Long.toString(tmp.getElapsedMillis()))); - return tmp; - } - } - return null; - } - - protected String getXMLifiedLogMsg(String type, String resourceId, String... attrs) { - return getXMLifiedLogMsg(type, resourceId, null, attrs); - } - - /** - * Use this for structured output that captures resourceId and other attributes. - * - * @param type entity name for exception - * @param resourceId resourceId string - * @param t throwable can be null - * @param attrs (array of key0, value0, key1, value1, etc.) - */ - protected String getXMLifiedLogMsg(String type, String resourceId, Throwable t, String... attrs) { - - StringWriter writer = new StringWriter(); - try { - XMLStreamWriter xml = xmlOutputFactory.createXMLStreamWriter(writer); - xml.writeStartDocument(); - xml.writeStartElement(type); - xml.writeAttribute("resourceId", resourceId); - if (attrs != null) { - //this assumes args has name value pairs alternating, name0 at 0, val0 at 1, name1 at 2, val2 at 3, etc. - for (int i = 0; i < attrs.length - 1; i++) { - xml.writeAttribute(attrs[i], attrs[i + 1]); - } - } - if (t != null) { - StringWriter stackWriter = new StringWriter(); - PrintWriter printWriter = new PrintWriter(stackWriter); - t.printStackTrace(printWriter); - printWriter.flush(); - stackWriter.flush(); - xml.writeCharacters(stackWriter.toString()); - } - xml.writeEndElement(); - xml.writeEndDocument(); - xml.flush(); - xml.close(); - } catch (XMLStreamException e) { - logger.error("error writing xml stream for: " + resourceId, t); - } - return writer.toString(); - } - - private FileResource getNextFileResource() throws InterruptedException { - FileResource fileResource = null; - long start = new Date().getTime(); - while (fileResource == null) { - //check to see if thread is interrupted before polling - if (Thread.currentThread().isInterrupted()) { - setEndedState(STATE.THREAD_INTERRUPTED); - logger.debug("Consumer thread was interrupted."); - break; - } - - synchronized(lock) { - //need to lock here to prevent race condition with other threads setting state - if (currentState != STATE.ACTIVELY_CONSUMING) { - logger.debug("Consumer already closed because of: "+ currentState.toString()); - break; - } - } - fileResource = fileQueue.poll(1L, TimeUnit.SECONDS); - if (fileResource != null) { - if (fileResource instanceof PoisonFileResource) { - setEndedState(STATE.SWALLOWED_POISON); - fileResource = null; - } - break; - } - logger.debug(consumerId + " is waiting for file and the queue size is: " + fileQueue.size()); - - long elapsed = new Date().getTime() - start; - if (maxConsecWaitInMillis > 0 && elapsed > maxConsecWaitInMillis) { - setEndedState(STATE.EXCEEDED_MAX_CONSEC_WAIT_MILLIS); - break; - } - } - return fileResource; - } - - protected void close(Closeable closeable) { - if (closeable != null) { - try { - closeable.close(); - } catch (IOException e){ - logger.error(e.getMessage()); - } - } - closeable = null; - } - - protected void flushAndClose(Closeable closeable) { - if (closeable == null) { - return; - } - if (closeable instanceof Flushable){ - try { - ((Flushable)closeable).flush(); - } catch (IOException e) { - logger.error(e.getMessage()); - } - } - close(closeable); - } - - //do not overwrite a finished state except if - //not yet started, actively consuming or shutting down. This should - //represent the initial cause; all subsequent calls - //to set will be ignored!!! - private void setEndedState(STATE cause) { - synchronized(lock) { - if (currentState == STATE.NOT_YET_STARTED || - currentState == STATE.ACTIVELY_CONSUMING || - currentState == STATE.ASKED_TO_SHUTDOWN) { - currentState = cause; - } - } - } - - /** - * Utility method to handle logging equivalently among all - * implementing classes. Use, override or avoid as desired. - * - * @param resourceId resourceId - * @param parser parser to use - * @param is inputStream (will be closed by this method!) - * @param handler handler for the content - * @param m metadata - * @param parseContext parse context - * @throws Throwable (logs and then throws whatever was thrown (if anything) - */ - protected void parse(final String resourceId, final Parser parser, InputStream is, - final ContentHandler handler, - final Metadata m, final ParseContext parseContext) throws Throwable { - - try { - parser.parse(is, handler, m, parseContext); - } catch (Throwable t) { - if (t instanceof OutOfMemoryError) { - logger.error(getXMLifiedLogMsg(OOM, - resourceId, t)); - } else if (t instanceof Error) { - logger.error(getXMLifiedLogMsg(PARSE_ERR, - resourceId, t)); - } else { - logger.warn(getXMLifiedLogMsg(PARSE_EX, - resourceId, t)); - incrementHandledExceptions(); - } - throw t; - } finally { - close(is); - } - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawler.java b/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawler.java deleted file mode 100644 index 4dc4f2f..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawler.java +++ /dev/null @@ -1,270 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Date; -import java.util.concurrent.ArrayBlockingQueue; -import java.util.concurrent.Callable; -import java.util.concurrent.TimeUnit; - -import org.apache.tika.extractor.DocumentSelector; -import org.apache.tika.metadata.Metadata; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -public abstract class FileResourceCrawler implements Callable { - - protected final static int SKIPPED = 0; - protected final static int ADDED = 1; - protected final static int STOP_NOW = 2; - - private volatile boolean hasCompletedCrawling = false; - private volatile boolean shutDownNoPoison = false; - private volatile boolean isActive = true; - private volatile boolean timedOut = false; - - //how long to pause if can't add to queue - private static final long PAUSE_INCREMENT_MILLIS = 1000; - - protected static Logger logger = LoggerFactory.getLogger(FileResourceCrawler.class.toString()); - - private int maxFilesToAdd = -1; - private int maxFilesToConsider = -1; - - private final ArrayBlockingQueue queue; - private final int numConsumers; - - - private long maxConsecWaitInMillis = 300000;//300,000ms = 5 minutes - private DocumentSelector documentSelector = null; - - //number of files added to queue - private int added = 0; - //number of files considered including those that were rejected by documentSelector - private int considered = 0; - - /** - * @param queue shared queue - * @param numConsumers number of consumers (needs to know how many poisons to add when done) - */ - public FileResourceCrawler(ArrayBlockingQueue queue, int numConsumers) { - this.queue = queue; - this.numConsumers = numConsumers; - } - - /** - * Implement this to control the addition of FileResources. Call {@link #tryToAdd} - * to add FileResources to the queue. - * - * @throws InterruptedException - */ - public abstract void start() throws InterruptedException; - - public FileResourceCrawlerFutureResult call() { - try { - start(); - } catch (InterruptedException e) { - //this can be triggered by shutdownNow in BatchProcess - logger.info("InterruptedException in FileCrawler: " + e.getMessage()); - } catch (Exception e) { - logger.error("Exception in FileResourceCrawler: " + e.getMessage()); - } finally { - isActive = false; - } - - try { - shutdown(); - } catch (InterruptedException e) { - //swallow - } - - return new FileResourceCrawlerFutureResult(considered, added); - } - - /** - * - * @param fileResource resource to add - * @return int status of the attempt (SKIPPED, ADDED, STOP_NOW) to add the resource to the queue. - * @throws InterruptedException - */ - protected int tryToAdd(FileResource fileResource) throws InterruptedException { - - if (maxFilesToAdd > -1 && added >= maxFilesToAdd) { - return STOP_NOW; - } - - if (maxFilesToConsider > -1 && considered > maxFilesToConsider) { - return STOP_NOW; - } - - boolean isAdded = false; - if (select(fileResource.getMetadata())) { - long totalConsecutiveWait = 0; - while (queue.offer(fileResource, 1L, TimeUnit.SECONDS) == false) { - - logger.info("FileResourceCrawler is pausing. Queue is full: " + queue.size()); - Thread.sleep(PAUSE_INCREMENT_MILLIS); - totalConsecutiveWait += PAUSE_INCREMENT_MILLIS; - if (maxConsecWaitInMillis > -1 && totalConsecutiveWait > maxConsecWaitInMillis) { - timedOut = true; - logger.error("Crawler had to wait longer than max consecutive wait time."); - throw new InterruptedException("FileResourceCrawler had to wait longer than max consecutive wait time."); - } - if (Thread.currentThread().isInterrupted()) { - logger.info("FileResourceCrawler shutting down because of interrupted thread."); - throw new InterruptedException("FileResourceCrawler interrupted."); - } - } - isAdded = true; - added++; - } else { - logger.debug("crawler did not select: "+fileResource.getResourceId()); - } - considered++; - return (isAdded)?ADDED:SKIPPED; - } - - //Warning! Depending on the value of maxConsecWaitInMillis - //this could try forever in vain to add poison to the queue. - private void shutdown() throws InterruptedException{ - logger.debug("FileResourceCrawler entering shutdown"); - if (hasCompletedCrawling || shutDownNoPoison) { - return; - } - int i = 0; - long start = new Date().getTime(); - while (queue.offer(new PoisonFileResource(), 1L, TimeUnit.SECONDS)) { - if (shutDownNoPoison) { - logger.debug("quitting the poison loop because shutDownNoPoison is now true"); - return; - } - if (Thread.currentThread().isInterrupted()) { - logger.debug("thread interrupted while trying to add poison"); - return; - } - long elapsed = new Date().getTime() - start; - if (maxConsecWaitInMillis > -1 && elapsed > maxConsecWaitInMillis) { - logger.error("Crawler timed out while trying to add poison"); - return; - } - logger.debug("added "+i+" number of PoisonFileResource(s)"); - if (i++ >= numConsumers) { - break; - } - - } - hasCompletedCrawling = true; - } - - /** - * If the crawler stops for any reason, it is no longer active. - * - * @return whether crawler is active or not - */ - public boolean isActive() { - return isActive; - } - - public void setMaxConsecWaitInMillis(long maxConsecWaitInMillis) { - this.maxConsecWaitInMillis = maxConsecWaitInMillis; - } - public void setDocumentSelector(DocumentSelector documentSelector) { - this.documentSelector = documentSelector; - } - - public int getConsidered() { - return considered; - } - - protected boolean select(Metadata m) { - return documentSelector.select(m); - } - - /** - * Maximum number of files to add. If {@link #maxFilesToAdd} < 0 (default), - * then this crawler will add all documents. - * - * @param maxFilesToAdd maximum number of files to add to the queue - */ - public void setMaxFilesToAdd(int maxFilesToAdd) { - this.maxFilesToAdd = maxFilesToAdd; - } - - - /** - * Maximum number of files to consider. A file is considered - * whether or not the DocumentSelector selects a document. - *

    - * If {@link #maxFilesToConsider} < 0 (default), then this crawler - * will add all documents. - * - * @param maxFilesToConsider maximum number of files to consider adding to the queue - */ - public void setMaxFilesToConsider(int maxFilesToConsider) { - this.maxFilesToConsider = maxFilesToConsider; - } - - /** - * Use sparingly. This synchronizes on the queue! - * @return whether this queue contains any non-poison file resources - */ - public boolean isQueueEmpty() { - int size= 0; - synchronized(queue) { - for (FileResource aQueue : queue) { - if (!(aQueue instanceof PoisonFileResource)) { - size++; - } - } - } - return size == 0; - } - - /** - * Returns whether the crawler timed out while trying to add a resource - * to the queue. - *

    - * If the crawler timed out while trying to add poison, this is not - * set to true. - * - * @return whether this was timed out or not - */ - public boolean wasTimedOut() { - return timedOut; - } - - /** - * - * @return number of files that this crawler added to the queue - */ - public int getAdded() { - return added; - } - - /** - * Set to true to shut down the FileResourceCrawler without - * adding poison. Do this only if you've already called another mechanism - * to request that consumers shut down. This prevents a potential deadlock issue - * where the crawler is trying to add to the queue, but it is full. - * - * @return - */ - public void shutDownNoPoison() { - this.shutDownNoPoison = true; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawlerFutureResult.java b/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawlerFutureResult.java deleted file mode 100644 index b8c696c..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/FileResourceCrawlerFutureResult.java +++ /dev/null @@ -1,37 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -class FileResourceCrawlerFutureResult implements IFileProcessorFutureResult { - - private final int considered; - private final int added; - - protected FileResourceCrawlerFutureResult(int considered, int added) { - this.considered = considered; - this.added = added; - } - - protected int getConsidered() { - return considered; - } - - protected int getAdded() { - return added; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/FileStarted.java b/tika-batch/src/main/java/org/apache/tika/batch/FileStarted.java deleted file mode 100644 index 3a8d4f4..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/FileStarted.java +++ /dev/null @@ -1,113 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Date; - -/** - * Simple class to record the time when a FileResource's processing started. - */ -class FileStarted { - - private final String resourceId; - private final long started; - - /** - * Initializes a new FileStarted class with {@link #resourceId} - * and sets {@link #started} as new Date().getTime(). - * - * @param resourceId string for unique resource id - */ - public FileStarted(String resourceId) { - this(resourceId, new Date().getTime()); - } - - public FileStarted(String resourceId, long started) { - this.resourceId = resourceId; - this.started = started; - } - - - /** - * @return id of resource - */ - public String getResourceId() { - return resourceId; - } - - /** - * @return time at which processing on this file started - */ - public long getStarted() { - return started; - } - - /** - * @return elapsed milliseconds this the start of processing of this - * file resource - */ - public long getElapsedMillis() { - long now = new Date().getTime(); - return now - started; - } - - @Override - public int hashCode() { - final int prime = 31; - int result = 1; - result = prime * result - + ((resourceId == null) ? 0 : resourceId.hashCode()); - result = prime * result + (int) (started ^ (started >>> 32)); - return result; - } - - @Override - public boolean equals(Object obj) { - if (this == obj) { - return true; - } - if (obj == null) { - return false; - } - if (!(obj instanceof FileStarted)) { - return false; - } - FileStarted other = (FileStarted) obj; - if (resourceId == null) { - if (other.resourceId != null) { - return false; - } - } else if (!resourceId.equals(other.resourceId)) { - return false; - } - return started == other.started; - } - - @Override - public String toString() { - StringBuilder builder = new StringBuilder(); - builder.append("FileStarted [resourceId="); - builder.append(resourceId); - builder.append(", started="); - builder.append(started); - builder.append("]"); - return builder.toString(); - } - - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/IFileProcessorFutureResult.java b/tika-batch/src/main/java/org/apache/tika/batch/IFileProcessorFutureResult.java deleted file mode 100644 index 258c36f..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/IFileProcessorFutureResult.java +++ /dev/null @@ -1,26 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/** - * stub interface to allow for different result types from different processors - * - */ -public interface IFileProcessorFutureResult { - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/Interrupter.java b/tika-batch/src/main/java/org/apache/tika/batch/Interrupter.java deleted file mode 100644 index 6a0439d..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/Interrupter.java +++ /dev/null @@ -1,59 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.BufferedReader; -import java.io.IOException; -import java.io.InputStreamReader; -import java.util.concurrent.Callable; - -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -import static java.nio.charset.StandardCharsets.UTF_8; - - -/** - * Class that waits for input on System.in. If the user enters a keystroke on - * System.in, this will send a signal to the FileResourceRunner to shutdown gracefully. - * - *

    - * In the future, this may implement a common IInterrupter interface for more flexibility. - */ -public class Interrupter implements Callable { - - private Logger logger = LoggerFactory.getLogger(Interrupter.class); - public IFileProcessorFutureResult call(){ - try{ - BufferedReader reader = new BufferedReader(new InputStreamReader(System.in, UTF_8)); - while (true){ - if (reader.ready()){ - reader.readLine(); - break; - } else { - Thread.sleep(1000); - } - } - } catch (InterruptedException e){ - //canceller was interrupted - } catch (IOException e){ - logger.error("IOException from STDIN in CommandlineInterrupter."); - } - return new InterrupterFutureResult(); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/InterrupterFutureResult.java b/tika-batch/src/main/java/org/apache/tika/batch/InterrupterFutureResult.java deleted file mode 100644 index c4d3704..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/InterrupterFutureResult.java +++ /dev/null @@ -1,22 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -public class InterrupterFutureResult implements IFileProcessorFutureResult { - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/OutputStreamFactory.java b/tika-batch/src/main/java/org/apache/tika/batch/OutputStreamFactory.java deleted file mode 100644 index acbf58e..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/OutputStreamFactory.java +++ /dev/null @@ -1,29 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.tika.metadata.Metadata; - -import java.io.IOException; -import java.io.OutputStream; - -public interface OutputStreamFactory { - - public OutputStream getOutputStream(Metadata metadata) throws IOException; - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java b/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java deleted file mode 100644 index d446a80..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/ParallelFileProcessingResult.java +++ /dev/null @@ -1,109 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -public class ParallelFileProcessingResult { - private final int considered; - private final int added; - private final int consumed; - private final int numberHandledExceptions; - private final double secondsElapsed; - private final int exitStatus; - private final String causeForTermination; - - public ParallelFileProcessingResult(int considered, int added, - int consumed, int numberHandledExceptions, - double secondsElapsed, - int exitStatus, - String causeForTermination) { - this.considered = considered; - this.added = added; - this.consumed = consumed; - this.numberHandledExceptions = numberHandledExceptions; - this.secondsElapsed = secondsElapsed; - this.exitStatus = exitStatus; - this.causeForTermination = causeForTermination; - } - - /** - * Returns the number of file resources considered. - * If a filter causes the crawler to ignore a number of resources, - * this number could be higher than that returned by {@link #getConsumed()}. - * - * @return number of file resources considered - */ - public int getConsidered() { - return considered; - } - - /** - * @return number of resources added to the queue - */ - public int getAdded() { - return added; - } - - /** - * @return number of resources that were tried to be consumed. There - * may have been an exception. - */ - public int getConsumed() { - return consumed; - } - - /** - * @return whether the {@link BatchProcess} was interrupted - * by an {@link Interrupter}. - */ - public String getCauseForTermination() { - return causeForTermination; - } - - /** - * - * @return seconds elapsed since the start of the batch processing - */ - public double secondsElapsed() { - return secondsElapsed; - } - - public int getNumberHandledExceptions() { - return numberHandledExceptions; - } - - /** - * - * @return intendedExitStatus - */ - public int getExitStatus() { - return exitStatus; - } - - @Override - public String toString() { - return "ParallelFileProcessingResult{" + - "considered=" + considered + - ", added=" + added + - ", consumed=" + consumed + - ", numberHandledExceptions=" + numberHandledExceptions + - ", secondsElapsed=" + secondsElapsed + - ", exitStatus=" + exitStatus + - ", causeForTermination='" + causeForTermination + '\'' + - '}'; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java b/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java deleted file mode 100644 index 6908a17..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/ParserFactory.java +++ /dev/null @@ -1,37 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.tika.config.TikaConfig; -import org.apache.tika.parser.Parser; - -public abstract class ParserFactory { - - private boolean parseRecursively = true; - - public abstract Parser getParser(TikaConfig config); - - public boolean getParseRecursively() { - return parseRecursively; - } - - public void setParseRecursively(boolean parseRecursively) { - this.parseRecursively = parseRecursively; - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/PoisonFileResource.java b/tika-batch/src/main/java/org/apache/tika/batch/PoisonFileResource.java deleted file mode 100644 index 6cfff16..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/PoisonFileResource.java +++ /dev/null @@ -1,54 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.tika.metadata.Metadata; - -import java.io.InputStream; - -/** - * Sentinel class for the crawler to add to the queue to let - * the consumers know that they should shutdown. - */ -class PoisonFileResource implements FileResource { - - /** - * always returns null - */ - @Override - public Metadata getMetadata() { - return null; - } - - /** - * always returns null - */ - @Override - public InputStream openInputStream() { - return null; - } - - /** - * always returns null - */ - @Override - public String getResourceId() { - return null; - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/StatusReporter.java b/tika-batch/src/main/java/org/apache/tika/batch/StatusReporter.java deleted file mode 100644 index 6ce1a17..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/StatusReporter.java +++ /dev/null @@ -1,227 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.text.NumberFormat; -import java.util.Date; -import java.util.Locale; -import java.util.concurrent.Callable; - -import org.apache.tika.util.DurationFormatUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -/** - * Basic class to use for reporting status from both the crawler and the consumers. - * This wakes up roughly every {@link #sleepMillis} and log.info's a status report. - */ - -public class StatusReporter implements Callable { - - private final Logger logger = LoggerFactory.getLogger(StatusReporter.class); - - //require references to these so that the - //StatusReporter can query them when it wakes up - private final ConsumersManager consumersManager; - private final FileResourceCrawler crawler; - - //local time that the StatusReporter started - private final long start; - //how long to sleep between reporting intervals - private long sleepMillis = 1000; - - //how long before considering a parse "stale" (potentially hung forever) - private long staleThresholdMillis = 100000; - - private volatile boolean isShuttingDown = false; - - /** - * Initialize with the crawler and consumers - * - * @param crawler crawler to ping at intervals - * @param consumersManager consumers to ping at intervals - */ - public StatusReporter(FileResourceCrawler crawler, ConsumersManager consumersManager) { - this.consumersManager = consumersManager; - this.crawler = crawler; - start = new Date().getTime(); - } - - /** - * Override for different behavior. - *

    - * This reports the string at the info level to this class' logger. - * - * @param s string to report - */ - protected void report(String s) { - logger.info(s); - } - - /** - * Startup the reporter. - */ - public IFileProcessorFutureResult call() { - NumberFormat numberFormat = NumberFormat.getNumberInstance(Locale.ROOT); - try { - while (true) { - Thread.sleep(sleepMillis); - int cnt = getRoughCountConsumed(); - int exceptions = getRoughCountExceptions(); - long elapsed = new Date().getTime() - start; - double elapsedSecs = (double) elapsed / (double) 1000; - int avg = (elapsedSecs > 5 || cnt > 100) ? (int) ((double) cnt / elapsedSecs) : -1; - - String elapsedString = DurationFormatUtils.formatMillis(new Date().getTime() - start); - String docsPerSec = avg > -1 ? String.format(Locale.ROOT, - " (%s docs per sec)", - numberFormat.format(avg)) : ""; - String msg = - String.format( - Locale.ROOT, - "Processed %s documents in %s%s.", - numberFormat.format(cnt), elapsedString, docsPerSec); - report(msg); - if (exceptions == 1){ - msg = "There has been one handled exception."; - } else { - msg = - String.format(Locale.ROOT, - "There have been %s handled exceptions.", - numberFormat.format(exceptions)); - } - report(msg); - - reportStale(); - - int stillAlive = getStillAlive(); - if (stillAlive == 1) { - msg = "There is one file processor still active."; - } else { - msg = "There are " + numberFormat.format(stillAlive) + " file processors still active."; - } - report(msg); - - int crawled = crawler.getConsidered(); - int added = crawler.getAdded(); - if (crawled == 1) { - msg = "The directory crawler has considered 1 file,"; - } else { - msg = "The directory crawler has considered " + - numberFormat.format(crawled) + " files, "; - } - if (added == 1) { - msg += "and it has added 1 file."; - } else { - msg += "and it has added " + - numberFormat.format(crawler.getAdded()) + " files."; - } - msg += "\n"; - report(msg); - - if (! crawler.isActive()) { - msg = "The directory crawler has completed its crawl.\n"; - report(msg); - } - if (isShuttingDown) { - msg = "Process is shutting down now."; - report(msg); - } - } - } catch (InterruptedException e) { - //swallow - } - return new StatusReporterFutureResult(); - } - - - /** - * Set the amount of time to sleep between reports. - * @param sleepMillis length to sleep btwn reports in milliseconds - */ - public void setSleepMillis(long sleepMillis) { - this.sleepMillis = sleepMillis; - } - - /** - * Set the amount of time in milliseconds to use as the threshold for determining - * a stale parse. - * - * @param staleThresholdMillis threshold for determining whether or not to report a stale - */ - public void setStaleThresholdMillis(long staleThresholdMillis) { - this.staleThresholdMillis = staleThresholdMillis; - } - - - private void reportStale() { - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - FileStarted fs = consumer.getCurrentFile(); - if (fs == null) { - continue; - } - long elapsed = fs.getElapsedMillis(); - if (elapsed > staleThresholdMillis) { - String elapsedString = Double.toString((double) elapsed / (double) 1000); - report("A thread has been working on " + fs.getResourceId() + - " for " + elapsedString + " seconds."); - } - } - } - - /* - * This returns a rough (unsynchronized) count of resources consumed. - */ - private int getRoughCountConsumed() { - int ret = 0; - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - ret += consumer.getNumResourcesConsumed(); - } - return ret; - } - - private int getStillAlive() { - int ret = 0; - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - if ( consumer.isStillActive()) { - ret++; - } - } - return ret; - } - - /** - * This returns a rough (unsynchronized) count of caught/handled exceptions. - * @return rough count of exceptions - */ - public int getRoughCountExceptions() { - int ret = 0; - for (FileResourceConsumer consumer : consumersManager.getConsumers()) { - ret += consumer.getNumHandledExceptions(); - } - return ret; - } - - /** - * Set whether the main process is in the process of shutting down. - * @param isShuttingDown - */ - public void setIsShuttingDown(boolean isShuttingDown){ - this.isShuttingDown = isShuttingDown; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/StatusReporterFutureResult.java b/tika-batch/src/main/java/org/apache/tika/batch/StatusReporterFutureResult.java deleted file mode 100644 index d0b6944..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/StatusReporterFutureResult.java +++ /dev/null @@ -1,23 +0,0 @@ -package org.apache.tika.batch; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/** - * Empty class for what a StatusReporter returns when it finishes. - */ -public class StatusReporterFutureResult implements IFileProcessorFutureResult { -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/AbstractConsumersBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/AbstractConsumersBuilder.java deleted file mode 100644 index bcb8e0b..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/AbstractConsumersBuilder.java +++ /dev/null @@ -1,38 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.tika.batch.ConsumersManager; -import org.apache.tika.batch.FileResource; -import org.w3c.dom.Node; - -import java.util.Map; -import java.util.concurrent.ArrayBlockingQueue; - -public abstract class AbstractConsumersBuilder { - - public static int getDefaultNumConsumers(){ - int n = Runtime.getRuntime().availableProcessors()-1; - return (n < 1) ? 1 : n; - } - - public abstract ConsumersManager build(Node node, Map runtimeAttributes, - ArrayBlockingQueue queue); - - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/BatchProcessBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/BatchProcessBuilder.java deleted file mode 100644 index 83afa78..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/BatchProcessBuilder.java +++ /dev/null @@ -1,295 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import javax.xml.parsers.DocumentBuilder; -import javax.xml.parsers.DocumentBuilderFactory; -import javax.xml.parsers.ParserConfigurationException; -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.HashMap; -import java.util.Map; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.batch.BatchProcess; -import org.apache.tika.batch.ConsumersManager; -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.FileResourceCrawler; -import org.apache.tika.batch.Interrupter; -import org.apache.tika.batch.StatusReporter; -import org.apache.tika.util.ClassLoaderUtil; -import org.apache.tika.util.XMLDOMUtil; -import org.w3c.dom.Document; -import org.w3c.dom.Node; -import org.w3c.dom.NodeList; -import org.xml.sax.SAXException; - - -/** - * Builds a BatchProcessor from a combination of runtime arguments and the - * config file. - */ -public class BatchProcessBuilder { - - public final static int DEFAULT_MAX_QUEUE_SIZE = 1000; - public final static String MAX_QUEUE_SIZE_KEY = "maxQueueSize"; - public final static String NUM_CONSUMERS_KEY = "numConsumers"; - - /** - * Builds a BatchProcess from runtime arguments and a - * input stream of a configuration file. With the exception of the QueueBuilder, - * the builders choose how to adjudicate between - * runtime arguments and the elements in the configuration file. - *

    - * This does not close the InputStream! - * @param is inputStream - * @param runtimeAttributes incoming runtime attributes - * @return batch process - * @throws java.io.IOException - */ - public BatchProcess build(InputStream is, Map runtimeAttributes) throws IOException { - Document doc = null; - DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance(); - DocumentBuilder docBuilder = null; - try { - docBuilder = fact.newDocumentBuilder(); - doc = docBuilder.parse(is); - } catch (ParserConfigurationException e) { - throw new IOException(e); - } catch (SAXException e) { - throw new IOException(e); - } - Node docElement = doc.getDocumentElement(); - return build(docElement, runtimeAttributes); - } - - /** - * Builds a FileResourceBatchProcessor from runtime arguments and a - * document node of a configuration file. With the exception of the QueueBuilder, - * the builders choose how to adjudicate between - * runtime arguments and the elements in the configuration file. - * - * @param docElement document element of the xml config file - * @param incomingRuntimeAttributes runtime arguments - * @return FileResourceBatchProcessor - */ - public BatchProcess build(Node docElement, Map incomingRuntimeAttributes) { - - //key components - long timeoutThresholdMillis = XMLDOMUtil.getLong("timeoutThresholdMillis", - incomingRuntimeAttributes, docElement); - long timeoutCheckPulseMillis = XMLDOMUtil.getLong("timeoutCheckPulseMillis", - incomingRuntimeAttributes, docElement); - long pauseOnEarlyTerminationMillis = XMLDOMUtil.getLong("pauseOnEarlyTerminationMillis", - incomingRuntimeAttributes, docElement); - int maxAliveTimeSeconds = XMLDOMUtil.getInt("maxAliveTimeSeconds", - incomingRuntimeAttributes, docElement); - - FileResourceCrawler crawler = null; - ConsumersManager consumersManager = null; - StatusReporter reporter = null; - Interrupter interrupter = null; - - /* - * TODO: This is a bit smelly. NumConsumers needs to be used by the crawler - * and the consumers. This copies the incomingRuntimeAttributes and then - * supplies the numConsumers from the commandline (if it exists) or from the config file - * At least this creates an unmodifiable defensive copy of incomingRuntimeAttributes... - */ - Map runtimeAttributes = setNumConsumersInRuntimeAttributes(docElement, incomingRuntimeAttributes); - - //build queue - ArrayBlockingQueue queue = buildQueue(docElement, runtimeAttributes); - - NodeList children = docElement.getChildNodes(); - Map keyNodes = new HashMap(); - for (int i = 0; i < children.getLength(); i++) { - Node child = children.item(i); - if (child.getNodeType() != Node.ELEMENT_NODE) { - continue; - } - String nodeName = child.getNodeName(); - keyNodes.put(nodeName, child); - } - //build consumers - consumersManager = buildConsumersManager(keyNodes.get("consumers"), runtimeAttributes, queue); - - //build crawler - crawler = buildCrawler(queue, keyNodes.get("crawler"), runtimeAttributes); - - reporter = buildReporter(crawler, consumersManager, keyNodes.get("reporter"), runtimeAttributes); - - interrupter = buildInterrupter(keyNodes.get("interrupter"), runtimeAttributes); - - BatchProcess proc = new BatchProcess( - crawler, consumersManager, reporter, interrupter); - - if (timeoutThresholdMillis > -1) { - proc.setTimeoutThresholdMillis(timeoutThresholdMillis); - } - - if (pauseOnEarlyTerminationMillis > -1) { - proc.setPauseOnEarlyTerminationMillis(pauseOnEarlyTerminationMillis); - } - - if (timeoutCheckPulseMillis > -1) { - proc.setTimeoutCheckPulseMillis(timeoutCheckPulseMillis); - } - proc.setMaxAliveTimeSeconds(maxAliveTimeSeconds); - return proc; - } - - private Interrupter buildInterrupter(Node node, Map runtimeAttributes) { - Map attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - String className = attrs.get("builderClass"); - if (className == null) { - throw new RuntimeException("Need to specify class name in interrupter element"); - } - InterrupterBuilder builder = ClassLoaderUtil.buildClass(InterrupterBuilder.class, className); - - return builder.build(node, runtimeAttributes); - - } - - private StatusReporter buildReporter(FileResourceCrawler crawler, ConsumersManager consumersManager, - Node node, Map runtimeAttributes) { - - Map attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - String className = attrs.get("builderClass"); - if (className == null) { - throw new RuntimeException("Need to specify class name in reporter element"); - } - StatusReporterBuilder builder = ClassLoaderUtil.buildClass(StatusReporterBuilder.class, className); - - return builder.build(crawler, consumersManager, node, runtimeAttributes); - - } - - /** - * numConsumers is needed by both the crawler and the consumers. This utility method - * is to be used to extract the number of consumers from a map of String key value pairs. - *

    - * If the value is "default", not a parseable integer or has a value < 1, - * then AbstractConsumersBuilder's getDefaultNumConsumers() - * @param attrs attributes from which to select the NUM_CONSUMERS_KEY - * @return number of consumers - */ - public static int getNumConsumers(Map attrs) { - String nString = attrs.get(BatchProcessBuilder.NUM_CONSUMERS_KEY); - if (nString == null || nString.equals("default")) { - return AbstractConsumersBuilder.getDefaultNumConsumers(); - } - int n = -1; - try { - n = Integer.parseInt(nString); - } catch (NumberFormatException e) { - //swallow - } - if (n < 1) { - n = AbstractConsumersBuilder.getDefaultNumConsumers(); - } - return n; - } - - private Map setNumConsumersInRuntimeAttributes(Node docElement, Map incomingRuntimeAttributes) { - Map runtimeAttributes = new HashMap(); - - for(Map.Entry e : incomingRuntimeAttributes.entrySet()) { - runtimeAttributes.put(e.getKey(), e.getValue()); - } - - //if this is set at runtime use that value - if (runtimeAttributes.containsKey(NUM_CONSUMERS_KEY)){ - return Collections.unmodifiableMap(runtimeAttributes); - } - Node ncNode = docElement.getAttributes().getNamedItem("numConsumers"); - int numConsumers = -1; - String numConsumersString = ncNode.getNodeValue(); - try { - numConsumers = Integer.parseInt(numConsumersString); - } catch (NumberFormatException e) { - //swallow and just use numConsumers - } - //TODO: should we have a max range check? - if (numConsumers < 1) { - numConsumers = AbstractConsumersBuilder.getDefaultNumConsumers(); - } - runtimeAttributes.put(NUM_CONSUMERS_KEY, Integer.toString(numConsumers)); - return Collections.unmodifiableMap(runtimeAttributes); - } - - //tries to get maxQueueSize from main element - private ArrayBlockingQueue buildQueue(Node docElement, - Map runtimeAttributes) { - int maxQueueSize = DEFAULT_MAX_QUEUE_SIZE; - String szString = runtimeAttributes.get(MAX_QUEUE_SIZE_KEY); - - if (szString == null) { - Node szNode = docElement.getAttributes().getNamedItem(MAX_QUEUE_SIZE_KEY); - if (szNode != null) { - szString = szNode.getNodeValue(); - } - } - - if (szString != null) { - try { - maxQueueSize = Integer.parseInt(szString); - } catch (NumberFormatException e) { - //swallow - } - } - - if (maxQueueSize < 0) { - maxQueueSize = DEFAULT_MAX_QUEUE_SIZE; - } - - return new ArrayBlockingQueue(maxQueueSize); - } - - private ConsumersManager buildConsumersManager(Node node, - Map runtimeAttributes, ArrayBlockingQueue queue) { - - Map attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - String className = attrs.get("builderClass"); - if (className == null) { - throw new RuntimeException("Need to specify class name in consumers element"); - } - AbstractConsumersBuilder builder = ClassLoaderUtil.buildClass(AbstractConsumersBuilder.class, className); - - return builder.build(node, runtimeAttributes, queue); - } - - - private FileResourceCrawler buildCrawler(ArrayBlockingQueue queue, - Node node, Map runtimeAttributes) { - Map attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - String className = attrs.get("builderClass"); - if (className == null) { - throw new RuntimeException("Need to specify class name in crawler element"); - } - - ICrawlerBuilder builder = ClassLoaderUtil.buildClass(ICrawlerBuilder.class, className); - return builder.build(node, runtimeAttributes, queue); - } - - - - - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/CommandLineParserBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/CommandLineParserBuilder.java deleted file mode 100644 index 9a32f12..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/CommandLineParserBuilder.java +++ /dev/null @@ -1,143 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import javax.xml.parsers.DocumentBuilder; -import javax.xml.parsers.DocumentBuilderFactory; -import javax.xml.parsers.ParserConfigurationException; -import java.io.IOException; -import java.io.InputStream; -import java.util.Locale; - -import org.apache.commons.cli.Option; -import org.apache.commons.cli.Options; -import org.w3c.dom.Document; -import org.w3c.dom.NamedNodeMap; -import org.w3c.dom.Node; -import org.w3c.dom.NodeList; -import org.xml.sax.SAXException; - -/** - * Reads configurable options from a config file and returns org.apache.commons.cli.Options - * object to be used in commandline parser. This allows users and developers to set - * which options should be made available via the commandline. - */ -public class CommandLineParserBuilder { - - public Options build(InputStream is) throws IOException { - Document doc = null; - DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance(); - DocumentBuilder docBuilder = null; - try { - docBuilder = fact.newDocumentBuilder(); - doc = docBuilder.parse(is); - } catch (ParserConfigurationException e) { - throw new IOException(e); - } catch (SAXException e) { - throw new IOException(e); - } - Node docElement = doc.getDocumentElement(); - NodeList children = docElement.getChildNodes(); - Node commandlineNode = null; - for (int i = 0; i < children.getLength(); i++) { - Node child = children.item(i); - if (child.getNodeType() != Node.ELEMENT_NODE) { - continue; - } - String nodeName = child.getNodeName(); - if (nodeName.equals("commandline")) { - commandlineNode = child; - break; - } - } - Options options = new Options(); - if (commandlineNode == null) { - return options; - } - NodeList optionNodes = commandlineNode.getChildNodes(); - for (int i = 0; i < optionNodes.getLength(); i++) { - - Node optionNode = optionNodes.item(i); - if (optionNode.getNodeType() != Node.ELEMENT_NODE) { - continue; - } - Option opt = buildOption(optionNode); - if (opt != null) { - options.addOption(opt); - } - } - return options; - } - - private Option buildOption(Node optionNode) { - NamedNodeMap map = optionNode.getAttributes(); - String opt = getString(map, "opt", ""); - String description = getString(map, "description", ""); - String longOpt = getString(map, "longOpt", ""); - boolean isRequired = getBoolean(map, "required", false); - boolean hasArg = getBoolean(map, "hasArg", false); - if(opt.trim().length() == 0 || description.trim().length() == 0) { - throw new IllegalArgumentException( - "Must specify at least option and description"); - } - Option option = new Option(opt, description); - if (longOpt.trim().length() > 0) { - option.setLongOpt(longOpt); - } - if (isRequired) { - option.setRequired(true); - } - if (hasArg) { - option.setArgs(1); - } - return option; - } - - private boolean getBoolean(NamedNodeMap map, String opt, boolean defaultValue) { - Node n = map.getNamedItem(opt); - if (n == null) { - return defaultValue; - } - - if (n.getNodeValue() == null) { - return defaultValue; - } - - if (n.getNodeValue().toLowerCase(Locale.ROOT).equals("true")) { - return true; - } else if (n.getNodeValue().toLowerCase(Locale.ROOT).equals("false")) { - return false; - } - return defaultValue; - } - - private String getString(NamedNodeMap map, String opt, String defaultVal) { - Node n = map.getNamedItem(opt); - if (n == null) { - return defaultVal; - } - String value = n.getNodeValue(); - - if (value == null) { - return defaultVal; - } - return value; - } - - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/DefaultContentHandlerFactoryBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/DefaultContentHandlerFactoryBuilder.java deleted file mode 100644 index 7c54c49..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/DefaultContentHandlerFactoryBuilder.java +++ /dev/null @@ -1,58 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Map; - -import org.apache.tika.sax.BasicContentHandlerFactory; -import org.apache.tika.sax.ContentHandlerFactory; -import org.apache.tika.util.XMLDOMUtil; -import org.w3c.dom.Node; - -/** - * Builds BasicContentHandler with type defined by attribute "basicHandlerType" - * with possible values: xml, html, text, body, ignore. - * Default is text. - *

    - * Sets the writeLimit to the value of "writeLimit. - * Default is -1; - */ -public class DefaultContentHandlerFactoryBuilder implements IContentHandlerFactoryBuilder { - - @Override - public ContentHandlerFactory build(Node node, Map runtimeAttributes) { - Map attributes = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - BasicContentHandlerFactory.HANDLER_TYPE type = null; - String handlerTypeString = attributes.get("basicHandlerType"); - type = BasicContentHandlerFactory.parseHandlerType(handlerTypeString, - BasicContentHandlerFactory.HANDLER_TYPE.TEXT); - int writeLimit = -1; - String writeLimitString = attributes.get("writeLimit"); - if (writeLimitString != null) { - try { - writeLimit = Integer.parseInt(attributes.get("writeLimit")); - } catch (NumberFormatException e) { - //swallow and default to -1 - //TODO: should we throw a RuntimeException? - } - } - return new BasicContentHandlerFactory(type, writeLimit); - } - - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/IContentHandlerFactoryBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/IContentHandlerFactoryBuilder.java deleted file mode 100644 index 644d6ba..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/IContentHandlerFactoryBuilder.java +++ /dev/null @@ -1,28 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch.builders; - -import java.util.Map; - -import org.apache.tika.sax.ContentHandlerFactory; -import org.w3c.dom.Node; - -public interface IContentHandlerFactoryBuilder extends ObjectFromDOMBuilder { - - public ContentHandlerFactory build(Node node, Map attributes); - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/ICrawlerBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/ICrawlerBuilder.java deleted file mode 100644 index be411d1..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/ICrawlerBuilder.java +++ /dev/null @@ -1,31 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch.builders; - -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.FileResourceCrawler; -import org.w3c.dom.Node; - -import java.util.Map; -import java.util.concurrent.ArrayBlockingQueue; - -public interface ICrawlerBuilder extends ObjectFromDOMAndQueueBuilder{ - - public FileResourceCrawler build(Node node, Map attributes, - ArrayBlockingQueue queue); - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/IParserFactoryBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/IParserFactoryBuilder.java deleted file mode 100644 index 6a1aae1..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/IParserFactoryBuilder.java +++ /dev/null @@ -1,26 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch.builders; - -import java.util.Map; - -import org.apache.tika.batch.ParserFactory; -import org.w3c.dom.Node; - -public interface IParserFactoryBuilder { - public ParserFactory build(Node node, Map runtimeAttrs); -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/InterrupterBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/InterrupterBuilder.java deleted file mode 100644 index d7223cd..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/InterrupterBuilder.java +++ /dev/null @@ -1,32 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch.builders; - -import java.util.Map; - -import org.apache.tika.batch.Interrupter; -import org.w3c.dom.Node; - -/** - * Builds an Interrupter - */ -public class InterrupterBuilder { - - public Interrupter build(Node n, Map commandlineArguments) { - return new Interrupter(); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMAndQueueBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMAndQueueBuilder.java deleted file mode 100644 index 2baefdf..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMAndQueueBuilder.java +++ /dev/null @@ -1,36 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Map; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.batch.FileResource; -import org.w3c.dom.Node; - -/** - * Same as {@link org.apache.tika.batch.builders.ObjectFromDOMAndQueueBuilder}, - * but this is for objects that require access to the shared queue. - * @param - */ -public interface ObjectFromDOMAndQueueBuilder { - - public T build(Node node, Map runtimeAttributes, - ArrayBlockingQueue resourceQueue); - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMBuilder.java deleted file mode 100644 index 785c9da..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/ObjectFromDOMBuilder.java +++ /dev/null @@ -1,31 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.w3c.dom.Node; - -import java.util.Map; - -/** - * Interface for things that build objects from a DOM Node and a map of runtime attributes - * @param - */ -public interface ObjectFromDOMBuilder { - - public T build(Node node, Map runtimeAttributes); -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java deleted file mode 100644 index 7008422..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/ParserFactoryBuilder.java +++ /dev/null @@ -1,49 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch.builders; - -import java.util.Locale; -import java.util.Map; - -import org.apache.tika.batch.ParserFactory; -import org.apache.tika.util.ClassLoaderUtil; -import org.apache.tika.util.XMLDOMUtil; -import org.w3c.dom.Node; - -public class ParserFactoryBuilder implements IParserFactoryBuilder { - - - @Override - public ParserFactory build(Node node, Map runtimeAttrs) { - Map localAttrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttrs); - String className = localAttrs.get("class"); - ParserFactory pf = ClassLoaderUtil.buildClass(ParserFactory.class, className); - - if (localAttrs.containsKey("parseRecursively")) { - String bString = localAttrs.get("parseRecursively").toLowerCase(Locale.ENGLISH); - if (bString.equals("true")) { - pf.setParseRecursively(true); - } else if (bString.equals("false")) { - pf.setParseRecursively(false); - } else { - throw new RuntimeException("parseRecursively must have value of \"true\" or \"false\": "+ - bString); - } - } - return pf; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/ReporterBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/ReporterBuilder.java deleted file mode 100644 index 5f3f2c8..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/ReporterBuilder.java +++ /dev/null @@ -1,30 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Map; - -import org.apache.tika.batch.StatusReporter; -import org.w3c.dom.Node; - -/** - * Interface for reporter builders - */ -public interface ReporterBuilder extends ObjectFromDOMBuilder { - public StatusReporter build(Node n, Map runtimeAttributes); -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/SimpleLogReporterBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/SimpleLogReporterBuilder.java deleted file mode 100644 index fd68ab9..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/SimpleLogReporterBuilder.java +++ /dev/null @@ -1,43 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Map; - -import org.apache.tika.batch.ConsumersManager; -import org.apache.tika.batch.FileResourceCrawler; -import org.apache.tika.batch.StatusReporter; -import org.apache.tika.util.PropsUtil; -import org.apache.tika.util.XMLDOMUtil; -import org.w3c.dom.Node; - -public class SimpleLogReporterBuilder implements StatusReporterBuilder { - - @Override - public StatusReporter build(FileResourceCrawler crawler, ConsumersManager consumersManager, - Node n, Map commandlineArguments) { - - Map attributes = XMLDOMUtil.mapifyAttrs(n, commandlineArguments); - long sleepMillis = PropsUtil.getLong(attributes.get("reporterSleepMillis"), 1000L); - long staleThresholdMillis = PropsUtil.getLong(attributes.get("reporterStaleThresholdMillis"), 500000L); - StatusReporter reporter = new StatusReporter(crawler, consumersManager); - reporter.setSleepMillis(sleepMillis); - reporter.setStaleThresholdMillis(staleThresholdMillis); - return reporter; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/builders/StatusReporterBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/builders/StatusReporterBuilder.java deleted file mode 100644 index 172e20e..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/builders/StatusReporterBuilder.java +++ /dev/null @@ -1,31 +0,0 @@ -package org.apache.tika.batch.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.Map; - -import org.apache.tika.batch.ConsumersManager; -import org.apache.tika.batch.FileResourceCrawler; -import org.apache.tika.batch.StatusReporter; -import org.w3c.dom.Node; - -public interface StatusReporterBuilder { - - public StatusReporter build(FileResourceCrawler crawler, ConsumersManager consumers, - Node n, Map commandlineArguments); -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java deleted file mode 100644 index 723b5e0..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/AbstractFSConsumer.java +++ /dev/null @@ -1,78 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.io.InputStream; -import java.io.OutputStream; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.batch.BatchNoRestartError; -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.FileResourceConsumer; -import org.apache.tika.batch.OutputStreamFactory; - -public abstract class AbstractFSConsumer extends FileResourceConsumer { - - public AbstractFSConsumer(ArrayBlockingQueue fileQueue) { - super(fileQueue); - } - - /** - * Use this for consistent logging of exceptions. Clients must - * check for whether the os is null, which is the signal - * that the output file already exists and should be skipped. - * - * @param fsOSFactory factory that creates the outputstream - * @param fileResource used by the OSFactory to create the stream - * @return the OutputStream or null if the output file already exists - */ - protected OutputStream getOutputStream(OutputStreamFactory fsOSFactory, - FileResource fileResource) { - OutputStream os = null; - try { - os = fsOSFactory.getOutputStream(fileResource.getMetadata()); - } catch (IOException e) { - //This can happen if the disk has run out of space, - //or if there was a failure with mkdirs in fsOSFactory - logger.error("{}", getXMLifiedLogMsg(IO_OS, - fileResource.getResourceId(), e)); - throw new BatchNoRestartError("IOException trying to open output stream for " + - fileResource.getResourceId() + " :: " + e.getMessage()); - } - return os; - } - - /** - * - * @param fileResource - * @return inputStream, can be null if there is an exception opening IS - */ - protected InputStream getInputStream(FileResource fileResource) { - InputStream is = null; - try { - is = fileResource.openInputStream(); - } catch (IOException e) { - logger.warn("{}", getXMLifiedLogMsg(IO_IS, - fileResource.getResourceId(), e)); - flushAndClose(is); - } - return is; - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java deleted file mode 100644 index 227a426..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/BasicTikaFSConsumer.java +++ /dev/null @@ -1,126 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.InputStream; -import java.io.OutputStream; -import java.io.UnsupportedEncodingException; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.commons.io.IOUtils; -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.OutputStreamFactory; -import org.apache.tika.batch.ParserFactory; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.ContentHandlerFactory; -import org.xml.sax.ContentHandler; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * Basic FileResourceConsumer that reads files from an input - * directory and writes content to the output directory. - *

    - * This catches all exceptions and errors and then logs them. - * This will re-throw errors. - * - */ -public class BasicTikaFSConsumer extends AbstractFSConsumer { - - private boolean parseRecursively = true; - private final ParserFactory parserFactory; - private final ContentHandlerFactory contentHandlerFactory; - private final OutputStreamFactory fsOSFactory; - private final TikaConfig config; - private String outputEncoding = UTF_8.toString(); - - - public BasicTikaFSConsumer(ArrayBlockingQueue queue, - ParserFactory parserFactory, - ContentHandlerFactory contentHandlerFactory, - OutputStreamFactory fsOSFactory, - TikaConfig config) { - super(queue); - this.parserFactory = parserFactory; - this.contentHandlerFactory = contentHandlerFactory; - this.fsOSFactory = fsOSFactory; - this.config = config; - } - - @Override - public boolean processFileResource(FileResource fileResource) { - - Parser parser = parserFactory.getParser(config); - ParseContext context = new ParseContext(); - if (parseRecursively) { - context.set(Parser.class, parser); - } - - OutputStream os = getOutputStream(fsOSFactory, fileResource); - //os can be null if fsOSFactory is set to skip processing a file if the output - //file already exists - if (os == null) { - logger.debug("Skipping: " + fileResource.getMetadata().get(FSProperties.FS_REL_PATH)); - return false; - } - - InputStream is = getInputStream(fileResource); - if (is == null) { - IOUtils.closeQuietly(os); - return false; - } - ContentHandler handler; - try { - handler = contentHandlerFactory.getNewContentHandler(os, getOutputEncoding()); - } catch (UnsupportedEncodingException e) { - incrementHandledExceptions(); - logger.error(getXMLifiedLogMsg("output_encoding_ex", - fileResource.getResourceId(), e)); - flushAndClose(os); - throw new RuntimeException(e.getMessage()); - } - - //now actually call parse! - Throwable thrown = null; - try { - parse(fileResource.getResourceId(), parser, is, handler, - fileResource.getMetadata(), context); - } catch (Error t) { - throw t; - } catch (Throwable t) { - thrown = t; - } finally { - flushAndClose(os); - } - - if (thrown != null) { - return false; - } - return true; - } - - public String getOutputEncoding() { - return outputEncoding; - } - - public void setOutputEncoding(String outputEncoding) { - this.outputEncoding = outputEncoding; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java deleted file mode 100644 index f358002..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSBatchProcessCLI.java +++ /dev/null @@ -1,160 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.net.URL; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.HashMap; -import java.util.Map; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.Future; - -import org.apache.commons.cli.CommandLine; -import org.apache.commons.cli.CommandLineParser; -import org.apache.commons.cli.GnuParser; -import org.apache.commons.cli.HelpFormatter; -import org.apache.commons.cli.Option; -import org.apache.commons.cli.Options; -import org.apache.commons.io.IOUtils; -import org.apache.tika.batch.BatchProcess; -import org.apache.tika.batch.BatchProcessDriverCLI; -import org.apache.tika.batch.ParallelFileProcessingResult; -import org.apache.tika.batch.builders.BatchProcessBuilder; -import org.apache.tika.batch.builders.CommandLineParserBuilder; -import org.apache.tika.io.TikaInputStream; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.slf4j.MarkerFactory; - -public class FSBatchProcessCLI { - - public static String FINISHED_STRING = "Main thread in TikaFSBatchCLI has finished processing."; - - private static Logger logger = LoggerFactory.getLogger(FSBatchProcessCLI.class); - private final Options options; - - public FSBatchProcessCLI(String[] args) throws IOException { - TikaInputStream configIs = null; - try { - configIs = getConfigInputStream(args, true); - CommandLineParserBuilder builder = new CommandLineParserBuilder(); - options = builder.build(configIs); - } finally { - IOUtils.closeQuietly(configIs); - } - } - - public void usage() { - HelpFormatter helpFormatter = new HelpFormatter(); - helpFormatter.printHelp("tika filesystem batch", options); - } - - private TikaInputStream getConfigInputStream(String[] args, boolean logDefault) throws IOException { - TikaInputStream is = null; - Path batchConfigFile = getConfigFile(args); - if (batchConfigFile != null) { - //this will throw IOException if it can't find a specified config file - //better to throw an exception than silently back off to default. - is = TikaInputStream.get(batchConfigFile); - } else { - if (logDefault) { - logger.info("No config file set via -bc, relying on tika-app-batch-config.xml or default-tika-batch-config.xml"); - } - //test to see if there's a tika-app-batch-config.xml on the path - URL config = FSBatchProcessCLI.class.getResource("/tika-app-batch-config.xml"); - if (config != null) { - is = TikaInputStream.get( - FSBatchProcessCLI.class.getResourceAsStream("/tika-app-batch-config.xml")); - } else { - is = TikaInputStream.get( - FSBatchProcessCLI.class.getResourceAsStream("default-tika-batch-config.xml")); - } - } - return is; - } - - private void execute(String[] args) throws Exception { - - CommandLineParser cliParser = new GnuParser(); - CommandLine line = cliParser.parse(options, args); - - if (line.hasOption("help")) { - usage(); - System.exit(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE); - } - - Map mapArgs = new HashMap(); - for (Option option : line.getOptions()) { - String v = option.getValue(); - if (v == null || v.equals("")) { - v = "true"; - } - mapArgs.put(option.getOpt(), v); - } - - BatchProcessBuilder b = new BatchProcessBuilder(); - TikaInputStream is = null; - BatchProcess process = null; - try { - is = getConfigInputStream(args, false); - process = b.build(is, mapArgs); - } finally { - IOUtils.closeQuietly(is); - } - final Thread mainThread = Thread.currentThread(); - - - ExecutorService executor = Executors.newSingleThreadExecutor(); - Future futureResult = executor.submit(process); - - ParallelFileProcessingResult result = futureResult.get(); - System.out.println(FINISHED_STRING); - System.out.println("\n"); - System.out.println(result.toString()); - System.exit(result.getExitStatus()); - } - - private Path getConfigFile(String[] args) { - Path configFile = null; - for (int i = 0; i < args.length; i++) { - if (args[i].equals("-bc") || args[i].equals("-batch-config")) { - if (i < args.length-1) { - configFile = Paths.get(args[i+1]); - } - } - } - return configFile; - } - - public static void main(String[] args) throws Exception { - - try{ - FSBatchProcessCLI cli = new FSBatchProcessCLI(args); - cli.execute(args); - } catch (Throwable t) { - t.printStackTrace(); - logger.error(MarkerFactory.getMarker("FATAL"), - "Fatal exception from FSBatchProcessCLI: " + t.getMessage(), t); - System.exit(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE); - } - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSConsumersManager.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSConsumersManager.java deleted file mode 100644 index 1792f60..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSConsumersManager.java +++ /dev/null @@ -1,42 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.List; - -import org.apache.tika.batch.ConsumersManager; -import org.apache.tika.batch.FileResourceConsumer; - -public class FSConsumersManager extends ConsumersManager { - - - public FSConsumersManager(List consumers) { - super(consumers); - } - - @Override - public void init() { - //noop - } - - @Override - public void shutdown() { - //noop - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDirectoryCrawler.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDirectoryCrawler.java deleted file mode 100644 index c844de9..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDirectoryCrawler.java +++ /dev/null @@ -1,165 +0,0 @@ -package org.apache.tika.batch.fs; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.nio.file.DirectoryStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Comparator; -import java.util.Iterator; -import java.util.LinkedList; -import java.util.List; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.FileResourceCrawler; - -public class FSDirectoryCrawler extends FileResourceCrawler { - - public enum CRAWL_ORDER - { - SORTED, //alphabetical order; necessary for cross-platform unit tests - RANDOM, //shuffle - OS_ORDER //operating system chooses - } - - private final Path root; - private final Path startDirectory; - private final Comparator pathComparator = new FileNameComparator(); - private CRAWL_ORDER crawlOrder; - - public FSDirectoryCrawler(ArrayBlockingQueue fileQueue, - int numConsumers, Path root, CRAWL_ORDER crawlOrder) { - super(fileQueue, numConsumers); - this.root = root; - this.startDirectory = root; - this.crawlOrder = crawlOrder; - if (!Files.isDirectory(startDirectory)) { - throw new RuntimeException("Crawler couldn't find this directory:" + - startDirectory.toAbsolutePath()); - } - - } - - public FSDirectoryCrawler(ArrayBlockingQueue fileQueue, - int numConsumers, Path root, Path startDirectory, - CRAWL_ORDER crawlOrder) { - super(fileQueue, numConsumers); - this.root = root; - this.startDirectory = startDirectory; - this.crawlOrder = crawlOrder; - assert(startDirectory.toAbsolutePath().startsWith(root.toAbsolutePath())); - - if (! Files.isDirectory(startDirectory)) { - throw new RuntimeException("Crawler couldn't find this directory:" + startDirectory.toAbsolutePath()); - } - } - - public void start() throws InterruptedException { - addFiles(startDirectory); - } - - private void addFiles(Path directory) throws InterruptedException { - - if (directory == null) { - logger.warn("FSFileAdder asked to process null directory?!"); - return; - } - - List files = new ArrayList<>(); - try (DirectoryStream ds = Files.newDirectoryStream(directory)){ - for (Path p : ds) { - files.add(p); - } - } catch (IOException e) { - logger.warn("FSFileAdder couldn't read "+directory.toAbsolutePath() + - ": "+e.getMessage()); - } - if (files.size() == 0) { - logger.info("Empty directory: " + directory.toAbsolutePath()); - return; - } - - - if (crawlOrder == CRAWL_ORDER.RANDOM) { - Collections.shuffle(files); - } else if (crawlOrder == CRAWL_ORDER.SORTED) { - Collections.sort(files, pathComparator); - } - - int numFiles = 0; - List directories = new LinkedList<>(); - for (Path f : files) { - if (Thread.currentThread().isInterrupted()) { - throw new InterruptedException("file adder interrupted"); - } - if (!Files.isReadable(f)) { - logger.warn("Skipping -- "+f.toAbsolutePath()+ - " -- file/directory is not readable"); - continue; - } - if (Files.isDirectory(f)) { - directories.add(f); - continue; - } - numFiles++; - if (numFiles == 1) { - handleFirstFileInDirectory(f); - } - int added = tryToAdd(new FSFileResource(root, f)); - if (added == FileResourceCrawler.STOP_NOW) { - logger.debug("crawler has hit a limit: "+f.toAbsolutePath() + " : " + added); - return; - } - logger.debug("trying to add: "+f.toAbsolutePath() + " : " + added); - } - - for (Path f : directories) { - addFiles(f); - } - } - - /** - * Override this if you have any special handling - * for the first actual file that the crawler comes across - * in a directory. For example, it might be handy to call - * mkdirs() on an output directory if your FileResourceConsumers - * are writing to a file. - * - * @param f file to handle - */ - public void handleFirstFileInDirectory(Path f) { - //no-op - } - - //simple lexical order for the file name, we don't really care about localization. - //we do want this, though, because file.compareTo behaves differently - //on different OS's. - private class FileNameComparator implements Comparator { - - @Override - public int compare(Path f1, Path f2) { - if (f1 == null || f2 == null) { - return 0; - } - return f1.getFileName().toString().compareTo(f2.getFileName().toString()); - } - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDocumentSelector.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDocumentSelector.java deleted file mode 100644 index 5db1a2d..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSDocumentSelector.java +++ /dev/null @@ -1,83 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -import org.apache.tika.extractor.DocumentSelector; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.util.PropsUtil; - -/** - * Selector that chooses files based on their file name - * and their size, as determined by Metadata.RESOURCE_NAME_KEY and Metadata.CONTENT_LENGTH. - *

    - * The {@link #excludeFileName} pattern is applied first (if it isn't null). - * Then the {@link #includeFileName} pattern is applied (if it isn't null), - * and finally, the size limit is applied if it is above 0. - */ -public class FSDocumentSelector implements DocumentSelector { - - //can be null! - private final Pattern includeFileName; - - //can be null! - private final Pattern excludeFileName; - private final long maxFileSizeBytes; - private final long minFileSizeBytes; - - public FSDocumentSelector(Pattern includeFileName, Pattern excludeFileName, long minFileSizeBytes, - long maxFileSizeBytes) { - this.includeFileName = includeFileName; - this.excludeFileName = excludeFileName; - this.minFileSizeBytes = minFileSizeBytes; - this.maxFileSizeBytes = maxFileSizeBytes; - } - - @Override - public boolean select(Metadata metadata) { - String fName = metadata.get(Metadata.RESOURCE_NAME_KEY); - long sz = PropsUtil.getLong(metadata.get(Metadata.CONTENT_LENGTH), -1L); - if (maxFileSizeBytes > -1 && sz > 0) { - if (sz > maxFileSizeBytes) { - return false; - } - } - - if (minFileSizeBytes > -1 && sz > 0) { - if (sz < minFileSizeBytes) { - return false; - } - } - - if (excludeFileName != null && fName != null) { - Matcher m = excludeFileName.matcher(fName); - if (m.find()) { - return false; - } - } - - if (includeFileName != null && fName != null) { - Matcher m = includeFileName.matcher(fName); - return m.find(); - } - return true; - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSFileResource.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSFileResource.java deleted file mode 100644 index 327ba1b..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSFileResource.java +++ /dev/null @@ -1,130 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.Locale; - -import org.apache.tika.batch.FileResource; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; - -/** - * FileSystem(FS)Resource wraps a file name. - *

    - * This class automatically sets the following keys in Metadata: - *

      - *
    • Metadata.RESOURCE_NAME_KEY (file name)
    • - *
    • Metadata.CONTENT_LENGTH
    • - *
    • FSProperties.FS_REL_PATH
    • - *
    • FileResource.FILE_EXTENSION
    • - *
    , - */ -public class FSFileResource implements FileResource { - - private final Path fullPath; - private final String relativePath; - private final Metadata metadata; - - /** - * - * @param inputRoot - * @param fullPath - * @see FSFileResource#FSFileResource(Path, Path) - * @deprecated to be removed in Tika 2.0 - */ - @Deprecated - public FSFileResource(File inputRoot, File fullPath) { - this(Paths.get(inputRoot.getAbsolutePath()), - Paths.get(fullPath.getAbsolutePath())); - } - - /** - * Constructor - * - * @param inputRoot the input root for the file - * @param fullPath the full path to the file - * @throws IllegalArgumentException if the fullPath is not - * a child of inputRoot - */ - public FSFileResource(Path inputRoot, Path fullPath) { - this.fullPath = fullPath; - this.metadata = new Metadata(); - //child path must actually be a child - assert(fullPath.toAbsolutePath().startsWith(inputRoot.toAbsolutePath())); - this.relativePath = inputRoot.relativize(fullPath).toString(); - - //need to set these now so that the filter can determine - //whether or not to crawl this file - metadata.set(Metadata.RESOURCE_NAME_KEY, fullPath.getFileName().toString()); - long sz = -1; - try { - sz = Files.size(fullPath); - } catch (IOException e) { - //swallow - //not existent file will be handled downstream - } - metadata.set(Metadata.CONTENT_LENGTH, Long.toString(sz)); - metadata.set(FSProperties.FS_REL_PATH, relativePath); - metadata.set(FileResource.FILE_EXTENSION, getExtension(fullPath)); - } - - /** - * Simple extension extractor that takes whatever comes after the - * last period in the path. It returns a lowercased version of the "extension." - *

    - * If there is no period, it returns an empty string. - * - * @param fullPath full path from which to try to find an extension - * @return the lowercased extension or an empty string - */ - private String getExtension(Path fullPath) { - String p = fullPath.getFileName().toString(); - int i = p.lastIndexOf("."); - if (i > -1) { - return p.substring(i + 1).toLowerCase(Locale.ROOT); - } - return ""; - } - - /** - * - * @return file's relativePath - */ - @Override - public String getResourceId() { - return relativePath; - } - - @Override - public Metadata getMetadata() { - return metadata; - } - - @Override - public InputStream openInputStream() throws IOException { - //no need to include Metadata because we already set the - //same information in the initializer - return TikaInputStream.get(fullPath); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSListCrawler.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSListCrawler.java deleted file mode 100644 index 9bb31f6..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSListCrawler.java +++ /dev/null @@ -1,118 +0,0 @@ -package org.apache.tika.batch.fs; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.BufferedReader; -import java.io.File; -import java.io.FileInputStream; -import java.io.FileNotFoundException; -import java.io.IOException; -import java.io.InputStreamReader; -import java.io.UnsupportedEncodingException; -import java.nio.charset.Charset; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.FileResourceCrawler; - -/** - * Class that "crawls" a list of files. - */ -public class FSListCrawler extends FileResourceCrawler { - - private final BufferedReader reader; - private final Path root; - - /** - * - * @param fileQueue - * @param numConsumers - * @param root - * @param list - * @param encoding - * @throws FileNotFoundException - * @throws UnsupportedEncodingException - * @deprecated - * @see #FSListCrawler(ArrayBlockingQueue, int, Path, Path, Charset) - */ - @Deprecated - public FSListCrawler(ArrayBlockingQueue fileQueue, - int numConsumers, File root, File list, String encoding) - throws FileNotFoundException, UnsupportedEncodingException { - super(fileQueue, numConsumers); - reader = new BufferedReader(new InputStreamReader(new FileInputStream(list), encoding)); - this.root = Paths.get(root.toURI()); - - } - - /** - * Constructor for a crawler that reads a list of files to process. - *

    - * The list should be paths relative to the root. - * - * @param fileQueue queue for batch - * @param numConsumers number of consumers - * @param root root input director - * @param list text file list (one file per line) of paths relative to - * the root for processing - * @param charset charset of the file - * @throws IOException - */ - public FSListCrawler(ArrayBlockingQueue fileQueue, - int numConsumers, Path root, Path list, Charset charset) - throws IOException { - super(fileQueue, numConsumers); - reader = Files.newBufferedReader(list, charset); - this.root = root; - } - - public void start() throws InterruptedException { - String line = nextLine(); - - while (line != null) { - if (Thread.currentThread().isInterrupted()) { - throw new InterruptedException("file adder interrupted"); - } - Path f = Paths.get(root.toString(), line); - if (! Files.exists(f)) { - logger.warn("File doesn't exist:"+f.toAbsolutePath()); - line = nextLine(); - continue; - } - if (Files.isDirectory(f)) { - logger.warn("File is a directory:"+f.toAbsolutePath()); - line = nextLine(); - continue; - } - tryToAdd(new FSFileResource(root, f)); - line = nextLine(); - } - } - - private String nextLine() { - String line = null; - try { - line = reader.readLine(); - } catch (IOException e) { - throw new RuntimeException(e); - } - return line; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSOutputStreamFactory.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSOutputStreamFactory.java deleted file mode 100644 index 93f6008..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSOutputStreamFactory.java +++ /dev/null @@ -1,114 +0,0 @@ -package org.apache.tika.batch.fs; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.BufferedOutputStream; -import java.io.File; -import java.io.IOException; -import java.io.OutputStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.zip.GZIPOutputStream; - -import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream; -import org.apache.commons.compress.compressors.bzip2.BZip2CompressorOutputStream; -import org.apache.tika.batch.OutputStreamFactory; -import org.apache.tika.metadata.Metadata; - -public class FSOutputStreamFactory implements OutputStreamFactory { - - public enum COMPRESSION { - NONE, - BZIP2, - GZIP, - ZIP - } - - private final FSUtil.HANDLE_EXISTING handleExisting; - private final Path outputRoot; - private final String suffix; - private final COMPRESSION compression; - - /** - * - * @param outputRoot - * @param handleExisting - * @param compression - * @param suffix - * @see #FSOutputStreamFactory(Path, FSUtil.HANDLE_EXISTING, COMPRESSION, String) - */ - @Deprecated - public FSOutputStreamFactory(File outputRoot, FSUtil.HANDLE_EXISTING handleExisting, - COMPRESSION compression, String suffix) { - this(Paths.get(outputRoot.toURI()), - handleExisting, compression, suffix); - } - public FSOutputStreamFactory(Path outputRoot, FSUtil.HANDLE_EXISTING handleExisting, - COMPRESSION compression, String suffix) { - this.handleExisting = handleExisting; - this.outputRoot = outputRoot; - this.suffix = suffix; - this.compression = compression; - } - - /** - * This tries to create a file based on the {@link org.apache.tika.batch.fs.FSUtil.HANDLE_EXISTING} - * value that was passed in during initialization. - *

    - * If {@link #handleExisting} is set to "SKIP" and the output file already exists, - * this will return null. - *

    - * If an output file can be found, this will try to mkdirs for that output file. - * If mkdirs() fails, this will throw an IOException. - *

    - * Finally, this will open an output stream for the appropriate output file. - * @param metadata must have a value set for FSMetadataProperties.FS_ABSOLUTE_PATH or - * else NullPointerException will be thrown! - * @return OutputStream - * @throws java.io.IOException, NullPointerException - */ - @Override - public OutputStream getOutputStream(Metadata metadata) throws IOException { - String initialRelativePath = metadata.get(FSProperties.FS_REL_PATH); - Path outputPath = FSUtil.getOutputPath(outputRoot, initialRelativePath, handleExisting, suffix); - if (outputPath == null) { - return null; - } - if (!Files.isDirectory(outputPath.getParent())) { - Files.createDirectories(outputPath.getParent()); - //TODO: shouldn't need this any more in java 7, right? - if (! Files.isDirectory(outputPath.getParent())) { - throw new IOException("Couldn't create parent directory for:"+outputPath.toAbsolutePath()); - } - } - - OutputStream os = Files.newOutputStream(outputPath); - switch (compression) { - case BZIP2: - os = new BZip2CompressorOutputStream(os); - break; - case GZIP: - os = new GZIPOutputStream(os); - break; - case ZIP: - os = new ZipArchiveOutputStream(os); - break; - } - return new BufferedOutputStream(os); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSProperties.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSProperties.java deleted file mode 100644 index f52a4d2..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSProperties.java +++ /dev/null @@ -1,28 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch.fs; - -import org.apache.tika.metadata.Property; - -public class FSProperties { - private final static String TIKA_BATCH_FS_NAMESPACE = "tika_batch_fs"; - - /** - * File's relative path (including file name) from a given source root - */ - public final static Property FS_REL_PATH = Property.internalText(TIKA_BATCH_FS_NAMESPACE+":relative_path"); -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSUtil.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/FSUtil.java deleted file mode 100644 index e758564..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/FSUtil.java +++ /dev/null @@ -1,211 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.batch.fs; - -import java.io.File; -import java.io.IOException; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.UUID; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -/** - * Utility class to handle some common issues when - * reading from and writing to a file system (FS). - */ -public class FSUtil { - - @Deprecated - public static boolean checkThisIsAncestorOfThat(File ancestor, File child) { - int ancLen = ancestor.getAbsolutePath().length(); - int childLen = child.getAbsolutePath().length(); - if (childLen <= ancLen) { - return false; - } - - String childBase = child.getAbsolutePath().substring(0, ancLen); - return childBase.equals(ancestor.getAbsolutePath()); - - } - - @Deprecated - public static boolean checkThisIsAncestorOfOrSameAsThat(File ancestor, File child) { - if (ancestor.equals(child)) { - return true; - } - return checkThisIsAncestorOfThat(ancestor, child); - } - - public enum HANDLE_EXISTING { - OVERWRITE, - RENAME, - SKIP - } - - private final static Pattern FILE_NAME_PATTERN = - Pattern.compile("\\A(.*?)(?:\\((\\d+)\\))?\\.([^\\.]+)\\Z"); - - /** - * Given an output root and an initial relative path, - * return the output file according to the HANDLE_EXISTING strategy - *

    - * In the most basic use case, given a root directory "input", - * a file's relative path "dir1/dir2/fileA.docx", and an output directory - * "output", the output file would be "output/dir1/dir2/fileA.docx." - *

    - * If HANDLE_EXISTING is set to OVERWRITE, this will not check to see if the output already exists, - * and the returned file could overwrite an existing file!!! - *

    - * If HANDLE_EXISTING is set to RENAME, this will try to increment a counter at the end of - * the file name (fileA(2).docx) until there is a file name that doesn't exist. - *

    - * This will return null if handleExisting == HANDLE_EXISTING.SKIP and - * the candidate file already exists. - *

    - * This will throw an IOException if HANDLE_EXISTING is set to - * RENAME, and a candidate cannot output file cannot be found - * after trying to increment the file count (e.g. fileA(2).docx) 10000 times - * and then after trying 20,000 UUIDs. - * - * @param outputRoot directory root for output - * @param initialRelativePath initial relative path (including file name, which may be renamed) - * @param handleExisting what to do if the output file exists - * @param suffix suffix to add to files, can be null - * @return output file or null if no output file should be created - * @throws java.io.IOException - * @see #getOutputPath(Path, String, HANDLE_EXISTING, String) - */ - @Deprecated - public static File getOutputFile(File outputRoot, String initialRelativePath, - HANDLE_EXISTING handleExisting, String suffix) throws IOException { - return getOutputPath(Paths.get(outputRoot.toURI()), - initialRelativePath, handleExisting, suffix).toFile(); - } - - /** - * Given an output root and an initial relative path, - * return the output file according to the HANDLE_EXISTING strategy - *

    - * In the most basic use case, given a root directory "input", - * a file's relative path "dir1/dir2/fileA.docx", and an output directory - * "output", the output file would be "output/dir1/dir2/fileA.docx." - *

    - * If HANDLE_EXISTING is set to OVERWRITE, this will not check to see if the output already exists, - * and the returned file could overwrite an existing file!!! - *

    - * If HANDLE_EXISTING is set to RENAME, this will try to increment a counter at the end of - * the file name (fileA(2).docx) until there is a file name that doesn't exist. - *

    - * This will return null if handleExisting == HANDLE_EXISTING.SKIP and - * the candidate file already exists. - *

    - * This will throw an IOException if HANDLE_EXISTING is set to - * RENAME, and a candidate cannot output file cannot be found - * after trying to increment the file count (e.g. fileA(2).docx) 10000 times - * and then after trying 20,000 UUIDs. - * - * @param outputRoot root directory into which to put the path - * @param initialRelativePath relative path including file ("somedir/subdir1/file.doc") - * @param handleExisting policy for what to do if the output path already exists - * @param suffix suffix to add to the output path - * @return can return null - * @throws IOException - */ - public static Path getOutputPath(Path outputRoot, String initialRelativePath, - HANDLE_EXISTING handleExisting, String suffix) throws IOException { - - String localSuffix = (suffix == null) ? "" : suffix; - Path cand = FSUtil.resolveRelative(outputRoot, - initialRelativePath + "." + localSuffix); - if (Files.exists(cand)) { - if (handleExisting.equals(HANDLE_EXISTING.OVERWRITE)) { - return cand; - } else if (handleExisting.equals(HANDLE_EXISTING.SKIP)) { - return null; - } - } - - //if we're here, the output file exists, and - //we must find a new name for it. - - //groups for "testfile(1).txt": - //group(1) is "testfile" - //group(2) is 1 - //group(3) is "txt" - //Note: group(2) can be null - int cnt = 0; - String fNameBase = null; - String fNameExt = ""; - //this doesn't include the addition of the localSuffix - Path candOnly = FSUtil.resolveRelative(outputRoot, - initialRelativePath); - Matcher m = FILE_NAME_PATTERN.matcher(candOnly.getFileName().toString()); - if (m.find()) { - fNameBase = m.group(1); - - if (m.group(2) != null) { - try { - cnt = Integer.parseInt(m.group(2)); - } catch (NumberFormatException e) { - //swallow - } - } - if (m.group(3) != null) { - fNameExt = m.group(3); - } - } - - Path outputParent = cand.getParent(); - while (fNameBase != null && Files.exists(cand) && ++cnt < 10000) { - String candFileName = fNameBase + "(" + cnt + ")." + fNameExt + "" + localSuffix; - cand = FSUtil.resolveRelative(outputParent, candFileName); - } - //reset count to 0 and try 20000 times - cnt = 0; - while (Files.exists(cand) && cnt++ < 20000) { - UUID uid = UUID.randomUUID(); - cand = FSUtil.resolveRelative(outputParent, - uid.toString() + fNameExt + "" + localSuffix); - } - - if (Files.exists(cand)) { - throw new IOException("Couldn't find candidate output file after trying " + - "very, very hard"); - } - return cand; - } - - /** - * Convenience method to ensure that "other" is not an absolute path. - * One could imagine malicious use of this. - * - * @param p - * @param other - * @return resolved path - * @throws IllegalArgumentException if "other" is an absolute path - */ - public static Path resolveRelative(Path p, String other) { - Path op = Paths.get(other); - if (op.isAbsolute()) { - throw new IllegalArgumentException(other + " cannot be an absolute path!"); - } - return p.resolve(op); - } -} \ No newline at end of file diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java deleted file mode 100644 index 3722873..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/RecursiveParserWrapperFSConsumer.java +++ /dev/null @@ -1,159 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.InputStream; -import java.io.OutputStream; -import java.io.OutputStreamWriter; -import java.io.Writer; -import java.util.LinkedList; -import java.util.List; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.commons.io.IOUtils; -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.OutputStreamFactory; -import org.apache.tika.batch.ParserFactory; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.metadata.serialization.JsonMetadataList; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.RecursiveParserWrapper; -import org.apache.tika.sax.ContentHandlerFactory; -import org.apache.tika.utils.ExceptionUtils; -import org.xml.sax.helpers.DefaultHandler; - -/** - * Basic FileResourceConsumer that reads files from an input - * directory and writes content to the output directory. - *

    - * This tries to catch most of the common exceptions, log them and - * store them in the metadata list output. - */ -public class RecursiveParserWrapperFSConsumer extends AbstractFSConsumer { - - - private final ParserFactory parserFactory; - private final ContentHandlerFactory contentHandlerFactory; - private final OutputStreamFactory fsOSFactory; - private final TikaConfig tikaConfig; - private String outputEncoding = "UTF-8"; - - - public RecursiveParserWrapperFSConsumer(ArrayBlockingQueue queue, - ParserFactory parserFactory, - ContentHandlerFactory contentHandlerFactory, - OutputStreamFactory fsOSFactory, TikaConfig tikaConfig) { - super(queue); - this.parserFactory = parserFactory; - this.contentHandlerFactory = contentHandlerFactory; - this.fsOSFactory = fsOSFactory; - this.tikaConfig = tikaConfig; - } - - @Override - public boolean processFileResource(FileResource fileResource) { - - Parser wrapped = parserFactory.getParser(tikaConfig); - RecursiveParserWrapper parser = new RecursiveParserWrapper(wrapped, contentHandlerFactory); - ParseContext context = new ParseContext(); - -// if (parseRecursively == true) { - context.set(Parser.class, parser); -// } - - //try to open outputstream first - OutputStream os = getOutputStream(fsOSFactory, fileResource); - - if (os == null) { - logger.debug("Skipping: " + fileResource.getMetadata().get(FSProperties.FS_REL_PATH)); - return false; - } - - //try to open the inputstream before the parse. - //if the parse hangs or throws a nasty exception, at least there will - //be a zero byte file there so that the batchrunner can skip that problematic - //file during the next run. - InputStream is = getInputStream(fileResource); - if (is == null) { - IOUtils.closeQuietly(os); - return false; - } - - Throwable thrown = null; - List metadataList = null; - Metadata containerMetadata = fileResource.getMetadata(); - try { - parse(fileResource.getResourceId(), parser, is, new DefaultHandler(), - containerMetadata, context); - metadataList = parser.getMetadata(); - } catch (Throwable t) { - thrown = t; - metadataList = parser.getMetadata(); - if (metadataList == null) { - metadataList = new LinkedList(); - } - Metadata m = null; - if (metadataList.size() == 0) { - m = containerMetadata; - } else { - //take the top metadata item - m = metadataList.remove(0); - } - String stackTrace = ExceptionUtils.getFilteredStackTrace(t); - m.add(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX+"runtime", stackTrace); - metadataList.add(0, m); - } finally { - IOUtils.closeQuietly(is); - } - - Writer writer = null; - - try { - writer = new OutputStreamWriter(os, getOutputEncoding()); - JsonMetadataList.toJson(metadataList, writer); - } catch (Exception e) { - //this is a stop the world kind of thing - logger.error("{}", getXMLifiedLogMsg(IO_OS+"json", - fileResource.getResourceId(), e)); - throw new RuntimeException(e); - } finally { - flushAndClose(writer); - } - - if (thrown != null) { - if (thrown instanceof Error) { - throw (Error) thrown; - } else { - return false; - } - } - - return true; - } - - public String getOutputEncoding() { - return outputEncoding; - } - - public void setOutputEncoding(String outputEncoding) { - this.outputEncoding = outputEncoding; - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java deleted file mode 100644 index b65b046..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/BasicTikaFSConsumersBuilder.java +++ /dev/null @@ -1,207 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.batch.fs.builders; - -import java.io.InputStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.LinkedList; -import java.util.List; -import java.util.Map; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.batch.ConsumersManager; -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.FileResourceConsumer; -import org.apache.tika.batch.OutputStreamFactory; -import org.apache.tika.batch.ParserFactory; -import org.apache.tika.batch.builders.AbstractConsumersBuilder; -import org.apache.tika.batch.builders.BatchProcessBuilder; -import org.apache.tika.batch.builders.IContentHandlerFactoryBuilder; -import org.apache.tika.batch.builders.IParserFactoryBuilder; -import org.apache.tika.batch.fs.BasicTikaFSConsumer; -import org.apache.tika.batch.fs.FSConsumersManager; -import org.apache.tika.batch.fs.FSOutputStreamFactory; -import org.apache.tika.batch.fs.FSUtil; -import org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.sax.ContentHandlerFactory; -import org.apache.tika.util.ClassLoaderUtil; -import org.apache.tika.util.PropsUtil; -import org.apache.tika.util.XMLDOMUtil; -import org.w3c.dom.Node; -import org.w3c.dom.NodeList; - -public class BasicTikaFSConsumersBuilder extends AbstractConsumersBuilder { - - @Override - public ConsumersManager build(Node node, Map runtimeAttributes, - ArrayBlockingQueue queue) { - - //figure out if we're building a recursiveParserWrapper - boolean recursiveParserWrapper = false; - String recursiveParserWrapperString = runtimeAttributes.get("recursiveParserWrapper"); - if (recursiveParserWrapperString != null){ - recursiveParserWrapper = PropsUtil.getBoolean(recursiveParserWrapperString, recursiveParserWrapper); - } else { - Node recursiveParserWrapperNode = node.getAttributes().getNamedItem("recursiveParserWrapper"); - if (recursiveParserWrapperNode != null) { - recursiveParserWrapper = PropsUtil.getBoolean(recursiveParserWrapperNode.getNodeValue(), recursiveParserWrapper); - } - } - - //how long to let the consumersManager run on init() and shutdown() - Long consumersManagerMaxMillis = null; - String consumersManagerMaxMillisString = runtimeAttributes.get("consumersManagerMaxMillis"); - if (consumersManagerMaxMillisString != null){ - consumersManagerMaxMillis = PropsUtil.getLong(consumersManagerMaxMillisString, null); - } else { - Node consumersManagerMaxMillisNode = node.getAttributes().getNamedItem("consumersManagerMaxMillis"); - if (consumersManagerMaxMillis == null && consumersManagerMaxMillisNode != null) { - consumersManagerMaxMillis = PropsUtil.getLong(consumersManagerMaxMillisNode.getNodeValue(), - null); - } - } - - TikaConfig config = null; - String tikaConfigPath = runtimeAttributes.get("c"); - - if( tikaConfigPath == null) { - Node tikaConfigNode = node.getAttributes().getNamedItem("tikaConfig"); - if (tikaConfigNode != null) { - tikaConfigPath = PropsUtil.getString(tikaConfigNode.getNodeValue(), null); - } - } - if (tikaConfigPath != null) { - try (InputStream is = Files.newInputStream(Paths.get(tikaConfigPath))) { - config = new TikaConfig(is); - } catch (Exception e) { - throw new RuntimeException(e); - } - } else { - config = TikaConfig.getDefaultConfig(); - } - - List consumers = new LinkedList(); - int numConsumers = BatchProcessBuilder.getNumConsumers(runtimeAttributes); - - NodeList nodeList = node.getChildNodes(); - Node contentHandlerFactoryNode = null; - Node parserFactoryNode = null; - Node outputStreamFactoryNode = null; - - for (int i = 0; i < nodeList.getLength(); i++){ - Node child = nodeList.item(i); - String cn = child.getNodeName(); - if (cn.equals("parser")){ - parserFactoryNode = child; - } else if (cn.equals("contenthandler")) { - contentHandlerFactoryNode = child; - } else if (cn.equals("outputstream")) { - outputStreamFactoryNode = child; - } - } - - if (contentHandlerFactoryNode == null || parserFactoryNode == null - || outputStreamFactoryNode == null) { - throw new RuntimeException("You must specify a ContentHandlerFactory, "+ - "a ParserFactory and an OutputStreamFactory"); - } - ContentHandlerFactory contentHandlerFactory = getContentHandlerFactory(contentHandlerFactoryNode, runtimeAttributes); - ParserFactory parserFactory = getParserFactory(parserFactoryNode, runtimeAttributes); - OutputStreamFactory outputStreamFactory = getOutputStreamFactory(outputStreamFactoryNode, runtimeAttributes); - - if (recursiveParserWrapper) { - for (int i = 0; i < numConsumers; i++) { - FileResourceConsumer c = new RecursiveParserWrapperFSConsumer(queue, - parserFactory, contentHandlerFactory, outputStreamFactory, config); - consumers.add(c); - } - } else { - for (int i = 0; i < numConsumers; i++) { - FileResourceConsumer c = new BasicTikaFSConsumer(queue, - parserFactory, contentHandlerFactory, outputStreamFactory, config); - consumers.add(c); - } - } - ConsumersManager manager = new FSConsumersManager(consumers); - if (consumersManagerMaxMillis != null) { - manager.setConsumersManagerMaxMillis(consumersManagerMaxMillis); - } - return manager; - } - - - private ContentHandlerFactory getContentHandlerFactory(Node node, Map runtimeAttributes) { - - Map localAttrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - String className = localAttrs.get("builderClass"); - if (className == null) { - throw new RuntimeException("Must specify builderClass for contentHandler"); - } - IContentHandlerFactoryBuilder builder = ClassLoaderUtil.buildClass(IContentHandlerFactoryBuilder.class, className); - return builder.build(node, runtimeAttributes); - } - - private ParserFactory getParserFactory(Node node, Map runtimeAttributes) { - Map localAttrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - String className = localAttrs.get("builderClass"); - IParserFactoryBuilder builder = ClassLoaderUtil.buildClass(IParserFactoryBuilder.class, className); - return builder.build(node, runtimeAttributes); - } - - private OutputStreamFactory getOutputStreamFactory(Node node, Map runtimeAttributes) { - Map attrs = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - - Path outputDir = PropsUtil.getPath(attrs.get("outputDir"), null); -/* FSUtil.HANDLE_EXISTING handleExisting = null; - String handleExistingString = attrs.get("handleExisting"); - if (handleExistingString == null) { - handleExistingException(); - } else if (handleExistingString.equals("overwrite")){ - handleExisting = FSUtil.HANDLE_EXISTING.OVERWRITE; - } else if (handleExistingString.equals("rename")) { - handleExisting = FSUtil.HANDLE_EXISTING.RENAME; - } else if (handleExistingString.equals("skip")) { - handleExisting = FSUtil.HANDLE_EXISTING.SKIP; - } else { - handleExistingException(); - } -*/ - String compressionString = attrs.get("compression"); - FSOutputStreamFactory.COMPRESSION compression = FSOutputStreamFactory.COMPRESSION.NONE; - if (compressionString == null) { - //do nothing - } else if (compressionString.contains("bz")) { - compression = FSOutputStreamFactory.COMPRESSION.BZIP2; - } else if (compressionString.contains("gz")) { - compression = FSOutputStreamFactory.COMPRESSION.GZIP; - } else if (compressionString.contains("zip")) { - compression = FSOutputStreamFactory.COMPRESSION.ZIP; - } - String suffix = attrs.get("outputSuffix"); - - //TODO: possibly open up the different handle-existings in the future - //but for now, lock it down to require skip. Too dangerous otherwise - //if the driver restarts and this is set to overwrite... - return new FSOutputStreamFactory(outputDir, FSUtil.HANDLE_EXISTING.SKIP, - compression, suffix); - } - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/FSCrawlerBuilder.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/FSCrawlerBuilder.java deleted file mode 100644 index 53a3f96..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/builders/FSCrawlerBuilder.java +++ /dev/null @@ -1,141 +0,0 @@ -package org.apache.tika.batch.fs.builders; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - - -import java.io.FileNotFoundException; -import java.io.IOException; -import java.io.UnsupportedEncodingException; -import java.nio.charset.Charset; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.Locale; -import java.util.Map; -import java.util.concurrent.ArrayBlockingQueue; -import java.util.regex.Pattern; - -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.FileResourceCrawler; -import org.apache.tika.batch.builders.BatchProcessBuilder; -import org.apache.tika.batch.builders.ICrawlerBuilder; -import org.apache.tika.batch.fs.FSDirectoryCrawler; -import org.apache.tika.batch.fs.FSDocumentSelector; -import org.apache.tika.batch.fs.FSListCrawler; -import org.apache.tika.extractor.DocumentSelector; -import org.apache.tika.util.PropsUtil; -import org.apache.tika.util.XMLDOMUtil; -import org.w3c.dom.Node; - -/** - * Builds either an FSDirectoryCrawler or an FSListCrawler. - */ -public class FSCrawlerBuilder implements ICrawlerBuilder { - - private final static String MAX_CONSEC_WAIT_MILLIS = "maxConsecWaitMillis"; - private final static String MAX_FILES_TO_ADD_ATTR = "maxFilesToAdd"; - private final static String MAX_FILES_TO_CONSIDER_ATTR = "maxFilesToConsider"; - - - private final static String CRAWL_ORDER = "crawlOrder"; - private final static String INPUT_DIR_ATTR = "inputDir"; - private final static String INPUT_START_DIR_ATTR = "startDir"; - private final static String MAX_FILE_SIZE_BYTES_ATTR = "maxFileSizeBytes"; - private final static String MIN_FILE_SIZE_BYTES_ATTR = "minFileSizeBytes"; - - - private final static String INCLUDE_FILE_PAT_ATTR = "includeFilePat"; - private final static String EXCLUDE_FILE_PAT_ATTR = "excludeFilePat"; - - @Override - public FileResourceCrawler build(Node node, Map runtimeAttributes, - ArrayBlockingQueue queue) { - - Map attributes = XMLDOMUtil.mapifyAttrs(node, runtimeAttributes); - - int numConsumers = BatchProcessBuilder.getNumConsumers(runtimeAttributes); - Path inputDir = PropsUtil.getPath(attributes.get(INPUT_DIR_ATTR), - Paths.get("input")); - FileResourceCrawler crawler = null; - if (attributes.containsKey("fileList")) { - String randomCrawlString = attributes.get(CRAWL_ORDER); - - if (randomCrawlString != null) { - //TODO: change to logger warn or throw RuntimeException? - System.err.println("randomCrawl attribute is ignored by FSListCrawler"); - } - Path fileList = PropsUtil.getPath(attributes.get("fileList"), null); - String encodingString = PropsUtil.getString(attributes.get("fileListEncoding"), "UTF-8"); - - try { - Charset encoding = Charset.forName(encodingString); - crawler = new FSListCrawler(queue, numConsumers, inputDir, fileList, encoding); - } catch (FileNotFoundException e) { - throw new RuntimeException("fileList file not found for FSListCrawler: " + - fileList.toAbsolutePath()); - } catch (UnsupportedEncodingException e) { - throw new RuntimeException("fileList encoding not supported: "+encodingString); - } catch (IOException e) { - throw new RuntimeException("IOException while trying to open fileList: " + e.getMessage()); - } - } else { - FSDirectoryCrawler.CRAWL_ORDER crawlOrder = getCrawlOrder(attributes.get(CRAWL_ORDER)); - Path startDir = PropsUtil.getPath(attributes.get(INPUT_START_DIR_ATTR), null); - if (startDir == null) { - crawler = new FSDirectoryCrawler(queue, numConsumers, inputDir, crawlOrder); - } else { - crawler = new FSDirectoryCrawler(queue, numConsumers, inputDir, startDir, crawlOrder); - } - } - - crawler.setMaxFilesToConsider(PropsUtil.getInt(attributes.get(MAX_FILES_TO_CONSIDER_ATTR), -1)); - crawler.setMaxFilesToAdd(PropsUtil.getInt(attributes.get(MAX_FILES_TO_ADD_ATTR), -1)); - - DocumentSelector selector = buildSelector(attributes); - if (selector != null) { - crawler.setDocumentSelector(selector); - } - - crawler.setMaxConsecWaitInMillis(PropsUtil.getLong(attributes.get(MAX_CONSEC_WAIT_MILLIS), 300000L));//5 minutes - return crawler; - } - - private FSDirectoryCrawler.CRAWL_ORDER getCrawlOrder(String s) { - if (s == null || s.trim().length() == 0 || s.equals("os")) { - return FSDirectoryCrawler.CRAWL_ORDER.OS_ORDER; - } else if (s.toLowerCase(Locale.ROOT).contains("rand")) { - return FSDirectoryCrawler.CRAWL_ORDER.RANDOM; - } else if (s.toLowerCase(Locale.ROOT).contains("sort")) { - return FSDirectoryCrawler.CRAWL_ORDER.SORTED; - } else { - return FSDirectoryCrawler.CRAWL_ORDER.OS_ORDER; - } - } - - private DocumentSelector buildSelector(Map attributes) { - String includeString = attributes.get(INCLUDE_FILE_PAT_ATTR); - String excludeString = attributes.get(EXCLUDE_FILE_PAT_ATTR); - long maxFileSize = PropsUtil.getLong(attributes.get(MAX_FILE_SIZE_BYTES_ATTR), -1L); - long minFileSize = PropsUtil.getLong(attributes.get(MIN_FILE_SIZE_BYTES_ATTR), -1L); - Pattern includePat = (includeString != null && includeString.length() > 0) ? Pattern.compile(includeString) : null; - Pattern excludePat = (excludeString != null && excludeString.length() > 0) ? Pattern.compile(excludeString) : null; - - return new FSDocumentSelector(includePat, excludePat, minFileSize, maxFileSize); - } - - -} diff --git a/tika-batch/src/main/java/org/apache/tika/batch/fs/strawman/StrawManTikaAppDriver.java b/tika-batch/src/main/java/org/apache/tika/batch/fs/strawman/StrawManTikaAppDriver.java deleted file mode 100644 index 3f0fdfe..0000000 --- a/tika-batch/src/main/java/org/apache/tika/batch/fs/strawman/StrawManTikaAppDriver.java +++ /dev/null @@ -1,254 +0,0 @@ -package org.apache.tika.batch.fs.strawman; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.io.InputStream; -import java.io.OutputStream; -import java.nio.file.FileVisitResult; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.nio.file.SimpleFileVisitor; -import java.nio.file.attribute.BasicFileAttributes; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Date; -import java.util.List; -import java.util.concurrent.Callable; -import java.util.concurrent.ExecutionException; -import java.util.concurrent.ExecutorCompletionService; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.Future; -import java.util.concurrent.atomic.AtomicInteger; - -import org.apache.commons.io.IOUtils; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.slf4j.MarkerFactory; - -/** - * Simple single-threaded class that calls tika-app against every file in a directory. - * - * This is exceedingly robust. One file per process. - * - * However, you can use this to compare performance against tika-batch fs code. - * - * - */ -public class StrawManTikaAppDriver implements Callable { - - private static AtomicInteger threadCount = new AtomicInteger(0); - private final int totalThreads; - private final int threadNum; - private Path inputRoot = null; - private Path outputRoot = null; - private String[] args = null; - private Logger logger = LoggerFactory.getLogger(StrawManTikaAppDriver.class); - - - public StrawManTikaAppDriver(Path inputRoot, Path outputRoot, - int totalThreads, String[] args) { - this.inputRoot = inputRoot; - this.outputRoot = outputRoot; - this.args = args; - threadNum = threadCount.getAndIncrement(); - this.totalThreads = totalThreads; - } - - - private class TikaVisitor extends SimpleFileVisitor { - private int processed = 0; - - int getProcessed() { - return processed; - } - @Override - public FileVisitResult visitFile(Path file, - BasicFileAttributes attr) { - if (totalThreads > 1) { - int hashCode = file.toAbsolutePath().toString().hashCode(); - if (Math.abs(hashCode % totalThreads) != threadNum) { - return FileVisitResult.CONTINUE; - } - } - assert(file.startsWith(inputRoot)); - Path relPath = inputRoot.relativize(file); - Path outputFile = Paths.get(outputRoot.toAbsolutePath().toString(), - relPath.toString() + ".txt"); - try { - Files.createDirectories(outputFile.getParent()); - } catch (IOException e) { - logger.error(MarkerFactory.getMarker("FATAL"), - "parent directory for "+ outputFile + " was not made!"); - throw new RuntimeException("couldn't make parent file for " + outputFile); - } - List commandLine = new ArrayList<>(); - for (String arg : args) { - commandLine.add(arg); - } - commandLine.add("-t"); - commandLine.add("\""+outputFile.toAbsolutePath()+"\""); - ProcessBuilder builder = new ProcessBuilder(commandLine.toArray(new String[commandLine.size()])); - logger.info("about to process: "+file.toAbsolutePath()); - Process proc = null; - RedirectGobbler gobbler = null; - Thread gobblerThread = null; - try { - OutputStream os = Files.newOutputStream(outputFile); - proc = builder.start(); - gobbler = new RedirectGobbler(proc.getInputStream(), os); - gobblerThread = new Thread(gobbler); - gobblerThread.start(); - } catch (IOException e) { - logger.error(e.getMessage()); - return FileVisitResult.CONTINUE; - } - - boolean finished = false; - long totalTime = 180000;//3 minutes - long pulse = 100; - for (int i = 0; i < totalTime; i += pulse) { - try { - Thread.currentThread().sleep(pulse); - } catch (InterruptedException e) { - //swallow - } - try { - int exit = proc.exitValue(); - finished = true; - break; - } catch (IllegalThreadStateException e) { - //swallow - } - } - if (!finished) { - logger.warn("Had to kill process working on: " + file.toAbsolutePath()); - proc.destroy(); - } - gobbler.close(); - gobblerThread.interrupt(); - processed++; - return FileVisitResult.CONTINUE; - } - - } - - - - @Override - public Integer call() throws Exception { - long start = new Date().getTime(); - TikaVisitor v = new TikaVisitor(); - Files.walkFileTree(inputRoot, v); - int processed = v.getProcessed(); - double elapsedSecs = ((double)new Date().getTime()-(double)start)/(double)1000; - logger.info("Finished processing " + processed + " files in " + elapsedSecs + " seconds."); - return processed; - } - - private class RedirectGobbler implements Runnable { - private OutputStream redirectOs = null; - private InputStream redirectIs = null; - - private RedirectGobbler(InputStream is, OutputStream os) { - this.redirectIs = is; - this.redirectOs = os; - } - - private void close() { - if (redirectOs != null) { - try { - redirectOs.flush(); - } catch (IOException e) { - logger.error("can't flush"); - } - try { - redirectIs.close(); - } catch (IOException e) { - logger.error("can't close input in redirect gobbler"); - } - try { - redirectOs.close(); - } catch (IOException e) { - logger.error("can't close output in redirect gobbler"); - } - } - } - - @Override - public void run() { - try { - IOUtils.copy(redirectIs, redirectOs); - } catch (IOException e) { - logger.error("IOException while gobbling"); - } - } - } - - - - public static String usage() { - StringBuilder sb = new StringBuilder(); - sb.append("Example usage:\n"); - sb.append("java -cp org.apache.batch.fs.strawman.StrawManTikaAppDriver "); - sb.append(" "); - sb.append("java -jar tika-app-X.Xjar <...commandline arguments for tika-app>\n\n"); - return sb.toString(); - } - - public static void main(String[] args) { - long start = new Date().getTime(); - if (args.length < 6) { - System.err.println(StrawManTikaAppDriver.usage()); - } - Path inputDir = Paths.get(args[0]); - Path outputDir = Paths.get(args[1]); - int totalThreads = Integer.parseInt(args[2]); - - List commandLine = new ArrayList(); - commandLine.addAll(Arrays.asList(args).subList(3, args.length)); - totalThreads = (totalThreads < 1) ? 1 : totalThreads; - ExecutorService ex = Executors.newFixedThreadPool(totalThreads); - ExecutorCompletionService completionService = - new ExecutorCompletionService(ex); - - for (int i = 0; i < totalThreads; i++) { - StrawManTikaAppDriver driver = - new StrawManTikaAppDriver(inputDir, outputDir, totalThreads, commandLine.toArray(new String[commandLine.size()])); - completionService.submit(driver); - } - - int totalFilesProcessed = 0; - for (int i = 0; i < totalThreads; i++) { - try { - Future future = completionService.take(); - if (future != null) { - totalFilesProcessed += future.get(); - } - } catch (InterruptedException e) { - e.printStackTrace(); - } catch (ExecutionException e) { - e.printStackTrace(); - } - } - double elapsedSeconds = (double)(new Date().getTime()-start)/(double)1000; - System.out.println("Processed "+totalFilesProcessed + " in " + elapsedSeconds + " seconds"); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/util/ClassLoaderUtil.java b/tika-batch/src/main/java/org/apache/tika/util/ClassLoaderUtil.java deleted file mode 100644 index 80f618c..0000000 --- a/tika-batch/src/main/java/org/apache/tika/util/ClassLoaderUtil.java +++ /dev/null @@ -1,41 +0,0 @@ -package org.apache.tika.util; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -public class ClassLoaderUtil { - - @SuppressWarnings("unchecked") - public static T buildClass(Class iface, String className) { - - ClassLoader loader = ClassLoader.getSystemClassLoader(); - Class clazz; - try { - clazz = loader.loadClass(className); - if (iface.isAssignableFrom(clazz)) { - return (T) clazz.newInstance(); - } - throw new IllegalArgumentException(iface.toString() + " is not assignable from " + className); - } catch (ClassNotFoundException e) { - throw new RuntimeException(e); - } catch (InstantiationException e) { - throw new RuntimeException(e); - } catch (IllegalAccessException e) { - throw new RuntimeException(e); - } - - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/util/DurationFormatUtils.java b/tika-batch/src/main/java/org/apache/tika/util/DurationFormatUtils.java deleted file mode 100644 index d61cac3..0000000 --- a/tika-batch/src/main/java/org/apache/tika/util/DurationFormatUtils.java +++ /dev/null @@ -1,66 +0,0 @@ -package org.apache.tika.util; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/** - * Functionality and naming conventions (roughly) copied from org.apache.commons.lang3 - * so that we didn't have to add another dependency. - */ -public class DurationFormatUtils { - - public static String formatMillis(long duration) { - duration = Math.abs(duration); - StringBuilder sb = new StringBuilder(); - int secs = (int) (duration / 1000) % 60; - int mins = (int) ((duration / (1000 * 60)) % 60); - int hrs = (int) ((duration / (1000 * 60 * 60)) % 24); - int days = (int) ((duration / (1000 * 60 * 60 * 24)) % 7); - - //sb.append(millis + " milliseconds"); - addUnitString(sb, days, "day"); - addUnitString(sb, hrs, "hour"); - addUnitString(sb, mins, "minute"); - addUnitString(sb, secs, "second"); - if (duration < 1000) { - addUnitString(sb, duration, "millisecond"); - } - - return sb.toString(); - } - - private static void addUnitString(StringBuilder sb, long unit, String unitString) { - //only add unit if >= 1 - if (unit == 1) { - addComma(sb); - sb.append("1 "); - sb.append(unitString); - } else if (unit > 1) { - addComma(sb); - sb.append(unit); - sb.append(" "); - sb.append(unitString); - sb.append("s"); - } - } - - private static void addComma(StringBuilder sb) { - if (sb.length() > 0) { - sb.append(", "); - } - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/util/PropsUtil.java b/tika-batch/src/main/java/org/apache/tika/util/PropsUtil.java deleted file mode 100644 index 4238851..0000000 --- a/tika-batch/src/main/java/org/apache/tika/util/PropsUtil.java +++ /dev/null @@ -1,149 +0,0 @@ -package org.apache.tika.util; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.File; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.Locale; - -/** - * Utility class to handle properties. If the value is null, - * or if there is a parser error, the defaultMissing value will be returned. - */ -public class PropsUtil { - - /** - * Parses v. If there is a problem, this returns defaultMissing. - * - * @param v string to parse - * @param defaultMissing value to return if value is null or unparseable - * @return parsed value - */ - public static Boolean getBoolean(String v, Boolean defaultMissing) { - if (v == null || v.length() == 0) { - return defaultMissing; - } - if (v.toLowerCase(Locale.ROOT).equals("true")) { - return true; - } - if (v.toLowerCase(Locale.ROOT).equals("false")) { - return false; - } - return defaultMissing; - } - - /** - * Parses v. If there is a problem, this returns defaultMissing. - * - * @param v string to parse - * @param defaultMissing value to return if value is null or unparseable - * @return parsed value - */ - public static Integer getInt(String v, Integer defaultMissing) { - if (v == null || v.length() == 0) { - return defaultMissing; - } - try { - return Integer.parseInt(v); - } catch (NumberFormatException e) { - //NO OP - } - return defaultMissing; - } - - /** - * Parses v. If there is a problem, this returns defaultMissing. - * - * @param v string to parse - * @param defaultMissing value to return if value is null or unparseable - * @return parsed value - */ - public static Long getLong(String v, Long defaultMissing) { - if (v == null || v.length() == 0) { - return defaultMissing; - } - try { - return Long.parseLong(v); - } catch (NumberFormatException e) { - //swallow - } - return defaultMissing; - } - - - /** - * Parses v. If there is a problem, this returns defaultMissing. - * - * @param v string to parse - * @param defaultMissing value to return if value is null or unparseable - * @return parsed value - * @see #getPath(String, Path) - */ - @Deprecated - public static File getFile(String v, File defaultMissing) { - if (v == null || v.length() == 0) { - return defaultMissing; - } - //trim initial and final " if they exist - if (v.startsWith("\"")) { - v = v.substring(1); - } - if (v.endsWith("\"")) { - v = v.substring(0, v.length()-1); - } - - return new File(v); - } - - /** - * Parses v. If v is null, this returns defaultMissing. - * - * @param v string to parse - * @param defaultMissing value to return if value is null - * @return parsed value - */ - public static String getString(String v, String defaultMissing) { - if (v == null) { - return defaultMissing; - } - return v; - } - - /** - * Parses v. If there is a problem, this returns defaultMissing. - * - * @param v string to parse - * @param defaultMissing value to return if value is null or unparseable - * @return parsed value - * @see #getPath(String, Path) - */ - public static Path getPath(String v, Path defaultMissing) { - if (v == null || v.length() == 0) { - return defaultMissing; - } - //trim initial and final " if they exist - if (v.startsWith("\"")) { - v = v.substring(1); - } - if (v.endsWith("\"")) { - v = v.substring(0, v.length()-1); - } - return Paths.get(v); - } -} diff --git a/tika-batch/src/main/java/org/apache/tika/util/XMLDOMUtil.java b/tika-batch/src/main/java/org/apache/tika/util/XMLDOMUtil.java deleted file mode 100644 index d930915..0000000 --- a/tika-batch/src/main/java/org/apache/tika/util/XMLDOMUtil.java +++ /dev/null @@ -1,109 +0,0 @@ -package org.apache.tika.util; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.util.HashMap; -import java.util.Map; - -import org.w3c.dom.NamedNodeMap; -import org.w3c.dom.Node; - -public class XMLDOMUtil { - - /** - * This grabs the attributes from a dom node and overwrites those values with those - * specified by the overwrite map. - * - * @param node node for building - * @param overwrite map of attributes to overwrite - * @return map of attributes - */ - public static Map mapifyAttrs(Node node, Map overwrite) { - Map map = new HashMap(); - NamedNodeMap nnMap = node.getAttributes(); - for (int i = 0; i < nnMap.getLength(); i++) { - Node attr = nnMap.item(i); - map.put(attr.getNodeName(), attr.getNodeValue()); - } - if (overwrite != null) { - for (Map.Entry e : overwrite.entrySet()) { - map.put(e.getKey(), e.getValue()); - } - } - return map; - } - - - /** - * Get an int value. Try the runtime attributes first and then back off to - * the document element. Throw a RuntimeException if the attribute is not - * found or if the value is not parseable as an int. - * - * @param attrName attribute name to find - * @param runtimeAttributes runtime attributes - * @param docElement correct element that should have specified attribute - * @return specified int value - */ - public static int getInt(String attrName, Map runtimeAttributes, Node docElement) { - String stringValue = getStringValue(attrName, runtimeAttributes, docElement); - if (stringValue != null) { - try { - return Integer.parseInt(stringValue); - } catch (NumberFormatException e) { - //swallow - } - } - throw new RuntimeException("Need to specify a parseable int value in -- " - +attrName+" -- in commandline or in config file!"); - } - - - /** - * Get a long value. Try the runtime attributes first and then back off to - * the document element. Throw a RuntimeException if the attribute is not - * found or if the value is not parseable as a long. - * - * @param attrName attribute name to find - * @param runtimeAttributes runtime attributes - * @param docElement correct element that should have specified attribute - * @return specified long value - */ - public static long getLong(String attrName, Map runtimeAttributes, Node docElement) { - String stringValue = getStringValue(attrName, runtimeAttributes, docElement); - if (stringValue != null) { - try { - return Long.parseLong(stringValue); - } catch (NumberFormatException e) { - //swallow - } - } - throw new RuntimeException("Need to specify a \"long\" value in -- " - +attrName+" -- in commandline or in config file!"); - } - - private static String getStringValue(String attrName, Map runtimeAttributes, Node docElement) { - String stringValue = runtimeAttributes.get(attrName); - if (stringValue == null) { - Node staleNode = docElement.getAttributes().getNamedItem(attrName); - if (staleNode != null) { - stringValue = staleNode.getNodeValue(); - } - } - return stringValue; - } -} diff --git a/tika-batch/src/main/java/overview.html b/tika-batch/src/main/java/overview.html deleted file mode 100644 index 6e34a60..0000000 --- a/tika-batch/src/main/java/overview.html +++ /dev/null @@ -1,41 +0,0 @@ - - - - - - Tika Batch Module - - - -

    The Batch Module for Apache Tika

    - -

    - The batch module is new to Tika in 1.8. The goal is to enable robust - batch processing, with extensibility and logging. -

    -

    - This module currently enables file system directory to directory processing. - To build out other interfaces, follow the example of BasicTikaFSConsumer and - extend FileResourceConsumer. -

    -

    - NOTE: This package is new and experimental and is subject to suddenly change in the next release. -

    - - - \ No newline at end of file diff --git a/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml b/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml deleted file mode 100644 index 394c458..0000000 --- a/tika-batch/src/main/resources/org/apache/tika/batch/fs/default-tika-batch-config.xml +++ /dev/null @@ -1,127 +0,0 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/tika-batch/src/test/java/org/apache/tika/batch/CommandLineParserBuilderTest.java b/tika-batch/src/test/java/org/apache/tika/batch/CommandLineParserBuilderTest.java deleted file mode 100644 index 12be8a8..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/CommandLineParserBuilderTest.java +++ /dev/null @@ -1,39 +0,0 @@ -package org.apache.tika.batch; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.InputStream; - -import org.apache.commons.cli.Options; -import org.apache.tika.batch.builders.CommandLineParserBuilder; -import org.apache.tika.batch.fs.FSBatchTestBase; -import org.junit.Test; - - -public class CommandLineParserBuilderTest extends FSBatchTestBase { - - @Test - public void testBasic() throws Exception { - try (InputStream is = this.getClass().getResourceAsStream("/tika-batch-config-test.xml")) { - CommandLineParserBuilder builder = new CommandLineParserBuilder(); - Options options = builder.build(is); - //TODO: insert actual tests :) - } - - } -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/RecursiveParserWrapperFSConsumerTest.java b/tika-batch/src/test/java/org/apache/tika/batch/RecursiveParserWrapperFSConsumerTest.java deleted file mode 100644 index dd8e7a1..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/RecursiveParserWrapperFSConsumerTest.java +++ /dev/null @@ -1,149 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; - -import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.io.OutputStream; -import java.util.ArrayList; -import java.util.List; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.TikaTest; -import org.apache.tika.batch.fs.RecursiveParserWrapperFSConsumer; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.metadata.serialization.JsonMetadataList; -import org.apache.tika.parser.RecursiveParserWrapper; -import org.apache.tika.sax.BasicContentHandlerFactory; -import org.junit.Test; - -public class RecursiveParserWrapperFSConsumerTest extends TikaTest { - - - @Test - public void testEmbeddedWithNPE() throws Exception { - final String path = "/test-documents/embedded_with_npe.xml"; - final Metadata metadata = new Metadata(); - metadata.add(Metadata.RESOURCE_NAME_KEY, "embedded_with_npe.xml"); - - ArrayBlockingQueue queue = new ArrayBlockingQueue(2); - queue.add(new FileResource() { - - @Override - public String getResourceId() { - return "testFile"; - } - - @Override - public Metadata getMetadata() { - return metadata; - } - - @Override - public InputStream openInputStream() throws IOException { - return this.getClass().getResourceAsStream(path); - } - }); - queue.add(new PoisonFileResource()); - - MockOSFactory mockOSFactory = new MockOSFactory(); - RecursiveParserWrapperFSConsumer consumer = new RecursiveParserWrapperFSConsumer( - queue, new AutoDetectParserFactory(), new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1), - mockOSFactory, new TikaConfig()); - - IFileProcessorFutureResult result = consumer.call(); - mockOSFactory.getStreams().get(0).flush(); - byte[] bytes = mockOSFactory.getStreams().get(0).toByteArray(); - List results = JsonMetadataList.fromJson(new InputStreamReader(new ByteArrayInputStream(bytes), UTF_8)); - - assertEquals(4, results.size()); - assertContains("another null pointer", - results.get(2).get(RecursiveParserWrapper.EMBEDDED_EXCEPTION)); - - assertEquals("Nikolai Lobachevsky", results.get(0).get("author")); - for (int i = 1; i < 4; i++) { - assertEquals("embeddedAuthor"+i, results.get(i).get("author")); - assertContains("some_embedded_content"+i, results.get(i).get(RecursiveParserWrapper.TIKA_CONTENT)); - } - } - - @Test - public void testEmbeddedThenNPE() throws Exception { - final String path = "/test-documents/embedded_then_npe.xml"; - final Metadata metadata = new Metadata(); - metadata.add(Metadata.RESOURCE_NAME_KEY, "embedded_then_npe.xml"); - - ArrayBlockingQueue queue = new ArrayBlockingQueue(2); - queue.add(new FileResource() { - - @Override - public String getResourceId() { - return "testFile"; - } - - @Override - public Metadata getMetadata() { - return metadata; - } - - @Override - public InputStream openInputStream() throws IOException { - return this.getClass().getResourceAsStream(path); - } - }); - queue.add(new PoisonFileResource()); - - MockOSFactory mockOSFactory = new MockOSFactory(); - RecursiveParserWrapperFSConsumer consumer = new RecursiveParserWrapperFSConsumer( - queue, new AutoDetectParserFactory(), new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1), - mockOSFactory, new TikaConfig()); - - IFileProcessorFutureResult result = consumer.call(); - mockOSFactory.getStreams().get(0).flush(); - byte[] bytes = mockOSFactory.getStreams().get(0).toByteArray(); - List results = JsonMetadataList.fromJson(new InputStreamReader(new ByteArrayInputStream(bytes), UTF_8)); - assertEquals(2, results.size()); - assertContains("another null pointer", - results.get(0).get(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX + "runtime")); - assertEquals("Nikolai Lobachevsky", results.get(0).get("author")); - assertEquals("embeddedAuthor", results.get(1).get("author")); - assertContains("some_embedded_content", results.get(1).get(RecursiveParserWrapper.TIKA_CONTENT)); - } - - - - private class MockOSFactory implements OutputStreamFactory { - List streams = new ArrayList(); - @Override - public OutputStream getOutputStream(Metadata metadata) throws IOException { - ByteArrayOutputStream bos = new ByteArrayOutputStream(); - streams.add(bos); - return bos; - } - public List getStreams() { - return streams; - } - } -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/BatchDriverTest.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/BatchDriverTest.java deleted file mode 100644 index 8c85fb9..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/BatchDriverTest.java +++ /dev/null @@ -1,210 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - -import java.nio.file.Files; -import java.nio.file.Path; -import java.util.Arrays; -import java.util.HashMap; -import java.util.Map; - -import org.apache.tika.batch.BatchProcessDriverCLI; -import org.junit.Test; - - -public class BatchDriverTest extends FSBatchTestBase { - - //for debugging, turn logging off/on via resources/log4j.properties for the driver - //and log4j_process.properties for the process. - - @Test(timeout = 15000) - public void oneHeavyHangTest() throws Exception { - //batch runner hits one heavy hang file, keep going - Path outputDir = getNewOutputDir("daemon-"); - assertTrue(Files.isDirectory(outputDir)); - //make sure output directory is empty! - assertEquals(0, countChildren(outputDir)); - - String[] args = getDefaultCommandLineArgsArr("one_heavy_hang", outputDir, null); - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", args); - driver.execute(); - - assertEquals(0, driver.getNumRestarts()); - assertFalse(driver.getUserInterrupted()); - assertEquals(5, countChildren(outputDir)); - - assertContains("first test file", - readFileToString(outputDir.resolve("test2_ok.xml.xml"), UTF_8)); - } - - @Test(timeout = 30000) - public void restartOnFullHangTest() throws Exception { - //batch runner hits more heavy hangs than threads; needs to restart - Path outputDir = getNewOutputDir("daemon-"); - - //make sure output directory is empty! - assertEquals(0, countChildren(outputDir)); - - String[] args = getDefaultCommandLineArgsArr("heavy_heavy_hangs", outputDir, null); - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", args); - driver.execute(); - //could be one or two depending on timing - assertTrue(driver.getNumRestarts() > 0); - assertFalse(driver.getUserInterrupted()); - assertContains("first test file", - readFileToString(outputDir.resolve("test6_ok.xml.xml"), UTF_8)); - } - - @Test(timeout = 15000) - public void noRestartTest() throws Exception { - Path outputDir = getNewOutputDir("daemon-"); - - //make sure output directory is empty! - assertEquals(0, countChildren(outputDir)); - - String[] args = getDefaultCommandLineArgsArr("no_restart", outputDir, null); - String[] mod = Arrays.copyOf(args, args.length + 2); - mod[args.length] = "-numConsumers"; - mod[args.length+1] = "1"; - - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", mod); - driver.execute(); - assertEquals(0, driver.getNumRestarts()); - assertFalse(driver.getUserInterrupted()); - assertEquals(2, countChildren(outputDir)); - Path test2 = outputDir.resolve("test2_norestart.xml.xml"); - assertTrue("test2_norestart.xml", Files.exists(test2)); - Path test3 = outputDir.resolve("test3_ok.xml.xml"); - assertFalse("test3_ok.xml", Files.exists(test3)); - } - - @Test(timeout = 15000) - public void restartOnOOMTest() throws Exception { - //batch runner hits more heavy hangs than threads; needs to restart - Path outputDir = getNewOutputDir("daemon-"); - - //make sure output directory is empty! - assertEquals(0, countChildren(outputDir)); - - String[] args = getDefaultCommandLineArgsArr("oom", outputDir, null); - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", args); - driver.execute(); - assertEquals(1, driver.getNumRestarts()); - assertFalse(driver.getUserInterrupted()); - assertContains("first test file", - readFileToString(outputDir.resolve("test2_ok.xml.xml"), UTF_8)); - } - - @Test(timeout = 30000) - public void allHeavyHangsTestWithStarvedCrawler() throws Exception { - //this tests that if all consumers are hung and the crawler is - //waiting to add to the queue, there isn't deadlock. The BatchProcess should - //just shutdown, and the driver should restart - Path outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-"); - Map args = new HashMap<>(); - args.put("-numConsumers", "2"); - args.put("-maxQueueSize", "2"); - String[] commandLine = getDefaultCommandLineArgsArr("heavy_heavy_hangs", outputDir, args); - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine); - driver.execute(); - assertEquals(3, driver.getNumRestarts()); - assertFalse(driver.getUserInterrupted()); - assertContains("first test file", - readFileToString(outputDir.resolve("test6_ok.xml.xml"), UTF_8)); - } - - @Test(timeout = 30000) - public void maxRestarts() throws Exception { - //tests that maxRestarts works - //if -maxRestarts is not correctly removed from the commandline, - //FSBatchProcessCLI's cli parser will throw an Unrecognized option exception - - Path outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-"); - Map args = new HashMap<>(); - args.put("-numConsumers", "1"); - args.put("-maxQueueSize", "10"); - args.put("-maxRestarts", "2"); - - String[] commandLine = getDefaultCommandLineArgsArr("max_restarts", outputDir, args); - - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine); - driver.execute(); - assertEquals(2, driver.getNumRestarts()); - assertFalse(driver.getUserInterrupted()); - assertEquals(3, countChildren(outputDir)); - } - - @Test(timeout = 30000) - public void maxRestartsBadParameter() throws Exception { - //tests that maxRestarts must be followed by an Integer - Path outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-"); - Map args = new HashMap<>(); - args.put("-numConsumers", "1"); - args.put("-maxQueueSize", "10"); - args.put("-maxRestarts", "zebra"); - - String[] commandLine = getDefaultCommandLineArgsArr("max_restarts", outputDir, args); - boolean ex = false; - try { - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine); - driver.execute(); - } catch (IllegalArgumentException e) { - ex = true; - } - assertTrue("IllegalArgumentException should have been thrown", ex); - } - - @Test(timeout = 30000) - public void testNoRestartIfProcessFails() throws Exception { - //tests that if something goes horribly wrong with FSBatchProcessCLI - //the driver will not restart it again and again - //this calls a bad xml file which should trigger a no restart exit. - Path outputDir = getNewOutputDir("nostart-norestart-"); - Map args = new HashMap<>(); - args.put("-numConsumers", "1"); - args.put("-maxQueueSize", "10"); - - String[] commandLine = getDefaultCommandLineArgsArr("basic", outputDir, args); - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-broken.xml", commandLine); - driver.execute(); - assertEquals(0, countChildren(outputDir)); - assertEquals(0, driver.getNumRestarts()); - } - - @Test(timeout = 30000) - public void testNoRestartIfProcessFailsTake2() throws Exception { - Path outputDir = getNewOutputDir("nostart-norestart-"); - Map args = new HashMap<>(); - args.put("-numConsumers", "1"); - args.put("-maxQueueSize", "10"); - args.put("-somethingOrOther", "I don't Know"); - - String[] commandLine = getDefaultCommandLineArgsArr("basic", outputDir, args); - BatchProcessDriverCLI driver = getNewDriver("/tika-batch-config-test.xml", commandLine); - driver.execute(); - assertEquals(0, countChildren(outputDir)); - assertEquals(0, driver.getNumRestarts()); - } - - -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/BatchProcessTest.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/BatchProcessTest.java deleted file mode 100644 index 68d606d..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/BatchProcessTest.java +++ /dev/null @@ -1,369 +0,0 @@ -package org.apache.tika.batch.fs; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; - -import java.io.IOException; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.util.List; -import java.util.Map; - -import org.apache.tika.batch.BatchProcess; -import org.apache.tika.batch.BatchProcessDriverCLI; -import org.junit.Test; - -public class BatchProcessTest extends FSBatchTestBase { - - @Test(timeout = 15000) - public void oneHeavyHangTest() throws Exception { - - Path outputDir = getNewOutputDir("one_heavy_hang-"); - - Map args = getDefaultArgs("one_heavy_hang", outputDir); - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - StreamStrings streamStrings = ex.execute(); - assertEquals(5, countChildren(outputDir)); - Path hvyHang = outputDir.resolve("test0_heavy_hang.xml.xml"); - assertTrue(Files.exists(hvyHang)); - assertEquals(0, Files.size(hvyHang)); - assertNotContained(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(), - streamStrings.getErrString()); - } - - - @Test(timeout = 15000) - public void allHeavyHangsTest() throws Exception { - //each of the three threads hits a heavy hang. The BatchProcess runs into - //all timedouts and shuts down. - Path outputDir = getNewOutputDir("allHeavyHangs-"); - Map args = getDefaultArgs("heavy_heavy_hangs", outputDir); - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - StreamStrings streamStrings = ex.execute(); - - assertEquals(3, countChildren(outputDir)); - for (Path hvyHang : listPaths(outputDir)){ - assertTrue(Files.exists(hvyHang)); - assertEquals("file length for "+hvyHang.getFileName()+" should be 0, but is: " + - Files.size(hvyHang), - 0, Files.size(hvyHang)); - } - assertContains(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(), - streamStrings.getErrString()); - } - - @Test(timeout = 30000) - public void allHeavyHangsTestWithCrazyNumberConsumersTest() throws Exception { - Path outputDir = getNewOutputDir("allHeavyHangsCrazyNumberConsumers-"); - Map args = getDefaultArgs("heavy_heavy_hangs", outputDir); - args.put("numConsumers", "100"); - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - StreamStrings streamStrings = ex.execute(); - assertEquals(7, countChildren(outputDir)); - - for (int i = 0; i < 6; i++){ - Path hvyHang = outputDir.resolve("test"+i+"_heavy_hang.xml.xml"); - assertTrue(Files.exists(hvyHang)); - assertEquals(0, Files.size(hvyHang)); - } - assertContains("This is tika-batch's first test file", - readFileToString(outputDir.resolve("test6_ok.xml.xml"), UTF_8)); - - //key that the process realize that there were no more processable files - //in the queue and does not ask for a restart! - assertNotContained(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(), - streamStrings.getErrString()); - } - - @Test(timeout = 30000) - public void allHeavyHangsTestWithStarvedCrawler() throws Exception { - //this tests that if all consumers are hung and the crawler is - //waiting to add to the queue, there isn't deadlock. The batchrunner should - //shutdown and ask to be restarted. - Path outputDir = getNewOutputDir("allHeavyHangsStarvedCrawler-"); - Map args = getDefaultArgs("heavy_heavy_hangs", outputDir); - args.put("numConsumers", "2"); - args.put("maxQueueSize", "2"); - args.put("timeoutThresholdMillis", "100000000");//make sure that the batch process doesn't time out - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - StreamStrings streamStrings = ex.execute(); - assertEquals(2, countChildren(outputDir)); - - for (int i = 0; i < 2; i++){ - Path hvyHang = outputDir.resolve("test"+i+"_heavy_hang.xml.xml"); - assertTrue(Files.exists(hvyHang)); - assertEquals(0, Files.size(hvyHang)); - } - assertContains(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(), - streamStrings.getErrString()); - assertContains("Crawler timed out", streamStrings.getErrString()); - } - - @Test(timeout = 15000) - public void outOfMemory() throws Exception { - //the first consumer should sleep for 10 seconds - //the second should be tied up in a heavy hang - //the third one should hit the oom after processing test2_ok.xml - //no consumers should process test2-4.txt! - //i.e. the first consumer will finish in 10 seconds and - //then otherwise would be looking for more, but the oom should prevent that - Path outputDir = getNewOutputDir("oom-"); - - Map args = getDefaultArgs("oom", outputDir); - args.put("numConsumers", "3"); - args.put("timeoutThresholdMillis", "30000"); - - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - StreamStrings streamStrings = ex.execute(); - - assertEquals(4, countChildren(outputDir)); - assertContains("This is tika-batch's first test file", - readFileToString(outputDir.resolve("test2_ok.xml.xml"), UTF_8)); - - assertContains(BatchProcess.BATCH_CONSTANTS.BATCH_PROCESS_FATAL_MUST_RESTART.toString(), - streamStrings.getErrString()); - } - - - - @Test(timeout = 15000) - public void noRestart() throws Exception { - Path outputDir = getNewOutputDir("no_restart"); - - Map args = getDefaultArgs("no_restart", outputDir); - args.put("numConsumers", "1"); - - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - - StreamStrings streamStrings = ex.execute(); - - Path test2 = outputDir.resolve("test2_norestart.xml.xml"); - assertTrue("test2_norestart.xml", Files.exists(test2)); - Path test3 = outputDir.resolve("test3_ok.xml.xml"); - assertFalse("test3_ok.xml", Files.exists(test3)); - assertContains("exitStatus="+ BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE, - streamStrings.getOutString()); - assertContains("causeForTermination='MAIN_LOOP_EXCEPTION_NO_RESTART'", - streamStrings.getOutString()); - } - - /** - * This tests to make sure that BatchProcess waits the appropriate - * amount of time on an early termination before stopping. - * - * If this fails, then interruptible parsers (e.g. those with - * nio channels) will be interrupted and there will be corrupted data. - */ - @Test(timeout = 60000) - public void testWaitAfterEarlyTermination() throws Exception { - Path outputDir = getNewOutputDir("wait_after_early_termination"); - - Map args = getDefaultArgs("wait_after_early_termination", outputDir); - args.put("numConsumers", "1"); - args.put("maxAliveTimeSeconds", "5");//main process loop should stop after 5 seconds - args.put("timeoutThresholdMillis", "300000");//effectively never - args.put("pauseOnEarlyTerminationMillis", "20000");//let the parser have up to 20 seconds - - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - - StreamStrings streamStrings = ex.execute(); - assertEquals(1, countChildren(outputDir)); - assertContains("

    some content

    ", - readFileToString(outputDir.resolve("test0_sleep.xml.xml"), UTF_8)); - - assertContains("exitStatus="+BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE, - streamStrings.getOutString()); - assertContains("causeForTermination='BATCH_PROCESS_ALIVE_TOO_LONG'", - streamStrings.getOutString()); - } - - @Test(timeout = 60000) - public void testTimeOutAfterBeingAskedToShutdown() throws Exception { - Path outputDir = getNewOutputDir("timeout_after_early_termination"); - - Map args = getDefaultArgs("timeout_after_early_termination", outputDir); - args.put("numConsumers", "1"); - args.put("maxAliveTimeSeconds", "5");//main process loop should stop after 5 seconds - args.put("timeoutThresholdMillis", "10000"); - args.put("pauseOnEarlyTerminationMillis", "20000");//let the parser have up to 20 seconds - - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - StreamStrings streamStrings = ex.execute(); - List paths = listPaths(outputDir); - assertEquals(1, paths.size()); - assertEquals(0, Files.size(paths.get(0))); - assertContains("exitStatus="+BatchProcessDriverCLI.PROCESS_RESTART_EXIT_CODE, streamStrings.getOutString()); - assertContains("causeForTermination='BATCH_PROCESS_ALIVE_TOO_LONG'", - streamStrings.getOutString()); - } - - @Test(timeout = 10000) - public void testRedirectionOfStreams() throws Exception { - //test redirection of system.err to system.out - Path outputDir = getNewOutputDir("noisy_parsers"); - - Map args = getDefaultArgs("noisy_parsers", outputDir); - args.put("numConsumers", "1"); - args.put("maxAliveTimeSeconds", "20");//main process loop should stop after 5 seconds - - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args); - StreamStrings streamStrings = ex.execute(); - assertEquals(1, countChildren(outputDir)); - assertContains("System.out", streamStrings.getOutString()); - assertContains("System.err", streamStrings.getOutString()); - assertEquals(0, streamStrings.getErrString().length()); - - } - - @Test(timeout = 10000) - public void testConsumersManagerInitHang() throws Exception { - Path outputDir = getNewOutputDir("init_hang"); - - Map args = getDefaultArgs("noisy_parsers", outputDir); - args.put("numConsumers", "1"); - args.put("hangOnInit", "true"); - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args, "/tika-batch-config-MockConsumersBuilder.xml"); - StreamStrings streamStrings = ex.execute(); - assertEquals(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE, ex.getExitValue()); - assertContains("causeForTermination='CONSUMERS_MANAGER_DIDNT_INIT_IN_TIME_NO_RESTART'", streamStrings.getOutString()); - } - - @Test(timeout = 10000) - public void testConsumersManagerShutdownHang() throws Exception { - Path outputDir = getNewOutputDir("shutdown_hang"); - - Map args = getDefaultArgs("noisy_parsers", outputDir); - args.put("numConsumers", "1"); - args.put("hangOnShutdown", "true"); - - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args, "/tika-batch-config-MockConsumersBuilder.xml"); - StreamStrings streamStrings = ex.execute(); - assertEquals(BatchProcessDriverCLI.PROCESS_NO_RESTART_EXIT_CODE, ex.getExitValue()); - assertContains("ConsumersManager did not shutdown within", streamStrings.getOutString()); - } - - @Test - public void testHierarchicalWFileList() throws Exception { - //tests to make sure that hierarchy is maintained when reading from - //file list - //also tests that list actually works. - Path outputDir = getNewOutputDir("hierarchical_file_list"); - - Map args = getDefaultArgs("hierarchical", outputDir); - args.put("numConsumers", "1"); - args.put("fileList", - Paths.get(this.getClass().getResource("/testFileList.txt").toURI()).toString()); - args.put("recursiveParserWrapper", "true"); - args.put("basicHandlerType", "text"); - args.put("outputSuffix", "json"); - BatchProcessTestExecutor ex = new BatchProcessTestExecutor(args, "/tika-batch-config-MockConsumersBuilder.xml"); - ex.execute(); - Path test1 = outputDir.resolve("test1.xml.json"); - Path test2 = outputDir.resolve("sub1a/test2.xml.json"); - Path test3 = outputDir.resolve("sub1a/sub2a/test3.xml.json"); - assertTrue("test1 exists", Files.exists(test1)); - assertTrue("test1 length > 10", Files.size(test1) > 10); - assertTrue(Files.exists(test3) && Files.size(test3) > 10); - Path test2Dir = outputDir.resolve("sub1a"); - //should be just the subdirectory, no actual test2 file - assertEquals(1, countChildren(test2Dir)); - assertFalse(Files.exists(test2)); - } - - private class BatchProcessTestExecutor { - private final Map args; - private final String configPath; - private int exitValue = Integer.MIN_VALUE; - - public BatchProcessTestExecutor(Map args) { - this(args, "/tika-batch-config-test.xml"); - } - - public BatchProcessTestExecutor(Map args, String configPath) { - this.args = args; - this.configPath = configPath; - } - - private StreamStrings execute() { - Process p = null; - try { - ProcessBuilder b = getNewBatchRunnerProcess(configPath, args); - p = b.start(); - StringStreamGobbler errorGobbler = new StringStreamGobbler(p.getErrorStream()); - StringStreamGobbler outGobbler = new StringStreamGobbler(p.getInputStream()); - Thread errorThread = new Thread(errorGobbler); - Thread outThread = new Thread(outGobbler); - errorThread.start(); - outThread.start(); - while (true) { - try { - exitValue = p.exitValue(); - break; - } catch (IllegalThreadStateException e) { - //still going; - } - } - errorGobbler.stopGobblingAndDie(); - outGobbler.stopGobblingAndDie(); - errorThread.interrupt(); - outThread.interrupt(); - return new StreamStrings(outGobbler.toString(), errorGobbler.toString()); - } catch (IOException e) { - fail(); - } finally { - destroyProcess(p); - } - return null; - } - - private int getExitValue() { - return exitValue; - } - - } - - private class StreamStrings { - private final String outString; - private final String errString; - - private StreamStrings(String outString, String errString) { - this.outString = outString; - this.errString = errString; - } - - private String getOutString() { - return outString; - } - - private String getErrString() { - return errString; - } - - @Override - public String toString() { - return "OUT>>"+outString+"<<\n"+ - "ERR>>"+errString+"<<\n"; - } - } -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/FSBatchTestBase.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/FSBatchTestBase.java deleted file mode 100644 index 9c56147..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/FSBatchTestBase.java +++ /dev/null @@ -1,301 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.BufferedReader; -import java.io.IOException; -import java.io.InputStream; -import java.net.URISyntaxException; -import java.nio.charset.Charset; -import java.nio.file.DirectoryStream; -import java.nio.file.FileVisitResult; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.nio.file.SimpleFileVisitor; -import java.nio.file.attribute.BasicFileAttributes; -import java.util.ArrayList; -import java.util.HashMap; -import java.util.Iterator; -import java.util.List; -import java.util.Map; -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Executors; -import java.util.concurrent.Future; -import java.util.concurrent.TimeUnit; - -import org.apache.commons.io.IOUtils; -import org.apache.tika.TikaTest; -import org.apache.tika.batch.BatchProcess; -import org.apache.tika.batch.BatchProcessDriverCLI; -import org.apache.tika.batch.ParallelFileProcessingResult; -import org.apache.tika.batch.builders.BatchProcessBuilder; -import org.junit.AfterClass; -import org.junit.BeforeClass; - -/** - * This is the base class for file-system batch tests. - *

    - * There are a few areas for improvement in this test suite. - *

      - *
    1. For the heavy load tests, the test cases leave behind files that - * cannot be deleted from within the same jvm. A thread is still actively writing to an - * OutputStream when tearDown() is called. The current solution is to create - * the temp dir within the target/tika-batch/test-classes so that they will at least - * be removed during each maven "clean"
    2. - *
    3. The "mock" tests are time-based. This is not - * extremely reliable across different machines with different number/power of cpus. - *
    4. - *
    - */ -public abstract class FSBatchTestBase extends TikaTest { - - private static Path outputRoot = null; - - @BeforeClass - public static void setUp() throws Exception { - Path testOutput = Paths.get("target/test-classes/test-output"); - Files.createDirectories(testOutput); - outputRoot = Files.createTempDirectory(testOutput, "tika-batch-output-root-"); - } - - @AfterClass - public static void tearDown() throws Exception { - //not ideal, but should be ok for testing - //see caveat in TikaCLITest's textExtract - - try { - deleteDirectory(outputRoot); - } catch (IOException e) { - e.printStackTrace(); - } - } - - protected void destroyProcess(Process p) { - if (p == null) - return; - - try { - p.exitValue(); - } catch (IllegalThreadStateException e) { - p.destroy(); - } - } - - Path getNewOutputDir(String subdirPrefix) throws IOException { - Path outputDir = Files.createTempDirectory(outputRoot, subdirPrefix); - assert(countChildren(outputDir) == 0); - return outputDir; - } - - Map getDefaultArgs(String inputSubDir, Path outputDir) throws Exception { - Map args = new HashMap<>(); - - args.put("inputDir", "\""+getInputRoot(inputSubDir).toString()+"\""); - if (outputDir != null) { - args.put("outputDir", "\""+outputDir.toString()+"\""); - } - return args; - } - - public String[] getDefaultCommandLineArgsArr(String inputSubDir, - Path outputDir, Map commandLine) throws Exception { - List args = new ArrayList<>(); - //need to include "-" because these are going to the commandline! - if (inputSubDir != null) { - args.add("-inputDir"); - args.add(getInputRoot(inputSubDir).toAbsolutePath().toString()); - } - if (outputDir != null) { - args.add("-outputDir"); - args.add(outputDir.toAbsolutePath().toString()); - } - if (commandLine != null) { - for (Map.Entry e : commandLine.entrySet()) { - args.add(e.getKey()); - args.add(e.getValue()); - } - } - return args.toArray(new String[args.size()]); - } - - - public Path getInputRoot(String subdir) throws Exception { - String path = (subdir == null || subdir.length() == 0) ? "/test-input" : "/test-input/"+subdir; - return Paths.get(this.getClass().getResource(path).toURI()); - } - - BatchProcess getNewBatchRunner(String testConfig, - Map args) throws IOException { - InputStream is = this.getClass().getResourceAsStream(testConfig); - BatchProcessBuilder b = new BatchProcessBuilder(); - BatchProcess runner = b.build(is, args); - - IOUtils.closeQuietly(is); - return runner; - } - - public ProcessBuilder getNewBatchRunnerProcess(String testConfig, Map args) { - List argList = new ArrayList<>(); - for (Map.Entry e : args.entrySet()) { - argList.add("-"+e.getKey()); - argList.add(e.getValue()); - } - - String[] fullCommandLine = commandLine(testConfig, - argList.toArray(new String[argList.size()])); - return new ProcessBuilder(fullCommandLine); - } - - private String[] commandLine(String testConfig, String[] args) { - List commandLine = new ArrayList<>(); - commandLine.add("java"); - commandLine.add("-Dlog4j.configuration=file:"+ - this.getClass().getResource("/log4j_process.properties").getFile()); - commandLine.add("-Xmx128m"); - commandLine.add("-cp"); - String cp = System.getProperty("java.class.path"); - //need to test for " " on *nix, can't just add double quotes - //across platforms. - if (cp.contains(" ")){ - cp = "\""+cp+"\""; - } - commandLine.add(cp); - commandLine.add("org.apache.tika.batch.fs.FSBatchProcessCLI"); - - String configFile = null; - try { - configFile = Paths.get(this.getClass().getResource(testConfig).toURI()).toAbsolutePath().toString(); - } catch (URISyntaxException e) { - e.printStackTrace(); - } - - commandLine.add("-bc"); - commandLine.add(configFile); - - for (String s : args) { - commandLine.add(s); - } - return commandLine.toArray(new String[commandLine.size()]); - } - - public BatchProcessDriverCLI getNewDriver(String testConfig, - String[] args) throws Exception { - List commandLine = new ArrayList<>(); - commandLine.add("java"); - commandLine.add("-Xmx128m"); - commandLine.add("-cp"); - String cp = System.getProperty("java.class.path"); - //need to test for " " on *nix, can't just add double quotes - //across platforms. - if (cp.contains(" ")){ - cp = "\""+cp+"\""; - } - commandLine.add(cp); - commandLine.add("org.apache.tika.batch.fs.FSBatchProcessCLI"); - - String configFile = Paths.get( - this.getClass().getResource(testConfig).toURI()).toAbsolutePath().toString(); - commandLine.add("-bc"); - - commandLine.add(configFile); - - for (String s : args) { - commandLine.add(s); - } - - BatchProcessDriverCLI driver = new BatchProcessDriverCLI( - commandLine.toArray(new String[commandLine.size()])); - driver.setRedirectChildProcessToStdOut(false); - return driver; - } - - protected ParallelFileProcessingResult run(BatchProcess process) throws Exception { - ExecutorService executor = Executors.newSingleThreadExecutor(); - Future futureResult = executor.submit(process); - return futureResult.get(10, TimeUnit.SECONDS); - } - - /** - * Counts immediate children only, does not work recursively - * @param p - * @return - * @throws IOException - */ - public static int countChildren(Path p) throws IOException { - int i = 0; - try (DirectoryStream ds = Files.newDirectoryStream(p)) { - Iterator it = ds.iterator(); - while (it.hasNext()) { - i++; - it.next(); - } - } - return i; - } - - //REMOVE THIS AND USE FileUtils, once a java 7 option has been added. - public static String readFileToString(Path p, Charset cs) throws IOException { - StringBuilder sb = new StringBuilder(); - try (BufferedReader r = Files.newBufferedReader(p, cs)) { - String line = r.readLine(); - while (line != null) { - sb.append(line).append("\n"); - line = r.readLine(); - } - } - return sb.toString(); - } - - //TODO: move this into FileUtils - public static void deleteDirectory(Path dir) throws IOException { - Files.walkFileTree(dir, new SimpleFileVisitor() { - @Override - public FileVisitResult visitFile(Path file, - BasicFileAttributes attrs) throws IOException { - Files.delete(file); - return FileVisitResult.CONTINUE; - } - - @Override - public FileVisitResult postVisitDirectory(Path dir, - IOException exc) throws IOException { - Files.delete(dir); - return FileVisitResult.CONTINUE; - } - - }); - } - - /** - * helper method equivalent to File#listFiles() - * grabs children only, does not walk recursively - * @param p - * @return - */ - public static List listPaths(Path p) throws IOException { - List list = new ArrayList<>(); - try (DirectoryStream ds = Files.newDirectoryStream(p)) { - Iterator it = ds.iterator(); - while (it.hasNext()) { - list.add(it.next()); - } - } - return list; - } -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/FSFileResourceTest.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/FSFileResourceTest.java deleted file mode 100644 index e890249..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/FSFileResourceTest.java +++ /dev/null @@ -1,49 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.batch.fs; - -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; - -import java.nio.file.Path; -import java.nio.file.Paths; - -import org.junit.Test; - -public class FSFileResourceTest { - @Test - public void testRelativization() throws Exception { - //test assertion error if alleged child is not actually child - Path root = Paths.get("root/abc/def").toAbsolutePath(); - Path allegedChild = Paths.get(root.getParent().getParent().toAbsolutePath().toString()); - try { - FSFileResource r = new FSFileResource(root, allegedChild); - fail("should have had assertion error: alleged child not actually child of root"); - } catch (AssertionError e) { - - } - - //test regular workings - root = Paths.get("root/abc/def"); - Path child = Paths.get(root.toString(), "ghi/jkl/lmnop.doc"); - FSFileResource r = new FSFileResource(root, child); - String id = r.getResourceId(); - assertTrue(id.startsWith("ghi")); - assertTrue(id.endsWith("lmnop.doc")); - } -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/FSUtilTest.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/FSUtilTest.java deleted file mode 100644 index f0377c6..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/FSUtilTest.java +++ /dev/null @@ -1,49 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.batch.fs; - -import static org.junit.Assert.assertTrue; - -import java.nio.file.Path; -import java.nio.file.Paths; - -import org.junit.Test; - -public class FSUtilTest { - - @Test - public void testSafeResolution() throws Exception { - Path cwd = Paths.get("."); - String windows = "C:/temp/file.txt"; - String linux = "/root/dir/file.txt"; - boolean ex = false; - try { - FSUtil.resolveRelative(cwd, windows); - } catch (IllegalArgumentException e) { - ex = true; - } - - try { - FSUtil.resolveRelative(cwd, linux); - } catch (IllegalArgumentException e) { - ex = true; - } - - assertTrue("IllegalArgumentException should have been thrown", ex); - } - -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/HandlerBuilderTest.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/HandlerBuilderTest.java deleted file mode 100644 index d8aecad..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/HandlerBuilderTest.java +++ /dev/null @@ -1,120 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - -import java.nio.file.Path; -import java.util.Map; - -import org.apache.tika.batch.BatchProcess; -import org.apache.tika.batch.ParallelFileProcessingResult; -import org.junit.Test; - -public class HandlerBuilderTest extends FSBatchTestBase { - - @Test - public void testXML() throws Exception { - - Path outputDir = getNewOutputDir("handler-xml-"); - Map args = getDefaultArgs("basic", outputDir); - args.put("basicHandlerType", "xml"); - args.put("outputSuffix", "xml"); - - BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - ParallelFileProcessingResult result = run(runner); - Path outputFile = outputDir.resolve("test0.xml.xml"); - String resultString = readFileToString(outputFile, UTF_8); - assertTrue(resultString.contains("")); - assertTrue(resultString.contains("")); - assertTrue(resultString.contains("This is tika-batch's first test file")); - } - - - @Test - public void testHTML() throws Exception { - Path outputDir = getNewOutputDir("handler-html-"); - - Map args = getDefaultArgs("basic", outputDir); - args.put("basicHandlerType", "html"); - args.put("outputSuffix", "html"); - BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - ParallelFileProcessingResult result = run(runner); - Path outputFile = outputDir.resolve("test0.xml.html"); - String resultString = readFileToString(outputFile, UTF_8); - assertTrue(resultString.contains("")); - assertFalse(resultString.contains("")); - assertTrue(resultString.contains("This is tika-batch's first test file")); - } - - @Test - public void testText() throws Exception { - Path outputDir = getNewOutputDir("handler-txt-"); - - Map args = getDefaultArgs("basic", outputDir); - args.put("basicHandlerType", "txt"); - args.put("outputSuffix", "txt"); - - BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - ParallelFileProcessingResult result = run(runner); - Path outputFile = outputDir.resolve("test0.xml.txt"); - String resultString = readFileToString(outputFile, UTF_8); - assertFalse(resultString.contains("")); - assertFalse(resultString.contains("")); - assertTrue(resultString.contains("This is tika-batch's first test file")); - } - - - @Test - public void testXMLWithWriteLimit() throws Exception { - Path outputDir = getNewOutputDir("handler-xml-write-limit-"); - - Map args = getDefaultArgs("basic", outputDir); - args.put("writeLimit", "5"); - - BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - ParallelFileProcessingResult result = run(runner); - - Path outputFile = outputDir.resolve("test0.xml.xml"); - String resultString = readFileToString(outputFile, UTF_8); - //this is not ideal. How can we change handlers to writeout whatever - //they've gotten so far, up to the writeLimit? - assertTrue(resultString.equals("")); - } - - @Test - public void testRecursiveParserWrapper() throws Exception { - Path outputDir = getNewOutputDir("handler-recursive-parser"); - - Map args = getDefaultArgs("basic", outputDir); - args.put("basicHandlerType", "txt"); - args.put("outputSuffix", "json"); - args.put("recursiveParserWrapper", "true"); - - BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - ParallelFileProcessingResult result = run(runner); - Path outputFile = outputDir.resolve("test0.xml.json"); - String resultString = readFileToString(outputFile, UTF_8); - assertTrue(resultString.contains("\"author\":\"Nikolai Lobachevsky\"")); - assertTrue(resultString.contains("tika-batch\\u0027s first test file")); - } - - -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/OutputStreamFactoryTest.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/OutputStreamFactoryTest.java deleted file mode 100644 index 5544a24..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/OutputStreamFactoryTest.java +++ /dev/null @@ -1,101 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - -import java.nio.file.Path; -import java.util.Map; -import java.util.concurrent.ExecutionException; - -import org.apache.tika.batch.BatchProcess; -import org.apache.tika.batch.ParallelFileProcessingResult; -import org.junit.Test; - -public class OutputStreamFactoryTest extends FSBatchTestBase { - - - @Test - public void testIllegalState() throws Exception { - Path outputDir = getNewOutputDir("os-factory-illegal-state-"); - Map args = getDefaultArgs("basic", outputDir); - BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - run(runner); - assertEquals(1, countChildren(outputDir)); - - boolean illegalState = false; - try { - ParallelFileProcessingResult result = run(runner); - } catch (ExecutionException e) { - if (e.getCause() instanceof IllegalStateException) { - illegalState = true; - } - } - assertTrue("Should have been an illegal state exception", illegalState); - } - - @Test - public void testSkip() throws Exception { - Path outputDir = getNewOutputDir("os-factory-skip-"); - Map args = getDefaultArgs("basic", outputDir); - args.put("handleExisting", "skip"); - BatchProcess runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - ParallelFileProcessingResult result = run(runner); - assertEquals(1, countChildren(outputDir)); - - runner = getNewBatchRunner("/tika-batch-config-test.xml", args); - result = run(runner); - assertEquals(1, countChildren(outputDir)); - } - - /* turn this back on if there is any need to add "handleExisting" - @Test - public void testRename() throws Exception { - File outputDir = getNewOutputDir("os-factory-rename-"); - Map args = getDefaultArgs("basic", outputDir); - - args.put("handleExisting", "rename"); - BatchProcess runner = getNewBatchRunner("/tika-batch-config-basic-test.xml", args); - ParallelFileProcessingResult result = runner.execute(); - assertEquals(1, outputDir.listFiles().length); - - runner = getNewBatchRunner("/tika-batch-config-basic-test.xml", args); - result = runner.execute(); - assertEquals(2, outputDir.listFiles().length); - - runner = getNewBatchRunner("/tika-batch-config-basic-test.xml", args); - result = runner.execute(); - assertEquals(3, outputDir.listFiles().length); - - int hits = 0; - for (File f : outputDir.listFiles()){ - String name = f.getName(); - if (name.equals("test2_ok.xml.xml")) { - hits++; - } else if (name.equals("test1(1).txt.xml")) { - hits++; - } else if (name.equals("test1(2).txt.xml")) { - hits++; - } - } - assertEquals(3, hits); - } - */ - -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/StringStreamGobbler.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/StringStreamGobbler.java deleted file mode 100644 index f03e0e3..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/StringStreamGobbler.java +++ /dev/null @@ -1,64 +0,0 @@ -package org.apache.tika.batch.fs; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.commons.io.IOUtils; - -import java.io.BufferedInputStream; -import java.io.BufferedReader; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; - -import static java.nio.charset.StandardCharsets.UTF_8; - -public class StringStreamGobbler implements Runnable { - - //plagiarized from org.apache.oodt's StreamGobbler - private final BufferedReader reader; - private volatile boolean running = true; - private final StringBuilder sb = new StringBuilder(); - - public StringStreamGobbler(InputStream is) { - this.reader = new BufferedReader(new InputStreamReader(new BufferedInputStream(is), UTF_8)); - } - - @Override - public void run() { - String line = null; - try { - while ((line = reader.readLine()) != null && this.running) { - sb.append(line); - sb.append("\n"); - } - } catch (IOException e) { - //swallow ioe - } - } - - public void stopGobblingAndDie() { - running = false; - IOUtils.closeQuietly(reader); - } - - @Override - public String toString() { - return sb.toString(); - } - -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/fs/strawman/StrawmanTest.java b/tika-batch/src/test/java/org/apache/tika/batch/fs/strawman/StrawmanTest.java deleted file mode 100644 index 56c5f10..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/fs/strawman/StrawmanTest.java +++ /dev/null @@ -1,26 +0,0 @@ -package org.apache.tika.batch.fs.strawman; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -import org.junit.Test; - -public class StrawmanTest { - //TODO: actually write some tests!!! - @Test - public void basicTest() { - - } -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/mock/MockConsumersBuilder.java b/tika-batch/src/test/java/org/apache/tika/batch/mock/MockConsumersBuilder.java deleted file mode 100644 index 169017f..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/mock/MockConsumersBuilder.java +++ /dev/null @@ -1,38 +0,0 @@ -package org.apache.tika.batch.mock; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -import java.util.Map; -import java.util.concurrent.ArrayBlockingQueue; - -import org.apache.tika.batch.ConsumersManager; -import org.apache.tika.batch.FileResource; -import org.apache.tika.batch.fs.builders.BasicTikaFSConsumersBuilder; -import org.w3c.dom.Node; - -public class MockConsumersBuilder extends BasicTikaFSConsumersBuilder { - - @Override - public ConsumersManager build(Node node, Map runtimeAttributes, - ArrayBlockingQueue queue) { - ConsumersManager manager = super.build(node, runtimeAttributes, queue); - - boolean hangOnInit = runtimeAttributes.containsKey("hangOnInit"); - boolean hangOnShutdown = runtimeAttributes.containsKey("hangOnShutdown"); - return new MockConsumersManager(manager, hangOnInit, hangOnShutdown); - } -} diff --git a/tika-batch/src/test/java/org/apache/tika/batch/mock/MockConsumersManager.java b/tika-batch/src/test/java/org/apache/tika/batch/mock/MockConsumersManager.java deleted file mode 100644 index acb659a..0000000 --- a/tika-batch/src/test/java/org/apache/tika/batch/mock/MockConsumersManager.java +++ /dev/null @@ -1,77 +0,0 @@ -package org.apache.tika.batch.mock; - -import org.apache.tika.batch.ConsumersManager; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -public class MockConsumersManager extends ConsumersManager { - - private final long HANG_MS = 30000; - - private final ConsumersManager wrapped; - private final boolean hangOnInit; - private final boolean hangOnClose; - - public MockConsumersManager(ConsumersManager wrapped, boolean hangOnInit, - boolean hangOnClose) { - super(wrapped.getConsumers()); - this.wrapped = wrapped; - this.hangOnInit = hangOnInit; - this.hangOnClose = hangOnClose; - } - - - @Override - public void init() { - if (hangOnInit) { - //interruptible light hang - try { - Thread.sleep(HANG_MS); - } catch (InterruptedException e) { - return; - } - return; - } - super.init(); - } - - @Override - public void shutdown() { - if (hangOnClose) { - //interruptible light hang - try { - Thread.sleep(HANG_MS); - } catch (InterruptedException e) { - return; - } - return; - } - super.shutdown(); - } - - @Override - public long getConsumersManagerMaxMillis() { - return wrapped.getConsumersManagerMaxMillis(); - } - - @Override - public void setConsumersManagerMaxMillis(long millis) { - wrapped.setConsumersManagerMaxMillis(millis); - } - -} diff --git a/tika-batch/src/test/java/org/apache/tika/parser/mock/MockParserFactory.java b/tika-batch/src/test/java/org/apache/tika/parser/mock/MockParserFactory.java deleted file mode 100644 index f78de03..0000000 --- a/tika-batch/src/test/java/org/apache/tika/parser/mock/MockParserFactory.java +++ /dev/null @@ -1,31 +0,0 @@ -package org.apache.tika.parser.mock; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.apache.tika.batch.ParserFactory; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.parser.Parser; - -public class MockParserFactory extends ParserFactory { - - @Override - public Parser getParser(TikaConfig config) { - return new MockParser(); - } - -} diff --git a/tika-bundle/pom.xml b/tika-bundle/pom.xml index f2de959..7257b69 100644 --- a/tika-bundle/pom.xml +++ b/tika-bundle/pom.xml @@ -25,7 +25,7 @@ org.apache.tika tika-parent - 1.11 + 1.5 ../tika-parent/pom.xml @@ -41,7 +41,7 @@ http://tika.apache.org/ - 4.4.0 + 2.2.0 @@ -61,6 +61,8 @@ junit junit + test + 4.11 org.ops4j.pax.exam @@ -77,7 +79,7 @@ org.apache.felix org.apache.felix.framework - 4.6.0 + 4.0.1 test @@ -89,24 +91,13 @@ org.ops4j.pax.url pax-url-aether - 2.3.0 + 1.3.3 test org.slf4j slf4j-simple - test - - - javax.inject - javax.inject - 1 - test - - - org.osgi - org.osgi.core - 5.0.0 + 1.6.1 test @@ -119,99 +110,63 @@ true - <_runsystempackages>com.sun.xml.bind.marshaller, com.sun.xml.internal.bind.marshaller org.apache.tika.parser.internal.Activator tika-parsers;inline=true, - commons-compress, xz, commons-codec, commons-csv, - commons-io, commons-exec, junrar, - pdfbox,fontbox,jempbox,bcmail-jdk15on,bcprov-jdk15on,bcpkix-jdk15on, + commons-compress, xz, commons-codec, commons-io, + pdfbox,fontbox,jempbox,bcmail-jdk15,bcprov-jdk15, poi,poi-scratchpad,poi-ooxml,poi-ooxml-schemas, - xmlbeans, - jackcess, - commons-lang, + xmlbeans, dom4j, tagsoup, asm, juniversalchardet, vorbis-java-core, vorbis-java-tika, isoparser, aspectjrt, - metadata-extractor, xmpcore, json-simple, - boilerpipe, rome, opennlp-tools, opennlp-maxent, - geoapi, sis-metadata, sis-netcdf, sis-utility, - sis-storage, apache-mime4j-core, apache-mime4j-dom, - jsr-275, jhighlight, java-libpst, jwnl, - netcdf4, grib, cdm, httpservices, jcip-annotations, - jmatio, guava + metadata-extractor, + boilerpipe, rome, + apache-mime4j-core, apache-mime4j-dom true ${project.url} !org.apache.tika.parser, !org.apache.tika.parser.external, - org.apache.tika.parser.*, + org.apache.tika.parser.* - !org.junit, - !org.junit.*, - !junit.*, - !org.apache.ctakes.*, - !org.apache.uima.*, *, - org.apache.tika.fork, - android.util;resolution:=optional, com.adobe.xmp;resolution:=optional, com.adobe.xmp.properties;resolution:=optional, com.google.protobuf;resolution:=optional, com.ibm.icu.text;resolution:=optional, com.sleepycat.je;resolution:=optional, com.sun.javadoc;resolution:=optional, - com.sun.xml.bind.marshaller;resolution:=optional, - com.sun.xml.internal.bind.marshaller;resolution:=optional, com.sun.msv.datatype;resolution:=optional, com.sun.msv.datatype.xsd;resolution:=optional, com.sun.tools.javadoc;resolution:=optional, edu.wisc.ssec.mcidas;resolution:=optional, edu.wisc.ssec.mcidas.adde;resolution:=optional, javax.activation;resolution:=optional, - javax.annotation;resolution:=optional, javax.mail;resolution:=optional, javax.mail.internet;resolution:=optional, - javax.servlet.annotation;resolution:=optional, - javax.servlet;resolution:=optional, - javax.servlet.http;resolution:=optional, - javax.measure.converter;resolution:=optional, + javax.xml.bind;resolution:=optional, javax.xml.stream;version="[1.0,2)";resolution:=optional, javax.xml.stream.events;version="[1.0,2)";resolution:=optional, javax.xml.stream.util;version="[1.0,2)";resolution:=optional, - javax.ws.rs.core;resolution:=optional, + junit.framework;resolution:=optional, + junit.textui;resolution:=optional, net.sf.ehcache;resolution:=optional, nu.xom;resolution:=optional, - opendap.dap.http;resolution:=optional, opendap.dap;resolution:=optional, opendap.dap.parser;resolution:=optional, - opennlp.maxent;resolution:=optional, - opennlp.tools.namefind;resolution:=optional, - net.didion.jwnl;resolution:=optional, - org.apache.cxf.jaxrs.client;resolution:=optional, - org.apache.cxf.jaxrs.ext.multipart;resolution:=optional, - org.apache.commons.exec;resolution:=optional, - org.apache.commons.io;resolution:=optional, org.apache.commons.httpclient;resolution:=optional, org.apache.commons.httpclient.auth;resolution:=optional, org.apache.commons.httpclient.methods;resolution:=optional, org.apache.commons.httpclient.params;resolution:=optional, org.apache.commons.httpclient.protocol;resolution:=optional, - org.apache.commons.httpclient.util;resolution:=optional, - org.apache.commons.vfs2;resolution:=optional, - org.apache.commons.vfs2.provider;resolution:=optional, - org.apache.commons.vfs2.util;resolution:=optional, org.apache.crimson.jaxp;resolution:=optional, - org.apache.jcp.xml.dsig.internal.dom;resolution:=optional, - org.apache.sis;resolution:=optional, - org.apache.sis.distance;resolution:=optional, - org.apache.sis.geometry;resolution:=optional, org.apache.tools.ant;resolution:=optional, org.apache.tools.ant.taskdefs;resolution:=optional, org.apache.tools.ant.types;resolution:=optional, @@ -221,21 +176,9 @@ org.apache.xerces.xni.parser;resolution:=optional, org.apache.xml.resolver;resolution:=optional, org.apache.xml.resolver.tools;resolution:=optional, - org.apache.xml.security;resolution:=optional, - org.apache.xml.security.c14n;resolution:=optional, - org.apache.xml.security.utils;resolution:=optional, org.apache.xmlbeans.impl.xpath.saxon;resolution:=optional, org.apache.xmlbeans.impl.xquery.saxon;resolution:=optional, - org.bouncycastle.cert;resolution:=optional, - org.bouncycastle.cert.jcajce;resolution:=optional, - org.bouncycastle.cert.ocsp;resolution:=optional, - org.bouncycastle.cms.bc;resolution:=optional, - org.bouncycastle.operator;resolution:=optional, - org.bouncycastle.operator.bc;resolution:=optional, - org.bouncycastle.tsp;resolution:=optional, org.cyberneko.html.xercesbridge;resolution:=optional, - org.etsi.uri.x01903.v14;resolution:=optional, - org.ibex.nestedvm;resolution:=optional, org.gjt.xpp;resolution:=optional, org.jaxen;resolution:=optional, org.jaxen.dom4j;resolution:=optional, @@ -244,18 +187,9 @@ org.jdom;resolution:=optional, org.jdom.input;resolution:=optional, org.jdom.output;resolution:=optional, - org.jdom2;resolution:=optional, - org.jdom2.input;resolution:=optional, - org.jdom2.output;resolution:=optional, - org.json.simple;resolution:=optional, - org.json;resolution:=optional, org.openxmlformats.schemas.officeDocument.x2006.math;resolution:=optional, org.openxmlformats.schemas.schemaLibrary.x2006.main;resolution:=optional, org.osgi.framework;resolution:=optional, - org.quartz;resolution:=optional, - org.quartz.impl;resolution:=optional, - org.slf4j;resolution:=optional, - org.sqlite;resolution:=optional, org.w3c.dom;resolution:=optional, org.relaxng.datatype;resolution:=optional, org.xml.sax;resolution:=optional, @@ -264,91 +198,15 @@ org.xmlpull.v1;resolution:=optional, schemasMicrosoftComOfficePowerpoint;resolution:=optional, schemasMicrosoftComOfficeWord;resolution:=optional, - sun.misc;resolution:=optional, - ucar.units;resolution:=optional, - ucar.httpservices;resolution:=optional, - ucar.nc2.util;resolution:=optional, - ucar.nc2.util.cache;resolution:=optional, - ucar.nc2.dataset;resolution:=optional, - ucar.nc2;resolution:=optional, - ucar.nc2.constants;resolution:=optional, - ucar.nc2.dt;resolution:=optional, - ucar.nc2.dt.grid;resolution:=optional, - ucar.nc2.ft;resolution:=optional, - ucar.nc2.iosp;resolution:=optional, - ucar.nc2.iosp.hdf4;resolution:=optional, - ucar.nc2.ncml;resolution:=optional, - ucar.nc2.stream;resolution:=optional, - ucar.nc2.time;resolution:=optional, - ucar.nc2.units;resolution:=optional, - ucar.nc2.wmo;resolution:=optional, - ucar.nc2.write;resolution:=optional, - ucar.ma2;resolution:=optional, ucar.grib;resolution:=optional, ucar.grib.grib1;resolution:=optional, ucar.grib.grib2;resolution:=optional, ucar.grid;resolution:=optional, - ucar.unidata.geoloc;resolution:=optional, - ucar.unidata.geoloc.projection;resolution:=optional, - ucar.unidata.geoloc.projection.proj4;resolution:=optional, - ucar.unidata.geoloc.projection.sat;resolution:=optional, - ucar.unidata.io;resolution:=optional, - ucar.unidata.util;resolution:=optional, - com.jmatio.io;resolution:=optional, visad;resolution:=optional, visad.data;resolution:=optional, visad.data.vis5d;resolution:=optional, visad.jmet;resolution:=optional, - visad.util;resolution:=optional, - colorspace;resolution:=optional, - com.sun.jna;resolution:=optional, - com.sun.jna.ptr;resolution:=optional, - icc;resolution:=optional, - jj2000.j2k.codestream;resolution:=optional, - jj2000.j2k.codestream.reader;resolution:=optional, - jj2000.j2k.decoder;resolution:=optional, - jj2000.j2k.entropy.decoder;resolution:=optional, - jj2000.j2k.fileformat.reader;resolution:=optional, - jj2000.j2k.image;resolution:=optional, - jj2000.j2k.image.invcomptransf;resolution:=optional, - jj2000.j2k.image.output;resolution:=optional, - jj2000.j2k.io;resolution:=optional, - jj2000.j2k.quantization.dequantizer;resolution:=optional, - jj2000.j2k.roi;resolution:=optional, - jj2000.j2k.util;resolution:=optional, - jj2000.j2k.wavelet.synthesis;resolution:=optional, - org.itadaki.bzip2;resolution:=optional, - org.jsoup;resolution:=optional, - org.jsoup.nodes;resolution:=optional, - org.jsoup.select;resolution:=optional, - thredds.featurecollection;resolution:=optional, - thredds.filesystem;resolution:=optional, - thredds.inventory;resolution:=optional, - thredds.inventory.filter;resolution:=optional, - thredds.inventory.partition;resolution:=optional, - com.beust.jcommander;resolution:=optional, - com.google.common.base;resolution:=optional, - com.google.common.math;resolution:=optional, - org.apache.http;resolution:=optional, - org.joda.time;resolution:=optional, - org.joda.time.chrono;resolution:=optional, - org.joda.time.field;resolution:=optional, - org.joda.time.format;resolution:=optional, - sun.reflect.generics.reflectiveObjects;resolution:=optional, - org.apache.http.auth;resolution:=optional, - org.apache.http.client;resolution:=optional, - org.apache.http.client.entity;resolution:=optional, - org.apache.http.client.methods;resolution:=optional, - org.apache.http.conn;resolution:=optional, - org.apache.http.conn.scheme;resolution:=optional, - org.apache.http.cookie;resolution:=optional, - org.apache.http.entity;resolution:=optional, - org.apache.http.impl.client;resolution:=optional, - org.apache.http.impl.conn;resolution:=optional, - org.apache.http.message;resolution:=optional, - org.apache.http.params;resolution:=optional, - org.apache.http.protocol;resolution:=optional, - org.apache.http.util;resolution:=optional + visad.util;resolution:=optional @@ -377,71 +235,72 @@ - - - - - de.thetaphi - forbiddenapis - - true - - - - - maven-assembly-plugin - - - pre-integration-test - - single - - - test-bundles.xml - test - false - - - - - - - maven-failsafe-plugin - 2.10 - - - - integration-test - verify - - - - - - - WARN - - - - + + + java6 + + [1.6,) + + + + + maven-assembly-plugin + + + pre-integration-test + + single + + + test-bundles.xml + test + false + + + + + + maven-failsafe-plugin + 2.10 + + + + integration-test + verify + + + + + + + WARN + + + + + + + + + - The Apache Software Founation - http://www.apache.org + The Apache Software Founation + http://www.apache.org - http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-bundle - scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-bundle - scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-bundle + http://svn.apache.org/viewvc/tika/tags/1.5/tika-bundle + scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.5/tika-bundle + scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.5/tika-bundle - JIRA - https://issues.apache.org/jira/browse/TIKA + JIRA + https://issues.apache.org/jira/browse/TIKA - Jenkins - https://builds.apache.org/job/Tika-trunk/ + Jenkins + https://builds.apache.org/job/Tika-trunk/ diff --git a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java index 78b4cdd..389a86e 100644 --- a/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java +++ b/tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java @@ -16,229 +16,53 @@ */ package org.apache.tika.bundle; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; -import static org.junit.Assert.assertFalse; import static org.ops4j.pax.exam.CoreOptions.bundle; import static org.ops4j.pax.exam.CoreOptions.junitBundles; -import static org.ops4j.pax.exam.CoreOptions.options; -import javax.inject.Inject; - -import java.io.ByteArrayInputStream; import java.io.File; import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; -import java.io.StringWriter; -import java.io.Writer; import java.net.URISyntaxException; -import java.util.HashSet; -import java.util.Set; -import java.util.jar.Attributes; -import java.util.jar.JarInputStream; -import java.util.jar.Manifest; import org.apache.tika.Tika; -import org.apache.tika.detect.DefaultDetector; -import org.apache.tika.detect.Detector; -import org.apache.tika.fork.ForkParser; import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.DefaultParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; -import org.apache.tika.parser.internal.Activator; import org.apache.tika.sax.BodyContentHandler; import org.junit.Test; -import org.junit.runner.RunWith; -import org.ops4j.pax.exam.Configuration; +import org.ops4j.pax.exam.CoreOptions; import org.ops4j.pax.exam.Option; -import org.ops4j.pax.exam.junit.PaxExam; -import org.ops4j.pax.exam.spi.reactors.ExamReactorStrategy; -import org.ops4j.pax.exam.spi.reactors.PerMethod; -import org.osgi.framework.Bundle; +import org.ops4j.pax.exam.junit.Configuration; import org.osgi.framework.BundleContext; -import org.osgi.framework.ServiceReference; import org.xml.sax.ContentHandler; -@RunWith(PaxExam.class) -@ExamReactorStrategy(PerMethod.class) public class BundleIT { private final File TARGET = new File("target"); - @Inject - private Parser defaultParser; - @Inject - private Detector contentTypeDetector; - @Inject - private BundleContext bc; - @Configuration public Option[] configuration() throws IOException, URISyntaxException { File base = new File(TARGET, "test-bundles"); - return options( + return CoreOptions.options( junitBundles(), bundle(new File(base, "tika-core.jar").toURI().toURL().toString()), bundle(new File(base, "tika-bundle.jar").toURI().toURL().toString())); } - - - @Test - public void testBundleLoaded() throws Exception { - boolean hasCore = false, hasBundle = false; - for (Bundle b : bc.getBundles()) { - if ("org.apache.tika.core".equals(b.getSymbolicName())) { - hasCore = true; - assertEquals("Core not activated", Bundle.ACTIVE, b.getState()); - } - if ("org.apache.tika.bundle".equals(b.getSymbolicName())) { - hasBundle = true; - assertEquals("Bundle not activated", Bundle.ACTIVE, b.getState()); - } - } - assertTrue("Core bundle not found", hasCore); - assertTrue("Bundle bundle not found", hasBundle); - } - - - @Test - public void testManifestNoJUnit() throws Exception { - File TARGET = new File("target"); - File base = new File(TARGET, "test-bundles"); - File tikaBundle = new File(base, "tika-bundle.jar"); - - JarInputStream jarIs = new JarInputStream(new FileInputStream(tikaBundle)); - Manifest mf = jarIs.getManifest(); - - Attributes main = mf.getMainAttributes(); - - String importPackage = main.getValue("Import-Package"); - - boolean containsJunit = importPackage.contains("junit"); - - assertFalse("The bundle should not import junit", containsJunit); - } - - - @Test - public void testBundleDetection() throws Exception { - Metadata metadataTXT = new Metadata(); - metadataTXT.set(Metadata.RESOURCE_NAME_KEY, "test.txt"); - - Metadata metadataPDF = new Metadata(); - metadataPDF.set(Metadata.RESOURCE_NAME_KEY, "test.pdf"); + + //@Test + public void testTikaBundle(BundleContext bc) throws Exception { + Tika tika = new Tika(); // Simple type detection - assertEquals(MediaType.TEXT_PLAIN, contentTypeDetector.detect(null, metadataTXT)); - assertEquals(MediaType.application("pdf"), contentTypeDetector.detect(null, metadataPDF)); - } - - - @Test - public void testForkParser() throws Exception { - ForkParser parser = new ForkParser(Activator.class.getClassLoader(), defaultParser); - String data = "\n

    test content

    "; - InputStream stream = new ByteArrayInputStream(data.getBytes(UTF_8)); - Writer writer = new StringWriter(); - ContentHandler contentHandler = new BodyContentHandler(writer); - Metadata metadata = new Metadata(); - MediaType type = contentTypeDetector.detect(stream, metadata); - assertEquals(type.toString(), "text/html"); - metadata.add(Metadata.CONTENT_TYPE, type.toString()); - ParseContext parseCtx = new ParseContext(); - parser.parse(stream, contentHandler, metadata, parseCtx); - writer.flush(); - String content = writer.toString(); - assertTrue(content.length() > 0); - assertEquals("test content", content.trim()); - } - - - @Test - public void testBundleSimpleText() throws Exception { - Tika tika = new Tika(); + assertEquals("text/plain", tika.detect("test.txt")); + assertEquals("application/pdf", tika.detect("test.pdf")); // Simple text extraction String xml = tika.parseToString(new File("pom.xml")); assertTrue(xml.contains("tika-bundle")); - } - - - @Test - public void testBundleDetectors() throws Exception { - //For some reason, the detector created by OSGi has a flat - //list of detectors, whereas the detector created by the traditional - //service loading method has children: DefaultDetector, MimeTypes. - //We have to flatten the service loaded DefaultDetector to get equivalence. - //Detection behavior should all be the same. - - // Get the classes found within OSGi - ServiceReference detectorRef = bc.getServiceReference(Detector.class); - DefaultDetector detectorService = (DefaultDetector)bc.getService(detectorRef); - - Set osgiDetectors = new HashSet(); - for (Detector d : detectorService.getDetectors()) { - osgiDetectors.add(d.getClass().getName()); - } - - // Check we did get a few, just in case... - assertTrue("Should have several Detector names, found " + osgiDetectors.size(), - osgiDetectors.size() > 3); - - // Get the raw detectors list from the traditional service loading mechanism - DefaultDetector detector = new DefaultDetector(); - Set rawDetectors = new HashSet(); - for (Detector d : detector.getDetectors()) { - if (d instanceof DefaultDetector) { - for (Detector dChild : ((DefaultDetector)d).getDetectors()) { - rawDetectors.add(dChild.getClass().getName()); - } - } else { - rawDetectors.add(d.getClass().getName()); - } - } - assertEquals(osgiDetectors, rawDetectors); - } - - - @Test - public void testBundleParsers() throws Exception { - // Get the classes found within OSGi - ServiceReference parserRef = bc.getServiceReference(Parser.class); - DefaultParser parserService = (DefaultParser)bc.getService(parserRef); - - Set osgiParsers = new HashSet(); - for (Parser p : parserService.getAllComponentParsers()) { - osgiParsers.add(p.getClass().getName()); - } - - // Check we did get a few, just in case... - assertTrue("Should have lots Parser names, found " + osgiParsers.size(), - osgiParsers.size() > 15); - - // Get the raw parsers list from the traditional service loading mechanism - CompositeParser parser = (CompositeParser)defaultParser; - Set rawParsers = new HashSet(); - for (Parser p : parser.getAllComponentParsers()) { - if (p instanceof DefaultParser) { - for (Parser pChild : ((DefaultParser)p).getAllComponentParsers()) { - rawParsers.add(pChild.getClass().getName()); - } - } else { - rawParsers.add(p.getClass().getName()); - } - } - assertEquals(rawParsers, osgiParsers); - } - - - @Test - public void testTikaBundle() throws Exception { - Tika tika = new Tika(); // Package extraction ContentHandler handler = new BodyContentHandler(); @@ -247,9 +71,12 @@ ParseContext context = new ParseContext(); context.set(Parser.class, parser); - try (InputStream stream = - new FileInputStream("src/test/resources/test-documents.zip")) { + InputStream stream = + new FileInputStream("src/test/resources/test-documents.zip"); + try { parser.parse(stream, handler, new Metadata(), context); + } finally { + stream.close(); } String content = handler.toString(); @@ -272,4 +99,5 @@ assertTrue(content.contains("testXML.xml")); assertTrue(content.contains("Rida Benjelloun")); } + } diff --git a/tika-core/pom.xml b/tika-core/pom.xml index 1ed5538..87e922d 100644 --- a/tika-core/pom.xml +++ b/tika-core/pom.xml @@ -25,7 +25,7 @@ org.apache.tika tika-parent - 1.11 + 1.5 ../tika-parent/pom.xml @@ -60,6 +60,8 @@ junit junit + test + 4.11 @@ -85,21 +87,8 @@ src/test/resources/org/apache/tika/** - src/main/resources/org/apache/tika/language/*.ngp - src/main/resources/org/apache/tika/detect/*.nnmodel - - - org.apache.maven.plugins - maven-jar-plugin - - - - test-jar - - - org.codehaus.mojo @@ -154,24 +143,22 @@ - This is the core Apache Tika™ toolkit library from which all other modules inherit functionality. It also - includes the core facades for the Tika API. - + This is the core Apache Tika™ toolkit library from which all other modules inherit functionality. It also includes the core facades for the Tika API. - The Apache Software Foundation - http://www.apache.org + The Apache Software Foundation + http://www.apache.org - http://svn.apache.org/viewvc/tika/tags/1.11-rc1/core - scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/core - scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1/core + http://svn.apache.org/viewvc/tika/tags/1.5/core + scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.5/core + scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.5/core - JIRA - https://issues.apache.org/jira/browse/TIKA + JIRA + https://issues.apache.org/jira/browse/TIKA - Jenkins - https://builds.apache.org/job/Tika-trunk/ + Jenkins + https://builds.apache.org/job/Tika-trunk/ diff --git a/tika-core/src/main/java/org/apache/tika/Tika.java b/tika-core/src/main/java/org/apache/tika/Tika.java index 245dc57..022b83c 100644 --- a/tika-core/src/main/java/org/apache/tika/Tika.java +++ b/tika-core/src/main/java/org/apache/tika/Tika.java @@ -22,15 +22,12 @@ import java.io.InputStream; import java.io.Reader; import java.net.URL; -import java.nio.file.Path; import java.util.Properties; import org.apache.tika.config.TikaConfig; import org.apache.tika.detect.Detector; import org.apache.tika.exception.TikaException; -import org.apache.tika.io.IOUtils; import org.apache.tika.io.TikaInputStream; -import org.apache.tika.language.translate.Translator; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; @@ -62,11 +59,6 @@ private final Parser parser; /** - * The Translator instance used by this facade. - */ - private final Translator translator; - - /** * Maximum length of the strings returned by the parseToString methods. * Used to prevent out of memory problems with huge input documents. * The default setting is 100k characters. @@ -74,7 +66,7 @@ private int maxStringLength = 100 * 1000; /** - * Creates a Tika facade using the given detector and parser instances, but the default Translator. + * Creates a Tika facade using the given detector and parser instances. * * @since Apache Tika 0.8 * @param detector type detector @@ -83,21 +75,6 @@ public Tika(Detector detector, Parser parser) { this.detector = detector; this.parser = parser; - this.translator = TikaConfig.getDefaultConfig().getTranslator(); - } - - /** - * Creates a Tika facade using the given detector, parser, and translator instances. - * - * @since Apache Tika 1.6 - * @param detector type detector - * @param parser document parser - * @param translator text translator - */ - public Tika(Detector detector, Parser parser, Translator translator) { - this.detector = detector; - this.parser = parser; - this.translator = translator; } /** @@ -106,7 +83,7 @@ * @param config Tika configuration */ public Tika(TikaConfig config) { - this(config.getDetector(), new AutoDetectParser(config), config.getTranslator()); + this(config.getDetector(), new AutoDetectParser(config)); } /** @@ -117,8 +94,8 @@ } /** - * Creates a Tika facade using the given detector instance, the - * default parser configuration, and the default Translator. + * Creates a Tika facade using the given detector instance and the + * default parser configuration. * * @since Apache Tika 0.8 * @param detector type detector @@ -219,8 +196,11 @@ */ public String detect(byte[] prefix, String name) { try { - try (InputStream stream = TikaInputStream.get(prefix)) { + InputStream stream = TikaInputStream.get(prefix); + try { return detect(stream, name); + } finally { + stream.close(); } } catch (IOException e) { throw new IllegalStateException("Unexpected IOException", e); @@ -242,8 +222,11 @@ */ public String detect(byte[] prefix) { try { - try (InputStream stream = TikaInputStream.get(prefix)) { + InputStream stream = TikaInputStream.get(prefix); + try { return detect(stream); + } finally { + stream.close(); } } catch (IOException e) { throw new IllegalStateException("Unexpected IOException", e); @@ -251,41 +234,18 @@ } /** - * Detects the media type of the file at the given path. The type - * detection is based on the document content and a potential known - * file extension. + * Detects the media type of the given file. The type detection is + * based on the document content and a potential known file extension. *

    * Use the {@link #detect(String)} method when you want to detect the * type of the document without actually accessing the file. * - * @param path the path of the file + * @param file the file * @return detected media type * @throws IOException if the file can not be read */ - public String detect(Path path) throws IOException { - Metadata metadata = new Metadata(); - try (InputStream stream = TikaInputStream.get(path, metadata)) { - return detect(stream, metadata); - } - } - - /** - * Detects the media type of the given file. The type detection is - * based on the document content and a potential known file extension. - *

    - * Use the {@link #detect(String)} method when you want to detect the - * type of the document without actually accessing the file. - * - * @param file the file - * @return detected media type - * @throws IOException if the file can not be read - * @see #detect(Path) - */ public String detect(File file) throws IOException { - Metadata metadata = new Metadata(); - try (InputStream stream = TikaInputStream.get(file, metadata)) { - return detect(stream, metadata); - } + return detect(file.toURI().toURL()); } /** @@ -302,8 +262,11 @@ */ public String detect(URL url) throws IOException { Metadata metadata = new Metadata(); - try (InputStream stream = TikaInputStream.get(url, metadata)) { + InputStream stream = TikaInputStream.get(url, metadata); + try { return detect(stream, metadata); + } finally { + stream.close(); } } @@ -322,69 +285,6 @@ return detect((InputStream) null, name); } catch (IOException e) { throw new IllegalStateException("Unexpected IOException", e); - } - } - - /** - * Translate the given text String to and from the given languages. - * @see org.apache.tika.language.translate.Translator - * @param text The text to translate. - * @param sourceLanguage The input text language (for example, "hi"). - * @param targetLanguage The desired output language (for example, "fr"). - * @return The translated text. If translation is unavailable (client keys not set), returns the same text back. - */ - public String translate(String text, String sourceLanguage, String targetLanguage){ - try { - return translator.translate(text, sourceLanguage, targetLanguage); - } catch (Exception e){ - throw new IllegalStateException("Error translating data.", e); - } - } - - /** - * Translate the given text String to the given language, attempting to auto-detect the source language. - * @see org.apache.tika.language.translate.Translator - * @param text The text to translate. - * @param targetLanguage The desired output language (for example, "en"). - * @return The translated text. If translation is unavailable (client keys not set), returns the same text back. - */ - public String translate(String text, String targetLanguage){ - try { - return translator.translate(text, targetLanguage); - } catch (Exception e){ - throw new IllegalStateException("Error translating data.", e); - } - } - - /** - * Translate the given text InputStream to and from the given languages. - * @see org.apache.tika.language.translate.Translator - * @param text The text to translate. - * @param sourceLanguage The input text language (for example, "hi"). - * @param targetLanguage The desired output language (for example, "fr"). - * @return The translated text. If translation is unavailable (client keys not set), returns the same text back. - */ - public String translate(InputStream text, String sourceLanguage, String targetLanguage){ - try { - return translator.translate(IOUtils.toString(text), sourceLanguage, targetLanguage); - } catch (Exception e){ - throw new IllegalStateException("Error translating data.", e); - } - } - - /** - * Translate the given text InputStream to the given language, attempting to auto-detect the source language. - * This does not close the stream, so the caller has the responsibility of closing it. - * @see org.apache.tika.language.translate.Translator - * @param text The text to translate. - * @param targetLanguage The desired output language (for example, "en"). - * @return The translated text. If translation is unavailable (client keys not set), returns the same text back. - */ - public String translate(InputStream text, String targetLanguage){ - try { - return translator.translate(IOUtils.toString(text), targetLanguage); - } catch (Exception e){ - throw new IllegalStateException("Error translating data.", e); } } @@ -426,30 +326,14 @@ } /** - * Parses the file at the given path and returns the extracted text content. - * - * @param path the path of the file to be parsed + * Parses the given file and returns the extracted text content. + * + * @param file the file to be parsed * @return extracted text content * @throws IOException if the file can not be read or parsed */ - public Reader parse(Path path) throws IOException { - Metadata metadata = new Metadata(); - InputStream stream = TikaInputStream.get(path, metadata); - return parse(stream, metadata); - } - - /** - * Parses the given file and returns the extracted text content. - * - * @param file the file to be parsed - * @return extracted text content - * @throws IOException if the file can not be read or parsed - * @see #parse(Path) - */ public Reader parse(File file) throws IOException { - Metadata metadata = new Metadata(); - InputStream stream = TikaInputStream.get(file, metadata); - return parse(stream, metadata); + return parse(file.toURI().toURL()); } /** @@ -572,53 +456,31 @@ } /** - * Parses the file at the given path and returns the extracted text content. + * Parses the given file and returns the extracted text content. *

    * To avoid unpredictable excess memory use, the returned string contains * only up to {@link #getMaxStringLength()} first characters extracted * from the input document. Use the {@link #setMaxStringLength(int)} * method to adjust this limitation. * - * @param path the path of the file to be parsed + * @param file the file to be parsed * @return extracted text content * @throws IOException if the file can not be read * @throws TikaException if the file can not be parsed */ - public String parseToString(Path path) throws IOException, TikaException { - Metadata metadata = new Metadata(); - InputStream stream = TikaInputStream.get(path, metadata); - return parseToString(stream, metadata); - } - - /** - * Parses the given file and returns the extracted text content. + public String parseToString(File file) throws IOException, TikaException { + return parseToString(file.toURI().toURL()); + } + + /** + * Parses the resource at the given URL and returns the extracted + * text content. *

    * To avoid unpredictable excess memory use, the returned string contains * only up to {@link #getMaxStringLength()} first characters extracted * from the input document. Use the {@link #setMaxStringLength(int)} * method to adjust this limitation. * - * @param file the file to be parsed - * @return extracted text content - * @throws IOException if the file can not be read - * @throws TikaException if the file can not be parsed - * @see #parseToString(Path) - */ - public String parseToString(File file) throws IOException, TikaException { - Metadata metadata = new Metadata(); - InputStream stream = TikaInputStream.get(file, metadata); - return parseToString(stream, metadata); - } - - /** - * Parses the resource at the given URL and returns the extracted - * text content. - *

    - * To avoid unpredictable excess memory use, the returned string contains - * only up to {@link #getMaxStringLength()} first characters extracted - * from the input document. Use the {@link #setMaxStringLength(int)} - * method to adjust this limitation. - * * @param url the URL of the resource to be parsed * @return extracted text content * @throws IOException if the resource can not be read @@ -673,27 +535,22 @@ return detector; } - /** - * Returns the translator instance used by this facade. - * - * @since Tika 1.6 - * @return translator instance - */ - public Translator getTranslator() { - return translator; - } - //--------------------------------------------------------------< Object > public String toString() { String version = null; - try (InputStream stream = Tika.class.getResourceAsStream( - "/META-INF/maven/org.apache.tika/tika-core/pom.properties")) { + try { + InputStream stream = Tika.class.getResourceAsStream( + "/META-INF/maven/org.apache.tika/tika-core/pom.properties"); if (stream != null) { - Properties properties = new Properties(); - properties.load(stream); - version = properties.getProperty("version"); + try { + Properties properties = new Properties(); + properties.load(stream); + version = properties.getProperty("version"); + } finally { + stream.close(); + } } } catch (Exception ignore) { } diff --git a/tika-core/src/main/java/org/apache/tika/concurrent/ConfigurableThreadPoolExecutor.java b/tika-core/src/main/java/org/apache/tika/concurrent/ConfigurableThreadPoolExecutor.java deleted file mode 100644 index 86f74a7..0000000 --- a/tika-core/src/main/java/org/apache/tika/concurrent/ConfigurableThreadPoolExecutor.java +++ /dev/null @@ -1,32 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.concurrent; - -import java.util.concurrent.ExecutorService; - -/** - * Allows Thread Pool to be Configurable. - * - * @since Apache Tika 1.11 - */ -public interface ConfigurableThreadPoolExecutor extends ExecutorService { - - public void setMaximumPoolSize(int threads); - - public void setCorePoolSize(int threads); - -} diff --git a/tika-core/src/main/java/org/apache/tika/concurrent/SimpleThreadPoolExecutor.java b/tika-core/src/main/java/org/apache/tika/concurrent/SimpleThreadPoolExecutor.java deleted file mode 100644 index a7e443f..0000000 --- a/tika-core/src/main/java/org/apache/tika/concurrent/SimpleThreadPoolExecutor.java +++ /dev/null @@ -1,40 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.concurrent; - -import java.util.concurrent.LinkedBlockingQueue; -import java.util.concurrent.ThreadFactory; -import java.util.concurrent.ThreadPoolExecutor; -import java.util.concurrent.TimeUnit; - -/** - * Simple Thread Pool Executor - * - * @since Apache Tika 1.11 - */ -public class SimpleThreadPoolExecutor extends ThreadPoolExecutor implements ConfigurableThreadPoolExecutor { - - public SimpleThreadPoolExecutor() { - super(1, 2, 0L, TimeUnit.SECONDS, new LinkedBlockingQueue(), new ThreadFactory() { - - @Override - public Thread newThread(Runnable r) { - return new Thread(r, "Tika Executor Thread"); - } - }); - } -} diff --git a/tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java b/tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java index 77b234a..8dcb892 100644 --- a/tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java +++ b/tika-core/src/main/java/org/apache/tika/config/LoadErrorHandler.java @@ -46,10 +46,6 @@ LoadErrorHandler IGNORE = new LoadErrorHandler() { public void handleLoadError(String classname, Throwable throwable) { } - @Override - public String toString() { - return "IGNORE"; - } }; /** @@ -60,10 +56,6 @@ public void handleLoadError(String classname, Throwable throwable) { Logger.getLogger(classname).log( Level.WARNING, "Unable to load " + classname, throwable); - } - @Override - public String toString() { - return "WARN"; } }; @@ -76,9 +68,6 @@ public void handleLoadError(String classname, Throwable throwable) { throw new RuntimeException("Unable to load " + classname, throwable); } - @Override - public String toString() { - return "THROW"; - } }; + } diff --git a/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java b/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java index f98117e..d3d556f 100644 --- a/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java +++ b/tika-core/src/main/java/org/apache/tika/config/ServiceLoader.java @@ -30,8 +30,6 @@ import java.util.Map; import java.util.regex.Pattern; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * Internal utility class that Tika uses to look up service providers. * @@ -122,7 +120,7 @@ private final LoadErrorHandler handler; private final boolean dynamic; - + public ServiceLoader( ClassLoader loader, LoadErrorHandler handler, boolean dynamic) { this.loader = loader; @@ -135,23 +133,11 @@ } public ServiceLoader(ClassLoader loader) { - this(loader, Boolean.getBoolean("org.apache.tika.service.error.warn") - ? LoadErrorHandler.WARN:LoadErrorHandler.IGNORE); + this(loader, LoadErrorHandler.IGNORE); } public ServiceLoader() { - this(getContextClassLoader(), Boolean.getBoolean("org.apache.tika.service.error.warn") - ? LoadErrorHandler.WARN:LoadErrorHandler.IGNORE, true); - } - - /** - * Returns if the service loader is static or dynamic - * - * @return dynamic or static loading - * @since Apache Tika 1.10 - */ - public boolean isDynamic() { - return dynamic; + this(getContextClassLoader(), LoadErrorHandler.IGNORE, true); } /** @@ -184,10 +170,6 @@ /** * Loads and returns the named service class that's expected to implement * the given interface. - * - * Note that this class does not use the {@link LoadErrorHandler}, a - * {@link ClassNotFoundException} is always returned for unknown - * classes or classes of the wrong type * * @param iface service interface * @param name service class name @@ -224,7 +206,7 @@ * service files. */ public Enumeration findServiceResources(String filePattern) { - try { + try { Enumeration resources = loader.getResources(filePattern); return resources; } catch (IOException ignore) { @@ -278,19 +260,22 @@ } /** - * Returns the defined static service providers of the given type, without - * attempting to load them. + * Returns the available static service providers of the given type. * The providers are loaded using the service provider mechanism using - * the configured class loader (if any). - * - * @since Apache Tika 1.6 + * the configured class loader (if any). The returned list is newly + * allocated and may be freely modified by the caller. + * + * @since Apache Tika 1.2 * @param iface service provider interface - * @return static list of uninitialised service providers - */ - protected List identifyStaticServiceProviders(Class iface) { - List names = new ArrayList(); + * @return static service providers + */ + @SuppressWarnings("unchecked") + public List loadStaticServiceProviders(Class iface) { + List providers = new ArrayList(); if (loader != null) { + List names = new ArrayList(); + String serviceName = iface.getName(); Enumeration resources = findServiceResources("META-INF/services/" + serviceName); @@ -301,27 +286,6 @@ handler.handleLoadError(serviceName, e); } } - } - - return names; - } - - /** - * Returns the available static service providers of the given type. - * The providers are loaded using the service provider mechanism using - * the configured class loader (if any). The returned list is newly - * allocated and may be freely modified by the caller. - * - * @since Apache Tika 1.2 - * @param iface service provider interface - * @return static service providers - */ - @SuppressWarnings("unchecked") - public List loadStaticServiceProviders(Class iface) { - List providers = new ArrayList(); - - if (loader != null) { - List names = identifyStaticServiceProviders(iface); for (String name : names) { try { @@ -334,6 +298,7 @@ } } } + return providers; } @@ -343,9 +308,10 @@ private void collectServiceClassNames(URL resource, Collection names) throws IOException { - try (InputStream stream = resource.openStream()) { + InputStream stream = resource.openStream(); + try { BufferedReader reader = - new BufferedReader(new InputStreamReader(stream, UTF_8)); + new BufferedReader(new InputStreamReader(stream, "UTF-8")); String line = reader.readLine(); while (line != null) { line = COMMENT.matcher(line).replaceFirst(""); @@ -355,6 +321,8 @@ } line = reader.readLine(); } + } finally { + stream.close(); } } diff --git a/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java b/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java index 17f36e0..c1dca42 100644 --- a/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java +++ b/tika-core/src/main/java/org/apache/tika/config/TikaConfig.java @@ -17,35 +17,24 @@ package org.apache.tika.config; import java.io.File; +import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; -import java.lang.reflect.Constructor; -import java.lang.reflect.InvocationTargetException; import java.net.URL; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; import java.util.ArrayList; -import java.util.Collection; -import java.util.Collections; import java.util.HashSet; import java.util.List; import java.util.Set; -import java.util.concurrent.ExecutorService; import javax.imageio.spi.ServiceRegistry; import javax.xml.parsers.DocumentBuilder; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.parsers.ParserConfigurationException; -import org.apache.tika.concurrent.ConfigurableThreadPoolExecutor; -import org.apache.tika.concurrent.SimpleThreadPoolExecutor; import org.apache.tika.detect.CompositeDetector; import org.apache.tika.detect.DefaultDetector; import org.apache.tika.detect.Detector; import org.apache.tika.exception.TikaException; -import org.apache.tika.language.translate.DefaultTranslator; -import org.apache.tika.language.translate.Translator; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MediaTypeRegistry; import org.apache.tika.mime.MimeTypeException; @@ -71,7 +60,7 @@ return MimeTypes.getDefaultMimeTypes(loader); } - protected static CompositeDetector getDefaultDetector( + private static Detector getDefaultDetector( MimeTypes types, ServiceLoader loader) { return new DefaultDetector(types, loader); } @@ -81,54 +70,27 @@ return new DefaultParser(types.getMediaTypeRegistry(), loader); } - private static Translator getDefaultTranslator(ServiceLoader loader) { - return new DefaultTranslator(loader); - } - - private static ConfigurableThreadPoolExecutor getDefaultExecutorService() { - return new SimpleThreadPoolExecutor(); - } - - private final ServiceLoader serviceLoader; private final CompositeParser parser; - private final CompositeDetector detector; - private final Translator translator; + private final Detector detector; private final MimeTypes mimeTypes; - private final ExecutorService executorService; public TikaConfig(String file) throws TikaException, IOException, SAXException { - this(Paths.get(file)); - } - - public TikaConfig(Path path) - throws TikaException, IOException, SAXException { - this(path, new ServiceLoader()); - } - public TikaConfig(Path path, ServiceLoader loader) - throws TikaException, IOException, SAXException { - this(getBuilder().parse(path.toFile()), loader); + this(new File(file)); } public TikaConfig(File file) throws TikaException, IOException, SAXException { - this(file, new ServiceLoader()); - } - public TikaConfig(File file, ServiceLoader loader) - throws TikaException, IOException, SAXException { - this(getBuilder().parse(file), loader); + this(getBuilder().parse(file)); } public TikaConfig(URL url) throws TikaException, IOException, SAXException { this(url, ServiceLoader.getContextClassLoader()); } + public TikaConfig(URL url, ClassLoader loader) - throws TikaException, IOException, SAXException { - this(getBuilder().parse(url.toString()).getDocumentElement(), loader); - } - public TikaConfig(URL url, ServiceLoader loader) throws TikaException, IOException, SAXException { this(getBuilder().parse(url.toString()).getDocumentElement(), loader); } @@ -141,32 +103,21 @@ public TikaConfig(Document document) throws TikaException, IOException { this(document.getDocumentElement()); } - public TikaConfig(Document document, ServiceLoader loader) throws TikaException, IOException { - this(document.getDocumentElement(), loader); - } public TikaConfig(Element element) throws TikaException, IOException { - this(element, serviceLoaderFromDomElement(element, null)); + this(element, new ServiceLoader()); } public TikaConfig(Element element, ClassLoader loader) throws TikaException, IOException { - this(element, serviceLoaderFromDomElement(element, loader)); + this(element, new ServiceLoader(loader)); } private TikaConfig(Element element, ServiceLoader loader) throws TikaException, IOException { - ParserXmlLoader parserLoader = new ParserXmlLoader(); - DetectorXmlLoader detectorLoader = new DetectorXmlLoader(); - TranslatorXmlLoader translatorLoader = new TranslatorXmlLoader(); - ExecutorServiceXmlLoader executorLoader = new ExecutorServiceXmlLoader(); - this.mimeTypes = typesFromDomElement(element); - this.detector = detectorLoader.loadOverall(element, mimeTypes, loader); - this.parser = parserLoader.loadOverall(element, mimeTypes, loader); - this.translator = translatorLoader.loadOverall(element, mimeTypes, loader); - this.executorService = executorLoader.loadOverall(element, mimeTypes, loader); - this.serviceLoader = loader; + this.detector = detectorFromDomElement(element, mimeTypes, loader); + this.parser = parserFromDomElement(element, mimeTypes, loader); } /** @@ -183,12 +134,10 @@ */ public TikaConfig(ClassLoader loader) throws MimeTypeException, IOException { - this.serviceLoader = new ServiceLoader(loader); + ServiceLoader serviceLoader = new ServiceLoader(loader); this.mimeTypes = getDefaultMimeTypes(loader); this.detector = getDefaultDetector(mimeTypes, serviceLoader); this.parser = getDefaultParser(mimeTypes, serviceLoader); - this.translator = getDefaultTranslator(serviceLoader); - this.executorService = getDefaultExecutorService(); } /** @@ -209,7 +158,7 @@ * @throws TikaException if problem with MimeTypes or parsing XML config */ public TikaConfig() throws TikaException, IOException { - this.serviceLoader = new ServiceLoader(); + ServiceLoader loader = new ServiceLoader(); String config = System.getProperty("tika.config"); if (config == null) { @@ -218,52 +167,45 @@ if (config == null) { this.mimeTypes = getDefaultMimeTypes(ServiceLoader.getContextClassLoader()); - this.parser = getDefaultParser(mimeTypes, serviceLoader); - this.detector = getDefaultDetector(mimeTypes, serviceLoader); - this.translator = getDefaultTranslator(serviceLoader); - this.executorService = getDefaultExecutorService(); + this.parser = getDefaultParser(mimeTypes, loader); + this.detector = getDefaultDetector(mimeTypes, loader); } else { - try (InputStream stream = getConfigInputStream(config, serviceLoader)) { - Element element = getBuilder().parse(stream).getDocumentElement(); - ParserXmlLoader parserLoader = new ParserXmlLoader(); - DetectorXmlLoader detectorLoader = new DetectorXmlLoader(); - TranslatorXmlLoader translatorLoader = new TranslatorXmlLoader(); - ExecutorServiceXmlLoader executorLoader = new ExecutorServiceXmlLoader(); - + // Locate the given configuration file + InputStream stream = null; + File file = new File(config); + if (file.isFile()) { + stream = new FileInputStream(file); + } + if (stream == null) { + try { + stream = new URL(config).openStream(); + } catch (IOException ignore) { + } + } + if (stream == null) { + stream = loader.getResourceAsStream(config); + } + if (stream == null) { + throw new TikaException( + "Specified Tika configuration not found: " + config); + } + + try { + Element element = + getBuilder().parse(stream).getDocumentElement(); this.mimeTypes = typesFromDomElement(element); - this.parser = parserLoader.loadOverall(element, mimeTypes, serviceLoader); - this.detector = detectorLoader.loadOverall(element, mimeTypes, serviceLoader); - this.translator = translatorLoader.loadOverall(element, mimeTypes, serviceLoader); - this.executorService = executorLoader.loadOverall(element, mimeTypes, serviceLoader); + this.parser = + parserFromDomElement(element, mimeTypes, loader); + this.detector = + detectorFromDomElement(element, mimeTypes, loader); } catch (SAXException e) { throw new TikaException( "Specified Tika configuration has syntax errors: " + config, e); - } - } - } - - private static InputStream getConfigInputStream(String config, ServiceLoader serviceLoader) - throws TikaException, IOException { - InputStream stream = null; - try { - stream = new URL(config).openStream(); - } catch (IOException ignore) { - } - if (stream == null) { - stream = serviceLoader.getResourceAsStream(config); - } - if (stream == null) { - Path file = Paths.get(config); - if (Files.isRegularFile(file)) { - stream = Files.newInputStream(file); - } - } - if (stream == null) { - throw new TikaException( - "Specified Tika configuration not found: " + config); - } - return stream; + } finally { + stream.close(); + } + } } private static String getText(Node node) { @@ -306,29 +248,12 @@ return detector; } - /** - * Returns the configured translator instance. - * - * @return configured translator - */ - public Translator getTranslator() { - return translator; - } - - public ExecutorService getExecutorService() { - return executorService; - } - public MimeTypes getMimeRepository(){ return mimeTypes; } public MediaTypeRegistry getMediaTypeRegistry() { return mimeTypes.getMediaTypeRegistry(); - } - - public ServiceLoader getServiceLoader() { - return serviceLoader; } /** @@ -369,42 +294,6 @@ } return null; } - private static List getTopLevelElementChildren(Element element, - String parentName, String childrenName) throws TikaException { - Node parentNode = null; - if (parentName != null) { - // Should be only zero or one / etc tag - NodeList nodes = element.getElementsByTagName(parentName); - if (nodes.getLength() > 1) { - throw new TikaException("Properties may not contain multiple "+parentName+" entries"); - } - else if (nodes.getLength() == 1) { - parentNode = nodes.item(0); - } - } else { - // All children directly on the master element - parentNode = element; - } - - if (parentNode != null) { - // Find only the direct child parser/detector objects - NodeList nodes = parentNode.getChildNodes(); - List elements = new ArrayList(); - for (int i = 0; i < nodes.getLength(); i++) { - Node node = nodes.item(i); - if (node instanceof Element) { - Element nodeE = (Element)node; - if (childrenName.equals(nodeE.getTagName())) { - elements.add(nodeE); - } - } - } - return elements; - } else { - // No elements of this type - return Collections.emptyList(); - } - } private static MimeTypes typesFromDomElement(Element element) throws TikaException, IOException { @@ -415,481 +304,92 @@ return getDefaultMimeTypes(null); } } - - private static Set mediaTypesListFromDomElement( - Element node, String tag) + + private static CompositeParser parserFromDomElement( + Element element, MimeTypes mimeTypes, ServiceLoader loader) throws TikaException, IOException { - Set types = null; - NodeList children = node.getChildNodes(); - for (int i=0; i(); - types.add(type); - } else { - throw new TikaException( - "Invalid media type name: " + mime); - } + List parsers = new ArrayList(); + NodeList nodes = element.getElementsByTagName("parser"); + for (int i = 0; i < nodes.getLength(); i++) { + Element node = (Element) nodes.item(i); + String name = node.getAttribute("class"); + + try { + Class parserClass = + loader.getServiceClass(Parser.class, name); + // https://issues.apache.org/jira/browse/TIKA-866 + if (AutoDetectParser.class.isAssignableFrom(parserClass)) { + throw new TikaException( + "AutoDetectParser not supported in a " + + " configuration element: " + name); } - } - } - if (types != null) return types; - return Collections.emptySet(); - } - - private static ServiceLoader serviceLoaderFromDomElement(Element element, ClassLoader loader) { - Element serviceLoaderElement = getChild(element, "service-loader"); - ServiceLoader serviceLoader; - if (serviceLoaderElement != null) { - boolean dynamic = Boolean.parseBoolean(serviceLoaderElement.getAttribute("dynamic")); - LoadErrorHandler loadErrorHandler = LoadErrorHandler.IGNORE; - String loadErrorHandleConfig = serviceLoaderElement.getAttribute("loadErrorHandler"); - if(LoadErrorHandler.WARN.toString().equalsIgnoreCase(loadErrorHandleConfig)) { - loadErrorHandler = LoadErrorHandler.WARN; - } else if(LoadErrorHandler.THROW.toString().equalsIgnoreCase(loadErrorHandleConfig)) { - loadErrorHandler = LoadErrorHandler.THROW; - } - - serviceLoader = new ServiceLoader(loader, loadErrorHandler, dynamic); - } else if(loader != null) { - serviceLoader = new ServiceLoader(loader); - } else { - serviceLoader = new ServiceLoader(); - } - return serviceLoader; - } - - private static abstract class XmlLoader { - abstract boolean supportsComposite(); - abstract String getParentTagName(); // eg parsers - abstract String getLoaderTagName(); // eg parser - abstract Class getLoaderClass(); // Generics workaround - abstract boolean isComposite(T loaded); - abstract boolean isComposite(Class loadedClass); - abstract T preLoadOne(Class loadedClass, String classname, - MimeTypes mimeTypes) throws TikaException; - abstract CT createDefault(MimeTypes mimeTypes, ServiceLoader loader); - abstract CT createComposite(List loaded, MimeTypes mimeTypes, ServiceLoader loader); - abstract T createComposite(Class compositeClass, - List children, Set> excludeChildren, - MimeTypes mimeTypes, ServiceLoader loader) - throws InvocationTargetException, IllegalAccessException, InstantiationException; - abstract T decorate(T created, Element element) - throws IOException, TikaException; // eg explicit mime types - - @SuppressWarnings("unchecked") - CT loadOverall(Element element, MimeTypes mimeTypes, - ServiceLoader loader) throws TikaException, IOException { - List loaded = new ArrayList(); - - // Find the children of the parent tag, if any - for (Element le : getTopLevelElementChildren(element, getParentTagName(), getLoaderTagName())) { - T loadedChild = loadOne(le, mimeTypes, loader); - if (loadedChild != null) loaded.add(loadedChild); - } - - // Build the classes, and wrap as needed - if (loaded.isEmpty()) { - // Nothing defined, create a Default - return createDefault(mimeTypes, loader); - } else if (loaded.size() == 1) { - T single = loaded.get(0); - if (isComposite(single)) { - // Single Composite defined, use that - return (CT)single; - } - } else if (! supportsComposite()) { - // No composite support, just return the first one - return (CT)loaded.get(0); - } - // Wrap the defined parsers/detectors up in a Composite - return createComposite(loaded, mimeTypes, loader); - } - T loadOne(Element element, MimeTypes mimeTypes, ServiceLoader loader) - throws TikaException, IOException { - String name = element.getAttribute("class"); - T loaded = null; - - try { - Class loadedClass = - loader.getServiceClass(getLoaderClass(), name); - - // Do pre-load checks and short-circuits - loaded = preLoadOne(loadedClass, name, mimeTypes); - if (loaded != null) return loaded; - - // Is this a composite or decorated class? If so, support recursion - if (isComposite(loadedClass)) { - // Get the child objects for it - List children = new ArrayList(); - NodeList childNodes = element.getElementsByTagName(getLoaderTagName()); - if (childNodes.getLength() > 0) { - for (int i = 0; i < childNodes.getLength(); i++) { - T loadedChild = loadOne((Element)childNodes.item(i), - mimeTypes, loader); - if (loadedChild != null) children.add(loadedChild); + Parser parser = parserClass.newInstance(); + + NodeList mimes = node.getElementsByTagName("mime"); + if (mimes.getLength() > 0) { + Set types = new HashSet(); + for (int j = 0; j < mimes.getLength(); j++) { + String mime = getText(mimes.item(j)); + MediaType type = MediaType.parse(mime); + if (type != null) { + types.add(type); + } else { + throw new TikaException( + "Invalid media type name: " + mime); } } - - // Get the list of children to exclude - Set> excludeChildren = new HashSet>(); - NodeList excludeChildNodes = element.getElementsByTagName(getLoaderTagName()+"-exclude"); - if (excludeChildNodes.getLength() > 0) { - for (int i = 0; i < excludeChildNodes.getLength(); i++) { - Element excl = (Element)excludeChildNodes.item(i); - String exclName = excl.getAttribute("class"); - excludeChildren.add(loader.getServiceClass(getLoaderClass(), exclName)); - } - } - - // Create the Composite - loaded = createComposite(loadedClass, children, excludeChildren, mimeTypes, loader); - - // Default constructor fallback - if (loaded == null) { - loaded = loadedClass.newInstance(); - } - } else { - // Regular class, create as-is - loaded = loadedClass.newInstance(); - // TODO Support arguments, needed for Translators etc - // See the thread "Configuring parsers and translators" for details + parser = ParserDecorator.withTypes(parser, types); } - - // Have any decoration performed, eg explicit mimetypes - loaded = decorate(loaded, element); - - // All done with setup - return loaded; + + parsers.add(parser); } catch (ClassNotFoundException e) { - if (loader.getLoadErrorHandler() == LoadErrorHandler.THROW) { - // Use a different exception signature here - throw new TikaException( - "Unable to find a "+getLoaderTagName()+" class: " + name, e); - } - // Report the problem - loader.getLoadErrorHandler().handleLoadError(name, e); - return null; + throw new TikaException( + "Unable to find a parser class: " + name, e); } catch (IllegalAccessException e) { throw new TikaException( - "Unable to access a "+getLoaderTagName()+" class: " + name, e); - } catch (InvocationTargetException e) { - throw new TikaException( - "Unable to create a "+getLoaderTagName()+" class: " + name, e); + "Unable to access a parser class: " + name, e); } catch (InstantiationException e) { throw new TikaException( - "Unable to instantiate a "+getLoaderTagName()+" class: " + name, e); - } - } - } - private static class ParserXmlLoader extends XmlLoader { - boolean supportsComposite() { return true; } - String getParentTagName() { return "parsers"; } - String getLoaderTagName() { return "parser"; } - - @Override - Class getLoaderClass() { - return Parser.class; - } - @Override - Parser preLoadOne(Class loadedClass, String classname, - MimeTypes mimeTypes) throws TikaException { - // Check for classes which can't be set in config - if (AutoDetectParser.class.isAssignableFrom(loadedClass)) { - // https://issues.apache.org/jira/browse/TIKA-866 - throw new TikaException( - "AutoDetectParser not supported in a " - + " configuration element: " + classname); - } - // Continue with normal loading - return null; - } - @Override - boolean isComposite(Parser loaded) { - return loaded instanceof CompositeParser; - } - @Override - boolean isComposite(Class loadedClass) { - if (CompositeParser.class.isAssignableFrom(loadedClass) || - ParserDecorator.class.isAssignableFrom(loadedClass)) { - return true; - } - return false; - } - @Override - CompositeParser createDefault(MimeTypes mimeTypes, ServiceLoader loader) { + "Unable to instantiate a parser class: " + name, e); + } + } + if (parsers.isEmpty()) { return getDefaultParser(mimeTypes, loader); - } - @Override - CompositeParser createComposite(List parsers, MimeTypes mimeTypes, ServiceLoader loader) { + } else { MediaTypeRegistry registry = mimeTypes.getMediaTypeRegistry(); return new CompositeParser(registry, parsers); } - @Override - Parser createComposite(Class parserClass, - List childParsers, Set> excludeParsers, - MimeTypes mimeTypes, ServiceLoader loader) - throws InvocationTargetException, IllegalAccessException, InstantiationException { - Parser parser = null; - Constructor c = null; - MediaTypeRegistry registry = mimeTypes.getMediaTypeRegistry(); - - // Try the possible default and composite parser constructors - if (parser == null) { - try { - c = parserClass.getConstructor(MediaTypeRegistry.class, ServiceLoader.class, Collection.class); - parser = c.newInstance(registry, loader, excludeParsers); - } - catch (NoSuchMethodException me) {} - } - if (parser == null) { - try { - c = parserClass.getConstructor(MediaTypeRegistry.class, List.class, Collection.class); - parser = c.newInstance(registry, childParsers, excludeParsers); - } catch (NoSuchMethodException me) {} - } - if (parser == null) { - try { - c = parserClass.getConstructor(MediaTypeRegistry.class, List.class); - parser = c.newInstance(registry, childParsers); - } catch (NoSuchMethodException me) {} - } - - // Create as a Parser Decorator - if (parser == null && ParserDecorator.class.isAssignableFrom(parserClass)) { - try { - CompositeParser cp = null; - if (childParsers.size() == 1 && excludeParsers.size() == 0 && - childParsers.get(0) instanceof CompositeParser) { - cp = (CompositeParser)childParsers.get(0); - } else { - cp = new CompositeParser(registry, childParsers, excludeParsers); - } - c = parserClass.getConstructor(Parser.class); - parser = c.newInstance(cp); - } catch (NoSuchMethodException me) {} - } - return parser; - } - @Override - Parser decorate(Parser created, Element element) throws IOException, TikaException { - Parser parser = created; - - // Is there an explicit list of mime types for this to handle? - Set parserTypes = mediaTypesListFromDomElement(element, "mime"); - if (! parserTypes.isEmpty()) { - parser = ParserDecorator.withTypes(parser, parserTypes); - } - // Is there an explicit list of mime types this shouldn't handle? - Set parserExclTypes = mediaTypesListFromDomElement(element, "mime-exclude"); - if (! parserExclTypes.isEmpty()) { - parser = ParserDecorator.withoutTypes(parser, parserExclTypes); - } - - // All done with decoration - return parser; - } - } - private static class DetectorXmlLoader extends XmlLoader { - boolean supportsComposite() { return true; } - String getParentTagName() { return "detectors"; } - String getLoaderTagName() { return "detector"; } - - @Override - Class getLoaderClass() { - return Detector.class; - } - @Override - Detector preLoadOne(Class loadedClass, String classname, - MimeTypes mimeTypes) throws TikaException { - // If they asked for the mime types as a detector, give - // them the one we've already created. TIKA-1708 - if (MimeTypes.class.equals(loadedClass)) { - return mimeTypes; - } - // Continue with normal loading - return null; - } - @Override - boolean isComposite(Detector loaded) { - return loaded instanceof CompositeDetector; - } - @Override - boolean isComposite(Class loadedClass) { - return CompositeDetector.class.isAssignableFrom(loadedClass); - } - @Override - CompositeDetector createDefault(MimeTypes mimeTypes, ServiceLoader loader) { - return getDefaultDetector(mimeTypes, loader); - } - @Override - CompositeDetector createComposite(List detectors, MimeTypes mimeTypes, ServiceLoader loader) { - MediaTypeRegistry registry = mimeTypes.getMediaTypeRegistry(); - return new CompositeDetector(registry, detectors); - } - @Override - Detector createComposite(Class detectorClass, - List childDetectors, - Set> excludeDetectors, - MimeTypes mimeTypes, ServiceLoader loader) - throws InvocationTargetException, IllegalAccessException, - InstantiationException { - Detector detector = null; - Constructor c; - MediaTypeRegistry registry = mimeTypes.getMediaTypeRegistry(); - - // Try the possible default and composite detector constructors - if (detector == null) { - try { - c = detectorClass.getConstructor(MimeTypes.class, ServiceLoader.class, Collection.class); - detector = c.newInstance(mimeTypes, loader, excludeDetectors); - } - catch (NoSuchMethodException me) {} - } - if (detector == null) { - try { - c = detectorClass.getConstructor(MediaTypeRegistry.class, List.class, Collection.class); - detector = c.newInstance(registry, childDetectors, excludeDetectors); - } catch (NoSuchMethodException me) {} - } - if (detector == null) { - try { - c = detectorClass.getConstructor(MediaTypeRegistry.class, List.class); - detector = c.newInstance(registry, childDetectors); - } catch (NoSuchMethodException me) {} - } - if (detector == null) { - try { - c = detectorClass.getConstructor(List.class); - detector = c.newInstance(childDetectors); - } catch (NoSuchMethodException me) {} - } - - return detector; - } - @Override - Detector decorate(Detector created, Element element) { - return created; // No decoration of Detectors - } - } - private static class TranslatorXmlLoader extends XmlLoader { - boolean supportsComposite() { return false; } - String getParentTagName() { return null; } - String getLoaderTagName() { return "translator"; } - - @Override - Class getLoaderClass() { - return Translator.class; - } - @Override - Translator preLoadOne(Class loadedClass, String classname, - MimeTypes mimeTypes) throws TikaException { - // Continue with normal loading - return null; - } - @Override - boolean isComposite(Translator loaded) { return false; } - @Override - boolean isComposite(Class loadedClass) { return false; } - @Override - Translator createDefault(MimeTypes mimeTypes, ServiceLoader loader) { - return getDefaultTranslator(loader); - } - @Override - Translator createComposite(List loaded, - MimeTypes mimeTypes, ServiceLoader loader) { - return loaded.get(0); - } - @Override - Translator createComposite(Class compositeClass, - List children, - Set> excludeChildren, - MimeTypes mimeTypes, ServiceLoader loader) - throws InvocationTargetException, IllegalAccessException, - InstantiationException { - throw new InstantiationException("Only one translator supported"); - } - @Override - Translator decorate(Translator created, Element element) { - return created; // No decoration of Translators - } - } - - private static class ExecutorServiceXmlLoader extends XmlLoader { - @Override - ConfigurableThreadPoolExecutor createComposite( - Class compositeClass, - List children, - Set> excludeChildren, - MimeTypes mimeTypes, ServiceLoader loader) - throws InvocationTargetException, IllegalAccessException, - InstantiationException { - throw new InstantiationException("Only one executor service supported"); - } - - @Override - ConfigurableThreadPoolExecutor createComposite(List loaded, - MimeTypes mimeTypes, ServiceLoader loader) { - return loaded.get(0); - } - - @Override - ConfigurableThreadPoolExecutor createDefault(MimeTypes mimeTypes, ServiceLoader loader) { - return getDefaultExecutorService(); - } - - @Override - ConfigurableThreadPoolExecutor decorate(ConfigurableThreadPoolExecutor created, Element element) - throws IOException, TikaException { - Element coreThreadElement = getChild(element, "core-threads"); - if(coreThreadElement != null) - { - created.setCorePoolSize(Integer.parseInt(getText(coreThreadElement))); - } - Element maxThreadElement = getChild(element, "max-threads"); - if(maxThreadElement != null) - { - created.setMaximumPoolSize(Integer.parseInt(getText(maxThreadElement))); - } - return created; - } - - @Override - Class getLoaderClass() { - return ConfigurableThreadPoolExecutor.class; - } - - @Override - ConfigurableThreadPoolExecutor loadOne(Element element, MimeTypes mimeTypes, - ServiceLoader loader) throws TikaException, IOException { - return super.loadOne(element, mimeTypes, loader); - } - - @Override - boolean supportsComposite() {return false;} - - @Override - String getParentTagName() {return null;} - - @Override - String getLoaderTagName() {return "executor-service";} - - @Override - boolean isComposite(ConfigurableThreadPoolExecutor loaded) {return false;} - - @Override - boolean isComposite(Class loadedClass) {return false;} - - @Override - ConfigurableThreadPoolExecutor preLoadOne( - Class loadedClass, String classname, - MimeTypes mimeTypes) throws TikaException { - return null; - } + } + + private static Detector detectorFromDomElement( + Element element, MimeTypes mimeTypes, ServiceLoader loader) + throws TikaException, IOException { + List detectors = new ArrayList(); + NodeList nodes = element.getElementsByTagName("detector"); + for (int i = 0; i < nodes.getLength(); i++) { + Element node = (Element) nodes.item(i); + String name = node.getAttribute("class"); + + try { + Class detectorClass = + loader.getServiceClass(Detector.class, name); + detectors.add(detectorClass.newInstance()); + } catch (ClassNotFoundException e) { + throw new TikaException( + "Unable to find a detector class: " + name, e); + } catch (IllegalAccessException e) { + throw new TikaException( + "Unable to access a detector class: " + name, e); + } catch (InstantiationException e) { + throw new TikaException( + "Unable to instantiate a detector class: " + name, e); + } + } + if (detectors.isEmpty()) { + return getDefaultDetector(mimeTypes, loader); + } else { + MediaTypeRegistry registry = mimeTypes.getMediaTypeRegistry(); + return new CompositeDetector(registry, detectors); + } } } diff --git a/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java b/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java index c77a04a..4ea245e 100644 --- a/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java +++ b/tika-core/src/main/java/org/apache/tika/detect/CompositeDetector.java @@ -18,9 +18,7 @@ import java.io.IOException; import java.io.InputStream; -import java.util.ArrayList; import java.util.Arrays; -import java.util.Collection; import java.util.Collections; import java.util.List; @@ -42,24 +40,10 @@ private final List detectors; - public CompositeDetector(MediaTypeRegistry registry, - List detectors, Collection> excludeDetectors) { - if (excludeDetectors == null || excludeDetectors.isEmpty()) { - this.detectors = detectors; - } else { - this.detectors = new ArrayList(); - for (Detector d : detectors) { - if (!isExcluded(excludeDetectors, d.getClass())) { - this.detectors.add(d); - } - } - } + public CompositeDetector( + MediaTypeRegistry registry, List detectors) { this.registry = registry; - } - - public CompositeDetector(MediaTypeRegistry registry, - List detectors) { - this(registry, detectors, null); + this.detectors = detectors; } public CompositeDetector(List detectors) { @@ -88,14 +72,4 @@ public List getDetectors() { return Collections.unmodifiableList(detectors); } - - private boolean isExcluded(Collection> excludeDetectors, Class d) { - return excludeDetectors.contains(d) || assignableFrom(excludeDetectors, d); - } - private boolean assignableFrom(Collection> excludeDetectors, Class d) { - for (Class e : excludeDetectors) { - if (e.isAssignableFrom(d)) return true; - } - return false; - } } diff --git a/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java b/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java index ab6e97c..c16d7bb 100644 --- a/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java +++ b/tika-core/src/main/java/org/apache/tika/detect/DefaultDetector.java @@ -16,14 +16,14 @@ */ package org.apache.tika.detect; -import java.util.Collection; +import java.util.Collections; +import java.util.Comparator; import java.util.List; import javax.imageio.spi.ServiceRegistry; import org.apache.tika.config.ServiceLoader; import org.apache.tika.mime.MimeTypes; -import org.apache.tika.utils.ServiceLoaderUtils; /** * A composite detector based on all the {@link Detector} implementations @@ -53,9 +53,23 @@ */ private static List getDefaultDetectors( MimeTypes types, ServiceLoader loader) { - List detectors = loader.loadStaticServiceProviders(Detector.class); - ServiceLoaderUtils.sortLoadedClasses(detectors); - + List detectors = + loader.loadStaticServiceProviders(Detector.class); + Collections.sort(detectors, new Comparator() { + public int compare(Detector d1, Detector d2) { + String n1 = d1.getClass().getName(); + String n2 = d2.getClass().getName(); + boolean t1 = n1.startsWith("org.apache.tika."); + boolean t2 = n2.startsWith("org.apache.tika."); + if (t1 == t2) { + return n1.compareTo(n2); + } else if (t1) { + return 1; + } else { + return -1; + } + } + }); // Finally the Tika MimeTypes as a fallback detectors.add(types); return detectors; @@ -63,14 +77,9 @@ private transient final ServiceLoader loader; - public DefaultDetector(MimeTypes types, ServiceLoader loader, - Collection> excludeDetectors) { - super(types.getMediaTypeRegistry(), getDefaultDetectors(types, loader), excludeDetectors); + public DefaultDetector(MimeTypes types, ServiceLoader loader) { + super(types.getMediaTypeRegistry(), getDefaultDetectors(types, loader)); this.loader = loader; - } - - public DefaultDetector(MimeTypes types, ServiceLoader loader) { - this(types, loader, null); } public DefaultDetector(MimeTypes types, ClassLoader loader) { diff --git a/tika-core/src/main/java/org/apache/tika/detect/DefaultProbDetector.java b/tika-core/src/main/java/org/apache/tika/detect/DefaultProbDetector.java deleted file mode 100644 index 5a7b60a..0000000 --- a/tika-core/src/main/java/org/apache/tika/detect/DefaultProbDetector.java +++ /dev/null @@ -1,80 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.detect; - -import java.util.List; - -import org.apache.tika.config.ServiceLoader; -import org.apache.tika.mime.MimeTypes; -import org.apache.tika.mime.ProbabilisticMimeDetectionSelector; -import org.apache.tika.utils.ServiceLoaderUtils; - -/** - * A version of {@link DefaultDetector} for probabilistic mime - * detectors, which use statistical techniques to blend the - * results of differing underlying detectors when attempting - * to detect the type of a given file. - * TODO Link to documentation on configuring these probabilities - */ -public class DefaultProbDetector extends CompositeDetector { - private static final long serialVersionUID = -8836240060532323352L; - - private static List getDefaultDetectors( - ProbabilisticMimeDetectionSelector sel, ServiceLoader loader) { - List detectors = loader.loadStaticServiceProviders(Detector.class); - ServiceLoaderUtils.sortLoadedClasses(detectors); - detectors.add(sel); - return detectors; - } - - private transient final ServiceLoader loader; - - public DefaultProbDetector(ProbabilisticMimeDetectionSelector sel, - ServiceLoader loader) { - super(sel.getMediaTypeRegistry(), getDefaultDetectors(sel, loader)); - this.loader = loader; - } - - public DefaultProbDetector(ProbabilisticMimeDetectionSelector sel, - ClassLoader loader) { - this(sel, new ServiceLoader(loader)); - } - - public DefaultProbDetector(ClassLoader loader) { - this(new ProbabilisticMimeDetectionSelector(), loader); - } - - public DefaultProbDetector(MimeTypes types) { - this(new ProbabilisticMimeDetectionSelector(types), new ServiceLoader()); - } - - public DefaultProbDetector() { - this(MimeTypes.getDefaultMimeTypes()); - } - - @Override - public List getDetectors() { - if (loader != null) { - List detectors = loader - .loadDynamicServiceProviders(Detector.class); - detectors.addAll(super.getDetectors()); - return detectors; - } else { - return super.getDetectors(); - } - } -} diff --git a/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java b/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java index 0574d29..d59715d 100644 --- a/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java +++ b/tika-core/src/main/java/org/apache/tika/detect/MagicDetector.java @@ -21,14 +21,12 @@ import java.io.InputStream; import java.nio.ByteBuffer; import java.nio.CharBuffer; -import java.util.Locale; +import java.nio.charset.Charset; import java.util.regex.Matcher; import java.util.regex.Pattern; + import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; - -import static java.nio.charset.StandardCharsets.ISO_8859_1; -import static java.nio.charset.StandardCharsets.UTF_8; /** * Content type detection based on magic bytes, i.e. type-specific patterns @@ -41,6 +39,8 @@ * @since Apache Tika 0.3 */ public class MagicDetector implements Detector { + + private static final Charset ISO_8859_1 = Charset.forName("ISO-8859-1"); public static MagicDetector parse( MediaType mediaType, @@ -95,9 +95,9 @@ || type.equals("unicodeBE")) { decoded = decodeString(value, type); } else if (type.equals("stringignorecase")) { - decoded = decodeString(value.toLowerCase(Locale.ROOT), type); + decoded = decodeString(value.toLowerCase(), type); } else if (type.equals("byte")) { - decoded = tmpVal.getBytes(UTF_8); + decoded = tmpVal.getBytes(); } else if (type.equals("host16") || type.equals("little16")) { int i = Integer.parseInt(tmpVal, radix); decoded = new byte[] { (byte) (i & 0x00FF), (byte) (i >> 8) }; @@ -393,7 +393,7 @@ flags = Pattern.CASE_INSENSITIVE; } - Pattern p = Pattern.compile(new String(this.pattern, UTF_8), flags); + Pattern p = Pattern.compile(new String(this.pattern), flags); ByteBuffer bb = ByteBuffer.wrap(buffer); CharBuffer result = ISO_8859_1.decode(bb); diff --git a/tika-core/src/main/java/org/apache/tika/detect/NNExampleModelDetector.java b/tika-core/src/main/java/org/apache/tika/detect/NNExampleModelDetector.java deleted file mode 100644 index 0274d50..0000000 --- a/tika-core/src/main/java/org/apache/tika/detect/NNExampleModelDetector.java +++ /dev/null @@ -1,160 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.detect; - -import java.io.BufferedReader; -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.net.URL; -import java.nio.file.Path; -import java.util.Objects; -import java.util.logging.Level; -import java.util.logging.Logger; - -import org.apache.tika.mime.MediaType; - -import static java.nio.charset.StandardCharsets.UTF_8; - -public class NNExampleModelDetector extends TrainedModelDetector { - private static final String EXAMPLE_NNMODEL_FILE = "tika-example.nnmodel"; - - private static final long serialVersionUID = 1L; - - private static final Logger log = Logger.getLogger(NNExampleModelDetector.class.getName()); - - public NNExampleModelDetector() { - super(); - } - - public NNExampleModelDetector(final Path modelFile) { - loadDefaultModels(modelFile); - } - - public NNExampleModelDetector(final File modelFile) { - loadDefaultModels(modelFile); - } - - @Override - public void loadDefaultModels(InputStream modelStream) { - BufferedReader bReader = new BufferedReader(new InputStreamReader(modelStream, UTF_8)); - - NNTrainedModelBuilder nnBuilder = new NNTrainedModelBuilder(); - String line; - try { - while ((line = bReader.readLine()) != null) { - line = line.trim(); - if (line.startsWith("#")) { - readDescription(nnBuilder, line); - } else { - readNNParams(nnBuilder, line); - // add this model into map of trained models. - super.registerModels(nnBuilder.getType(), nnBuilder.build()); - } - - } - } catch (IOException e) { - throw new RuntimeException("Unable to read the default media type registry", e); - } - } - - /** - * this method gets overwritten to register load neural network models - */ - @Override - public void loadDefaultModels(ClassLoader classLoader) { - if (classLoader == null) { - classLoader = TrainedModelDetector.class.getClassLoader(); - } - - // This allows us to replicate class.getResource() when using - // the classloader directly - String classPrefix = TrainedModelDetector.class.getPackage().getName() - .replace('.', '/') - + "/"; - - // Get the core URL, and all the extensions URLs - URL modelURL = classLoader.getResource(classPrefix + EXAMPLE_NNMODEL_FILE); - Objects.requireNonNull(modelURL, "required resource " + classPrefix + EXAMPLE_NNMODEL_FILE + " not found"); - try (InputStream stream = modelURL.openStream()) { - loadDefaultModels(stream); - } catch (IOException e) { - throw new RuntimeException("Unable to read the default media type registry", e); - } - - } - - /** - * read the comments where the model configuration is written, e.g the - * number of inputs, hiddens and output please ensure the first char in the - * given string is # In this example grb model file, there are 4 elements 1) - * type 2) number of input units 3) number of hidden units. 4) number of - * output units. - */ - private void readDescription(final NNTrainedModelBuilder builder, - final String line) { - int numInputs; - int numHidden; - int numOutputs; - String[] sarr = line.split("\t"); - - try { - MediaType type = MediaType.parse(sarr[1]); - numInputs = Integer.parseInt(sarr[2]); - numHidden = Integer.parseInt(sarr[3]); - numOutputs = Integer.parseInt(sarr[4]); - builder.setNumOfInputs(numInputs); - builder.setNumOfHidden(numHidden); - builder.setNumOfOutputs(numOutputs); - builder.setType(type); - } catch (Exception e) { - if (log.isLoggable(Level.WARNING)) { - log.log(Level.WARNING, "Unable to parse the model configuration", e); - } - throw new RuntimeException("Unable to parse the model configuration", e); - } - } - - /** - * Read the next line for the model parameters and populate the build which - * later will be used to instantiate the instance of TrainedModel - * - * @param builder - * @param line - */ - private void readNNParams(final NNTrainedModelBuilder builder, - final String line) { - String[] sarr = line.split("\t"); - int n = sarr.length; - float[] params = new float[n]; - try { - int i = 0; - for (String fstr : sarr) { - params[i] = Float.parseFloat(fstr); - i++; - } - builder.setParams(params); - } catch (Exception e) { - if (log.isLoggable(Level.WARNING)) { - log.log(Level.WARNING, "Unable to parse the model configuration", e); - } - throw new RuntimeException("Unable to parse the model configuration", e); - } - } -} diff --git a/tika-core/src/main/java/org/apache/tika/detect/NNTrainedModel.java b/tika-core/src/main/java/org/apache/tika/detect/NNTrainedModel.java deleted file mode 100644 index 28dc1b1..0000000 --- a/tika-core/src/main/java/org/apache/tika/detect/NNTrainedModel.java +++ /dev/null @@ -1,103 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.detect; - -public class NNTrainedModel extends TrainedModel { - - private int numOfInputs; - private int numOfHidden; - private int numOfOutputs; - private float[][] Theta1, Theta2; - - public NNTrainedModel(final int nInput, final int nHidden, - final int nOutput, final float[] nn_params) { - this.numOfInputs = nInput; - this.numOfHidden = nHidden; - this.numOfOutputs = nOutput; - this.Theta1 = new float[numOfHidden][numOfInputs + 1]; - this.Theta2 = new float[numOfOutputs][numOfHidden + 1]; - populateThetas(nn_params); - } - - // convert the vector params to the 2 thetas. - private void populateThetas(final float[] nn_params) { - int m = this.Theta1.length; - int n = this.Theta1[0].length; - int i, j, k = 0; - for (i = 0; i < n; i++) { - for (j = 0; j < m; j++) { - this.Theta1[j][i] = nn_params[k]; - k++; - } - } - - m = this.Theta2.length; - n = this.Theta2[0].length; - for (i = 0; i < n; i++) { - for (j = 0; j < m; j++) { - this.Theta2[j][i] = nn_params[k]; - k++; - } - } - } - - @Override - public double predict(double[] unseen) { - // TODO Auto-generated method stub - return 0; - } - - /** - * The given input vector of unseen is m=(256 + 1) * n= 1 this returns a - * prediction probability - */ - @Override - public float predict(float[] unseen) { - // please ensure the unseen in size consistent with theta1 - - int i, j; - int m = this.Theta1.length; - int n = this.Theta1[0].length; - float[] hh = new float[m + 1];// hidden unit summation - hh[0] = 1; - for (i = 0; i < m; i++) { - double h = 0; - for (j = 0; j < n; j++) { - h += this.Theta1[i][j] * unseen[j]; - } - // apply sigmoid - h = 1.0 / (1.0 + Math.exp(-h)); - hh[i+1] = (float)h; - } - - m = this.Theta2.length; - n = this.Theta2[0].length; - float[] oo = new float[m]; - for (i = 0; i < m; i++) { - double o = 0; - for (j = 0; j < n; j++) { - o += this.Theta2[i][j] * hh[j]; - } - // apply sigmoid - o = 1.0 / (1.0 + Math.exp(-o)); - oo[i] = (float)o; - } - - return oo[0]; - } -} diff --git a/tika-core/src/main/java/org/apache/tika/detect/NNTrainedModelBuilder.java b/tika-core/src/main/java/org/apache/tika/detect/NNTrainedModelBuilder.java deleted file mode 100644 index 6c80d7e..0000000 --- a/tika-core/src/main/java/org/apache/tika/detect/NNTrainedModelBuilder.java +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - - -package org.apache.tika.detect; - -import org.apache.tika.mime.MediaType; - -public class NNTrainedModelBuilder { - private MediaType type; - - private int numOfInputs; - private int numOfHidden; - private int numOfOutputs; - - private float[] params; - - public MediaType getType() { - return this.type; - } - - public int getNumOfInputs() { - return numOfInputs; - } - - public int getNumOfHidden() { - return numOfHidden; - } - - public int getNumOfOutputs() { - return numOfOutputs; - } - - public float[] getParams() { - return params; - } - - public void setType(final MediaType type) { - this.type = type; - } - - public void setNumOfInputs(final int numOfInputs) { - this.numOfInputs = numOfInputs; - } - - public void setNumOfHidden(final int numOfHidden) { - this.numOfHidden = numOfHidden; - } - - public void setNumOfOutputs(final int numOfOutputs) { - this.numOfOutputs = numOfOutputs; - } - - public void setParams(float[] params) { - this.params = params; - } - - public NNTrainedModel build() { - return new NNTrainedModel(numOfInputs, numOfHidden, numOfOutputs, - params); - } -} diff --git a/tika-core/src/main/java/org/apache/tika/detect/NameDetector.java b/tika-core/src/main/java/org/apache/tika/detect/NameDetector.java index 1638d50..18418a2 100644 --- a/tika-core/src/main/java/org/apache/tika/detect/NameDetector.java +++ b/tika-core/src/main/java/org/apache/tika/detect/NameDetector.java @@ -24,8 +24,6 @@ import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Content type detection based on the resource name. An instance of this @@ -121,7 +119,7 @@ int percent = name.indexOf('%'); if (percent != -1) { try { - name = URLDecoder.decode(name, UTF_8.name()); + name = URLDecoder.decode(name, "UTF-8"); } catch (UnsupportedEncodingException e) { throw new IllegalStateException("UTF-8 not supported", e); } diff --git a/tika-core/src/main/java/org/apache/tika/detect/TrainedModel.java b/tika-core/src/main/java/org/apache/tika/detect/TrainedModel.java deleted file mode 100644 index 16b4cd6..0000000 --- a/tika-core/src/main/java/org/apache/tika/detect/TrainedModel.java +++ /dev/null @@ -1,24 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.detect; - - -public abstract class TrainedModel { - - public abstract double predict(double[] input); - public abstract float predict(float[] input); -} diff --git a/tika-core/src/main/java/org/apache/tika/detect/TrainedModelDetector.java b/tika-core/src/main/java/org/apache/tika/detect/TrainedModelDetector.java deleted file mode 100644 index dc8f142..0000000 --- a/tika-core/src/main/java/org/apache/tika/detect/TrainedModelDetector.java +++ /dev/null @@ -1,176 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.detect; - -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.io.Writer; -import java.nio.ByteBuffer; -import java.nio.channels.Channels; -import java.nio.channels.ReadableByteChannel; -import java.nio.file.Files; -import java.nio.file.Path; -import java.util.HashMap; -import java.util.Iterator; -import java.util.Map; - -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; - -import static java.nio.charset.StandardCharsets.UTF_8; - -public abstract class TrainedModelDetector implements Detector { - private final Map MODEL_MAP = new HashMap<>(); - - private static final long serialVersionUID = 1L; - - public TrainedModelDetector() { - loadDefaultModels(getClass().getClassLoader()); - } - - public int getMinLength() { - return Integer.MAX_VALUE; - } - - public MediaType detect(InputStream input, Metadata metadata) - throws IOException { - // convert to byte-histogram - if (input != null) { - input.mark(getMinLength()); - float[] histogram = readByteFrequencies(input); - // writeHisto(histogram); //on testing purpose - /* - * iterate the map to find out the one that gives the higher - * prediction value. - */ - Iterator iter = MODEL_MAP.keySet().iterator(); - float threshold = 0.5f;// probability threshold, any value below the - // threshold will be considered as - // MediaType.OCTET_STREAM - float maxprob = threshold; - MediaType maxType = MediaType.OCTET_STREAM; - while (iter.hasNext()) { - MediaType key = iter.next(); - TrainedModel model = MODEL_MAP.get(key); - float prob = model.predict(histogram); - if (maxprob < prob) { - maxprob = prob; - maxType = key; - } - } - input.reset(); - return maxType; - } - return null; - } - - /** - * Read the {@code inputstream} and build a byte frequency histogram - * - * @param input stream to read from - * @return byte frequencies array - * @throws IOException - */ - protected float[] readByteFrequencies(final InputStream input) - throws IOException { - ReadableByteChannel inputChannel; - // TODO: any reason to avoid closing of input & inputChannel? - try { - inputChannel = Channels.newChannel(input); - // long inSize = inputChannel.size(); - float histogram[] = new float[257]; - histogram[0] = 1; - - // create buffer with capacity of maxBufSize bytes - ByteBuffer buf = ByteBuffer.allocate(1024 * 5); - int bytesRead = inputChannel.read(buf); // read into buffer. - - float max = -1; - while (bytesRead != -1) { - - buf.flip(); // make buffer ready for read - - while (buf.hasRemaining()) { - byte byt = buf.get(); - int idx = byt; - idx++; - if (byt < 0) { - idx = 256 + idx; - histogram[idx]++; - } else { - histogram[idx]++; - } - max = max < histogram[idx] ? histogram[idx] : max; - } - - buf.clear(); // make buffer ready for writing - bytesRead = inputChannel.read(buf); - } - - int i; - for (i = 1; i < histogram.length; i++) { - histogram[i] /= max; - histogram[i] = (float) Math.sqrt(histogram[i]); - } - - return histogram; - } finally { - // inputChannel.close(); - } - } - - /** - * for testing purposes; this method write the histogram vector to a file. - * - * @param histogram - * @throws IOException - */ - private void writeHisto(final float[] histogram) - throws IOException { - Path histPath = new TemporaryResources().createTempFile(); - try (Writer writer = Files.newBufferedWriter(histPath, UTF_8)) { - for (float bin : histogram) { - writer.write(String.valueOf(bin) + "\t"); - // writer.write(i + "\t"); - } - writer.write("\r\n"); - } - } - - public void loadDefaultModels(Path modelFile) { - try (InputStream in = Files.newInputStream(modelFile)) { - loadDefaultModels(in); - } catch (IOException e) { - throw new RuntimeException("Unable to read the default media type registry", e); - } - } - - public void loadDefaultModels(File modelFile) { - loadDefaultModels(modelFile.toPath()); - } - - public abstract void loadDefaultModels(final InputStream modelStream); - - public abstract void loadDefaultModels(final ClassLoader classLoader); - - protected void registerModels(MediaType type, TrainedModel model) { - MODEL_MAP.put(type, model); - } -} diff --git a/tika-core/src/main/java/org/apache/tika/detect/XmlRootExtractor.java b/tika-core/src/main/java/org/apache/tika/detect/XmlRootExtractor.java index 74d994d..88dec51 100644 --- a/tika-core/src/main/java/org/apache/tika/detect/XmlRootExtractor.java +++ b/tika-core/src/main/java/org/apache/tika/detect/XmlRootExtractor.java @@ -27,7 +27,6 @@ import org.apache.tika.sax.OfflineContentHandler; import org.xml.sax.Attributes; import org.xml.sax.SAXException; -import org.xml.sax.SAXNotRecognizedException; import org.xml.sax.helpers.DefaultHandler; /** @@ -51,14 +50,7 @@ SAXParserFactory factory = SAXParserFactory.newInstance(); factory.setNamespaceAware(true); factory.setValidating(false); - try { - factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true); - } catch (SAXNotRecognizedException e) { - // TIKA-271 and TIKA-1000: Some XML parsers do not support the secure-processing - // feature, even though it's required by JAXP in Java 5. Ignoring - // the exception is fine here, deployments without this feature - // are inherently vulnerable to XML denial-of-service attacks. - } + factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true); factory.newSAXParser().parse( new CloseShieldInputStream(stream), new OfflineContentHandler(handler)); diff --git a/tika-core/src/main/java/org/apache/tika/embedder/ExternalEmbedder.java b/tika-core/src/main/java/org/apache/tika/embedder/ExternalEmbedder.java index 84dc5da..8f069b1 100644 --- a/tika-core/src/main/java/org/apache/tika/embedder/ExternalEmbedder.java +++ b/tika-core/src/main/java/org/apache/tika/embedder/ExternalEmbedder.java @@ -38,8 +38,6 @@ import org.apache.tika.mime.MediaType; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.external.ExternalParser; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Embedder that uses an external program (like sed or exiftool) to embed text @@ -415,7 +413,7 @@ if (process.exitValue() != 0) { throw new TikaException("There was an error executing the command line" + "\nExecutable Command:\n\n" + cmd + - "\nExecutable Error:\n\n" + stdErrOutputStream.toString(UTF_8.name())); + "\nExecutable Error:\n\n" + stdErrOutputStream.toString("UTF-8")); } } } diff --git a/tika-core/src/main/java/org/apache/tika/exception/AccessPermissionException.java b/tika-core/src/main/java/org/apache/tika/exception/AccessPermissionException.java deleted file mode 100644 index b5f2136..0000000 --- a/tika-core/src/main/java/org/apache/tika/exception/AccessPermissionException.java +++ /dev/null @@ -1,40 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.exception; - -/** - * Exception to be thrown when a document does not allow content extraction. - * As of this writing, PDF documents are the only type of document that might - * cause this type of exception. - */ -public class AccessPermissionException extends TikaException { - public AccessPermissionException() { - super("Unable to process: content extraction is not allowed"); - } - - public AccessPermissionException(Throwable th) { - super("Unable to process: content extraction is not allowed", th); - } - - public AccessPermissionException(String info) { - super(info); - } - - public AccessPermissionException(String info, Throwable th) { - super(info, th); - } -} diff --git a/tika-core/src/main/java/org/apache/tika/extractor/ParserContainerExtractor.java b/tika-core/src/main/java/org/apache/tika/extractor/ParserContainerExtractor.java index f832c22..24f0d14 100644 --- a/tika-core/src/main/java/org/apache/tika/extractor/ParserContainerExtractor.java +++ b/tika-core/src/main/java/org/apache/tika/extractor/ParserContainerExtractor.java @@ -122,8 +122,11 @@ File file = tis.getFile(); // Let the handler process the embedded resource - try (InputStream input = TikaInputStream.get(file)) { + InputStream input = TikaInputStream.get(file); + try { handler.handle(filename, type, input); + } finally { + input.close(); } // Recurse diff --git a/tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java b/tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java index d67f086..c2e3cf3 100644 --- a/tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java +++ b/tika-core/src/main/java/org/apache/tika/extractor/ParsingEmbeddedDocumentExtractor.java @@ -21,7 +21,6 @@ import java.io.IOException; import java.io.InputStream; -import org.apache.tika.exception.EncryptedDocumentException; import org.apache.tika.exception.TikaException; import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.io.TemporaryResources; @@ -91,7 +90,8 @@ } // Use the delegate parser to parse this entry - try (TemporaryResources tmp = new TemporaryResources()) { + TemporaryResources tmp = new TemporaryResources(); + try { final TikaInputStream newStream = TikaInputStream.get(new CloseShieldInputStream(stream), tmp); if (stream instanceof TikaInputStream) { final Object container = ((TikaInputStream) stream).getOpenContainer(); @@ -103,12 +103,11 @@ newStream, new EmbeddedContentHandler(new BodyContentHandler(handler)), metadata, context); - } catch (EncryptedDocumentException ede) { - // TODO: can we log a warning that we lack the password? - // For now, just skip the content } catch (TikaException e) { // TODO: can we log a warning somehow? // Could not parse the entry, just skip the content + } finally { + tmp.close(); } if(outputHtml) { diff --git a/tika-core/src/main/java/org/apache/tika/fork/ForkClient.java b/tika-core/src/main/java/org/apache/tika/fork/ForkClient.java index 6a1fde9..930598e 100644 --- a/tika-core/src/main/java/org/apache/tika/fork/ForkClient.java +++ b/tika-core/src/main/java/org/apache/tika/fork/ForkClient.java @@ -24,17 +24,17 @@ import java.io.InputStream; import java.io.NotSerializableException; import java.util.ArrayList; +import java.util.Arrays; import java.util.List; import java.util.jar.JarEntry; import java.util.jar.JarOutputStream; import java.util.zip.ZipEntry; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOExceptionWithCause; import org.apache.tika.io.IOUtils; import org.xml.sax.ContentHandler; -import static java.nio.charset.StandardCharsets.UTF_8; - class ForkClient { private final List resources = new ArrayList(); @@ -51,7 +51,7 @@ private final InputStream error; - public ForkClient(ClassLoader loader, Object object, List java) + public ForkClient(ClassLoader loader, Object object, String java) throws IOException, TikaException { boolean ok = false; try { @@ -60,7 +60,7 @@ ProcessBuilder builder = new ProcessBuilder(); List command = new ArrayList(); - command.addAll(java); + command.addAll(Arrays.asList(java.split("\\s+"))); command.add("-jar"); command.add(jar.getPath()); builder.command(command); @@ -199,7 +199,7 @@ return (Throwable) ForkObjectInputStream.readObject( input, loader); } catch (ClassNotFoundException e) { - throw new IOException( + throw new IOExceptionWithCause( "Unable to deserialize an exception", e); } } else { @@ -258,12 +258,12 @@ * @throws IOException if the bootstrap archive could not be created */ private static void fillBootstrapJar(File file) throws IOException { - try (JarOutputStream jar = - new JarOutputStream(new FileOutputStream(file))) { + JarOutputStream jar = new JarOutputStream(new FileOutputStream(file)); + try { String manifest = - "Main-Class: " + ForkServer.class.getName() + "\n"; + "Main-Class: " + ForkServer.class.getName() + "\n"; jar.putNextEntry(new ZipEntry("META-INF/MANIFEST.MF")); - jar.write(manifest.getBytes(UTF_8)); + jar.write(manifest.getBytes("UTF-8")); Class[] bootstrap = { ForkServer.class, ForkObjectInputStream.class, @@ -276,11 +276,16 @@ ClassLoader loader = ForkServer.class.getClassLoader(); for (Class klass : bootstrap) { String path = klass.getName().replace('.', '/') + ".class"; - try (InputStream input = loader.getResourceAsStream(path)) { + InputStream input = loader.getResourceAsStream(path); + try { jar.putNextEntry(new JarEntry(path)); IOUtils.copy(input, jar); + } finally { + input.close(); } } + } finally { + jar.close(); } } diff --git a/tika-core/src/main/java/org/apache/tika/fork/ForkParser.java b/tika-core/src/main/java/org/apache/tika/fork/ForkParser.java index 7727fe8..c45e0e1 100644 --- a/tika-core/src/main/java/org/apache/tika/fork/ForkParser.java +++ b/tika-core/src/main/java/org/apache/tika/fork/ForkParser.java @@ -18,11 +18,7 @@ import java.io.IOException; import java.io.InputStream; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; import java.util.LinkedList; -import java.util.List; import java.util.Queue; import java.util.Set; @@ -47,7 +43,7 @@ private final Parser parser; /** Java command line */ - private List java = Arrays.asList("java", "-Xmx32m"); + private String java = "java -Xmx32m"; /** Process pool size */ private int poolSize = 5; @@ -99,55 +95,21 @@ * Returns the command used to start the forked server process. * * @return java command line - * @deprecated since 1.8 - * @see ForkParser#getJavaCommandAsList() - */ - @Deprecated + */ public String getJavaCommand() { - StringBuilder sb = new StringBuilder(); - for (String part : getJavaCommandAsList()) { - sb.append(part).append(' '); - } - sb.deleteCharAt(sb.length() - 1); - return sb.toString(); - } - - /** - * Returns the command used to start the forked server process. - *

    - * Returned list is unmodifiable. - * @return java command line args - */ - public List getJavaCommandAsList() { - return Collections.unmodifiableList(java); - } - - /** - * Sets the command used to start the forked server process. - * The arguments "-jar" and "/path/to/bootstrap.jar" are - * appended to the given command when starting the process. - * The default setting is {"java", "-Xmx32m"}. - *

    - * Creates a defensive copy. - * @param java java command line - */ - public void setJavaCommand(List java) { - this.java = new ArrayList(java); + return java; } /** * Sets the command used to start the forked server process. * The given command line is split on whitespace and the arguments -2 * "-jar" and "/path/to/bootstrap.jar" are appended to it when starting -2 * the process. The default setting is "java -Xmx32m". + * "-jar" and "/path/to/bootstrap.jar" are appended to it when starting + * the process. The default setting is "java -Xmx32m". * * @param java java command line - * @deprecated since 1.8 - * @see ForkParser#setJavaCommand(List) - */ - @Deprecated + */ public void setJavaCommand(String java) { - setJavaCommand(Arrays.asList(java.split(" "))); + this.java = java; } public Set getSupportedTypes(ParseContext context) { diff --git a/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java b/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java index 1e33986..2c5ed90 100644 --- a/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java +++ b/tika-core/src/main/java/org/apache/tika/io/EndianUtils.java @@ -170,27 +170,6 @@ (ch8 << 0); } - /** - * Gets the integer value that is stored in UTF-8 like fashion, in Big Endian - * but with the high bit on each number indicating if it continues or not - */ - public static long readUE7(InputStream stream) throws IOException { - int i; - long v = 0; - while ((i = stream.read()) >= 0) { - v = v << 7; - if ((i & 128) == 128) { - // Continues - v += (i&127); - } else { - // Last value - v += i; - break; - } - } - return v; - } - /** * Get a LE short value from the beginning of a byte array diff --git a/tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java b/tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java index d84222b..2bdfd53 100644 --- a/tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java +++ b/tika-core/src/main/java/org/apache/tika/io/FilenameUtils.java @@ -17,11 +17,9 @@ package org.apache.tika.io; import java.util.HashSet; -import java.util.Locale; public class FilenameUtils { - /** * Reserved characters @@ -36,13 +34,11 @@ private final static HashSet RESERVED = new HashSet(38); - static { for (int i=0; i - * The goal of this is to get a filename from a path. - * The package parsers and some other embedded doc - * extractors could put anything into Metadata.RESOURCE_NAME_KEY. - *

    - * If a careless client used that filename as if it were a - * filename and not a path when writing embedded files, - * bad things could happen. Consider: "../../../my_ppt.ppt". - *

    - * Consider using this in combination with {@link #normalize(String)}. - * - * @param path path to strip - * @return empty string or a filename, never null - */ - public static String getName(final String path) { - - if (path == null || path.length() == 0) { - return ""; - } - int unix = path.lastIndexOf("/"); - int windows = path.lastIndexOf("\\"); - //some macintosh file names are stored with : as the delimiter - //also necessary to properly handle C:somefilename - int colon = path.lastIndexOf(":"); - String cand = path.substring(Math.max(colon, Math.max(unix, windows))+1); - if (cand.equals("..") || cand.equals(".")){ - return ""; - } - return cand; - } } diff --git a/tika-core/src/main/java/org/apache/tika/io/IOUtils.java b/tika-core/src/main/java/org/apache/tika/io/IOUtils.java index 11d3bd3..1570983 100644 --- a/tika-core/src/main/java/org/apache/tika/io/IOUtils.java +++ b/tika-core/src/main/java/org/apache/tika/io/IOUtils.java @@ -30,7 +30,6 @@ import java.io.StringWriter; import java.io.Writer; import java.nio.channels.Channel; -import java.nio.charset.Charset; import java.util.ArrayList; import java.util.List; @@ -76,8 +75,6 @@ * @since Apache Tika 0.4, copied (partially) from Commons IO 1.4 */ public class IOUtils { - // TODO Remove this when we've finished TIKA-1706 and TIKA-1710 - public static final Charset UTF_8 = java.nio.charset.StandardCharsets.UTF_8; /** * The default buffer size to use. @@ -257,7 +254,7 @@ */ @Deprecated public static byte[] toByteArray(String input) throws IOException { - return input.getBytes(UTF_8); + return input.getBytes(); } // read char[] @@ -395,7 +392,7 @@ */ @Deprecated public static String toString(byte[] input) throws IOException { - return new String(input, UTF_8); + return new String(input); } /** @@ -415,9 +412,8 @@ @Deprecated public static String toString(byte[] input, String encoding) throws IOException { - // If no encoding is specified, default to UTF-8. if (encoding == null) { - return new String(input, UTF_8); + return new String(input); } else { return new String(input, encoding); } @@ -439,7 +435,7 @@ * @since Commons IO 1.1 */ public static List readLines(InputStream input) throws IOException { - InputStreamReader reader = new InputStreamReader(input, UTF_8); + InputStreamReader reader = new InputStreamReader(input); return readLines(reader); } @@ -533,7 +529,7 @@ * @since Commons IO 1.1 */ public static InputStream toInputStream(String input) { - byte[] bytes = input.getBytes(UTF_8); + byte[] bytes = input.getBytes(); return new ByteArrayInputStream(bytes); } @@ -551,7 +547,7 @@ * @since Commons IO 1.1 */ public static InputStream toInputStream(String input, String encoding) throws IOException { - byte[] bytes = encoding != null ? input.getBytes(encoding) : input.getBytes(UTF_8); + byte[] bytes = encoding != null ? input.getBytes(encoding) : input.getBytes(); return new ByteArrayInputStream(bytes); } @@ -589,7 +585,7 @@ */ public static void write(byte[] data, Writer output) throws IOException { if (data != null) { - output.write(new String(data, UTF_8)); + output.write(new String(data)); } } @@ -657,7 +653,7 @@ public static void write(char[] data, OutputStream output) throws IOException { if (data != null) { - output.write(new String(data).getBytes(UTF_8)); + output.write(new String(data).getBytes()); } } @@ -783,7 +779,7 @@ public static void write(String data, OutputStream output) throws IOException { if (data != null) { - output.write(data.getBytes(UTF_8)); + output.write(data.getBytes()); } } @@ -852,7 +848,7 @@ public static void write(StringBuffer data, OutputStream output) throws IOException { if (data != null) { - output.write(data.toString().getBytes(UTF_8)); + output.write(data.toString().getBytes()); } } @@ -958,7 +954,7 @@ */ public static void copy(InputStream input, Writer output) throws IOException { - InputStreamReader in = new InputStreamReader(input, UTF_8); + InputStreamReader in = new InputStreamReader(input); copy(in, output); } @@ -1065,7 +1061,7 @@ */ public static void copy(Reader input, OutputStream output) throws IOException { - OutputStreamWriter out = new OutputStreamWriter(output, UTF_8); + OutputStreamWriter out = new OutputStreamWriter(output); copy(input, out); // XXX Unless anyone is planning on rewriting OutputStreamWriter, we // have to flush here. diff --git a/tika-core/src/main/java/org/apache/tika/io/LookaheadInputStream.java b/tika-core/src/main/java/org/apache/tika/io/LookaheadInputStream.java index 18c80f4..18c7177 100644 --- a/tika-core/src/main/java/org/apache/tika/io/LookaheadInputStream.java +++ b/tika-core/src/main/java/org/apache/tika/io/LookaheadInputStream.java @@ -28,8 +28,11 @@ *

    * The recommended usage pattern of this class is: *

    - *     try (InputStream lookahead = new LookaheadInputStream(stream, n)) {
    + *     InputStream lookahead = new LookaheadInputStream(stream, n);
    + *     try {
      *         processStream(lookahead);
    + *     } finally {
    + *         lookahead.close();
      *     }
      * 
    *

    diff --git a/tika-core/src/main/java/org/apache/tika/io/TaggedInputStream.java b/tika-core/src/main/java/org/apache/tika/io/TaggedInputStream.java index a7d637e..a5810a3 100644 --- a/tika-core/src/main/java/org/apache/tika/io/TaggedInputStream.java +++ b/tika-core/src/main/java/org/apache/tika/io/TaggedInputStream.java @@ -77,7 +77,7 @@ /** * Casts or wraps the given stream to a TaggedInputStream instance. * - * @param proxy normal input stream + * @param stream normal input stream * @return a TaggedInputStream instance */ public static TaggedInputStream get(InputStream proxy) { diff --git a/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java b/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java index 2dad5bd..e03cb12 100644 --- a/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java +++ b/tika-core/src/main/java/org/apache/tika/io/TemporaryResources.java @@ -19,9 +19,8 @@ import java.io.Closeable; import java.io.File; import java.io.IOException; -import java.nio.file.Files; -import java.nio.file.Path; import java.util.LinkedList; +import java.util.List; import org.apache.tika.exception.TikaException; @@ -38,65 +37,43 @@ /** * Tracked resources in LIFO order. */ - private final LinkedList resources = new LinkedList<>(); + private final LinkedList resources = new LinkedList(); /** * Directory for temporary files, null for the system default. */ - private Path tempFileDir = null; + private File tmp = null; /** * Sets the directory to be used for the temporary files created by - * the {@link #createTempFile()} method. + * the {@link #createTemporaryFile()} method. * - * @param tempFileDir temporary file directory, - * or null for the system default + * @param tmp temporary file directory, + * or null for the system default */ - public void setTemporaryFileDirectory(Path tempFileDir) { - this.tempFileDir = tempFileDir; - } - - /** - * Sets the directory to be used for the temporary files created by - * the {@link #createTempFile()} method. - * - * @param tempFileDir temporary file directory, - * or null for the system default - * @see #setTemporaryFileDirectory(Path) - */ - public void setTemporaryFileDirectory(File tempFileDir) { - this.tempFileDir = tempFileDir == null ? null : tempFileDir.toPath(); - } - - /** - * Creates a temporary file that will automatically be deleted when - * the {@link #close()} method is called, returning its path. - * - * @return Path to created temporary file that will be deleted after closing - * @throws IOException - */ - public Path createTempFile() throws IOException { - final Path path = tempFileDir == null - ? Files.createTempFile("apache-tika-", ".tmp") - : Files.createTempFile(tempFileDir, "apache-tika-", ".tmp"); - addResource(new Closeable() { - public void close() throws IOException { - Files.delete(path); - } - }); - return path; + public void setTemporaryFileDirectory(File tmp) { + this.tmp = tmp; } /** * Creates and returns a temporary file that will automatically be * deleted when the {@link #close()} method is called. * - * @return Created temporary file that'll be deleted after closing + * @return * @throws IOException - * @see #createTempFile() */ public File createTemporaryFile() throws IOException { - return createTempFile().toFile(); + final File file = File.createTempFile("apache-tika-", ".tmp", tmp); + addResource(new Closeable() { + public void close() throws IOException { + if (!file.delete()) { + throw new IOException( + "Could not delete temporary file " + + file.getPath()); + } + } + }); + return file; } /** @@ -130,32 +107,33 @@ * Closes all tracked resources. The resources are closed in reverse order * from how they were added. *

    - * Any suppressed exceptions from managed resources are collected and - * then added to the first thrown exception, which is re-thrown once - * all the resources have been closed. + * Any thrown exceptions from managed resources are collected and + * then re-thrown only once all the resources have been closed. * * @throws IOException if one or more of the tracked resources * could not be closed */ public void close() throws IOException { // Release all resources and keep track of any exceptions - IOException exception = null; + List exceptions = new LinkedList(); for (Closeable resource : resources) { try { resource.close(); } catch (IOException e) { - if (exception == null) { - exception = e; - } else { - exception.addSuppressed(e); - } + exceptions.add(e); } } resources.clear(); // Throw any exceptions that were captured from above - if (exception != null) { - throw exception; + if (!exceptions.isEmpty()) { + if (exceptions.size() == 1) { + throw exceptions.get(0); + } else { + throw new IOExceptionWithCause( + "Multiple IOExceptions" + exceptions, + exceptions.get(0)); + } } } diff --git a/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java b/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java index 29957ff..a254908 100644 --- a/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java +++ b/tika-core/src/main/java/org/apache/tika/io/TikaInputStream.java @@ -16,24 +16,21 @@ */ package org.apache.tika.io; -import static java.nio.file.StandardCopyOption.REPLACE_EXISTING; - import java.io.BufferedInputStream; import java.io.ByteArrayInputStream; import java.io.Closeable; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; +import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; +import java.io.OutputStream; import java.net.URI; import java.net.URISyntaxException; import java.net.URL; import java.net.URLConnection; import java.nio.channels.FileChannel; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; import java.sql.Blob; import java.sql.SQLException; @@ -88,16 +85,19 @@ * when you don't explicitly close the returned stream. The * recommended access pattern is: *

    -     * try (TemporaryResources tmp = new TemporaryResources()) {
    +     * TemporaryResources tmp = new TemporaryResources();
    +     * try {
          *     TikaInputStream stream = TikaInputStream.get(..., tmp);
          *     // process stream but don't close it
    +     * } finally {
    +     *     tmp.close();
          * }
          * 
    *

    * The given stream instance will not be closed when the - * {@link TemporaryResources#close()} method is called by the - * try-with-resources statement. The caller is expected to explicitly - * close the original stream when it's no longer used. + * {@link TemporaryResources#close()} method is called. The caller + * is expected to explicitly close the original stream when it's no + * longer used. * * @since Apache Tika 0.10 * @param stream normal input stream @@ -131,14 +131,17 @@ * do explicitly close the returned stream. The recommended * access pattern is: *

    -     * try (TikaInputStream stream = TikaInputStream.get(...)) {
    +     * TikaInputStream stream = TikaInputStream.get(...);
    +     * try {
          *     // process stream
    +     * } finally {
    +     *     stream.close();
          * }
          * 
    *

    * The given stream instance will be closed along with any other resources * associated with the returned TikaInputStream instance when the - * {@link #close()} method is called by the try-with-resources statement. + * {@link #close()} method is called. * * @param stream normal input stream * @return a TikaInputStream instance @@ -196,71 +199,31 @@ } /** - * Creates a TikaInputStream from the file at the given path. + * Creates a TikaInputStream from the given file. *

    * Note that you must always explicitly close the returned stream to * prevent leaking open file handles. * - * @param path input file - * @return a TikaInputStream instance - * @throws IOException if an I/O error occurs - */ - public static TikaInputStream get(Path path) throws IOException { - return get(path, new Metadata()); - } - - /** - * Creates a TikaInputStream from the file at the given path. The file name - * and length are stored as input metadata in the given metadata instance. + * @param file input file + * @return a TikaInputStream instance + * @throws FileNotFoundException if the file does not exist + */ + public static TikaInputStream get(File file) throws FileNotFoundException { + return get(file, new Metadata()); + } + + /** + * Creates a TikaInputStream from the given file. The file name and + * length are stored as input metadata in the given metadata instance. *

    * Note that you must always explicitly close the returned stream to * prevent leaking open file handles. * - * @param path input file - * @param metadata metadata instance - * @return a TikaInputStream instance - * @throws IOException if an I/O error occurs - */ - public static TikaInputStream get(Path path, Metadata metadata) - throws IOException { - metadata.set(Metadata.RESOURCE_NAME_KEY, path.getFileName().toString()); - metadata.set(Metadata.CONTENT_LENGTH, Long.toString(Files.size(path))); - return new TikaInputStream(path); - } - - /** - * Creates a TikaInputStream from the given file. - *

    - * Note that you must always explicitly close the returned stream to - * prevent leaking open file handles. - * - * @param file input file - * @return a TikaInputStream instance - * @throws FileNotFoundException if the file does not exist - * @deprecated use {@link #get(Path)}. In Tika 2.0, this will be removed - * or modified to throw an IOException. - */ - @Deprecated - public static TikaInputStream get(File file) throws FileNotFoundException { - return get(file, new Metadata()); - } - - /** - * Creates a TikaInputStream from the given file. The file name and - * length are stored as input metadata in the given metadata instance. - *

    - * Note that you must always explicitly close the returned stream to - * prevent leaking open file handles. - * * @param file input file * @param metadata metadata instance * @return a TikaInputStream instance * @throws FileNotFoundException if the file does not exist - * or cannot be opened for reading - * @deprecated use {@link #get(Path, Metadata)}. In Tika 2.0, - * this will be removed or modified to throw an IOException. - */ - @Deprecated + */ public static TikaInputStream get(File file, Metadata metadata) throws FileNotFoundException { metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName()); @@ -357,9 +320,9 @@ throws IOException { // Special handling for file:// URIs if ("file".equalsIgnoreCase(uri.getScheme())) { - Path path = Paths.get(uri); - if (Files.isRegularFile(path)) { - return get(path, metadata); + File file = new File(uri); + if (file.isFile()) { + return get(file, metadata); } } @@ -397,9 +360,9 @@ // Special handling for file:// URLs if ("file".equalsIgnoreCase(url.getProtocol())) { try { - Path path = Paths.get(url.toURI()); - if (Files.isRegularFile(path)) { - return get(path, metadata); + File file = new File(url.toURI()); + if (file.isFile()) { + return get(file, metadata); } } catch (URISyntaxException e) { // fall through @@ -435,13 +398,13 @@ } /** - * The path to the file that contains the contents of this stream. - * This is either the original file passed to the - * {@link #TikaInputStream(Path)} constructor or a temporary file created - * by a call to the {@link #getPath()} method. If neither has been called, - * then the value is null. - */ - private Path path; + * The file that contains the contents of this stream. This is either + * the original file passed to the {@link #TikaInputStream(File)} + * constructor or a temporary file created by a call to the + * {@link #getFile()} method. If neither has been called, then + * the value is null. + */ + private File file; /** * Tracker of temporary resources. @@ -474,28 +437,12 @@ * Creates a TikaInputStream instance. This private constructor is used * by the static factory methods based on the available information. * - * @param path the path to the file that contains the stream - * @throws IOException if an I/O error occurs - */ - private TikaInputStream(Path path) throws IOException { - super(new BufferedInputStream(Files.newInputStream(path))); - this.path = path; - this.tmp = new TemporaryResources(); - this.length = Files.size(path); - } - - /** - * Creates a TikaInputStream instance. This private constructor is used - * by the static factory methods based on the available information. - * * @param file the file that contains the stream * @throws FileNotFoundException if the file does not exist - * @deprecated use {@link #TikaInputStream(Path)} - */ - @Deprecated + */ private TikaInputStream(File file) throws FileNotFoundException { super(new BufferedInputStream(new FileInputStream(file))); - this.path = file.toPath(); + this.file = file; this.tmp = new TemporaryResources(); this.length = file.length(); } @@ -515,7 +462,7 @@ private TikaInputStream( InputStream stream, TemporaryResources tmp, long length) { super(stream); - this.path = null; + this.file = null; this.tmp = tmp; this.length = length; } @@ -574,20 +521,25 @@ } public boolean hasFile() { - return path != null; - } - - public Path getPath() throws IOException { - if (path == null) { + return file != null; + } + + public File getFile() throws IOException { + if (file == null) { if (position > 0) { throw new IOException("Stream is already being read"); } else { // Spool the entire stream into a temporary file - path = tmp.createTempFile(); - Files.copy(in, path, REPLACE_EXISTING); + file = tmp.createTemporaryFile(); + OutputStream out = new FileOutputStream(file); + try { + IOUtils.copy(in, out); + } finally { + out.close(); + } // Create a new input stream and make sure it'll get closed - InputStream newStream = Files.newInputStream(path); + FileInputStream newStream = new FileInputStream(file); tmp.addResource(newStream); // Replace the spooled stream with the new stream in a way @@ -602,21 +554,16 @@ } }; - length = Files.size(path); + length = file.length(); } } - return path; - } - - /** - * @see #getPath() - */ - public File getFile() throws IOException { - return getPath().toFile(); + return file; } public FileChannel getFileChannel() throws IOException { - FileChannel channel = FileChannel.open(getPath()); + FileInputStream fis = new FileInputStream(getFile()); + tmp.addResource(fis); + FileChannel channel = fis.getChannel(); tmp.addResource(channel); return channel; } @@ -628,7 +575,7 @@ /** * Returns the length (in bytes) of this stream. Note that if the length * was not available when this stream was instantiated, then this method - * will use the {@link #getPath()} method to buffer the entire stream to + * will use the {@link #getFile()} method to buffer the entire stream to * a temporary file in order to calculate the stream length. This case * will only work if the stream has not yet been consumed. * @@ -637,7 +584,7 @@ */ public long getLength() throws IOException { if (length == -1) { - getPath(); // updates length internally + length = getFile().length(); } return length; } @@ -678,7 +625,7 @@ @Override public void close() throws IOException { - path = null; + file = null; mark = -1; // The close method was explicitly called, so we indeed @@ -700,7 +647,7 @@ public String toString() { String str = "TikaInputStream of "; if (hasFile()) { - str += path.toString(); + str += file.toString(); } else { str += in.toString(); } diff --git a/tika-core/src/main/java/org/apache/tika/language/LanguageIdentifier.java b/tika-core/src/main/java/org/apache/tika/language/LanguageIdentifier.java index 00f6d06..8a072b9 100644 --- a/tika-core/src/main/java/org/apache/tika/language/LanguageIdentifier.java +++ b/tika-core/src/main/java/org/apache/tika/language/LanguageIdentifier.java @@ -25,8 +25,6 @@ import java.util.Properties; import java.util.Set; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * Identifier of the language that best matches a given content profile. * The content profile is compared to generic language profiles based on @@ -46,6 +44,7 @@ private static final Map PROFILES = new HashMap(); private static final String PROFILE_SUFFIX = ".ngp"; + private static final String PROFILE_ENCODING = "UTF-8"; private static Properties props = new Properties(); private static String errors = ""; @@ -73,11 +72,11 @@ try { LanguageProfile profile = new LanguageProfile(); - try (InputStream stream = - LanguageIdentifier.class.getResourceAsStream( - language + PROFILE_SUFFIX)) { + InputStream stream = + LanguageIdentifier.class.getResourceAsStream(language + PROFILE_SUFFIX); + try { BufferedReader reader = - new BufferedReader(new InputStreamReader(stream, UTF_8)); + new BufferedReader(new InputStreamReader(stream, PROFILE_ENCODING)); String line = reader.readLine(); while (line != null) { if (line.length() > 0 && !line.startsWith("#")) { @@ -88,6 +87,8 @@ } line = reader.readLine(); } + } finally { + stream.close(); } addProfile(language, profile); diff --git a/tika-core/src/main/java/org/apache/tika/language/LanguageProfile.java b/tika-core/src/main/java/org/apache/tika/language/LanguageProfile.java index 9442920..3b01a58 100644 --- a/tika-core/src/main/java/org/apache/tika/language/LanguageProfile.java +++ b/tika-core/src/main/java/org/apache/tika/language/LanguageProfile.java @@ -16,15 +16,10 @@ */ package org.apache.tika.language; - import java.util.HashMap; import java.util.HashSet; import java.util.Map; import java.util.Set; -import java.util.List; -import java.util.ArrayList; -import java.util.Collections; -import java.util.Comparator; /** * Language profile based on ngram counts. @@ -42,12 +37,6 @@ */ private final Map ngrams = new HashMap(); - - /** - * Sorted ngram cache for faster distance calculation. - */ - private Interleaved interleaved = new Interleaved(); - public static boolean useInterleaved = true; // For testing purposes /** * The sum of all ngram counts in this profile. @@ -134,10 +123,6 @@ * @return distance between the profiles */ public double distance(LanguageProfile that) { - return useInterleaved ? distanceInterleaved(that) : distanceStandard(that); - } - - private double distanceStandard(LanguageProfile that) { if (length != that.length) { throw new IllegalArgumentException( "Unable to calculage distance of language profiles" @@ -167,148 +152,4 @@ return ngrams.toString(); } - /* Code for interleaved distance calculation below */ - - private double distanceInterleaved(LanguageProfile that) { - if (length != that.length) { - throw new IllegalArgumentException( - "Unable to calculage distance of language profiles" - + " with different ngram lengths: " - + that.length + " != " + length); - } - - double sumOfSquares = 0.0; - double thisCount = Math.max(this.count, 1.0); - double thatCount = Math.max(that.count, 1.0); - - Interleaved.Entry thisEntry = updateInterleaved().firstEntry(); - Interleaved.Entry thatEntry = that.updateInterleaved().firstEntry(); - - // Iterate the lists in parallel, until both lists has been depleted - while (thisEntry.hasNgram() || thatEntry.hasNgram()) { - if (!thisEntry.hasNgram()) { // Depleted this - sumOfSquares += square(thatEntry.count / thatCount); - thatEntry.next(); - continue; - } - - if (!thatEntry.hasNgram()) { // Depleted that - sumOfSquares += square(thisEntry.count / thisCount); - thisEntry.next(); - continue; - } - - final int compare = thisEntry.compareTo(thatEntry); - - if (compare == 0) { // Term exists both in this and that - double difference = thisEntry.count/thisCount - thatEntry.count/thatCount; - sumOfSquares += square(difference); - thisEntry.next(); - thatEntry.next(); - } else if (compare < 0) { // Term exists only in this - sumOfSquares += square(thisEntry.count/thisCount); - thisEntry.next(); - } else { // Term exists only in that - sumOfSquares += square(thatEntry.count/thatCount); - thatEntry.next(); - } - } - return Math.sqrt(sumOfSquares); - } - private double square(double count) { - return count * count; - } - - private class Interleaved { - - private char[] entries = null; // * - private int size = 0; // Number of entries (one entry = length+2 chars) - private long entriesGeneratedAtCount = -1; // Keeps track of when the sequential structure was current - - /** - * Ensure that the entries array is in sync with the ngrams. - */ - public void update() { - if (count == entriesGeneratedAtCount) { // Already up to date - return; - } - size = ngrams.size(); - final int numChars = (length+2)*size; - if (entries == null || entries.length < numChars) { - entries = new char[numChars]; - } - int pos = 0; - for (Map.Entry entry: getSortedNgrams()) { - for (int l = 0 ; l < length ; l++) { - entries[pos + l] = entry.getKey().charAt(l); - } - entries[pos + length] = (char)(entry.getValue().count / 65536); // Upper 16 bit - entries[pos + length + 1] = (char)(entry.getValue().count % 65536); // lower 16 bit - pos += length + 2; - } - entriesGeneratedAtCount = count; - } - - public Entry firstEntry() { - Entry entry = new Entry(); - if (size > 0) { - entry.update(0); - } - return entry; - } - - private List> getSortedNgrams() { - List> entries = new ArrayList>(ngrams.size()); - entries.addAll(ngrams.entrySet()); - Collections.sort(entries, new Comparator>() { - @Override - public int compare(Map.Entry o1, Map.Entry o2) { - return o1.getKey().compareTo(o2.getKey()); - } - }); - return entries; - } - - private class Entry implements Comparable { - char[] ngram = new char[length]; - int count = 0; - int pos = 0; - - private void update(int pos) { - this.pos = pos; - if (pos >= size) { // Reached the end - return; - } - final int origo = pos*(length+2); - System.arraycopy(entries, origo, ngram, 0, length); - count = entries[origo+length] * 65536 + entries[origo+length+1]; - } - - @Override - public int compareTo(Entry other) { - for (int i = 0 ; i < ngram.length ; i++) { - if (ngram[i] != other.ngram[i]) { - return ngram[i] - other.ngram[i]; - } - } - return 0; - } - public boolean hasNext() { - return pos < size-1; - } - public boolean hasNgram() { - return pos < size; - } - public void next() { - update(pos+1); - } - public String toString() { - return new String(ngram) + "(" + count + ")"; - } - } - } - private Interleaved updateInterleaved() { - interleaved.update(); - return interleaved; - } } diff --git a/tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java b/tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java index bac1f97..0508388 100644 --- a/tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java +++ b/tika-core/src/main/java/org/apache/tika/language/LanguageProfilerBuilder.java @@ -33,10 +33,7 @@ import java.util.Iterator; import java.util.List; import java.util.Map; - import org.apache.tika.exception.TikaException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * This class runs a ngram analysis over submitted text, results might be used @@ -344,7 +341,7 @@ ngrams.clear(); ngramcounts = new int[maxLength + 1]; - BufferedReader reader = new BufferedReader(new InputStreamReader(is, UTF_8)); + BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8")); String line = null; while ((line = reader.readLine()) != null) { @@ -408,7 +405,7 @@ */ public void save(OutputStream os) throws IOException { os.write(("# NgramProfile generated at " + new Date() + - " for Apache Tika Language Identification\n").getBytes(UTF_8)); + " for Apache Tika Language Identification\n").getBytes()); // And then each ngram @@ -435,7 +432,7 @@ for (int i = 0; i < list.size(); i++) { NGramEntry e = list.get(i); String line = e.toString() + " " + e.getCount() + "\n"; - os.write(line.getBytes(UTF_8)); + os.write(line.getBytes("UTF-8")); } os.flush(); } diff --git a/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java b/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java deleted file mode 100644 index f3e0600..0000000 --- a/tika-core/src/main/java/org/apache/tika/language/translate/DefaultTranslator.java +++ /dev/null @@ -1,119 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.language.translate; - -import java.io.IOException; -import java.util.Collections; -import java.util.Comparator; -import java.util.List; - -import org.apache.tika.config.ServiceLoader; -import org.apache.tika.exception.TikaException; - -/** - * A translator which picks the first available {@link Translator} - * implementations available through the - * {@link javax.imageio.spi.ServiceRegistry service provider mechanism}. - * - * @since Apache Tika 1.6 - */ -public class DefaultTranslator implements Translator { - private transient final ServiceLoader loader; - - public DefaultTranslator(ServiceLoader loader) { - this.loader = loader; - } - public DefaultTranslator() { - this(new ServiceLoader()); - } - - /** - * Finds all statically loadable translators and sort the list by name, - * rather than discovery order. - * - * @param loader service loader - * @return ordered list of statically loadable translators - */ - private static List getDefaultTranslators(ServiceLoader loader) { - List translators = loader.loadStaticServiceProviders(Translator.class); - Collections.sort(translators, new Comparator() { - public int compare(Translator t1, Translator t2) { - String n1 = t1.getClass().getName(); - String n2 = t2.getClass().getName(); - boolean tika1 = n1.startsWith("org.apache.tika."); - boolean tika2 = n2.startsWith("org.apache.tika."); - if (tika1 == tika2) { - return n1.compareTo(n2); - } else if (tika1) { - return -1; - } else { - return 1; - } - } - }); - return translators; - } - /** - * Returns the first available translator, or null if none are - */ - private static Translator getFirstAvailable(ServiceLoader loader) { - for (Translator t : getDefaultTranslators(loader)) { - if (t.isAvailable()) return t; - } - return null; - } - - /** - * Translate, using the first available service-loaded translator - */ - public String translate(String text, String sourceLanguage, String targetLanguage) throws TikaException, IOException { - Translator t = getFirstAvailable(loader); - if (t != null) { - return t.translate(text, sourceLanguage, targetLanguage); - } - throw new TikaException("No translators currently available"); - } - - /** - * Translate, using the first available service-loaded translator - */ - public String translate(String text, String targetLanguage) throws TikaException, IOException { - Translator t = getFirstAvailable(loader); - if (t != null) { - return t.translate(text, targetLanguage); - } - throw new TikaException("No translators currently available"); - } - - /** - * Returns all available translators - */ - public List getTranslators() { - return getDefaultTranslators(loader); - } - /** - * Returns the current translator - */ - public Translator getTranslator() { - return getFirstAvailable(loader); - } - - public boolean isAvailable() { - return getFirstAvailable(loader) != null; - } -} diff --git a/tika-core/src/main/java/org/apache/tika/language/translate/EmptyTranslator.java b/tika-core/src/main/java/org/apache/tika/language/translate/EmptyTranslator.java deleted file mode 100644 index 8fac26e..0000000 --- a/tika-core/src/main/java/org/apache/tika/language/translate/EmptyTranslator.java +++ /dev/null @@ -1,36 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.language.translate; - -/** - * Dummy translator that always declines to give any text. Useful as a - * sentinel translator for when none others are available. - * for unknown document types. - */ -public class EmptyTranslator implements Translator { - public String translate(String text, String sourceLanguage, String targetLanguage) { - return null; - } - - public String translate(String text, String targetLanguage) { - return null; - } - - public boolean isAvailable() { - return true; - } -} diff --git a/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java b/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java deleted file mode 100644 index f225565..0000000 --- a/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.language.translate; - -import org.apache.tika.exception.TikaException; - -import java.io.IOException; - -/** - * Interface for Translator services. - * @since Tika 1.6 - */ -public interface Translator { - /** - * Translate text between given languages. The following languages are supported: - * Arabic("ar"), Bulgarian("bg"), Catalan("ca"), Chinese-Simplified("zh-CHS"), Chinese-Traditional("zh-CHT"), - * Czech("cs"), Danish("da"), Dutch("nl"), English("en"), Estonian("et"), Innish("fi"), French("fr"), German("de"), - * Greek("el"), Haitian-Creole("ht"), Hebrew("he"), Hindi("hi"), Hmong-Daw("mww"), Hungarian("hu"), - * Indonesian("id"), Italian("it"), Japanese("ja"), Korean("ko"), Latvian("lv"), Lithuanian("lt"), Malay("ms"), - * Norwegian("no"), Persian("fa"), Polish("pl"), Portuguese("pt"), Romanian("ro"), Russian("ru"), Slovak("sk"), - * Slovenian("sl"), Spanish("es"), Swedish("sv"), Thai("th"), Turkish("tr"), Ukranian("uk"), Urdu("ur"), - * Vietnemese("vi"). - * @param text The text to translate. - * @param sourceLanguage The input text language (for example, "en"). - * @param targetLanguage The desired language to translate to (for example, "fr"). - * @return The translation result. If translation is unavailable, returns the same text back. - * @throws TikaException When there is an error translating. - * @throws java.io.IOException - * @since Tika 1.6 - */ - public String translate(String text, String sourceLanguage, String targetLanguage) throws TikaException, IOException; - - /** - * Translate text to the given language. This method attempts to auto-detect the source language of the text. - * The following languages are supported: - * Arabic("ar"), Bulgarian("bg"), Catalan("ca"), Chinese-Simplified("zh-CHS"), Chinese-Traditional("zh-CHT"), - * Czech("cs"), Danish("da"), Dutch("nl"), English("en"), Estonian("et"), Innish("fi"), French("fr"), German("de"), - * Greek("el"), Haitian-Creole("ht"), Hebrew("he"), Hindi("hi"), Hmong-Daw("mww"), Hungarian("hu"), - * Indonesian("id"), Italian("it"), Japanese("ja"), Korean("ko"), Latvian("lv"), Lithuanian("lt"), Malay("ms"), - * Norwegian("no"), Persian("fa"), Polish("pl"), Portuguese("pt"), Romanian("ro"), Russian("ru"), Slovak("sk"), - * Slovenian("sl"), Spanish("es"), Swedish("sv"), Thai("th"), Turkish("tr"), Ukranian("uk"), Urdu("ur"), - * Vietnemese("vi"). - * @param text The text to translate. - * @param targetLanguage The desired language to translate to (for example, "hi"). - * @return The translation result. If translation is unavailable, returns the same text back. - * @throws TikaException When there is an error translating. - * @throws java.io.IOException - * @since Tika 1.6 - */ - public String translate(String text, String targetLanguage) throws TikaException, IOException; - - /** - * @return true if this Translator is probably able to translate right now. - * @since Tika 1.6 - */ - public boolean isAvailable(); -} diff --git a/tika-core/src/main/java/org/apache/tika/metadata/AccessPermissions.java b/tika-core/src/main/java/org/apache/tika/metadata/AccessPermissions.java deleted file mode 100644 index 12ac0e5..0000000 --- a/tika-core/src/main/java/org/apache/tika/metadata/AccessPermissions.java +++ /dev/null @@ -1,71 +0,0 @@ -package org.apache.tika.metadata; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -/** - * Until we can find a common standard, we'll use these options. They - * were mostly derived from PDFBox's AccessPermission, but some can - * apply to other document formats, especially CAN_MODIFY and FILL_IN_FORM. - */ -public interface AccessPermissions { - - final static String PREFIX = "access_permission"+Metadata.NAMESPACE_PREFIX_DELIMITER; - - /** - * Can any modifications be made to the document - */ - Property CAN_MODIFY = Property.externalTextBag(PREFIX+"can_modify"); - - /** - * Should content be extracted, generally. - */ - Property EXTRACT_CONTENT = Property.externalText(PREFIX+"extract_content"); - - /** - * Should content be extracted for the purposes - * of accessibility. - */ - Property EXTRACT_FOR_ACCESSIBILITY = Property.externalText(PREFIX + "extract_for_accessibility"); - - /** - * Can the user insert/rotate/delete pages. - */ - Property ASSEMBLE_DOCUMENT = Property.externalText(PREFIX+"assemble_document"); - - - /** - * Can the user fill in a form - */ - Property FILL_IN_FORM = Property.externalText(PREFIX+"fill_in_form"); - - /** - * Can the user modify annotations - */ - Property CAN_MODIFY_ANNOTATIONS = Property.externalText(PREFIX+"modify_annotations"); - - /** - * Can the user print the document - */ - Property CAN_PRINT = Property.externalText(PREFIX+"can_print"); - - /** - * Can the user print an image-degraded version of the document. - */ - Property CAN_PRINT_DEGRADED = Property.externalText(PREFIX+"can_print_degraded"); - -} diff --git a/tika-core/src/main/java/org/apache/tika/metadata/Database.java b/tika-core/src/main/java/org/apache/tika/metadata/Database.java deleted file mode 100644 index 7f91a37..0000000 --- a/tika-core/src/main/java/org/apache/tika/metadata/Database.java +++ /dev/null @@ -1,25 +0,0 @@ -package org.apache.tika.metadata; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -public interface Database { - final static String PREFIX = "database"+Metadata.NAMESPACE_PREFIX_DELIMITER; - - Property TABLE_NAME = Property.externalTextBag(PREFIX+"table_name"); - Property COLUMN_COUNT = Property.externalText(PREFIX+"column_count"); - Property COLUMN_NAME = Property.externalTextBag(PREFIX+"column_name"); -} \ No newline at end of file diff --git a/tika-core/src/main/java/org/apache/tika/metadata/IPTC.java b/tika-core/src/main/java/org/apache/tika/metadata/IPTC.java index af51b51..3fc44f6 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/IPTC.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/IPTC.java @@ -369,7 +369,7 @@ *

    * Maps to this IIM property: 2:110 Credit * - * @see Photoshop#CREDIT + * @see Photoshop#CREDIT_LINE */ Property CREDIT_LINE = Photoshop.CREDIT; diff --git a/tika-core/src/main/java/org/apache/tika/metadata/Metadata.java b/tika-core/src/main/java/org/apache/tika/metadata/Metadata.java index 08e4fa8..ff4b279 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/Metadata.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/Metadata.java @@ -25,7 +25,6 @@ import java.text.DateFormatSymbols; import java.text.ParseException; import java.text.SimpleDateFormat; -import java.util.Calendar; import java.util.Date; import java.util.Enumeration; import java.util.HashMap; @@ -334,8 +333,7 @@ if (property.isMultiValuePermitted()) { set(property, appendedValues(values, value)); } else { - throw new PropertyTypeException(property.getName() + - " : " + property.getPropertyType()); + throw new PropertyTypeException(property.getPropertyType()); } } } @@ -462,27 +460,6 @@ * @param date property value */ public void set(Property property, Date date) { - if(property.getPrimaryProperty().getPropertyType() != Property.PropertyType.SIMPLE) { - throw new PropertyTypeException(Property.PropertyType.SIMPLE, property.getPrimaryProperty().getPropertyType()); - } - if(property.getPrimaryProperty().getValueType() != Property.ValueType.DATE) { - throw new PropertyTypeException(Property.ValueType.DATE, property.getPrimaryProperty().getValueType()); - } - String dateString = null; - if (date != null) { - dateString = formatDate(date); - } - set(property, dateString); - } - - /** - * Sets the date value of the identified metadata property. - * - * @since Apache Tika 0.8 - * @param property simple integer property definition - * @param date property value - */ - public void set(Property property, Calendar date) { if(property.getPrimaryProperty().getPropertyType() != Property.PropertyType.SIMPLE) { throw new PropertyTypeException(Property.PropertyType.SIMPLE, property.getPrimaryProperty().getPropertyType()); } diff --git a/tika-core/src/main/java/org/apache/tika/metadata/OfficeOpenXMLExtended.java b/tika-core/src/main/java/org/apache/tika/metadata/OfficeOpenXMLExtended.java index 5829339..b7b0264 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/OfficeOpenXMLExtended.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/OfficeOpenXMLExtended.java @@ -38,10 +38,11 @@ Property TEMPLATE = Property.externalText( PREFIX + Metadata.NAMESPACE_PREFIX_DELIMITER + "Template"); - Property MANAGER = Property.externalTextBag( + Property MANAGER = Property.externalText( PREFIX + Metadata.NAMESPACE_PREFIX_DELIMITER + "Manager"); - Property COMPANY = Property.externalText( PREFIX + Metadata.NAMESPACE_PREFIX_DELIMITER + "Company"); + Property COMPANY = Property.externalText( + PREFIX + Metadata.NAMESPACE_PREFIX_DELIMITER + "Company"); Property PRESENTATION_FORMAT = Property.externalText( PREFIX + Metadata.NAMESPACE_PREFIX_DELIMITER + "PresentationFormat"); diff --git a/tika-core/src/main/java/org/apache/tika/metadata/PagedText.java b/tika-core/src/main/java/org/apache/tika/metadata/PagedText.java index b3241a2..32645f3 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/PagedText.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/PagedText.java @@ -22,7 +22,7 @@ * properties defined in the XMP standard. * * @since Apache Tika 0.8 - * @see XMP Specification, Part 2: Standard Schemas */ public interface PagedText { diff --git a/tika-core/src/main/java/org/apache/tika/metadata/Photoshop.java b/tika-core/src/main/java/org/apache/tika/metadata/Photoshop.java index 76bd4d9..918bb37 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/Photoshop.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/Photoshop.java @@ -38,14 +38,6 @@ Property AUTHORS_POSITION = Property.internalText( PREFIX_PHOTOSHOP + Metadata.NAMESPACE_PREFIX_DELIMITER + "AuthorsPosition"); - // TODO Replace this with proper indexed choices support - String[] _COLOR_MODE_CHOICES_INDEXED = { "Bitmap", "Greyscale", "Indexed Colour", - "RGB Color", "CMYK Colour", "Multi-Channel", "Duotone", "LAB Colour", - "reserved", "reserved", "YCbCr Colour", "YCgCo Colour", "YCbCrK Colour"}; - Property COLOR_MODE = Property.internalClosedChoise( - PREFIX_PHOTOSHOP + Metadata.NAMESPACE_PREFIX_DELIMITER + "ColorMode", - _COLOR_MODE_CHOICES_INDEXED); - Property CAPTION_WRITER = Property.internalText( PREFIX_PHOTOSHOP + Metadata.NAMESPACE_PREFIX_DELIMITER + "CaptionWriter"); diff --git a/tika-core/src/main/java/org/apache/tika/metadata/RTFMetadata.java b/tika-core/src/main/java/org/apache/tika/metadata/RTFMetadata.java deleted file mode 100644 index e2c1471..0000000 --- a/tika-core/src/main/java/org/apache/tika/metadata/RTFMetadata.java +++ /dev/null @@ -1,46 +0,0 @@ -package org.apache.tika.metadata; /* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.Property; public interface -RTFMetadata { - public static final String PREFIX_RTF_META = "rtf_meta"; - - - public static final String RTF_PICT_META_PREFIX = "rtf_pict:"; - - /** - * if set to true, this means that an image file is probably a "thumbnail" - * any time a pict/emf/wmf is in an object - */ - Property THUMBNAIL = Property.internalBoolean(PREFIX_RTF_META+ - Metadata.NAMESPACE_PREFIX_DELIMITER+"thumbnail"); - - /** - * if an application and version is given as part of the - * embedded object, this is the literal string - */ - Property EMB_APP_VERSION = Property.internalText(PREFIX_RTF_META+ - Metadata.NAMESPACE_PREFIX_DELIMITER+"emb_app_version"); - - Property EMB_CLASS = Property.internalText(PREFIX_RTF_META+ - Metadata.NAMESPACE_PREFIX_DELIMITER+"emb_class"); - - Property EMB_TOPIC = Property.internalText(PREFIX_RTF_META+ - Metadata.NAMESPACE_PREFIX_DELIMITER+"emb_topic"); - - Property EMB_ITEM = Property.internalText(PREFIX_RTF_META+ - Metadata.NAMESPACE_PREFIX_DELIMITER+"emb_item"); - -} diff --git a/tika-core/src/main/java/org/apache/tika/metadata/TIFF.java b/tika-core/src/main/java/org/apache/tika/metadata/TIFF.java index f4ecacc..af81ef0 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/TIFF.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/TIFF.java @@ -22,7 +22,7 @@ * properties defined in the XMP standard. * * @since Apache Tika 0.8 - * @see XMP Specification, Part 2: Standard Schemas */ public interface TIFF { diff --git a/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java b/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java index 6f2ba2a..6084877 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java @@ -36,46 +36,6 @@ */ @SuppressWarnings("deprecation") public interface TikaCoreProperties { - - /** - * A file might contain different types of embedded documents. - * The most common is the ATTACHEMENT. - * An INLINE embedded resource should be used for embedded image - * files that are used to render the page image (as in PDXObjImages in PDF files). - *

    - * Not all parsers have yet implemented this. - * - */ - public enum EmbeddedResourceType { - INLINE, - ATTACHMENT - }; - - /** - * Use this to prefix metadata properties that store information - * about the parsing process. Users should be able to distinguish - * between metadata that was contained within the document and - * metadata about the parsing process. - * In Tika 2.0 (or earlier?), let's change X-ParsedBy to X-TIKA-Parsed-By. - */ - public static String TIKA_META_PREFIX = "X-TIKA"+Metadata.NAMESPACE_PREFIX_DELIMITER; - - /** - * Use this to store parse exception information in the Metadata object. - */ - public static String TIKA_META_EXCEPTION_PREFIX = TIKA_META_PREFIX+"EXCEPTION"+ - Metadata.NAMESPACE_PREFIX_DELIMITER; - - /** - * This is currently used to identify Content-Type that may be - * included within a document, such as in html documents - * (e.g. ) - , or the value might come from outside the document. This information - * may be faulty and should be treated only as a hint. - */ - public static final Property CONTENT_TYPE_HINT = - Property.internalText(HttpHeaders.CONTENT_TYPE+"-Hint"); - /** * @see DublinCore#FORMAT */ @@ -286,13 +246,5 @@ @Deprecated public static final Property TRANSITION_SUBJECT_TO_OO_SUBJECT = Property.composite(OfficeOpenXMLCore.SUBJECT, new Property[] { Property.internalText(Metadata.SUBJECT) }); - - /** - * See {@link #EMBEDDED_RESOURCE_TYPE} - */ - public static final Property EMBEDDED_RESOURCE_TYPE = - Property.internalClosedChoise(TikaMetadataKeys.EMBEDDED_RESOURCE_TYPE, - new String[]{EmbeddedResourceType.ATTACHMENT.toString(), EmbeddedResourceType.INLINE.toString()}); - } diff --git a/tika-core/src/main/java/org/apache/tika/metadata/TikaMetadataKeys.java b/tika-core/src/main/java/org/apache/tika/metadata/TikaMetadataKeys.java index 0c18beb..0846e32 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/TikaMetadataKeys.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/TikaMetadataKeys.java @@ -26,7 +26,4 @@ String PROTECTED = "protected"; String EMBEDDED_RELATIONSHIP_ID = "embeddedRelationshipId"; - - String EMBEDDED_RESOURCE_TYPE = "embeddedResourceType"; - } diff --git a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java index ce78145..f4ca630 100644 --- a/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java +++ b/tika-core/src/main/java/org/apache/tika/metadata/XMPDM.java @@ -24,7 +24,7 @@ * properties defined in the XMP standard. * * @since Apache Tika 0.7 - * @see XMP Specification, Part 2: Standard Schemas */ public interface XMPDM { @@ -58,11 +58,6 @@ * "The name of the artist or artists." */ Property ARTIST = Property.externalText("xmpDM:artist"); - - /** - * "The name of the album artist or group for compilation albums." - */ - Property ALBUM_ARTIST = Property.externalText("xmpDM:albumArtist"); /** * "The date and time when the audio was last modified." @@ -147,11 +142,6 @@ // Property BEAT_SPLICE_PARAMS = "xmpDM:beatSpliceParams"; /** - * "An album created by various artists." - */ - Property COMPILATION = Property.externalInteger("xmpDM:compilation"); - - /** * "The composer's name." */ Property COMPOSER = Property.externalText("xmpDM:composer"); @@ -165,11 +155,6 @@ * "The copyright information." */ Property COPYRIGHT = Property.externalText("xmpDM:copyright"); - - /** - * "The disc number for part of an album set." - */ - Property DISC_NUMBER = Property.externalInteger("xmpDM:discNumber"); /** * "The duration of the media file." diff --git a/tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java b/tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java index 33dce8f..497d036 100644 --- a/tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java +++ b/tika-core/src/main/java/org/apache/tika/mime/MediaTypeRegistry.java @@ -82,23 +82,6 @@ } return aliases; } - - /** - * Returns the set of known children of the given canonical media type - * - * @since Apache Tika 1.8 - * @param type canonical media type - * @return known children - */ - public SortedSet getChildTypes(MediaType type) { - SortedSet children = new TreeSet(); - for (Map.Entry entry : inheritance.entrySet()) { - if (entry.getValue().equals(type)) { - children.add(entry.getKey()); - } - } - return children; - } public void addType(MediaType type) { registry.put(type, type); @@ -170,12 +153,12 @@ } /** - * Returns the supertype of the given type. If the media type database - * has an explicit inheritance rule for the type, then that is used. - * Next, if the given type has any parameters, then the respective base - * type (parameter-less) is returned. Otherwise built-in heuristics like - * text/... -> text/plain and .../...+xml -> application/xml are used. - * Finally application/octet-stream is returned for all types for which no other + * Returns the supertype of the given type. If the given type has any + * parameters, then the respective base type is returned. Otherwise + * built-in heuristics like text/... -> text/plain and + * .../...+xml -> application/xml are used in addition to explicit + * type inheritance rules read from the media type database. Finally + * application/octet-stream is returned for all types for which no other * supertype is known, and the return value for application/octet-stream * is null. * @@ -186,10 +169,10 @@ public MediaType getSupertype(MediaType type) { if (type == null) { return null; + } else if (type.hasParameters()) { + return type.getBaseType(); } else if (inheritance.containsKey(type)) { return inheritance.get(type); - } else if (type.hasParameters()) { - return type.getBaseType(); } else if (type.getSubtype().endsWith("+xml")) { return MediaType.APPLICATION_XML; } else if (type.getSubtype().endsWith("+zip")) { diff --git a/tika-core/src/main/java/org/apache/tika/mime/MimeType.java b/tika-core/src/main/java/org/apache/tika/mime/MimeType.java index b4d651e..6b2b12c 100644 --- a/tika-core/src/main/java/org/apache/tika/mime/MimeType.java +++ b/tika-core/src/main/java/org/apache/tika/mime/MimeType.java @@ -19,6 +19,7 @@ import java.io.Serializable; import java.net.URI; import java.util.ArrayList; +import java.util.Arrays; import java.util.Collections; import java.util.List; @@ -190,7 +191,7 @@ /** * Get the UTI for this mime type. * - * @see http://en.wikipedia.org/wiki/Uniform_Type_Identifier + * @see http://en.wikipedia.org/wiki/Uniform_Type_Identifier * * @return The Uniform Type Identifier */ diff --git a/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java b/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java index 05d959f..79842f9 100644 --- a/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java +++ b/tika-core/src/main/java/org/apache/tika/mime/MimeTypes.java @@ -71,7 +71,7 @@ public static final String PLAIN_TEXT = "text/plain"; /** - * Name of the {@link #xmlMimeType xml} type, application/xml. + * Name of the {@link #xml xml} type, application/xml. */ public static final String XML = "application/xml"; @@ -79,7 +79,6 @@ * Root type, application/octet-stream. */ private final MimeType rootMimeType; - private final List rootMimeTypeL; /** * Text type, text/plain. @@ -113,8 +112,6 @@ rootMimeType = new MimeType(MediaType.OCTET_STREAM); textMimeType = new MimeType(MediaType.TEXT_PLAIN); xmlMimeType = new MimeType(MediaType.APPLICATION_XML); - - rootMimeTypeL = Collections.singletonList(rootMimeType); add(rootMimeType); add(textMimeType); @@ -160,11 +157,7 @@ /** * Returns the MIME type that best matches the given first few bytes * of a document stream. Returns application/octet-stream if no better - * match is found. - *

    - * If multiple matches are found, the best (highest priority) matching - * type is returned. If multiple matches are found with the same priority, - * then all of these are returned. + * match is found. *

    * The given byte array is expected to be at least {@link #getMinLength()} * long, or shorter only if the document stream itself is shorter. @@ -172,52 +165,44 @@ * @param data first few bytes of a document stream * @return matching MIME type */ - List getMimeType(byte[] data) { + private MimeType getMimeType(byte[] data) { if (data == null) { throw new IllegalArgumentException("Data is missing"); } else if (data.length == 0) { // See https://issues.apache.org/jira/browse/TIKA-483 - return rootMimeTypeL; + return rootMimeType; } // Then, check for magic bytes - List result = new ArrayList(1); - int currentPriority = -1; + MimeType result = null; for (Magic magic : magics) { - if (currentPriority > 0 && currentPriority > magic.getPriority()) { + if (magic.eval(data)) { + result = magic.getType(); break; } - if (magic.eval(data)) { - result.add(magic.getType()); - currentPriority = magic.getPriority(); - } } - if (!result.isEmpty()) { - for (int i=0; iUnlike {@link #forName(String)}, this function will not create a - * new MimeType and register it. Instead, null will be returned if - * there is no definition available for the given name. - * - *

    Also, unlike {@link #forName(String)}, this function may return a - * mime type that has fewer parameters than were included in the supplied name. - * If the registered mime type has parameters (e.g. - * application/dita+xml;format=map), then those will be maintained. - * However, if the supplied name has paramenters that the registered mime - * type does not (e.g. application/xml; charset=UTF-8 as a name, - * compared to just application/xml for the type in the registry), - * then those parameters will not be included in the returned type. + * Unlike {@link #forName(String)}, this function will *not* create a new + * MimeType and register it * * @param name media type name (case-insensitive) - * @return the registered media type with the given name or alias, or null if not found + * @return the registered media type with the given name or alias * @throws MimeTypeException if the given media type name is invalid */ public MimeType getRegisteredMimeType(String name) throws MimeTypeException { MediaType type = MediaType.parse(name); if (type != null) { MediaType normalisedType = registry.normalize(type); - MimeType candidate = types.get(normalisedType); - if (candidate != null) { - return candidate; - } - if (normalisedType.hasParameters()) { - return types.get(normalisedType.getBaseType()); - } - return null; + return types.get(normalisedType); } else { throw new MimeTypeException("Invalid media type name: " + name); } @@ -462,14 +429,14 @@ */ public MediaType detect(InputStream input, Metadata metadata) throws IOException { - List possibleTypes = null; + MediaType type = MediaType.OCTET_STREAM; // Get type based on magic prefix if (input != null) { input.mark(getMinLength()); try { byte[] prefix = readMagicHeader(input); - possibleTypes = getMimeType(prefix); + type = getMimeType(prefix).getType(); } finally { input.reset(); } @@ -495,12 +462,10 @@ } if (name != null) { - MimeType hint = getMimeType(name); - - // If we have some types based on mime magic, try to specialise - // and/or select the type based on that - // Otherwise, use the type identified from the name - possibleTypes = applyHint(possibleTypes, hint); + MediaType hint = getMimeType(name).getType(); + if (registry.isSpecializationOf(hint, type)) { + type = hint; + } } } @@ -508,42 +473,16 @@ String typeName = metadata.get(Metadata.CONTENT_TYPE); if (typeName != null) { try { - MimeType hint = forName(typeName); - possibleTypes = applyHint(possibleTypes, hint); + MediaType hint = forName(typeName).getType(); + if (registry.isSpecializationOf(hint, type)) { + type = hint; + } } catch (MimeTypeException e) { // Malformed type name, ignore } } - if (possibleTypes == null || possibleTypes.isEmpty()) { - // Report that we don't know what it is - return MediaType.OCTET_STREAM; - } else { - return possibleTypes.get(0).getType(); - } - } - /** - * Use the MimeType hint to try to clarify or specialise the current - * possible types list. - * If the hint is a specialised form, use that instead - * If there are multiple possible types, use the hint to select one - */ - private List applyHint(List possibleTypes, MimeType hint) { - if (possibleTypes == null || possibleTypes.isEmpty()) { - return Collections.singletonList(hint); - } else { - for (int i=0; ihttp://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec + * @see http://freedesktop.org/wiki/Standards_2fshared_2dmime_2dinfo_2dspec */ public class MimeTypesReader extends DefaultHandler implements MimeTypesReaderMetKeys { protected final MimeTypes types; diff --git a/tika-core/src/main/java/org/apache/tika/mime/ProbabilisticMimeDetectionSelector.java b/tika-core/src/main/java/org/apache/tika/mime/ProbabilisticMimeDetectionSelector.java deleted file mode 100644 index c317735..0000000 --- a/tika-core/src/main/java/org/apache/tika/mime/ProbabilisticMimeDetectionSelector.java +++ /dev/null @@ -1,539 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.mime; - -import java.io.IOException; -import java.io.InputStream; -import java.net.URI; -import java.net.URISyntaxException; -import java.util.List; - -import org.apache.tika.detect.Detector; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MediaTypeRegistry; -import org.apache.tika.mime.MimeType; -import org.apache.tika.mime.MimeTypeException; -import org.apache.tika.mime.MimeTypes; - -/** - * Selector for combining different mime detection results - * based on probability - */ -public class ProbabilisticMimeDetectionSelector implements Detector { - private static final long serialVersionUID = 224589862960269260L; - - private MimeTypes mimeTypes; - - private final MediaType rootMediaType; - - /** probability parameters default value */ - private static final float DEFAULT_MAGIC_TRUST = 0.9f; - private static final float DEFAULT_META_TRUST = 0.8f; - private static final float DEFAULT_EXTENSION_TRUST = 0.8f; - private float priorMagicFileType, priorExtensionFileType, - priorMetaFileType; - private float magic_trust, extension_trust, meta_trust; - private float magic_neg, extension_neg, meta_neg; - /* - * any posterior probability lower than the threshold, will be considered as - * an oct-stream type, the default value is 0.5 - */ - private float threshold; - - /* - * this change rate is used when there are multiple types predicted by - * magic-bytes. the first predicted type has the highest probability, and - * the probability for the next type predicted by magic-bytes will decay - * with this change rate. The idea is to have the first one to take - * precedence among the multiple possible types predicted by MAGIC-bytes. - */ - private float changeRate; - - /** ***********************/ - - public ProbabilisticMimeDetectionSelector() { - this(MimeTypes.getDefaultMimeTypes(), null); - } - - public ProbabilisticMimeDetectionSelector(final Builder builder) { - this(MimeTypes.getDefaultMimeTypes(), builder); - } - - public ProbabilisticMimeDetectionSelector(final MimeTypes mimeTypes) { - this(mimeTypes, null); - } - - public ProbabilisticMimeDetectionSelector(final MimeTypes mimeTypes, - final Builder builder) { - this.mimeTypes = mimeTypes; - rootMediaType = MediaType.OCTET_STREAM; - this.initializeDefaultProbabilityParameters(); - this.changeRate = 0.1f; - if (builder != null) { - priorMagicFileType = builder.priorMagicFileType == 0f ? - priorMagicFileType : builder.priorMagicFileType; - priorExtensionFileType = builder.priorExtensionFileType == 0f ? - priorExtensionFileType : builder.priorExtensionFileType; - priorMetaFileType = builder.priorMetaFileType == 0f ? - priorMetaFileType : builder.priorMetaFileType; - - magic_trust = builder.magic_trust == 0f ? magic_trust : builder.extension_neg; - extension_trust = builder.extension_trust == 0f ? extension_trust : builder.extension_trust; - meta_trust = builder.meta_trust == 0f ? meta_trust : builder.meta_trust; - - magic_neg = builder.magic_neg == 0f ? magic_neg : builder.magic_neg; - extension_neg = builder.extension_neg == 0f ? - extension_neg : builder.extension_neg; - meta_neg = builder.meta_neg == 0f ? meta_neg : builder.meta_neg; - threshold = builder.threshold == 0f ? threshold : builder.threshold; - } - } - - /** - * Initilize probability parameters with default values; - */ - private void initializeDefaultProbabilityParameters() { - priorMagicFileType = 0.5f; - priorExtensionFileType = 0.5f; - priorMetaFileType = 0.5f; - magic_trust = DEFAULT_MAGIC_TRUST; - extension_trust = DEFAULT_EXTENSION_TRUST; - meta_trust = DEFAULT_META_TRUST; - - // probability of the type detected by magic test given that the type is - // not the detected type. The default is taken by 1 - the magic trust - magic_neg = 1 - DEFAULT_MAGIC_TRUST; - // probability of the type detected by extension test given that the - // type is not the type detected by extension test - extension_neg = 1 - DEFAULT_EXTENSION_TRUST; - // same as above; but it could be customized to suffice different use. - meta_neg = 1 - DEFAULT_META_TRUST; - threshold = 0.5001f; - } - - public MediaType detect(InputStream input, Metadata metadata) - throws IOException { - List possibleTypes = null; - - // Get type based on magic prefix - if (input != null) { - input.mark(mimeTypes.getMinLength()); - try { - byte[] prefix = mimeTypes.readMagicHeader(input); - possibleTypes = mimeTypes.getMimeType(prefix); - } finally { - input.reset(); - } - } - - MimeType extHint = null; - // Get type based on resourceName hint (if available) - String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY); - if (resourceName != null) { - String name = null; - - // Deal with a URI or a path name in as the resource name - try { - URI uri = new URI(resourceName); - String path = uri.getPath(); - if (path != null) { - int slash = path.lastIndexOf('/'); - if (slash + 1 < path.length()) { - name = path.substring(slash + 1); - } - } - } catch (URISyntaxException e) { - name = resourceName; - } - - if (name != null) { - // MimeType hint = getMimeType(name); - extHint = mimeTypes.getMimeType(name); - // If we have some types based on mime magic, try to specialise - // and/or select the type based on that - // Otherwise, use the type identified from the name - // possibleTypes = applyHint(possibleTypes, hint); - } - } - - // Get type based on metadata hint (if available) - MimeType metaHint = null; - String typeName = metadata.get(Metadata.CONTENT_TYPE); - if (typeName != null) { - try { - // MimeType hint = forName(typeName); - metaHint = mimeTypes.forName(typeName); - // possibleTypes = applyHint(possibleTypes, hint); - } catch (MimeTypeException e) { - // Malformed type name, ignore - } - } - - /* - * the following calls the probability selection. - */ - return applyProbilities(possibleTypes, extHint, metaHint); - } - - private MediaType applyProbilities(final List possibleTypes, - final MimeType extMimeType, final MimeType metadataMimeType) { - - /* initialize some probability variables */ - MediaType extensionMediaType_ = extMimeType == null ? null : extMimeType.getType(); - MediaType metaMediaType_ = metadataMimeType == null ? null : metadataMimeType.getType(); - - int n = possibleTypes.size(); - float mag_trust = magic_trust; - float mag_neg = magic_neg; - float ext_trust = extension_trust; - float ext_neg = extension_neg; - float met_trust = meta_trust; - float met_neg = meta_neg; - /* ************************** */ - - /* pre-process some probability variables */ - if (extensionMediaType_ == null || extensionMediaType_.compareTo(rootMediaType) == 0) { - /* - * this is a root type, that means the extension method fails to - * identify any type. - */ - ext_trust = 1; - ext_neg = 1; - } - if (metaMediaType_ == null || metaMediaType_.compareTo(rootMediaType) == 0) { - met_trust = 1; - met_neg = 1; - } - - float maxProb = -1f; - MediaType bestEstimate = rootMediaType; - - if (possibleTypes != null && !possibleTypes.isEmpty()) { - int i; - for (i = 0; i < n; i++) { - MediaType magictype = possibleTypes.get(i).getType(); - MediaTypeRegistry registry = mimeTypes.getMediaTypeRegistry(); - if (magictype != null && magictype.equals(rootMediaType)) { - mag_trust = 1; - mag_neg = 1; - } else { - // check if each identified type belongs to the same class; - if (extensionMediaType_ != null) { - if (extensionMediaType_.equals(magictype) - || registry.isSpecializationOf( - extensionMediaType_, magictype)) { - // Use just this type - possibleTypes.set(i, extMimeType); - } else if (registry.isSpecializationOf(magictype, - extensionMediaType_)) { - extensionMediaType_ = magictype; - } - } - if (metaMediaType_ != null) { - if (metaMediaType_.equals(magictype) - || registry.isSpecializationOf(metaMediaType_, - magictype)) { - // Use just this type - possibleTypes.set(i, metadataMimeType); - } else if (registry.isSpecializationOf(magictype, - metaMediaType_)) { - metaMediaType_ = magictype; - } - } - } - - /* - * prepare the conditional probability for file type prediction. - */ - - float[] results = new float[3]; - float[] trust1 = new float[3]; - float[] negtrust1 = new float[3]; - magictype = possibleTypes.get(i).getType(); - - if (i > 0) { - /* - * decay as our trust goes down with next type predicted by - * magic - */ - mag_trust = mag_trust * (1 - changeRate); - /* - * grow as our trust goes down - */ - mag_neg = mag_neg * (1 + changeRate); - - } - - if (magictype != null && mag_trust != 1) { - trust1[0] = mag_trust; - negtrust1[0] = mag_neg; - if (metaMediaType_ != null && met_trust != 1) { - if (magictype.equals(metaMediaType_)) { - trust1[1] = met_trust; - negtrust1[1] = met_neg; - } else { - trust1[1] = 1 - met_trust; - negtrust1[1] = 1 - met_neg; - } - } else { - trust1[1] = 1; - negtrust1[1] = 1; - } - if (extensionMediaType_ != null && ext_trust != 1) { - if (magictype.equals(extensionMediaType_)) { - trust1[2] = ext_trust; - negtrust1[2] = ext_neg; - } else { - trust1[2] = 1 - ext_trust; - negtrust1[2] = 1 - ext_neg; - } - } else { - trust1[2] = 1; - negtrust1[2] = 1; - } - } else { - results[0] = 0.1f; - } - - float[] trust2 = new float[3]; - float[] negtrust2 = new float[3]; - if (metadataMimeType != null && met_trust != 1) { - trust2[1] = met_trust; - negtrust2[1] = met_neg; - if (magictype != null && mag_trust != 1) { - if (metaMediaType_.equals(magictype)) { - trust2[0] = mag_trust; - negtrust2[0] = mag_neg; - } else { - trust2[0] = 1 - mag_trust; - negtrust2[0] = 1 - mag_neg; - } - - } else { - trust2[0] = 1f; - negtrust2[0] = 1f; - } - if (extensionMediaType_ != null && ext_trust != 1) { - if (metaMediaType_.equals(extensionMediaType_)) { - trust2[2] = ext_trust; - negtrust2[2] = ext_neg; - } else { - trust2[2] = 1 - ext_trust; - negtrust2[2] = 1 - ext_neg; - } - } else { - trust2[2] = 1f; - negtrust2[2] = 1f; - } - } else { - results[1] = 0.1f; - } - - float[] trust3 = new float[3]; - float[] negtrust3 = new float[3]; - if (extensionMediaType_ != null && ext_trust != 1) { - trust3[2] = ext_trust; - negtrust3[2] = ext_neg; - if (magictype != null && mag_trust != 1) { - if (magictype.equals(extensionMediaType_)) { - trust3[0] = mag_trust; - negtrust3[0] = mag_neg; - } else { - trust3[0] = 1 - mag_trust; - negtrust3[0] = 1 - mag_neg; - } - } else { - trust3[0] = 1f; - negtrust3[0] = 1f; - } - - if (metaMediaType_ != null && met_trust != 1) { - if (metaMediaType_.equals(extensionMediaType_)) { - trust3[1] = met_trust; - negtrust3[1] = met_neg; - } else { - trust3[1] = 1 - met_trust; - negtrust3[1] = 1 - met_neg; - } - } else { - trust3[1] = 1f; - negtrust3[1] = 1f; - } - } else { - results[2] = 0.1f; - } - /* - * compute the posterior probability for each predicted file - * type and store them into the "results" array. - */ - float pPrime = priorMagicFileType; - float deno = 1 - priorMagicFileType; - int j; - - if (results[0] == 0) { - for (j = 0; j < trust1.length; j++) { - pPrime *= trust1[j]; - if (trust1[j] != 1) { - deno *= negtrust1[j]; - } - } - pPrime /= (pPrime + deno); - results[0] = pPrime; - - } - if (maxProb < results[0]) { - maxProb = results[0]; - bestEstimate = magictype; - } - - pPrime = priorMetaFileType; - deno = 1 - priorMetaFileType; - if (results[1] == 0) { - for (j = 0; j < trust2.length; j++) { - pPrime *= trust2[j]; - if (trust2[j] != 1) { - deno *= negtrust2[j]; - } - } - pPrime /= (pPrime + deno); - results[1] = pPrime; - - } - if (maxProb < results[1]) { - maxProb = results[1]; - bestEstimate = metaMediaType_; - } - - pPrime = priorExtensionFileType; - deno = 1 - priorExtensionFileType; - if (results[2] == 0) { - for (j = 0; j < trust3.length; j++) { - pPrime *= trust3[j]; - if (trust3[j] != 1) { - deno *= negtrust3[j]; - } - } - pPrime /= (pPrime + deno); - results[2] = pPrime; - } - if (maxProb < results[2]) { - maxProb = results[2]; - bestEstimate = extensionMediaType_; - } - /* - for (float r : results) { - System.out.print(r + "; "); - } - System.out.println(); - */ - } - - } - return maxProb < threshold ? this.rootMediaType : bestEstimate; - - } - - public MediaTypeRegistry getMediaTypeRegistry() { - return this.mimeTypes.getMediaTypeRegistry(); - } - - /** - * build class for probability parameters setting - * - * - */ - public static class Builder { - /* - * the following are the prior probabilities for the file type - * identified by each method. - */ - private float priorMagicFileType, priorExtensionFileType, - priorMetaFileType; - /* - * the following are the conditional probability for each method with - * positive conditions - */ - private float magic_trust, extension_trust, meta_trust; - - /* - * the following *_neg are the conditional probabilities with negative - * conditions - */ - private float magic_neg, extension_neg, meta_neg; - - private float threshold; - - public synchronized Builder priorMagicFileType(final float prior) { - this.priorMagicFileType = prior; - return this; - } - - public synchronized Builder priorExtensionFileType(final float prior) { - this.priorExtensionFileType = prior; - return this; - } - - public synchronized Builder priorMetaFileType(final float prior) { - this.priorMetaFileType = prior; - return this; - } - - public synchronized Builder magic_trust(final float trust) { - this.magic_trust = trust; - return this; - } - - public synchronized Builder extension_trust(final float trust) { - this.extension_trust = trust; - return this; - } - - public synchronized Builder meta_trust(final float trust) { - this.meta_trust = trust; - return this; - } - - public synchronized Builder magic_neg(final float trust) { - this.magic_neg = trust; - return this; - } - - public synchronized Builder extension_neg(final float trust) { - this.extension_neg = trust; - return this; - } - - public synchronized Builder meta_neg(final float trust) { - this.meta_neg = trust; - return this; - } - - public synchronized Builder threshold(final float threshold) { - this.threshold = threshold; - return this; - } - - /** - * Initialize the MimeTypes with this builder instance - */ - public ProbabilisticMimeDetectionSelector build2() { - return new ProbabilisticMimeDetectionSelector(this); - } - } - -} diff --git a/tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java b/tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java index 5e4cc7c..c10b132 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java +++ b/tika-core/src/main/java/org/apache/tika/parser/AutoDetectParser.java @@ -62,6 +62,7 @@ * available parsers have their 3rd party jars included, as otherwise the * use of the default TikaConfig will throw various "ClassNotFound" exceptions. * + * @param detector Detector to use * @param parsers */ public AutoDetectParser(Parser...parsers) { @@ -113,8 +114,7 @@ metadata.set(Metadata.CONTENT_TYPE, type.toString()); // TIKA-216: Zip bomb prevention - SecureContentHandler sch = - handler != null ? new SecureContentHandler(handler, tis) : null; + SecureContentHandler sch = new SecureContentHandler(handler, tis); try { // Parse the document super.parse(tis, sch, metadata, context); diff --git a/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java b/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java index ea3968e..05d1b72 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java +++ b/tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java @@ -16,6 +16,16 @@ */ package org.apache.tika.parser; +import java.io.IOException; +import java.io.InputStream; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; +import java.util.Set; + import org.apache.tika.exception.TikaException; import org.apache.tika.io.TemporaryResources; import org.apache.tika.io.TikaInputStream; @@ -26,17 +36,6 @@ import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; -import java.io.IOException; -import java.io.InputStream; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collection; -import java.util.Collections; -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.Set; - /** * Composite parser that delegates parsing tasks to a component parser * based on the declared content type of the incoming document. A fallback @@ -63,22 +62,9 @@ */ private Parser fallback = new EmptyParser(); - public CompositeParser(MediaTypeRegistry registry, List parsers, - Collection> excludeParsers) { - if (excludeParsers == null || excludeParsers.isEmpty()) { - this.parsers = parsers; - } else { - this.parsers = new ArrayList(); - for (Parser p : parsers) { - if (!isExcluded(excludeParsers, p.getClass())) { - this.parsers.add(p); - } - } - } + public CompositeParser(MediaTypeRegistry registry, List parsers) { + this.parsers = parsers; this.registry = registry; - } - public CompositeParser(MediaTypeRegistry registry, List parsers) { - this(registry, parsers, null); } public CompositeParser(MediaTypeRegistry registry, Parser... parsers) { @@ -97,16 +83,6 @@ } } return map; - } - - private boolean isExcluded(Collection> excludeParsers, Class p) { - return excludeParsers.contains(p) || assignableFrom(excludeParsers, p); - } - private boolean assignableFrom(Collection> excludeParsers, Class p) { - for (Class e : excludeParsers) { - if (e.isAssignableFrom(p)) return true; - } - return false; } /** @@ -164,15 +140,6 @@ } /** - * Returns all parsers registered with the Composite Parser, - * including ones which may not currently be active. - * This won't include the Fallback Parser, if defined - */ - public List getAllComponentParsers() { - return Collections.unmodifiableList(parsers); - } - - /** * Returns the component parsers. * * @return component parsers, keyed by media type @@ -236,6 +203,7 @@ // We always work on the normalised, canonical form type = registry.normalize(type); } + while (type != null) { // Try finding a parser for the type Parser parser = map.get(type); @@ -265,17 +233,11 @@ InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - Parser parser = getParser(metadata, context); + Parser parser = getParser(metadata); TemporaryResources tmp = new TemporaryResources(); try { TikaInputStream taggedStream = TikaInputStream.get(stream, tmp); - TaggedContentHandler taggedHandler = - handler != null ? new TaggedContentHandler(handler) : null; - if (parser instanceof ParserDecorator){ - metadata.add("X-Parsed-By", ((ParserDecorator) parser).getWrappedParser().getClass().getName()); - } else { - metadata.add("X-Parsed-By", parser.getClass().getName()); - } + TaggedContentHandler taggedHandler = new TaggedContentHandler(handler); try { parser.parse(taggedStream, taggedHandler, metadata, context); } catch (RuntimeException e) { @@ -286,7 +248,7 @@ throw new TikaException( "TIKA-198: Illegal IOException from " + parser, e); } catch (SAXException e) { - if (taggedHandler != null) taggedHandler.throwIfCauseOf(e); + taggedHandler.throwIfCauseOf(e); throw new TikaException( "TIKA-237: Illegal SAXException from " + parser, e); } diff --git a/tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java b/tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java index 769c0b3..09d844c 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java +++ b/tika-core/src/main/java/org/apache/tika/parser/DefaultParser.java @@ -16,16 +16,14 @@ */ package org.apache.tika.parser; -import java.util.ArrayList; -import java.util.Collection; import java.util.Collections; +import java.util.Comparator; import java.util.List; import java.util.Map; import org.apache.tika.config.ServiceLoader; import org.apache.tika.mime.MediaType; import org.apache.tika.mime.MediaTypeRegistry; -import org.apache.tika.utils.ServiceLoaderUtils; /** * A composite parser based on all the {@link Parser} implementations @@ -49,21 +47,31 @@ * @return ordered list of statically loadable parsers */ private static List getDefaultParsers(ServiceLoader loader) { - List parsers = loader.loadStaticServiceProviders(Parser.class); - ServiceLoaderUtils.sortLoadedClasses(parsers); + List parsers = + loader.loadStaticServiceProviders(Parser.class); + Collections.sort(parsers, new Comparator() { + public int compare(Parser p1, Parser p2) { + String n1 = p1.getClass().getName(); + String n2 = p2.getClass().getName(); + boolean t1 = n1.startsWith("org.apache.tika."); + boolean t2 = n2.startsWith("org.apache.tika."); + if (t1 == t2) { + return n1.compareTo(n2); + } else if (t1) { + return -1; + } else { + return 1; + } + } + }); return parsers; } private transient final ServiceLoader loader; - public DefaultParser(MediaTypeRegistry registry, ServiceLoader loader, - Collection> excludeParsers) { - super(registry, getDefaultParsers(loader), excludeParsers); + public DefaultParser(MediaTypeRegistry registry, ServiceLoader loader) { + super(registry, getDefaultParsers(loader)); this.loader = loader; - } - - public DefaultParser(MediaTypeRegistry registry, ServiceLoader loader) { - this(registry, loader, null); } public DefaultParser(MediaTypeRegistry registry, ClassLoader loader) { @@ -102,13 +110,4 @@ return map; } - @Override - public List getAllComponentParsers() { - List parsers = super.getAllComponentParsers(); - if (loader != null) { - parsers = new ArrayList(parsers); - parsers.addAll(loader.loadDynamicServiceProviders(Parser.class)); - } - return parsers; - } } diff --git a/tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java b/tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java deleted file mode 100644 index 2115001..0000000 --- a/tika-core/src/main/java/org/apache/tika/parser/DigestingParser.java +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser; - - -import java.io.IOException; -import java.io.InputStream; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class DigestingParser extends ParserDecorator { - - /** - * Interface for optional digester, if specified during construction. - * See org.apache.parser.utils.CommonsDigester in tika-parsers for an - * implementation. - */ - public interface Digester { - /** - * Digests an InputStream and sets the appropriate value(s) in the metadata. - * The Digester is also responsible for marking and resetting the stream. - *

    - * The given stream is guaranteed to support the - * {@link InputStream#markSupported() mark feature} and the detector - * is expected to {@link InputStream#mark(int) mark} the stream before - * reading any bytes from it, and to {@link InputStream#reset() reset} - * the stream before returning. The stream must not be closed by the - * detector. - * - * @param is InputStream to digest - * @param m Metadata to set the values for - * @param parseContext ParseContext - * @throws IOException - */ - void digest(InputStream is, Metadata m, ParseContext parseContext) throws IOException; - - - }; - - private final Digester digester; - /** - * Creates a decorator for the given parser. - * - * @param parser the parser instance to be decorated - */ - public DigestingParser(Parser parser, Digester digester) { - super(parser); - this.digester = digester; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - if (digester != null) { - digester.digest(stream, metadata, context); - } - super.parse(stream, handler, metadata, context); - } -} diff --git a/tika-core/src/main/java/org/apache/tika/parser/EmptyParser.java b/tika-core/src/main/java/org/apache/tika/parser/EmptyParser.java index 45d2eaa..3ca4188 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/EmptyParser.java +++ b/tika-core/src/main/java/org/apache/tika/parser/EmptyParser.java @@ -32,6 +32,7 @@ * for unknown document types. */ public class EmptyParser extends AbstractParser { + /** * Serial version UID. */ @@ -54,4 +55,5 @@ xhtml.startDocument(); xhtml.endDocument(); } + } diff --git a/tika-core/src/main/java/org/apache/tika/parser/ErrorParser.java b/tika-core/src/main/java/org/apache/tika/parser/ErrorParser.java index c115062..c8e69d1 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/ErrorParser.java +++ b/tika-core/src/main/java/org/apache/tika/parser/ErrorParser.java @@ -31,8 +31,7 @@ * for unknown document types. */ public class ErrorParser extends AbstractParser { - private static final long serialVersionUID = 7727423956957641824L; - + /** * Singleton instance of this class. */ @@ -48,4 +47,5 @@ throws TikaException { throw new TikaException("Parse error"); } + } diff --git a/tika-core/src/main/java/org/apache/tika/parser/NetworkParser.java b/tika-core/src/main/java/org/apache/tika/parser/NetworkParser.java index 0130005..76bcb8f 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/NetworkParser.java +++ b/tika-core/src/main/java/org/apache/tika/parser/NetworkParser.java @@ -78,7 +78,8 @@ Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { if ("telnet".equals(uri.getScheme())) { - try (Socket socket = new Socket(uri.getHost(), uri.getPort())) { + final Socket socket = new Socket(uri.getHost(), uri.getPort()); + try { new ParsingTask(stream, new FilterOutputStream(socket.getOutputStream()) { @Override public void close() throws IOException { @@ -86,16 +87,21 @@ } }).parse( socket.getInputStream(), handler, metadata, context); + } finally { + socket.close(); } } else { URL url = uri.toURL(); URLConnection connection = url.openConnection(); connection.setDoOutput(true); connection.connect(); - try (InputStream input = connection.getInputStream()) { + InputStream input = connection.getInputStream(); + try { new ParsingTask(stream, connection.getOutputStream()).parse( new CloseShieldInputStream(input), handler, metadata, context); + } finally { + input.close(); } } diff --git a/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java b/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java index f1865f3..6cc57ff 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java +++ b/tika-core/src/main/java/org/apache/tika/parser/ParserDecorator.java @@ -18,24 +18,19 @@ import java.io.IOException; import java.io.InputStream; -import java.util.Collection; -import java.util.HashSet; import java.util.Set; import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; /** - * Decorator base class for the {@link Parser} interface. - *

    This class simply delegates all parsing calls to an underlying decorated - * parser instance. Subclasses can provide extra decoration by overriding the + * Decorator base class for the {@link Parser} interface. This class + * simply delegates all parsing calls to an underlying decorated parser + * instance. Subclasses can provide extra decoration by overriding the * parse method. - *

    To decorate several different parsers at the same time, wrap them in - * a {@link CompositeParser} instance first. */ public class ParserDecorator extends AbstractParser { @@ -58,87 +53,6 @@ public Set getSupportedTypes(ParseContext context) { return types; } - @Override - public String getDecorationName() { - return "With Types"; - } - }; - } - - /** - * Decorates the given parser so that it never claims to support - * parsing of the given media types, but will work for all others. - * - * @param parser the parser to be decorated - * @param excludeTypes excluded/ignored media types - * @return the decorated parser - */ - public static final Parser withoutTypes( - Parser parser, final Set excludeTypes) { - return new ParserDecorator(parser) { - private static final long serialVersionUID = 7979614774021768609L; - @Override - public Set getSupportedTypes(ParseContext context) { - // Get our own, writable copy of the types the parser supports - Set parserTypes = - new HashSet(super.getSupportedTypes(context)); - // Remove anything on our excludes list - parserTypes.removeAll(excludeTypes); - // Return whatever is left - return parserTypes; - } - @Override - public String getDecorationName() { - return "Without Types"; - } - }; - } - - /** - * Decorates the given parsers into a virtual parser, where they'll - * be tried in preference order until one works without error. - * TODO Is this the right name? - * TODO Is this the right place to put this? Should it be in CompositeParser? Elsewhere? - * TODO Should we reset the Metadata if we try another parser? - * TODO Should we reset the ContentHandler if we try another parser? - * TODO Should we log/report failures anywhere? - * @deprecated Do not use until the TODOs are resolved, see TIKA-1509 - */ - public static final Parser withFallbacks( - final Collection parsers, final Set types) { - Parser parser = EmptyParser.INSTANCE; - if (!parsers.isEmpty()) parser = parsers.iterator().next(); - - return new ParserDecorator(parser) { - private static final long serialVersionUID = 1625187131782069683L; - @Override - public Set getSupportedTypes(ParseContext context) { - return types; - } - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - // Must have a TikaInputStream, so we can re-use it if parsing fails - TikaInputStream tstream = TikaInputStream.get(stream); - tstream.getFile(); - // Try each parser in turn - for (Parser p : parsers) { - tstream.mark(-1); - try { - p.parse(tstream, handler, metadata, context); - return; - } catch (Exception e) { - // TODO How to log / record this failure? - } - // Prepare for the next parser, if present - tstream.reset(); - } - } - @Override - public String getDecorationName() { - return "With Fallback"; - } }; } @@ -177,18 +91,13 @@ parser.parse(stream, handler, metadata, context); } - /** - * @return A name/description of the decoration, or null if none available - */ - public String getDecorationName() { - return null; - } /** * Gets the parser wrapped by this ParserDecorator - * @return the parser wrapped by this ParserDecorator + * @return */ public Parser getWrappedParser() { return this.parser; } + } diff --git a/tika-core/src/main/java/org/apache/tika/parser/ParsingReader.java b/tika-core/src/main/java/org/apache/tika/parser/ParsingReader.java index 0f334e3..023a7d2 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/ParsingReader.java +++ b/tika-core/src/main/java/org/apache/tika/parser/ParsingReader.java @@ -26,8 +26,6 @@ import java.io.PipedWriter; import java.io.Reader; import java.io.Writer; -import java.nio.file.Files; -import java.nio.file.Path; import java.util.concurrent.Executor; import org.apache.tika.metadata.Metadata; @@ -121,23 +119,11 @@ } /** - * Creates a reader for the text content of the file at the given path. - * - * @param path path - * @throws FileNotFoundException if the given file does not exist - * @throws IOException if the document can not be parsed - */ - public ParsingReader(Path path) throws IOException { - this(Files.newInputStream(path), path.getFileName().toString()); - } - - /** * Creates a reader for the text content of the given file. * * @param file file * @throws FileNotFoundException if the given file does not exist * @throws IOException if the document can not be parsed - * @see #ParsingReader(Path) */ public ParsingReader(File file) throws FileNotFoundException, IOException { this(new FileInputStream(file), file.getName()); diff --git a/tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java b/tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java deleted file mode 100644 index 1c179f3..0000000 --- a/tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java +++ /dev/null @@ -1,357 +0,0 @@ -package org.apache.tika.parser; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.io.InputStream; -import java.util.Date; -import java.util.LinkedList; -import java.util.List; -import java.util.Set; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.FilenameUtils; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Property; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.metadata.TikaMetadataKeys; -import org.apache.tika.mime.MediaType; -import org.apache.tika.sax.ContentHandlerFactory; -import org.apache.tika.utils.ExceptionUtils; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.DefaultHandler; - -/** - * This is a helper class that wraps a parser in a recursive handler. - * It takes care of setting the embedded parser in the ParseContext - * and handling the embedded path calculations. - *

    - * After parsing a document, call getMetadata() to retrieve a list of - * Metadata objects, one for each embedded resource. The first item - * in the list will contain the Metadata for the outer container file. - *

    - * Content can also be extracted and stored in the {@link #TIKA_CONTENT} field - * of a Metadata object. Select the type of content to be stored - * at initialization. - *

    - * If a WriteLimitReachedException is encountered, the wrapper will stop - * processing the current resource, and it will not process - * any of the child resources for the given resource. However, it will try to - * parse as much as it can. If a WLRE is reached in the parent document, - * no child resources will be parsed. - *

    - * The implementation is based on Jukka's RecursiveMetadataParser - * and Nick's additions. See: - * RecursiveMetadataParser. - *

    - * Note that this wrapper holds all data in memory and is not appropriate - * for files with content too large to be held in memory. - *

    - * Note, too, that this wrapper is not thread safe because it stores state. - * The client must initialize a new wrapper for each thread, and the client - * is responsible for calling {@link #reset()} after each parse. - *

    - * The unit tests for this class are in the tika-parsers module. - *

    - */ -public class RecursiveParserWrapper implements Parser { - - /** - * Generated serial version - */ - private static final long serialVersionUID = 9086536568120690938L; - - //move this to TikaCoreProperties? - public final static Property TIKA_CONTENT = Property.internalText(TikaCoreProperties.TIKA_META_PREFIX+"content"); - public final static Property PARSE_TIME_MILLIS = Property.internalText(TikaCoreProperties.TIKA_META_PREFIX + "parse_time_millis"); - public final static Property WRITE_LIMIT_REACHED = - Property.internalBoolean(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX + "write_limit_reached"); - public final static Property EMBEDDED_RESOURCE_LIMIT_REACHED = - Property.internalBoolean(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX + "embedded_resource_limit_reached"); - - public final static Property EMBEDDED_EXCEPTION = - Property.internalText(TikaCoreProperties.TIKA_META_EXCEPTION_PREFIX + "embedded_exception"); - //move this to TikaCoreProperties? - public final static Property EMBEDDED_RESOURCE_PATH = - Property.internalText(TikaCoreProperties.TIKA_META_PREFIX+"embedded_resource_path"); - - private final Parser wrappedParser; - private final ContentHandlerFactory contentHandlerFactory; - private final List metadatas = new LinkedList<>(); - - private final boolean catchEmbeddedExceptions; - - //used in naming embedded resources that don't have a name. - private int unknownCount = 0; - private int maxEmbeddedResources = -1; - private boolean hitMaxEmbeddedResources = false; - - /** - * Initialize the wrapper with {@link #catchEmbeddedExceptions} set - * to true as default. - * - * @param wrappedParser parser to use for the container documents and the embedded documents - * @param contentHandlerFactory factory to use to generate a new content handler for - * the container document and each embedded document - */ - public RecursiveParserWrapper(Parser wrappedParser, ContentHandlerFactory contentHandlerFactory) { - this(wrappedParser, contentHandlerFactory, true); - } - - /** - * Initialize the wrapper. - * - * @param wrappedParser parser to use for the container documents and the embedded documents - * @param contentHandlerFactory factory to use to generate a new content handler for - * the container document and each embedded document - * @param catchEmbeddedExceptions whether or not to catch the embedded exceptions. - * If set to true, the stack traces will be stored in - * the metadata object with key: {@link #EMBEDDED_EXCEPTION}. - */ - public RecursiveParserWrapper(Parser wrappedParser, - ContentHandlerFactory contentHandlerFactory, boolean catchEmbeddedExceptions) { - this.wrappedParser = wrappedParser; - this.contentHandlerFactory = contentHandlerFactory; - this.catchEmbeddedExceptions = catchEmbeddedExceptions; - } - - @Override - public Set getSupportedTypes(ParseContext context) { - return wrappedParser.getSupportedTypes(context); - } - - /** - * Acts like a regular parser except it ignores the ContentHandler - * and it automatically sets/overwrites the embedded Parser in the - * ParseContext object. - *

    - * To retrieve the results of the parse, use {@link #getMetadata()}. - *

    - * Make sure to call {@link #reset()} after each parse. - */ - @Override - public void parse(InputStream stream, ContentHandler ignore, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - EmbeddedParserDecorator decorator = new EmbeddedParserDecorator("/"); - context.set(Parser.class, decorator); - ContentHandler localHandler = contentHandlerFactory.getNewContentHandler(); - long started = new Date().getTime(); - try { - wrappedParser.parse(stream, localHandler, metadata, context); - } catch (SAXException e) { - boolean wlr = isWriteLimitReached(e); - if (wlr == false) { - throw e; - } - metadata.set(WRITE_LIMIT_REACHED, "true"); - } finally { - long elapsedMillis = new Date().getTime() - started; - metadata.set(PARSE_TIME_MILLIS, Long.toString(elapsedMillis)); - addContent(localHandler, metadata); - - if (hitMaxEmbeddedResources) { - metadata.set(EMBEDDED_RESOURCE_LIMIT_REACHED, "true"); - } - metadatas.add(0, deepCopy(metadata)); - } - } - - /** - * - * The first element in the returned list represents the - * data from the outer container file. There is no guarantee - * about the ordering of the list after that. - * - * @return list of Metadata objects that were gathered during the parse - */ - public List getMetadata() { - return metadatas; - } - - /** - * Set the maximum number of embedded resources to store. - * If the max is hit during parsing, the {@link #EMBEDDED_RESOURCE_LIMIT_REACHED} - * property will be added to the container document's Metadata. - * - *

    - * If this value is < 0 (the default), the wrapper will store all Metadata. - * - * @param max maximum number of embedded resources to store - */ - public void setMaxEmbeddedResources(int max) { - maxEmbeddedResources = max; - } - - - /** - * This clears the metadata list and resets {@link #unknownCount} and - * {@link #hitMaxEmbeddedResources} - */ - public void reset() { - metadatas.clear(); - unknownCount = 0; - hitMaxEmbeddedResources = false; - } - - /** - * Copied/modified from WriteOutContentHandler. Couldn't make that - * static, and we need to have something that will work - * with exceptions thrown from both BodyContentHandler and WriteOutContentHandler - * @param t - * @return - */ - private boolean isWriteLimitReached(Throwable t) { - if (t.getMessage() != null && - t.getMessage().indexOf("Your document contained more than") == 0) { - return true; - } else { - return t.getCause() != null && isWriteLimitReached(t.getCause()); - } - } - - //defensive copy - private Metadata deepCopy(Metadata m) { - Metadata clone = new Metadata(); - - for (String n : m.names()){ - if (! m.isMultiValued(n)) { - clone.set(n, m.get(n)); - } else { - String[] vals = m.getValues(n); - for (int i = 0; i < vals.length; i++) { - clone.add(n, vals[i]); - } - } - } - return clone; - } - - private String getResourceName(Metadata metadata) { - String objectName = ""; - if (metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY) != null) { - objectName = metadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY); - } else if (metadata.get(TikaMetadataKeys.EMBEDDED_RELATIONSHIP_ID) != null) { - objectName = metadata.get(TikaMetadataKeys.EMBEDDED_RELATIONSHIP_ID); - } else { - objectName = "embedded-" + (++unknownCount); - } - //make sure that there isn't any path info in the objectName - //some parsers can return paths, not just file names - objectName = FilenameUtils.getName(objectName); - return objectName; - } - - private void addContent(ContentHandler handler, Metadata metadata) { - - if (handler.getClass().equals(DefaultHandler.class)){ - //no-op: we can't rely on just testing for - //empty content because DefaultHandler's toString() - //returns e.g. "org.xml.sax.helpers.DefaultHandler@6c8b1edd" - } else { - String content = handler.toString(); - if (content != null && content.trim().length() > 0 ) { - metadata.add(TIKA_CONTENT, content); - } - } - - } - - - private class EmbeddedParserDecorator extends ParserDecorator { - - private static final long serialVersionUID = 207648200464263337L; - - private String location = null; - - - private EmbeddedParserDecorator(String location) { - super(wrappedParser); - this.location = location; - if (! this.location.endsWith("/")) { - this.location += "/"; - } - } - - @Override - public void parse(InputStream stream, ContentHandler ignore, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - //Test to see if we should avoid parsing - if (maxEmbeddedResources > -1 && - metadatas.size() >= maxEmbeddedResources) { - hitMaxEmbeddedResources = true; - return; - } - // Work out what this thing is - String objectName = getResourceName(metadata); - String objectLocation = this.location + objectName; - - metadata.add(EMBEDDED_RESOURCE_PATH, objectLocation); - - //ignore the content handler that is passed in - //and get a fresh handler - ContentHandler localHandler = contentHandlerFactory.getNewContentHandler(); - - Parser preContextParser = context.get(Parser.class); - context.set(Parser.class, new EmbeddedParserDecorator(objectLocation)); - long started = new Date().getTime(); - try { - super.parse(stream, localHandler, metadata, context); - } catch (SAXException e) { - boolean wlr = isWriteLimitReached(e); - if (wlr == true) { - metadata.add(WRITE_LIMIT_REACHED, "true"); - } else { - if (catchEmbeddedExceptions) { - String trace = ExceptionUtils.getStackTrace(e); - metadata.set(EMBEDDED_EXCEPTION, trace); - } else { - throw e; - } - } - } catch (IOException|TikaException e) { - if (catchEmbeddedExceptions) { - String trace = ExceptionUtils.getStackTrace(e); - metadata.set(EMBEDDED_EXCEPTION, trace); - } else { - throw e; - } - } finally { - context.set(Parser.class, preContextParser); - long elapsedMillis = new Date().getTime() - started; - metadata.set(PARSE_TIME_MILLIS, Long.toString(elapsedMillis)); - } - - //Because of recursion, we need - //to re-test to make sure that we limit the - //number of stored resources - if (maxEmbeddedResources > -1 && - metadatas.size() >= maxEmbeddedResources) { - hitMaxEmbeddedResources = true; - return; - } - addContent(localHandler, metadata); - metadatas.add(deepCopy(metadata)); - } - } - - -} diff --git a/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java b/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java index 1515de6..59b3e4c 100644 --- a/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java +++ b/tika-core/src/main/java/org/apache/tika/parser/external/ExternalParser.java @@ -44,8 +44,6 @@ import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * Parser that uses an external program (like catdoc or pdf2txt) to extract * text content and metadata from a given document. @@ -160,7 +158,8 @@ File output = null; // Build our command - String[] cmd = command[0].split(" "); + String[] cmd = new String[command.length]; + System.arraycopy(command, 0, cmd, 0, command.length); for(int i=0; i create(URL... urls) throws IOException, TikaException { List parsers = new ArrayList(); for(URL url : urls) { - try (InputStream stream = url.openStream()) { + InputStream stream = url.openStream(); + try { parsers.addAll( ExternalParsersConfigReader.read(stream) ); + } finally { + stream.close(); } } return parsers; diff --git a/tika-core/src/main/java/org/apache/tika/sax/BasicContentHandlerFactory.java b/tika-core/src/main/java/org/apache/tika/sax/BasicContentHandlerFactory.java deleted file mode 100644 index 810b72e..0000000 --- a/tika-core/src/main/java/org/apache/tika/sax/BasicContentHandlerFactory.java +++ /dev/null @@ -1,156 +0,0 @@ -package org.apache.tika.sax; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.OutputStream; -import java.io.OutputStreamWriter; -import java.io.UnsupportedEncodingException; -import java.util.Locale; - -import org.xml.sax.ContentHandler; -import org.xml.sax.helpers.DefaultHandler; - -/** - * Basic factory for creating common types of ContentHandlers - */ -public class BasicContentHandlerFactory implements ContentHandlerFactory { - - /** - * Tries to parse string into handler type. Returns default if string is null or - * parse fails. - *

    - * Options: xml, html, text, body, ignore (no content) - * - * @param handlerTypeName string to parse - * @param defaultType type to return if parse fails - * @return handler type - */ - public static HANDLER_TYPE parseHandlerType(String handlerTypeName, HANDLER_TYPE defaultType) { - if (handlerTypeName == null) { - return defaultType; - } - - String lcHandlerTypeName = handlerTypeName.toLowerCase(Locale.ROOT); - switch (lcHandlerTypeName) { - case "xml" : return HANDLER_TYPE.XML; - case "text" : return HANDLER_TYPE.TEXT; - case "txt" : return HANDLER_TYPE.TEXT; - case "html" : return HANDLER_TYPE.HTML; - case "body" : return HANDLER_TYPE.BODY; - case "ignore" : return HANDLER_TYPE.IGNORE; - default : return defaultType; - } - } - - /** - * Common handler types for content. - */ - public enum HANDLER_TYPE { - BODY, - IGNORE, //don't store content - TEXT, - HTML, - XML - } - - private final HANDLER_TYPE type; - private final int writeLimit; - - /** - * - * @param type basic type of handler - * @param writeLimit max number of characters to store; if < 0, the handler will store all characters - */ - public BasicContentHandlerFactory(HANDLER_TYPE type, int writeLimit) { - this.type = type; - this.writeLimit = writeLimit; - } - - @Override - public ContentHandler getNewContentHandler() { - - if (type == HANDLER_TYPE.BODY) { - return new BodyContentHandler(writeLimit); - } else if (type == HANDLER_TYPE.IGNORE) { - return new DefaultHandler(); - } - if (writeLimit > -1) { - switch(type) { - case TEXT: - return new WriteOutContentHandler(new ToTextContentHandler(), writeLimit); - case HTML: - return new WriteOutContentHandler(new ToHTMLContentHandler(), writeLimit); - case XML: - return new WriteOutContentHandler(new ToXMLContentHandler(), writeLimit); - default: - return new WriteOutContentHandler(new ToTextContentHandler(), writeLimit); - } - } else { - switch (type) { - case TEXT: - return new ToTextContentHandler(); - case HTML: - return new ToHTMLContentHandler(); - case XML: - return new ToXMLContentHandler(); - default: - return new ToTextContentHandler(); - - } - } - } - - @Override - public ContentHandler getNewContentHandler(OutputStream os, String encoding) throws UnsupportedEncodingException { - - if (type == HANDLER_TYPE.IGNORE) { - return new DefaultHandler(); - } - - if (writeLimit > -1) { - switch(type) { - case BODY: - return new WriteOutContentHandler( - new BodyContentHandler( - new OutputStreamWriter(os, encoding)), writeLimit); - case TEXT: - return new WriteOutContentHandler(new ToTextContentHandler(os, encoding), writeLimit); - case HTML: - return new WriteOutContentHandler(new ToHTMLContentHandler(os, encoding), writeLimit); - case XML: - return new WriteOutContentHandler(new ToXMLContentHandler(os, encoding), writeLimit); - default: - return new WriteOutContentHandler(new ToTextContentHandler(os, encoding), writeLimit); - } - } else { - switch (type) { - case BODY: - return new BodyContentHandler(new OutputStreamWriter(os, encoding)); - case TEXT: - return new ToTextContentHandler(os, encoding); - case HTML: - return new ToHTMLContentHandler(os, encoding); - case XML: - return new ToXMLContentHandler(os, encoding); - default: - return new ToTextContentHandler(os, encoding); - - } - } - } - -} diff --git a/tika-core/src/main/java/org/apache/tika/sax/CleanPhoneText.java b/tika-core/src/main/java/org/apache/tika/sax/CleanPhoneText.java deleted file mode 100644 index e63fea5..0000000 --- a/tika-core/src/main/java/org/apache/tika/sax/CleanPhoneText.java +++ /dev/null @@ -1,286 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.sax; - -import java.util.Locale; -import java.util.regex.Matcher; -import java.util.regex.Pattern; -import java.util.ArrayList; - -/** - * Class to help de-obfuscate phone numbers in text. - */ -public class CleanPhoneText { - // Regex to identify a phone number - static final String cleanPhoneRegex = "([2-9]\\d{2}[2-9]\\d{6})"; - - // Regex which attempts to ignore punctuation and other distractions. - static final String phoneRegex = "([{(<]{0,3}[2-9][\\W_]{0,3}\\d[\\W_]{0,3}\\d[\\W_]{0,6}[2-9][\\W_]{0,3}\\d[\\W_]{0,3}\\d[\\W_]{0,6}\\d[\\W_]{0,3}\\d[\\W_]{0,3}\\d[\\W_]{0,3}\\d)"; - - public static ArrayList extractPhoneNumbers(String text) { - text = clean(text); - int idx = 0; - Pattern p = Pattern.compile(cleanPhoneRegex); - Matcher m = p.matcher(text); - ArrayList phoneNumbers = new ArrayList(); - while (m.find(idx)) { - String digits = m.group(1); - int start = m.start(1); - int end = m.end(1); - String prefix = ""; - if (start > 0) { - prefix = text.substring(start-1, start); - } - if (digits.substring(0, 2).equals("82") && prefix.equals("*")) { - // this number overlaps with a *82 sequence - idx += 2; - } else { - // seems good - phoneNumbers.add(digits); - idx = end; - } - } - return phoneNumbers; - } - - public static String clean(String text) { - text = text.toLowerCase(Locale.ROOT); - for (String[][] group : cleanSubstitutions) { - for (String[] sub : group) { - text = text.replaceAll(sub[0], sub[1]); - } - } - // Delete all non-digits and white space. - text = text.replaceAll("[\\D+\\s]", ""); - return text; - } - - - public static final String[][][] cleanSubstitutions = new String[][][]{ - {{"&#\\d{1,3};", ""}}, // first simply remove numeric entities - {{"th0usand", "thousand"}, // handle common misspellings - {"th1rteen", "thirteen"}, - {"f0urteen", "fourteen"}, - {"e1ghteen", "eighteen"}, - {"n1neteen", "nineteen"}, - {"f1fteen", "fifteen"}, - {"s1xteen", "sixteen"}, - {"th1rty", "thirty"}, - {"e1ghty", "eighty"}, - {"n1nety", "ninety"}, - {"fourty", "forty"}, - {"f0urty", "forty"}, - {"e1ght", "eight"}, - {"f0rty", "forty"}, - {"f1fty", "fifty"}, - {"s1xty", "sixty"}, - {"zer0", "zero"}, - {"f0ur", "four"}, - {"f1ve", "five"}, - {"n1ne", "nine"}, - {"0ne", "one"}, - {"tw0", "two"}, - {"s1x", "six"}}, - // mixed compound numeral words - // consider 7teen, etc. - {{"twenty[\\W_]{0,3}1", "twenty-one"}, - {"twenty[\\W_]{0,3}2", "twenty-two"}, - {"twenty[\\W_]{0,3}3", "twenty-three"}, - {"twenty[\\W_]{0,3}4", "twenty-four"}, - {"twenty[\\W_]{0,3}5", "twenty-five"}, - {"twenty[\\W_]{0,3}6", "twenty-six"}, - {"twenty[\\W_]{0,3}7", "twenty-seven"}, - {"twenty[\\W_]{0,3}8", "twenty-eight"}, - {"twenty[\\W_]{0,3}9", "twenty-nine"}, - {"thirty[\\W_]{0,3}1", "thirty-one"}, - {"thirty[\\W_]{0,3}2", "thirty-two"}, - {"thirty[\\W_]{0,3}3", "thirty-three"}, - {"thirty[\\W_]{0,3}4", "thirty-four"}, - {"thirty[\\W_]{0,3}5", "thirty-five"}, - {"thirty[\\W_]{0,3}6", "thirty-six"}, - {"thirty[\\W_]{0,3}7", "thirty-seven"}, - {"thirty[\\W_]{0,3}8", "thirty-eight"}, - {"thirty[\\W_]{0,3}9", "thirty-nine"}, - {"forty[\\W_]{0,3}1", "forty-one"}, - {"forty[\\W_]{0,3}2", "forty-two"}, - {"forty[\\W_]{0,3}3", "forty-three"}, - {"forty[\\W_]{0,3}4", "forty-four"}, - {"forty[\\W_]{0,3}5", "forty-five"}, - {"forty[\\W_]{0,3}6", "forty-six"}, - {"forty[\\W_]{0,3}7", "forty-seven"}, - {"forty[\\W_]{0,3}8", "forty-eight"}, - {"forty[\\W_]{0,3}9", "forty-nine"}, - {"fifty[\\W_]{0,3}1", "fifty-one"}, - {"fifty[\\W_]{0,3}2", "fifty-two"}, - {"fifty[\\W_]{0,3}3", "fifty-three"}, - {"fifty[\\W_]{0,3}4", "fifty-four"}, - {"fifty[\\W_]{0,3}5", "fifty-five"}, - {"fifty[\\W_]{0,3}6", "fifty-six"}, - {"fifty[\\W_]{0,3}7", "fifty-seven"}, - {"fifty[\\W_]{0,3}8", "fifty-eight"}, - {"fifty[\\W_]{0,3}9", "fifty-nine"}, - {"sixty[\\W_]{0,3}1", "sixty-one"}, - {"sixty[\\W_]{0,3}2", "sixty-two"}, - {"sixty[\\W_]{0,3}3", "sixty-three"}, - {"sixty[\\W_]{0,3}4", "sixty-four"}, - {"sixty[\\W_]{0,3}5", "sixty-five"}, - {"sixty[\\W_]{0,3}6", "sixty-six"}, - {"sixty[\\W_]{0,3}7", "sixty-seven"}, - {"sixty[\\W_]{0,3}8", "sixty-eight"}, - {"sixty[\\W_]{0,3}9", "sixty-nine"}, - {"seventy[\\W_]{0,3}1", "seventy-one"}, - {"seventy[\\W_]{0,3}2", "seventy-two"}, - {"seventy[\\W_]{0,3}3", "seventy-three"}, - {"seventy[\\W_]{0,3}4", "seventy-four"}, - {"seventy[\\W_]{0,3}5", "seventy-five"}, - {"seventy[\\W_]{0,3}6", "seventy-six"}, - {"seventy[\\W_]{0,3}7", "seventy-seven"}, - {"seventy[\\W_]{0,3}8", "seventy-eight"}, - {"seventy[\\W_]{0,3}9", "seventy-nine"}, - {"eighty[\\W_]{0,3}1", "eighty-one"}, - {"eighty[\\W_]{0,3}2", "eighty-two"}, - {"eighty[\\W_]{0,3}3", "eighty-three"}, - {"eighty[\\W_]{0,3}4", "eighty-four"}, - {"eighty[\\W_]{0,3}5", "eighty-five"}, - {"eighty[\\W_]{0,3}6", "eighty-six"}, - {"eighty[\\W_]{0,3}7", "eighty-seven"}, - {"eighty[\\W_]{0,3}8", "eighty-eight"}, - {"eighty[\\W_]{0,3}9", "eighty-nine"}, - {"ninety[\\W_]{0,3}1", "ninety-one"}, - {"ninety[\\W_]{0,3}2", "ninety-two"}, - {"ninety[\\W_]{0,3}3", "ninety-three"}, - {"ninety[\\W_]{0,3}4", "ninety-four"}, - {"ninety[\\W_]{0,3}5", "ninety-five"}, - {"ninety[\\W_]{0,3}6", "ninety-six"}, - {"ninety[\\W_]{0,3}7", "ninety-seven"}, - {"ninety[\\W_]{0,3}8", "ninety-eight"}, - {"ninety[\\W_]{0,3}9", "ninety-nine"}}, - // now resolve compound numeral words - {{"twenty-one", "21"}, - {"twenty-two", "22"}, - {"twenty-three", "23"}, - {"twenty-four", "24"}, - {"twenty-five", "25"}, - {"twenty-six", "26"}, - {"twenty-seven", "27"}, - {"twenty-eight", "28"}, - {"twenty-nine", "29"}, - {"thirty-one", "31"}, - {"thirty-two", "32"}, - {"thirty-three", "33"}, - {"thirty-four", "34"}, - {"thirty-five", "35"}, - {"thirty-six", "36"}, - {"thirty-seven", "37"}, - {"thirty-eight", "38"}, - {"thirty-nine", "39"}, - {"forty-one", "41"}, - {"forty-two", "42"}, - {"forty-three", "43"}, - {"forty-four", "44"}, - {"forty-five", "45"}, - {"forty-six", "46"}, - {"forty-seven", "47"}, - {"forty-eight", "48"}, - {"forty-nine", "49"}, - {"fifty-one", "51"}, - {"fifty-two", "52"}, - {"fifty-three", "53"}, - {"fifty-four", "54"}, - {"fifty-five", "55"}, - {"fifty-six", "56"}, - {"fifty-seven", "57"}, - {"fifty-eight", "58"}, - {"fifty-nine", "59"}, - {"sixty-one", "61"}, - {"sixty-two", "62"}, - {"sixty-three", "63"}, - {"sixty-four", "64"}, - {"sixty-five", "65"}, - {"sixty-six", "66"}, - {"sixty-seven", "67"}, - {"sixty-eight", "68"}, - {"sixty-nine", "69"}, - {"seventy-one", "71"}, - {"seventy-two", "72"}, - {"seventy-three", "73"}, - {"seventy-four", "74"}, - {"seventy-five", "75"}, - {"seventy-six", "76"}, - {"seventy-seven", "77"}, - {"seventy-eight", "78"}, - {"seventy-nine", "79"}, - {"eighty-one", "81"}, - {"eighty-two", "82"}, - {"eighty-three", "83"}, - {"eighty-four", "84"}, - {"eighty-five", "85"}, - {"eighty-six", "86"}, - {"eighty-seven", "87"}, - {"eighty-eight", "88"}, - {"eighty-nine", "89"}, - {"ninety-one", "91"}, - {"ninety-two", "92"}, - {"ninety-three", "93"}, - {"ninety-four", "94"}, - {"ninety-five", "95"}, - {"ninety-six", "96"}, - {"ninety-seven", "97"}, - {"ninety-eight", "98"}, - {"ninety-nine", "99"}}, - // larger units function as suffixes now - // assume never have three hundred four, three hundred and four - {{"hundred", "00"}, - {"thousand", "000"}}, - // single numeral words now - // some would have been ambiguous - {{"seventeen", "17"}, - {"thirteen", "13"}, - {"fourteen", "14"}, - {"eighteen", "18"}, - {"nineteen", "19"}, - {"fifteen", "15"}, - {"sixteen", "16"}, - {"seventy", "70"}, - {"eleven", "11"}, - {"twelve", "12"}, - {"twenty", "20"}, - {"thirty", "30"}, - {"eighty", "80"}, - {"ninety", "90"}, - {"three", "3"}, - {"seven", "7"}, - {"eight", "8"}, - {"forty", "40"}, - {"fifty", "50"}, - {"sixty", "60"}, - {"zero", "0"}, - {"four", "4"}, - {"five", "5"}, - {"nine", "9"}, - {"one", "1"}, - {"two", "2"}, - {"six", "6"}, - {"ten", "10"}}, - // now do letter for digit substitutions - {{"oh", "0"}, - {"o", "0"}, - {"i", "1"}, - {"l", "1"}} - }; -} \ No newline at end of file diff --git a/tika-core/src/main/java/org/apache/tika/sax/ContentHandlerFactory.java b/tika-core/src/main/java/org/apache/tika/sax/ContentHandlerFactory.java deleted file mode 100644 index c69b980..0000000 --- a/tika-core/src/main/java/org/apache/tika/sax/ContentHandlerFactory.java +++ /dev/null @@ -1,32 +0,0 @@ -package org.apache.tika.sax; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import org.xml.sax.ContentHandler; - -import java.io.OutputStream; -import java.io.UnsupportedEncodingException; - -/** - * Interface to allow easier injection of code for getting a new ContentHandler - */ -public interface ContentHandlerFactory { - public ContentHandler getNewContentHandler(); - public ContentHandler getNewContentHandler(OutputStream os, String encoding) throws UnsupportedEncodingException; - -} diff --git a/tika-core/src/main/java/org/apache/tika/sax/DIFContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/DIFContentHandler.java deleted file mode 100644 index df29e5a..0000000 --- a/tika-core/src/main/java/org/apache/tika/sax/DIFContentHandler.java +++ /dev/null @@ -1,152 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.sax; - -import java.util.Stack; - -import org.apache.tika.metadata.Metadata; -import org.xml.sax.Attributes; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.AttributesImpl; -import org.xml.sax.helpers.DefaultHandler; - -public class DIFContentHandler extends DefaultHandler { - - private static final char[] NEWLINE = new char[] { '\n' }; - private static final char[] TABSPACE = new char[] { '\t' }; - private static final Attributes EMPTY_ATTRIBUTES = new AttributesImpl(); - - private Stack treeStack; - private Stack dataStack; - private final ContentHandler delegate; - private boolean isLeaf; - private Metadata metadata; - - public DIFContentHandler(ContentHandler delegate, Metadata metadata) { - this.delegate = delegate; - this.isLeaf = false; - this.metadata = metadata; - this.treeStack = new Stack(); - this.dataStack = new Stack(); - } - - @Override - public void setDocumentLocator(org.xml.sax.Locator locator) { - delegate.setDocumentLocator(locator); - } - - @Override - public void characters(char[] ch, int start, int length) - throws SAXException { - String value = (new String(ch, start, length)).toString(); - this.dataStack.push(value); - - if (this.treeStack.peek().equals("Entry_Title")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "h3", "h3", EMPTY_ATTRIBUTES); - String title = "Title: "; - title = title + value; - this.delegate.characters(title.toCharArray(), 0, title.length()); - this.delegate.endElement("", "h3", "h3"); - } - if (this.treeStack.peek().equals("Southernmost_Latitude") - || this.treeStack.peek().equals("Northernmost_Latitude") - || this.treeStack.peek().equals("Westernmost_Longitude") - || this.treeStack.peek().equals("Easternmost_Longitude")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "tr", "tr", EMPTY_ATTRIBUTES); - this.delegate.startElement("", "td", "td", EMPTY_ATTRIBUTES); - String key = this.treeStack.peek() + " : "; - this.delegate.characters(key.toCharArray(), 0, key.length()); - this.delegate.endElement("", "td", "td"); - this.delegate.startElement("", "td", "td", EMPTY_ATTRIBUTES); - this.delegate.characters(value.toCharArray(), 0, value.length()); - this.delegate.endElement("", "td", "td"); - this.delegate.endElement("", "tr", "tr"); - } - } - - @Override - public void ignorableWhitespace(char[] ch, int start, int length) - throws SAXException { - delegate.ignorableWhitespace(ch, start, length); - } - - @Override - public void startElement(String uri, String localName, String qName, - Attributes attributes) throws SAXException { - this.isLeaf = true; - if (localName.equals("Spatial_Coverage")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "h3", "h3", EMPTY_ATTRIBUTES); - String value = "Geographic Data: "; - this.delegate.characters(value.toCharArray(), 0, value.length()); - this.delegate.endElement("", "h3", "h3"); - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "table", "table", EMPTY_ATTRIBUTES); - } - this.treeStack.push(localName); - } - - @Override - public void endElement(String uri, String localName, String qName) - throws SAXException { - if (localName.equals("Spatial_Coverage")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.endElement("", "table", "table"); - } - if (this.isLeaf) { - Stack tempStack = (Stack) this.treeStack.clone(); - String key = ""; - while (!tempStack.isEmpty()) { - if (key.length() == 0) { - key = tempStack.pop(); - } else { - key = tempStack.pop() + "-" + key; - } - } - String value = this.dataStack.peek(); - this.metadata.add(key, value); - this.isLeaf = false; - } - this.treeStack.pop(); - this.dataStack.pop(); - } - - @Override - public void startDocument() throws SAXException { - delegate.startDocument(); - } - - @Override - public void endDocument() throws SAXException { - delegate.endDocument(); - } - - @Override - public String toString() { - return delegate.toString(); - } - -} diff --git a/tika-core/src/main/java/org/apache/tika/sax/PhoneExtractingContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/PhoneExtractingContentHandler.java deleted file mode 100644 index 3bb5fb2..0000000 --- a/tika-core/src/main/java/org/apache/tika/sax/PhoneExtractingContentHandler.java +++ /dev/null @@ -1,111 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.sax; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.sax.CleanPhoneText; -import org.apache.tika.sax.ContentHandlerDecorator; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.DefaultHandler; - -import java.util.Arrays; -import java.util.List; - -/** - * Class used to extract phone numbers while parsing. - * - * Every time a document is parsed in Tika, the content is split into SAX events. - * Those SAX events are handled by a ContentHandler. You can think of these events - * as marking a tag in an HTML file. Once you're finished parsing, you can call - * handler.toString(), for example, to get the text contents of the file. On the other - * hand, any of the metadata of the file will be added to the Metadata object passed - * in during the parse() call. So, the Parser class sends metadata to the Metadata - * object and content to the ContentHandler. - * - * This class is an example of how to combine a ContentHandler and a Metadata. - * As content is passed to the handler, we first check to see if it matches a - * textual pattern for a phone number. If the extracted content is a phone number, - * we add it to the metadata under the key "phonenumbers". So, if you used this - * ContentHandler when you parsed a document, then called - * metadata.getValues("phonenumbers"), you would get an array of Strings of phone - * numbers found in the document. - * - * Please see the PhoneExtractingContentHandlerTest for an example of how to use - * this class. - * - */ -public class PhoneExtractingContentHandler extends ContentHandlerDecorator { - private Metadata metadata; - private static final String PHONE_NUMBERS = "phonenumbers"; - private StringBuilder stringBuilder; - - /** - * Creates a decorator for the given SAX event handler and Metadata object. - * - * @param handler SAX event handler to be decorated - */ - public PhoneExtractingContentHandler(ContentHandler handler, Metadata metadata) { - super(handler); - this.metadata = metadata; - this.stringBuilder = new StringBuilder(); - } - - /** - * Creates a decorator that by default forwards incoming SAX events to - * a dummy content handler that simply ignores all the events. Subclasses - * should use the {@link #setContentHandler(ContentHandler)} method to - * switch to a more usable underlying content handler. - * Also creates a dummy Metadata object to store phone numbers in. - */ - protected PhoneExtractingContentHandler() { - this(new DefaultHandler(), new Metadata()); - } - - /** - * The characters method is called whenever a Parser wants to pass raw... - * characters to the ContentHandler. But, sometimes, phone numbers are split - * accross different calls to characters, depending on the specific Parser - * used. So, we simply add all characters to a StringBuilder and analyze it - * once the document is finished. - */ - @Override - public void characters(char[] ch, int start, int length) throws SAXException { - try { - String text = new String(Arrays.copyOfRange(ch, start, start + length)); - stringBuilder.append(text); - super.characters(ch, start, length); - } catch (SAXException e) { - handleException(e); - } - } - - - /** - * This method is called whenever the Parser is done parsing the file. So, - * we check the output for any phone numbers. - */ - @Override - public void endDocument() throws SAXException { - super.endDocument(); - List numbers = CleanPhoneText.extractPhoneNumbers(stringBuilder.toString()); - for (String number : numbers) { - metadata.add(PHONE_NUMBERS, number); - } - } -} diff --git a/tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java index 4fdeaf3..046f02b 100755 --- a/tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java +++ b/tika-core/src/main/java/org/apache/tika/sax/ToTextContentHandler.java @@ -22,7 +22,6 @@ import java.io.StringWriter; import java.io.UnsupportedEncodingException; import java.io.Writer; -import java.nio.charset.Charset; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; @@ -58,7 +57,7 @@ * @param stream output stream */ public ToTextContentHandler(OutputStream stream) { - this(new OutputStreamWriter(stream, Charset.defaultCharset())); + this(new OutputStreamWriter(stream)); } /** diff --git a/tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java index 90b98ac..684f135 100644 --- a/tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java +++ b/tika-core/src/main/java/org/apache/tika/sax/WriteOutContentHandler.java @@ -21,7 +21,6 @@ import java.io.Serializable; import java.io.StringWriter; import java.io.Writer; -import java.nio.charset.Charset; import java.util.UUID; import org.xml.sax.ContentHandler; @@ -91,7 +90,7 @@ * @param stream output stream */ public WriteOutContentHandler(OutputStream stream) { - this(new OutputStreamWriter(stream, Charset.defaultCharset())); + this(new OutputStreamWriter(stream)); } /** diff --git a/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java b/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java index 79c948c..00fb928 100644 --- a/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java +++ b/tika-core/src/main/java/org/apache/tika/sax/XHTMLContentHandler.java @@ -60,7 +60,7 @@ * skip them if they get sent to startElement/endElement by mistake. */ private static final Set AUTO = - unmodifiableSet("html", "head", "frameset"); + unmodifiableSet("html", "head", "body", "frameset"); /** * The elements that get prepended with the {@link #TAB} character. diff --git a/tika-core/src/main/java/org/apache/tika/utils/ConcurrentUtils.java b/tika-core/src/main/java/org/apache/tika/utils/ConcurrentUtils.java deleted file mode 100644 index 5f4cd13..0000000 --- a/tika-core/src/main/java/org/apache/tika/utils/ConcurrentUtils.java +++ /dev/null @@ -1,57 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.utils; - -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Future; -import java.util.concurrent.FutureTask; - -import org.apache.tika.parser.ParseContext; - -/** - * Utility Class for Concurrency in Tika - * - * @since Apache Tika 1.11 - */ -public class ConcurrentUtils { - - /** - * - * Execute a runnable using an ExecutorService from the ParseContext if possible. - * Otherwise fallback to individual threads. - * - * @param context - * @param runnable - * @return - */ - public static Future execute(ParseContext context, Runnable runnable) { - - Future future = null; - ExecutorService executorService = context.get(ExecutorService.class); - if(executorService == null) { - FutureTask task = new FutureTask<>(runnable, null); - Thread thread = new Thread(task, "Tika Thread"); - thread.start(); - future = task; - } - else { - future = executorService.submit(runnable); - } - - return future; - } -} diff --git a/tika-core/src/main/java/org/apache/tika/utils/DateUtils.java b/tika-core/src/main/java/org/apache/tika/utils/DateUtils.java index 6b764c8..6126722 100644 --- a/tika-core/src/main/java/org/apache/tika/utils/DateUtils.java +++ b/tika-core/src/main/java/org/apache/tika/utils/DateUtils.java @@ -50,46 +50,12 @@ * * @see TIKA-495 * @param date given date - * @return ISO 8601 date string, including timezone details + * @return ISO 8601 date string */ public static String formatDate(Date date) { Calendar calendar = GregorianCalendar.getInstance(UTC, Locale.US); calendar.setTime(date); - return doFormatDate(calendar); - } - /** - * Returns a ISO 8601 representation of the given date. This method - * is thread safe and non-blocking. - * - * @see TIKA-495 - * @param date given date - * @return ISO 8601 date string, including timezone details - */ - public static String formatDate(Calendar date) { - // Explicitly switch it into UTC before formatting - date.setTimeZone(UTC); - return doFormatDate(date); - } - /** - * Returns a ISO 8601 representation of the given date, which is - * in an unknown timezone. This method is thread safe and non-blocking. - * - * @see TIKA-495 - * @param date given date - * @return ISO 8601 date string, without timezone details - */ - public static String formatDateUnknownTimezone(Date date) { - // Create the Calendar object in the system timezone - Calendar calendar = GregorianCalendar.getInstance(TimeZone.getDefault(), Locale.US); - calendar.setTime(date); - // Have it formatted - String formatted = formatDate(calendar); - // Strip the timezone details before returning - return formatted.substring(0, formatted.length()-1); - } - private static String doFormatDate(Calendar calendar) { return String.format( - Locale.ROOT, "%04d-%02d-%02dT%02d:%02d:%02dZ", calendar.get(Calendar.YEAR), calendar.get(Calendar.MONTH) + 1, diff --git a/tika-core/src/main/java/org/apache/tika/utils/ExceptionUtils.java b/tika-core/src/main/java/org/apache/tika/utils/ExceptionUtils.java deleted file mode 100644 index 190d977..0000000 --- a/tika-core/src/main/java/org/apache/tika/utils/ExceptionUtils.java +++ /dev/null @@ -1,90 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.utils; - - -import java.io.IOException; -import java.io.PrintWriter; -import java.io.StringWriter; -import java.io.Writer; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -import org.apache.tika.exception.TikaException; - -public class ExceptionUtils { - - private final static Pattern MSG_PATTERN = Pattern.compile(":[^\r\n]+"); - - /** - * Simple util to get stack trace. - *

    - * This will unwrap a TikaException and return the cause if not null - *

    - * NOTE: If your stacktraces are truncated, make sure to start your jvm - * with: -XX:-OmitStackTraceInFastThrow - * - * @param t throwable - * @return - * @throws IOException - */ - public static String getFilteredStackTrace(Throwable t) { - Throwable cause = t; - if ((t instanceof TikaException) && - t.getCause() != null) { - cause = t.getCause(); - } - return getStackTrace(cause); - } - - /** - * Get the full stacktrace as a string - * @param t - * @return - */ - public static String getStackTrace(Throwable t) { - Writer result = new StringWriter(); - PrintWriter writer = new PrintWriter(result); - t.printStackTrace(writer); - try { - writer.flush(); - result.flush(); - writer.close(); - result.close(); - } catch (IOException e) { - //swallow - } - return result.toString(); - } - - /** - * Utility method to trim the message from a stack trace - * string. - *

    - * E.g. java.lang.IllegalStateException: Potential loop detected - * will be trimmed to java.lang.IllegalStateException - * @param trace string view of stack trace - * @return trimmed stack trace - */ - public static String trimMessage(String trace) { - Matcher msgMatcher = MSG_PATTERN.matcher(trace); - if (msgMatcher.find()) { - return msgMatcher.replaceFirst(""); - } - return trace; - } -} diff --git a/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java b/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java deleted file mode 100644 index 5ee1fe8..0000000 --- a/tika-core/src/main/java/org/apache/tika/utils/ServiceLoaderUtils.java +++ /dev/null @@ -1,48 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.utils; - -import java.util.Collections; -import java.util.Comparator; -import java.util.List; - -/** - * Service Loading and Ordering related utils - */ -public class ServiceLoaderUtils { - /** - * Sorts a list of loaded classes, so that non-Tika ones come - * before Tika ones, and otherwise in reverse alphabetical order - */ - public static void sortLoadedClasses(List loaded) { - Collections.sort(loaded, new Comparator() { - public int compare(T c1, T c2) { - String n1 = c1.getClass().getName(); - String n2 = c2.getClass().getName(); - boolean t1 = n1.startsWith("org.apache.tika."); - boolean t2 = n2.startsWith("org.apache.tika."); - if (t1 == t2) { - return n1.compareTo(n2); - } else if (t1) { - return -1; - } else { - return 1; - } - } - }); - } -} diff --git a/tika-core/src/main/resources/org/apache/tika/detect/tika-example.nnmodel b/tika-core/src/main/resources/org/apache/tika/detect/tika-example.nnmodel deleted file mode 100644 index 2c4a065..0000000 --- a/tika-core/src/main/resources/org/apache/tika/detect/tika-example.nnmodel +++ /dev/null @@ -1,2 +0,0 @@ -#nn application/x-grib 256 2 1 0.0208829625450076 --0.05145498 5.127922 -3.078527 -3.988763 -0.2107816 0.8629464 -0.0002683103 3.268235 -0.1322597 2.256234 0.1585416 2.448373 -0.181859 -2.230397 -0.1796804 -3.778733 0.04216435 -5.247452 -0.07137667 4.015641 0.3369519 -2.232862 0.9872297 3.434345 0.01537457 -2.99547 0.05939847 -0.4262142 0.6431364 4.015367 -0.1052946 -0.5597504 -0.02748792 1.406441 -0.00300846 1.756262 0.09294704 -1.096173 -0.006158207 -1.221678 -0.02583992 -3.162681 0.08686034 1.269717 -0.01359701 -4.19245 -0.1400172 -3.274052 -0.0260648 -2.134415 0.02505031 -0.7995273 -0.01464903 -5.054871 0.06927952 -1.541063 -0.1199119 -1.126015 0.1847894 0.3805432 0.1132536 -1.962234 0.1468857 -0.5322409 -0.1450888 -0.1371484 3.097646 7.200917 0.17761 -0.5256572 0.8017177 0.8039271 0.2925865 3.100624 0.1743965 4.369426 0.1317674 -0.833718 0.220852 -4.685311 0.220108 1.81378 0.283003 2.38208 0.3301778 -3.338501 0.2043622 2.858757 0.1801032 -2.616795 0.5628204 1.599731 0.6208351 1.161804 0.8223767 -2.614317 0.9187943 -2.446297 1.350831 2.777917 0.8146963 -0.9438954 0.8172006 -0.3445128 0.6723926 0.9863181 0.5705485 2.406452 0.6185437 -1.202793 0.6450134 1.808502 0.4259292 -7.941289 0.6099652 3.878562 0.4132371 0.98963 0.3298724 -2.366126 0.3319441 -0.8838808 0.6854118 3.12115 0.525016 -1.727541 0.7185858 1.830994 0.03694244 -3.686851 -0.09525806 3.558239 0.4775859 1.931869 0.293615 -3.693917 0.4294216 -5.695489 0.3452209 0.6723521 0.4919632 -0.7649217 0.3818015 2.153055 0.3088684 -0.655546 0.2931553 -1.343124 0.4890785 -0.7017448 0.1028709 -2.578852 0.1097999 -0.3793636 0.3154411 -2.476895 0.3301697 -1.578032 0.4468783 0.9338115 0.3876319 0.1819231 0.3975481 2.377385 0.07344178 -0.5816987 0.4968894 -3.409108 0.4824005 -0.2253791 0.538568 0.3602099 0.1838869 0.9865417 0.2429371 -0.2590284 0.1997649 -2.825891 0.1690305 -2.876223 0.1508797 -2.025365 0.0745675 1.403832 0.1730641 -3.619998 0.02000108 -1.345008 0.1982651 -4.596162 0.08308789 -0.08156122 0.2391607 0.06438381 -0.01053187 -3.724878 1.342193 0.8912752 0.6741032 -2.038227 1.02059 1.314822 1.046652 -0.09340967 1.746544 1.748925 0.7121063 2.712003 0.7263532 -0.4646987 0.8260612 -1.657031 1.326838 2.972081 0.3091417 -0.0007440437 0.4395537 0.2231857 1.106737 -1.463802 0.912596 2.458881 1.365523 2.234517 1.385396 -0.4650663 0.7839583 2.282071 0.2386854 0.9784987 1.273242 0.6254676 1.276118 2.447344 1.575805 4.515861 0.8364591 0.6976352 0.5248962 -2.899523 0.5404893 -1.394514 0.3592697 -0.9920228 0.6932542 -0.180825 0.2744942 -0.6203262 0.2417071 -1.224851 0.1811886 -0.8900446 0.226887 -3.733405 0.1735546 -1.547938 0.112908 -8.991437 0.1115473 -4.406975 0.1178586 2.36029 0.1271653 -2.011151 0.08239845 -0.1951623 0.04392552 -3.046913 0.05452365 -1.516051 0.1397218 1.094332 0.1264289 1.933882 0.1752298 -1.989354 0.1306208 -0.9046116 0.1277338 2.303931 0.1250716 -1.277528 0.1262729 2.467187 0.1100633 -0.01313957 0.2080954 0.3508277 0.2128501 0.8266836 0.1570878 -3.358276 0.1242704 0.5336581 0.2453244 1.62488 0.1280062 -2.309668 0.1366247 -2.19346 0.1809374 -1.296981 0.1988072 1.969289 0.108966 1.910149 0.1144941 1.112991 0.1896516 3.369206 0.1071244 -1.499693 0.2437382 -0.4268651 0.2060418 2.251279 0.1191459 -0.03434586 0.173683 2.421464 0.1856368 0.8361177 0.1358361 -1.145463 0.1938676 -0.5617768 0.2014952 0.9808323 0.1261575 3.778169 0.2198753 -2.851743 0.2107957 1.588071 0.2140176 -1.231923 0.2426582 3.296147 0.1962984 1.245569 0.2223071 1.98822 0.1688889 0.5244448 0.1442813 0.5712083 0.1423984 1.580774 0.1360883 3.032353 0.1284363 0.2361942 0.2330592 0.7644741 0.1345135 -0.2284832 0.177801 0.9684574 0.1846398 -0.8217517 0.1690076 -0.1556177 0.2140797 0.9469408 0.1869065 1.540043 0.2084914 1.695791 0.1887449 0.8586018 0.2197045 2.042192 0.1317169 -0.8347201 0.1562218 -1.021943 0.1533783 1.968931 0.1491463 0.9810904 0.1098978 1.055438 0.2106167 1.022286 0.199176 0.9168212 0.03894011 -0.3426502 0.1671793 -3.521834 0.1534467 -0.2374157 0.2013961 1.556842 0.1250643 2.738925 0.1591011 1.886823 0.1626787 -0.2158393 0.2056642 3.486286 0.1987337 2.471281 0.1744332 1.152822 0.1951747 -0.989242 0.1155556 0.5867982 0.08321315 2.675081 0.1303141 -2.584626 0.1609869 -1.254465 0.122685 -2.27439 0.1415207 1.642059 0.1458468 -2.672928 0.1453881 -5.031017 0.1312302 0.2756784 0.1731334 0.8576683 0.1351911 -2.385648 0.2008109 -0.2893286 0.2062319 2.538498 0.191823 -1.739293 0.1671868 -1.622128 0.1850325 2.469388 0.2514207 2.117658 0.2225232 2.403703 0.1996269 -1.039887 0.122257 -0.6789801 0.1378511 -4.054801 0.08070226 0.2997747 0.1447362 2.234673 0.1876959 2.499773 0.1622439 0.5804301 0.1345898 0.07422866 0.09716025 -0.8710974 0.1657741 -1.370054 0.1313368 -1.671961 0.1861922 -0.9497792 0.2265888 -1.500299 0.2406933 -0.639602 0.2041045 -0.02993134 0.1440134 1.211941 0.1917053 -2.045351 0.1995012 0.4286667 0.2419866 1.319268 0.05873292 3.849816 0.1642288 3.828327 0.2147195 -0.1579602 0.2119238 -1.37588 0.142787 1.026216 0.2192122 -0.106043 0.2299229 0.5020428 0.1666804 -1.020357 0.03175325 -4.244395 0.1417808 1.535539 0.2701716 2.742496 0.2501491 1.629031 0.08584132 1.70437 0.1870297 6.293977 0.1240707 6.969766 -1.00584 -2.592914 8.495434 8.769945 -32.94919 \ No newline at end of file diff --git a/tika-core/src/main/resources/org/apache/tika/language/fa.ngp b/tika-core/src/main/resources/org/apache/tika/language/fa.ngp deleted file mode 100644 index e86e27e..0000000 --- a/tika-core/src/main/resources/org/apache/tika/language/fa.ngp +++ /dev/null @@ -1,1001 +0,0 @@ -# NgramProfile generated at Sun Jun 15 15:17:44 IRDT 2014 for Apache Tika Language Identification -ان_ 34459 -ند_ 15773 -نی_ 12790 -ری_ 12406 -_با 11822 -ین_ 11501 -های 11147 -انی 10770 -ای_ 10340 -ران 10094 -است 9846 -مان 9238 -_ال 9163 -ها_ 8765 -_بر 8513 -_کا 8460 -ید_ 8306 -یم_ 8287 -ار_ 8283 -ون_ 8244 -یان 8215 -ده_ 8205 -اری 7849 -یی_ 7834 -_در 7787 -_اس 7775 -ایی 7489 -_ما 7405 -ست_ 7374 -نه_ 7180 -اند 7113 -دار 7111 -تی_ 7097 -دی_ 6826 -کار 6646 -شان 6424 -رها 6258 -لی_ 6098 -_نا 6066 -_فر 6059 -ات_ 6000 -_سا 5901 -تان 5833 -_دا 5738 -وان 5616 -_پر 5565 -_خو 5537 -ستا 5387 -دان 5346 -ارا 5333 -_رو 5290 -انه 5223 -_ای 5222 -_می 5145 -ور_ 5145 -_ان 5134 -_مو 4986 -_سر 4836 -_دو 4813 -ال_ 4755 -_پا 4738 -_را 4721 -_وا 4659 -_بی 4653 -می_ 4589 -الی 4532 -یه_ 4488 -_کو 4476 -رد_ 4389 -_تا 4291 -وری 4233 -ردا 4167 -_سی 4144 -ندا 4126 -یست 4086 -_او 4080 -رای 4050 -خوا 4041 -یت_ 4039 -_تو 3989 -ارد 3964 -اد_ 3960 -ره_ 3931 -_نو 3882 -_نی 3874 -الا 3845 -_پی 3818 -نى_ 3813 -وار 3808 -رى_ 3743 -_دی 3737 -ام_ 3689 -یک_ 3689 -_بو 3673 -_سو 3637 -امی 3604 -یا_ 3533 -_گر 3529 -_ام 3528 -یرا 3512 -رین 3483 -ته_ 3466 -ورد 3462 -این 3460 -_مر 3446 -یر_ 3430 -نیا 3371 -اره 3336 -انت 3332 -ورا 3302 -ونی 3301 -ریا 3288 -بار 3286 -باز 3271 -گرا 3258 -تری 3242 -وی_ 3241 -زی_ 3238 -کی_ 3230 -مار 3226 -سی_ 3224 -_تر 3218 -بان 3196 -تر_ 3195 -را_ 3185 -دن_ 3181 -_ها 3156 -_کر 3151 -دها 3148 -_از 3144 -ود_ 3138 -_مت 3108 -نده 3089 -ندی 3089 -گی_ 3071 -ادی 3068 -ستی 3065 -ینی 3063 -له_ 3060 -ارت 3049 -_من 3034 -ولی 3012 -گان 3001 -برا 2969 -سان 2967 -بی_ 2950 -_مح 2939 -_ار 2903 -ازی 2896 -غیر 2883 -رند 2879 -رما 2864 -دید 2852 -یون 2848 -_مس 2818 -_غی 2810 -اتی 2795 -زار 2792 -نند 2778 -_پو 2775 -نگ_ 2774 -مای 2753 -یش_ 2753 -اور 2750 -اه_ 2737 -نها 2737 -ارو 2721 -_مع 2690 -انو 2681 -اها 2664 -تور 2662 -وس_ 2649 -انگ 2639 -_شا 2631 -_هم 2628 -فرا 2617 -ترا 2616 -انش 2605 -ایش 2605 -لان 2602 -میر 2594 -انس 2570 -_یا 2560 -مین 2556 -یل_ 2548 -دند 2522 -داد 2520 -انا 2515 -_لا 2514 -نا_ 2513 -_کن 2507 -اسی 2499 -یار 2496 -لا_ 2493 -کان 2491 -مه_ 2488 -از_ 2471 -_وی 2468 -یس_ 2448 -نان 2445 -رات 2434 -یری 2433 -راس 2430 -_کل 2422 -تار 2415 -سال 2415 -تى_ 2408 -_گو 2401 -نام 2401 -اما 2395 -وست 2389 -وند 2387 -گاه 2376 -ردی 2354 -رون 2336 -رم_ 2327 -یلی 2327 -ولا 2324 -نما 2322 -روز 2318 -اده 2275 -به_ 2274 -ساز 2271 -ستر 2253 -مون 2243 -دم_ 2239 -یند 2238 -ارن 2233 -_جا 2230 -یما 2223 -رو_ 2215 -یى_ 2206 -کرد 2196 -اس_ 2195 -گیر 2193 -_خا 2191 -گرد 2182 -_اب 2162 -اخت 2159 -ناس 2154 -اشت 2142 -هان 2142 -_ری 2140 -الم 2134 -خود 2134 -روی 2133 -اى_ 2126 -ربا 2117 -شی_ 2117 -تا_ 2115 -هار 2112 -انى 2106 -تون 2100 -رام 2091 -خان 2090 -گار 2089 -ول_ 2084 -_آن 2083 -_جو 2083 -یدا 2083 -_شو 2080 -در_ 2078 -_آر 2069 -اتو 2067 -رید 2057 -ینا 2049 -ابی 2048 -رت_ 2041 -لین 2032 -ما_ 2024 -ادا 2017 -_زی 2015 -مند 2015 -_اف 2006 -اب_ 2005 -رسا 2000 -پور 1998 -_بن 1995 -وال 1995 -بال 1988 -ادر 1982 -دى_ 1982 -شنا 1972 -_لو 1970 -تند 1964 -نش_ 1963 -اید 1962 -بود 1950 -_بل 1944 -راد 1943 -_لی 1931 -سیا 1926 -روا 1921 -درا 1914 -ابر 1912 -_شی 1911 -رست 1908 -وم_ 1907 -بیا 1904 -ینگ 1901 -_هو 1899 -یکا 1898 -_کی 1889 -یرو 1889 -_گا 1884 -ولو 1860 -دین 1856 -سید 1855 -ایت 1851 -_آم 1849 -رسی 1838 -زاد 1838 -یده 1836 -ارم 1827 -ابا 1825 -ایا 1825 -مور 1821 -_فا 1816 -اهی 1810 -سین 1800 -بین 1784 -نم_ 1784 -_مه 1782 -یدن 1782 -بری 1777 -تن_ 1777 -ردن 1776 -_به 1775 -ایى 1774 -یدی 1773 -ارک 1770 -وها 1769 -هی_ 1768 -ریم 1767 -_بد 1764 -زان 1764 -دال 1755 -نگی 1753 -_شه 1750 -ایر 1745 -ریک 1742 -_فی 1741 -فرو 1738 -لى_ 1733 -شهر 1728 -من_ 1728 -تها 1721 -ویی 1720 -_مد 1714 -ارس 1714 -ارش 1711 -ـــ 1710 -ورت 1709 -لیس 1704 -لیا 1700 -واه 1700 -نگا 1693 -اسا 1692 -لات 1687 -تما 1684 -رده 1682 -رفت 1682 -ودی 1679 -ترو 1678 -خور 1676 -رال 1676 -گری 1675 -باد 1673 -رش_ 1670 -حمد 1669 -لام 1664 -گر_ 1664 -دری 1663 -دگا 1656 -ورو 1652 -روس 1648 -نور 1647 -دور 1646 -لو_ 1646 -پار 1646 -ریس 1644 -یو_ 1642 -لای 1640 -مال 1637 -ایم 1634 -فری 1633 -سم_ 1632 -ندگ 1628 -زند 1624 -دا_ 1620 -وره 1619 -دست 1613 -ازا 1608 -انم 1607 -افت 1606 -یز_ 1605 -ندر 1604 -دای 1603 -رود 1603 -_کش 1595 -راب 1593 -تاب 1589 -تم_ 1574 -بای 1568 -_هی 1564 -وسی 1562 -بند 1561 -ستو 1561 -نگر 1560 -نوا 1554 -وای 1554 -تین 1550 -جان 1550 -ارگ 1549 -_هر 1548 -شور 1547 -رگر 1545 -_مش 1544 -مات 1544 -_آل 1542 -راه 1542 -یرم 1540 -بور 1534 -یای 1531 -الو 1529 -توا 1527 -شت_ 1524 -_یو 1512 -امو 1505 -درو 1505 -فی_ 1501 -_خر 1500 -ندو 1499 -نو_ 1496 -یها 1495 -_نم 1489 -_گل 1488 -یات 1483 -ورن 1475 -وما 1475 -نت_ 1470 -_ور 1465 -ماد 1464 -ونا 1464 -نوی 1463 -سون 1462 -_دل 1452 -کور 1451 -پرو 1445 -دیم 1444 -یشا 1444 -رار 1440 -اوی 1438 -نیم 1438 -برگ 1435 -_قا 1434 -مرد 1434 -راف 1430 -وا_ 1430 -_تی 1429 -یتا 1427 -_گی 1426 -یلا 1418 -نیک 1415 -ابو 1411 -شید 1400 -اک_ 1398 -مید 1398 -کرا 1395 -وز_ 1394 -برو 1390 -ستن 1390 -نیس 1390 -یاد 1389 -روش 1387 -فت_ 1386 -لار 1386 -نید 1386 -_نگ 1380 -انن 1377 -ومی 1377 -نای 1376 -وین 1375 -یسم 1373 -که_ 1372 -قی_ 1371 -_تن 1369 -داش 1363 -برد 1361 -میل 1357 -باش 1355 -محم 1352 -دو_ 1351 -کلا 1351 -ویا 1347 -_فو 1336 -الت 1336 -تو_ 1333 -نشا 1333 -_مج 1328 -نار 1328 -وزی 1328 -_اش 1325 -مدا 1323 -زه_ 1317 -اتر 1312 -کا_ 1310 -انک 1306 -_بخ 1304 -اله 1299 -بها 1298 -_سن 1296 -اسک 1296 -ائی 1295 -مى_ 1294 -دیا 1292 -ویس 1291 -گذا 1288 -نتی 1284 -راک 1280 -بر_ 1278 -کام 1277 -رور 1275 -وف_ 1272 -زیر 1266 -_و_ 1265 -واس 1264 -لند 1262 -یگر 1261 -ودا 1259 -راز 1258 -_اک 1257 -یلو 1255 -آور 1254 -تاد 1254 -تش_ 1254 -رتر 1254 -_اع 1252 -اکا 1251 -دام 1247 -_چا 1244 -پرس 1243 -_مق 1242 -کو_ 1242 -مرا 1238 -امه 1237 -داز 1237 -لیو 1237 -ارى 1235 -_آب 1232 -_مل 1231 -بدا 1228 -رنا 1228 -ماه 1226 -مست 1220 -نجا 1220 -دگی 1218 -سرا 1218 -برن 1217 -تای 1214 -یتی 1214 -وش_ 1211 -_ات 1205 -دون 1204 -یاب 1204 -تیک 1199 -ریز 1196 -ونه 1196 -_بگ 1194 -زها 1193 -تیا 1192 -عی_ 1190 -وشی 1190 -یزی 1185 -_زا 1184 -اوا 1184 -وت_ 1184 -_نش 1183 -هرا 1180 -با_ 1173 -ینو 1173 -اگر 1172 -میا 1172 -ورم 1172 -کال 1172 -ادو 1171 -سیو 1171 -_شر 1169 -دیو 1166 -_کم 1165 -سای 1165 -ارب 1163 -یسی 1156 -روم 1155 -_قر 1153 -اعت 1153 -ریو 1153 -_هن 1150 -رکا 1149 -وکا 1148 -تیم 1147 -شیر 1147 -_اح 1145 -یور 1145 -_رس 1144 -اشی 1143 -الد 1142 -امر 1142 -لاس 1142 -هم_ 1141 -یکو 1141 -کری 1140 -ازم 1139 -یاس 1139 -لما 1138 -نیو 1137 -یمی 1137 -تال 1134 -علی 1132 -_دس 1131 -رمی 1126 -رتو 1125 -الب 1124 -_ند 1123 -اکو 1122 -گوی 1120 -یوا 1118 -متر 1117 -رزا 1115 -نین 1115 -کند 1115 -_حا 1114 -سکو 1108 -پای 1108 -اکس 1107 -مدی 1104 -ندن 1103 -فتا 1100 -بات 1097 -_بس 1096 -رخو 1096 -پیر 1095 -اول 1092 -شد_ 1086 -کس_ 1086 -_عل 1085 -ذار 1081 -ربی 1081 -هر_ 1080 -هاى 1078 -_زن 1077 -رنگ 1077 -ویر 1076 -_تع 1073 -_شک 1073 -ردو 1071 -شین 1070 -نتر 1069 -هاس 1068 -نس_ 1067 -هند 1066 -یچ_ 1066 -_عا 1064 -الک 1059 -رتی 1058 -ازن 1057 -جوا 1056 -رتا 1055 -وید 1054 -مری 1052 -_آو 1050 -مسا 1047 -رشا 1045 -لیت 1045 -ینه 1043 -افی 1042 -نتا 1040 -_شن 1039 -_فل 1039 -افر 1039 -_جن 1038 -کنن 1037 -_اد 1035 -_پس 1035 -وى_ 1035 -نات 1032 -واد 1031 -وبا 1029 -فیل 1028 -وتو 1028 -دنی 1026 -دیر 1025 -_ول 1023 -ازد 1023 -باب 1021 -شاه 1020 -گون 1018 -واب 1017 -وبی 1016 -رنی 1014 -سار 1014 -واز 1014 -یره 1014 -لوی 1013 -کنی 1013 -کول 1013 -اتا 1011 -اهر 1011 -یال 1011 -یام 1011 -ودن 1010 -رگ_ 1009 -یزا 1009 -_اخ 1007 -رمن 1007 -ریت 1000 -ریه 999 -مت_ 999 -کرو 999 -یبا 998 -سى_ 997 -اسپ 996 -یرن 996 -_ده 995 -_سل 995 -_عم 995 -_صد 992 -تول 990 -زای 990 -اش_ 989 -اون 988 -وئی 987 -ماس 986 -_مخ 984 -ایس 984 -ایل 984 -ربر 983 -_عب 982 -سته 982 -نون 982 -پول 982 -_تم 981 -_شم 981 -یدم 976 -سکی 973 -شار 973 -پیش 970 -وتی 964 -لت_ 963 -دوس 959 -کلی 959 -منا 958 -سن_ 957 -لم_ 957 -بیر 955 -کاس 953 -وزا 951 -وه_ 950 -دش_ 949 -_نف 948 -رک_ 947 -_بش 945 -کات 945 -ستم 944 -هاد 944 -رس_ 943 -روب 943 -ودر 943 -وسا 943 -ویل 943 -نتو 938 -سلا 937 -_ضد 936 -_نب 935 -کتر 934 -درس 931 -جی_ 930 -فر_ 930 -ارز 929 -یف_ 929 -بى_ 928 -ونت 928 -باس 927 -_چن 926 -شما 925 -گى_ 923 -_م_ 918 -امت 917 -_تک 915 -اف_ 913 -داو 913 -زما 912 -نفر 909 -اکی 907 -درم 907 -خت_ 906 -ناب 906 -هام 906 -سه_ 903 -موز 903 -رن_ 902 -_پل 901 -روت 901 -سند 901 -_آی 900 -لوا 900 -ذیر 898 -دیگ 896 -سیم 896 -شه_ 895 -کى_ 894 -نست 893 -هزا 893 -پان 892 -دما 891 -ورس 890 -دود 888 -شتی 888 -_وم 887 -_بز 886 -هما 886 -ائو 884 -میز 883 -_پن 882 -_چی 882 -ونو 882 -کین 882 -پرد 879 -زى_ 878 -یوس 877 -تشا 876 -چه_ 876 -ایو 875 -رگا 874 -اسم 871 -راو 870 -پذی 870 -یب_ 870 -_آس 868 -قرا 868 -نیت 868 -فور 867 -اهد 866 -بلا 865 -ردم 864 -_اص 863 -پرا 860 -آمد 859 -ادگ 857 -وده 856 -گفت 856 -شکا 855 -لید 854 -ازه 853 -یگا 851 -هری 850 -ابه 847 -رز_ 847 -زین 847 -قه_ 847 -لور 847 -چی_ 847 -_بب 846 -ویت 846 -یکی 846 -اق_ 844 -شای 844 -الع 843 -دول 843 -شون 843 -فرم 842 -وجو 842 -مول 840 -_خی 835 -خدا 833 -زد_ 833 -مهر 833 -لون 832 -مام 830 -وتا 829 -جار 828 -مد_ 828 -ینت 828 -_اق 826 -وتر 826 -_آز 825 -سرو 823 -میت 823 -سیر 822 -آبا 819 -اشا 815 -درب 815 -رخا 815 -ناه 814 -تبا 813 -سوا 813 -ملا 812 -_مک 809 -وک_ 808 -کتا 808 -رضا 807 -سور 807 -ناک 807 -دوا 806 -پری 806 -ادم 805 -رسو 805 -ورز 804 -موا 801 -نال 801 -وب_ 801 -وشا 800 -_خد 799 -امب 798 -مود 798 -رمو 797 -امل 795 -مشا 795 -الح 793 -همی 793 -بست 792 -لیم 791 -ونگ 791 -توم 790 -فان 790 -_نظ 789 -بید 789 -داس 789 -قان 788 -کشو 788 -_گذ 785 -کر_ 785 -_ون 784 -عه_ 782 -کای 782 -یرد 782 -هنگ 780 -ورش 780 -ورک 780 -اکر 779 -نظر 779 -یدر 778 -_بک 777 -_شد 776 -چند 776 -تقا 773 -نبا 773 -_وس 772 -رکو 772 -لد_ 772 -مها 772 -کن_ 772 -الن 771 -ردر 771 -موس 770 -نیز 770 -یاز 769 -_ه_ 767 -مش_ 767 -یتو 767 -زیا 765 -کسی 764 -اپی 763 -شتر 763 -نوش 763 -هست 763 -_چر 761 -_مب 760 -ستگ 760 -_ا_ 759 -ابل 759 -_تق 758 -لوم 758 -ناد 758 -یمو 758 -_جی 757 -وکو 756 -_تح 755 -امپ 754 -_رف 753 -بخش 753 -برت 753 -_زو 752 -شود 752 -نسی 752 -امد 748 -دگر 748 -_یک 747 -اجر 747 -او_ 746 -فار 746 -_مص 745 -توس 745 -_چه 744 -اسل 744 -هور 744 -هوا 742 -الس 741 -اکت 741 -یوی 741 diff --git a/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties b/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties index c6b1880..bb73ff5 100644 --- a/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties +++ b/tika-core/src/main/resources/org/apache/tika/language/tika.language.properties @@ -23,7 +23,7 @@ # If there exists an ISO 639-1 2-letter code it should be used # If not, you can choose an ISO 639-2 3-letter code # See http://www.loc.gov/standards/iso639-2/php/code_list.php -languages=be,ca,da,de,eo,et,el,en,es,fi,fr,fa,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk +languages=be,ca,da,de,eo,et,el,en,es,fi,fr,gl,hu,is,it,lt,nl,no,pl,pt,ro,ru,sk,sl,sv,th,uk # List of language names in english name.be=Belarusian @@ -37,7 +37,6 @@ name.es=Spanish name.fi=Finnish name.fr=French -name.fa=Persian name.gl=Galician name.hu=Hungarian name.is=Icelandic diff --git a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml index d0f7df9..d4ad91a 100644 --- a/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml +++ b/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml @@ -19,22 +19,6 @@ Description: This xml file defines the valid mime types used by Tika. The mime type data within this file is based on information from various sources like Apache Nutch, Apache HTTP Server, the file(1) command, etc. - - Notes: - * Tika supports a wider range of match types than Freedesktop does - * Glob patterns must be unique, if there's a clash assign to the most - popular format - * The main mime type should be the canonical one, use aliases for any - other widely used forms - * Where there's a hierarchy in the types, list it via a parent - * Highly specific magic matches get a high priority - * General magic matches which could trigger a false-positive need - a low one - * The priority for containers normally need to be higher than for - the things they contain, so they don't accidently get detected - as what's in them - * For logic too complex to be expressed in a magic match, do the best - you can here, then provide a Custom Detector for the rest --> @@ -47,17 +31,8 @@ - - - - - - - - - @@ -79,17 +54,6 @@ - - - CBOR - <_comment>Concise Binary Object Representation container - http://tools.ietf.org/html/rfc7049 - - - - - - @@ -111,13 +75,7 @@ - - - <_comment>DICOM medical imaging data - - - - + @@ -140,7 +98,7 @@ - + <_comment>DITA Task Topic @@ -187,11 +145,10 @@ - + FITS <_comment>Flexible Image Transport System - http://www.digitalpreservation.gov/formats/fdd/fdd000317.shtml @@ -227,19 +184,6 @@ - - - <_comment>Windows setup INFormation - http://msdn.microsoft.com/en-us/library/windows/hardware/ff549520(v=vs.85).aspx - - - - - - - - - @@ -298,10 +242,7 @@ <_comment>Java Native Library for OSX - - - - + @@ -383,6 +324,8 @@ + + @@ -435,6 +378,8 @@ + + @@ -458,9 +403,6 @@ - - - @@ -481,16 +423,7 @@ http://www.adobe.com/devnet/pdf/pdf_reference_archive.html com.adobe.pdf - - - - - - - - - @@ -645,37 +578,6 @@ - - - <_comment>Sereal binary serialization format - https://github.com/Sereal/Sereal/blob/master/sereal_spec.pod - - - - - - - - - - - - - - - - - - - - - - - - - - - @@ -1013,14 +915,6 @@ - FDF - <_comment>Forms Data Format - http://en.wikipedia.org/wiki/Forms_Data_Format - http://www.adobe.com/devnet/acrobat/fdftoolkit.html - com.adobe.fdf - - - @@ -1105,9 +999,6 @@ - - - KML <_comment>Keyhole Markup Language @@ -1473,6 +1364,7 @@ + @@ -1506,60 +1398,6 @@ - - <_comment>Microsoft Excel 4 Worksheet - - - - - - - - - - - <_comment>Microsoft Excel 4 Workspace - - - - - - - - - - <_comment>Microsoft Excel 3 Worksheet - - - - - - - - - - - <_comment>Microsoft Excel 3 Workspace - - - - - - - - - - <_comment>Microsoft Excel 2 Worksheet - - - - - - - - - - @@ -1580,15 +1418,6 @@ <_comment>Microsoft Outlook Message - - - - <_comment>Outlook Personal Folders File Format - - - - - @@ -1700,10 +1529,7 @@ - - <_comment>Open XML Paper Specification - @@ -2214,9 +2040,8 @@ - - + @@ -2225,7 +2050,7 @@ - + @@ -2234,7 +2059,7 @@ - + @@ -2246,7 +2071,7 @@ - + @@ -2255,13 +2080,12 @@ - + - @@ -2398,44 +2222,12 @@ - <_comment>Microsoft Visio Diagram - - - - <_comment>Office Open XML Visio Drawing (macro-free) - - - - - <_comment>Office Open XML Visio Template (macro-free) - - - - - <_comment>Office Open XML Visio Stencil (macro-free) - - - - - <_comment>Office Open XML Visio Drawing (macro-enabled) - - - - - <_comment>Office Open XML Visio Template (macro-enabled) - - - - - <_comment>Office Open XML Visio Stencil (macro-enabled) - - @@ -2558,17 +2350,6 @@ - - <_comment>AxCrypt - - - - - - - - - INDD <_comment>Adobe InDesign document @@ -2586,16 +2367,6 @@ - - - - - - - - - - @@ -2634,157 +2405,24 @@ - <_comment>Berkeley DB - - - - <_comment>Berkeley DB Hash Database - - + - - - - - - <_comment>Berkeley DB BTree Database - + + + + + + + + + - - - - - - - - <_comment>Berkeley DB Queue Database - - - - - - - - - <_comment>Berkeley DB Log Database - - - - - - - - - - <_comment>Berkeley DB Version 2 Hash Database - - - - - - - - - - - - - - - <_comment>Berkeley DB Version 3 Hash Database - - - - - - - - - - - - - - - <_comment>Berkeley DB Version 4 Hash Database - - - - - - - - - - - - - - - <_comment>Berkeley DB Version 5 Hash Database - - - - - - - - - - - - - - - - <_comment>Berkeley DB Version 2 BTree Database - - - - - - - - - - - - - - - <_comment>Berkeley DB Version 3 BTree Database - - - - - - - - - - - - - - - <_comment>Berkeley DB Version 4 and 5 BTree Database - - - - - - - - - - - - + @@ -2793,38 +2431,20 @@ - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + - @@ -2833,12 +2453,12 @@ - + - @@ -2872,19 +2492,8 @@ - - - - - CRX - <_comment>Chrome Extension Package - https://developer.chrome.com/extensions/crx - - - - @@ -2898,7 +2507,7 @@ - + <_comment>UNIX CPIO Archive @@ -2917,26 +2526,12 @@ - - - - - - DEX - <_comment>Dalvik Executable Format - http://source.android.com/devices/tech/dalvik/dex-format.html - - - - - - @@ -3007,7 +2602,6 @@ - @@ -3061,12 +2655,9 @@ EMF <_comment>Extended Metafile - https://msdn.microsoft.com/en-us/library/cc230711.aspx - - - + @@ -3124,7 +2715,7 @@ - @@ -3176,21 +2767,9 @@ - - GRIB - <_comment>General Regularly-distributed Information in Binary form - http://en.wikipedia.org/wiki/GRIB - - - - - - - - <_comment>GNU tar Compressed File Archive (GNU Tape Archive) - + @@ -3198,20 +2777,14 @@ - + <_comment>Gzip Compressed Archive - - - - - - - + + - @@ -3230,19 +2803,13 @@ - <_comment>Hangul Word Processor File - - - <_comment>Hangul Word Processor File v5 - @@ -3257,38 +2824,11 @@ - - <_comment>ISA-Tab Investigation file - - - - - - - - - <_comment>ISA-Tab Study file - - - - - - - - <_comment>ISA-Tab Assay file - - - - - - ISO <_comment>ISO 9660 CD-ROM filesystem data - - @@ -3383,33 +2923,19 @@ + - - - - <_comment>Microsoft Windows Installer - - - - - - - - - - - - - - + + + @@ -3486,9 +3012,6 @@ - - - @@ -3502,46 +3025,6 @@ - - - - - <_comment>MySQL Table Definition (Format) - - - - - - - - - - - - - <_comment>MySQL MISAM Index - - - - - - - - <_comment>MySQL MISAM Compressed Index - - - - - - - - - <_comment>MySQL MISAM Data - - - - - @@ -3564,7 +3047,7 @@ - + @@ -3572,13 +3055,6 @@ - - - <_comment>XQuery source code - - - - <_comment>RAR archive @@ -3588,11 +3064,6 @@ - - - - - @@ -3720,20 +3191,6 @@ - - - - <_comment>Snappy Framed - - - - - - - - - - @@ -3775,7 +3232,6 @@ - @@ -3796,7 +3252,7 @@ - + @@ -3824,10 +3280,6 @@ - - <_comment>Pre-OLE2 (Old) Microsoft Excel Worksheets - - @@ -3848,16 +3300,6 @@ <_comment>Password Protected OOXML File - - - <_comment>Visio OOXML File - - - - - - - @@ -3867,23 +3309,6 @@ - - - VHD - <_comment>Virtual PC Virtual Hard Disk - http://en.wikipedia.org/wiki/VHD_%28file_format%29 - - - - - - - VMDK - <_comment>Virtual Disk Format - http://en.wikipedia.org/wiki/VMDK - - - @@ -3937,9 +3362,7 @@ - - - + @@ -3955,10 +3378,10 @@ http://en.wikipedia.org/wiki/Xml public.xml - + @@ -3966,12 +3389,6 @@ - - - - @@ -3992,16 +3409,6 @@ - - - - - XSLFO - <_comment>XSL Format - - - @@ -4014,13 +3421,8 @@ - XSPF - <_comment>XML Shareable Playlist Format - - @@ -4033,28 +3435,10 @@ http://en.wikipedia.org/wiki/ZIP_(file_format) com.pkware.zip-archive - + - - - - - - - <_comment>ZLIB Compressed Data Format - http://tools.ietf.org/html/rfc1950 - - - - - - - - - - @@ -4228,71 +3612,20 @@ - <_comment>Ogg Vorbis Audio - - - - - <_comment>Ogg Vorbis Codec Compressed WAV File - - + + + - - - - - <_comment>Ogg Packaged Free Lossless Audio Codec - - - - - - - - - - <_comment>Ogg Packaged Unompressed WAV File - - - - - - - - - - <_comment>Ogg Opus Codec Compressed WAV File - - - - - - - - - - <_comment>Ogg Speex Codec Compressed WAV File - - - - - - + @@ -4375,6 +3708,7 @@ + @@ -4554,16 +3888,6 @@ - - BPG - <_comment>Better Portable Graphics - - - - - - - CGM <_comment>Computer Graphics Metafile @@ -4580,11 +3904,7 @@ - - - - - + @@ -4715,17 +4035,16 @@ - PSD <_comment>Photoshop Image + - @@ -4751,6 +4070,7 @@ + @@ -4805,13 +4125,8 @@ - - <_comment>Microsoft Document Imaging - - - - - + + @@ -4826,17 +4141,6 @@ - - WEBP - http://en.wikipedia.org/wiki/WebP - - - - - - - @@ -4846,35 +4150,12 @@ - - <_comment>FreeHand image - - - - - - - - - - - - - - - - - - - - - @@ -5121,14 +4402,12 @@ - - @@ -5148,7 +4427,6 @@ - @@ -5174,12 +4452,6 @@ - - <_comment>AutoCAD Design Web Format - - - - @@ -5219,13 +4491,6 @@ - - - - - - - <_comment>ActionScript source code @@ -5286,7 +4551,6 @@ - @@ -5409,6 +4673,7 @@ + @@ -5540,17 +4805,6 @@ - - <_comment>Web Video Text Tracks Format - WebVTT - - - - - - - - <_comment>AWK script @@ -5609,9 +4863,6 @@ <_comment>C source code header - - - @@ -5630,9 +4881,6 @@ <_comment>C source code - - - @@ -5762,14 +5010,6 @@ - - <_comment>Java Properties - - - - - - <_comment>Java Server Page @@ -5813,36 +5053,11 @@ <_comment>Matlab source code - - - - - - - - - - - - - - - - - - - - MATLAB data file - - - - - @@ -5993,10 +5208,6 @@ - - <_comment>Text-based (non-binary) Message - - @@ -6032,9 +5243,6 @@ - - - @@ -6163,89 +5371,8 @@ - <_comment>Ogg Vorbis Video - - - - <_comment>Ogg Daala Video - - - - - - - - - - <_comment>Ogg Theora Video - - - - - - - - - - <_comment>Ogg Packaged Dirac Video - - - - - - - - - <_comment>Ogg Packaged OGM Video - - - - - - - - - - <_comment>Ogg Packaged Raw UVS Video - - - - - - - - - <_comment>Ogg Packaged Raw YUV Video - - - - - - - - - <_comment>Ogg Packaged Raw RGB Video - - - - - @@ -6385,7 +5512,7 @@ - + <_comment>Matroska Media Container @@ -6459,4 +5586,11 @@ + + <_comment>XQuery source code + + + + + diff --git a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java index 799f977..f81c731 100644 --- a/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java +++ b/tika-core/src/test/java/org/apache/tika/TikaDetectionTest.java @@ -77,8 +77,8 @@ assertEquals("application/octet-stream", tika.detect("x.lrf")); assertEquals("application/octet-stream", tika.detect("x.lzh")); assertEquals("application/octet-stream", tika.detect("x.so")); - assertEquals("application/x-iso9660-image", tika.detect("x.iso")); - assertEquals("application/x-apple-diskimage", tika.detect("x.dmg")); + assertEquals("application/octet-stream", tika.detect("x.iso")); + assertEquals("application/octet-stream", tika.detect("x.dmg")); assertEquals("application/octet-stream", tika.detect("x.dist")); assertEquals("application/octet-stream", tika.detect("x.distz")); assertEquals("application/octet-stream", tika.detect("x.pkg")); @@ -584,8 +584,7 @@ assertEquals("application/x-msdownload", tika.detect("x.dll")); assertEquals("application/x-msdownload", tika.detect("x.com")); assertEquals("application/x-msdownload", tika.detect("x.bat")); - // Differ from httpd - MSI is different from normal windows executables - //assertEquals("application/x-msdownload", tika.detect("x.msi")); + assertEquals("application/x-msdownload", tika.detect("x.msi")); assertEquals("application/x-msmediaview", tika.detect("x.mvb")); assertEquals("application/x-msmediaview", tika.detect("x.m13")); assertEquals("application/x-msmediaview", tika.detect("x.m14")); @@ -652,10 +651,8 @@ assertEquals("audio/mpeg", tika.detect("x.m2a")); assertEquals("audio/mpeg", tika.detect("x.m3a")); assertEquals("audio/ogg", tika.detect("x.oga")); - // Differ from httpd - Use a dedicated mimetype of Vorbis - //assertEquals("audio/ogg", tika.detect("x.ogg")); - // Differ from httpd - Speex more commonly uses its own mimetype - //assertEquals("audio/ogg", tika.detect("x.spx")); + assertEquals("audio/ogg", tika.detect("x.ogg")); + assertEquals("audio/ogg", tika.detect("x.spx")); assertEquals("audio/vnd.digital-winds", tika.detect("x.eol")); assertEquals("audio/vnd.dts", tika.detect("x.dts")); assertEquals("audio/vnd.dts.hd", tika.detect("x.dtshd")); @@ -841,11 +838,6 @@ assertEquals("video/x-msvideo", tika.detect("x.avi")); assertEquals("video/x-sgi-movie", tika.detect("x.movie")); assertEquals("x-conference/x-cooltalk", tika.detect("x.ice")); - - assertEquals("application/x-grib", tika.detect("x.grb")); - assertEquals("application/x-grib", tika.detect("x.grb1")); - assertEquals("application/x-grib", tika.detect("x.grb2")); - assertEquals("application/dif+xml", tika.detect("x.dif")); } } diff --git a/tika-core/src/test/java/org/apache/tika/TikaTest.java b/tika-core/src/test/java/org/apache/tika/TikaTest.java deleted file mode 100644 index 2c6f21f..0000000 --- a/tika-core/src/test/java/org/apache/tika/TikaTest.java +++ /dev/null @@ -1,214 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika; - -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; - -import java.io.ByteArrayOutputStream; -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.net.URISyntaxException; -import java.net.URL; -import java.util.ArrayList; -import java.util.Collection; -import java.util.HashSet; -import java.util.List; -import java.util.Set; - -import org.apache.tika.extractor.EmbeddedResourceHandler; -import org.apache.tika.io.IOUtils; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.ToXMLContentHandler; -import org.xml.sax.ContentHandler; - -/** - * Parent class of Tika tests - */ -public abstract class TikaTest { - /** - * This method will give you back the filename incl. the absolute path name - * to the resource. If the resource does not exist it will give you back the - * resource name incl. the path. - * - * @param name - * The named resource to search for. - * @return an absolute path incl. the name which is in the same directory as - * the the class you've called it from. - */ - public File getResourceAsFile(String name) throws URISyntaxException { - URL url = this.getClass().getResource(name); - if (url != null) { - return new File(url.toURI()); - } else { - // We have a file which does not exists - // We got the path - url = this.getClass().getResource("."); - File file = new File(new File(url.toURI()), name); - if (file == null) { - fail("Unable to find requested file " + name); - } - return file; - } - } - - public InputStream getResourceAsStream(String name) { - InputStream stream = this.getClass().getResourceAsStream(name); - if (stream == null) { - fail("Unable to find requested resource " + name); - } - return stream; - } - - public static void assertContains(String needle, String haystack) { - assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle)); - } - public static void assertContains(T needle, Collection haystack) { - assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle)); - } - - public static void assertNotContained(String needle, String haystack) { - assertFalse(needle + " unexpectedly found in:\n" + haystack, haystack.contains(needle)); - } - public static void assertNotContained(T needle, Collection haystack) { - assertFalse(needle + " unexpectedly found in:\n" + haystack, haystack.contains(needle)); - } - - protected static class XMLResult { - public final String xml; - public final Metadata metadata; - - public XMLResult(String xml, Metadata metadata) { - this.xml = xml; - this.metadata = metadata; - } - } - - protected XMLResult getXML(String filePath, Parser parser, Metadata metadata) throws Exception { - return getXML(getResourceAsStream("/test-documents/" + filePath), parser, metadata); - } - - protected XMLResult getXML(String filePath, Metadata metadata) throws Exception { - return getXML(getResourceAsStream("/test-documents/" + filePath), new AutoDetectParser(), metadata); - } - - protected XMLResult getXML(String filePath) throws Exception { - return getXML(getResourceAsStream("/test-documents/" + filePath), new AutoDetectParser(), new Metadata()); - } - - protected XMLResult getXML(InputStream input, Parser parser, Metadata metadata) throws Exception { - ParseContext context = new ParseContext(); - context.set(Parser.class, parser); - - try { - ContentHandler handler = new ToXMLContentHandler(); - parser.parse(input, handler, metadata, context); - return new XMLResult(handler.toString(), metadata); - } finally { - input.close(); - } - } - - /** - * Basic text extraction. - *

    - * Tries to close input stream after processing. - */ - public String getText(InputStream is, Parser parser, ParseContext context, Metadata metadata) throws Exception{ - ContentHandler handler = new BodyContentHandler(1000000); - try { - parser.parse(is, handler, metadata, context); - } finally { - is.close(); - } - return handler.toString(); - } - - public String getText(InputStream is, Parser parser, Metadata metadata) throws Exception{ - return getText(is, parser, new ParseContext(), metadata); - } - - public String getText(InputStream is, Parser parser, ParseContext context) throws Exception{ - return getText(is, parser, context, new Metadata()); - } - - public String getText(InputStream is, Parser parser) throws Exception{ - return getText(is, parser, new ParseContext(), new Metadata()); - } - - /** - * Keeps track of media types and file names recursively. - * - */ - public static class TrackingHandler implements EmbeddedResourceHandler { - public List filenames = new ArrayList(); - public List mediaTypes = new ArrayList(); - - private final Set skipTypes; - - public TrackingHandler() { - skipTypes = new HashSet(); - } - - public TrackingHandler(Set skipTypes) { - this.skipTypes = skipTypes; - } - - @Override - public void handle(String filename, MediaType mediaType, - InputStream stream) { - if (skipTypes.contains(mediaType)) { - return; - } - mediaTypes.add(mediaType); - filenames.add(filename); - } - } - - /** - * Copies byte[] of embedded documents into a List. - */ - public static class ByteCopyingHandler implements EmbeddedResourceHandler { - - public List bytes = new ArrayList(); - - @Override - public void handle(String filename, MediaType mediaType, - InputStream stream) { - ByteArrayOutputStream os = new ByteArrayOutputStream(); - if (! stream.markSupported()) { - stream = TikaInputStream.get(stream); - } - stream.mark(0); - try { - IOUtils.copy(stream, os); - bytes.add(os.toByteArray()); - stream.reset(); - } catch (IOException e) { - //swallow - } - } - } -} diff --git a/tika-core/src/test/java/org/apache/tika/TypeDetectionBenchmark.java b/tika-core/src/test/java/org/apache/tika/TypeDetectionBenchmark.java index 550c8fe..25f5d15 100644 --- a/tika-core/src/test/java/org/apache/tika/TypeDetectionBenchmark.java +++ b/tika-core/src/test/java/org/apache/tika/TypeDetectionBenchmark.java @@ -20,7 +20,6 @@ import java.io.File; import java.io.FileInputStream; import java.io.InputStream; -import java.util.Locale; import org.apache.tika.io.IOUtils; @@ -36,7 +35,7 @@ } } else { benchmark(new File( - "../tika-parsers/src/test/resources/test-documents")); + "../tika-parsers/src/test/resources/test-documents")); } System.out.println( "Total benchmark time: " @@ -47,18 +46,20 @@ if (file.isHidden()) { // ignore } else if (file.isFile()) { - try (InputStream input = new FileInputStream(file)) { + InputStream input = new FileInputStream(file); + try { byte[] content = IOUtils.toByteArray(input); String type = - tika.detect(new ByteArrayInputStream(content)); + tika.detect(new ByteArrayInputStream(content)); long start = System.currentTimeMillis(); for (int i = 0; i < 1000; i++) { tika.detect(new ByteArrayInputStream(content)); } System.out.printf( - Locale.ROOT, "%6dns per Tika.detect(%s) = %s%n", System.currentTimeMillis() - start, file, type); + } finally { + input.close(); } } else if (file.isDirectory()) { for (File child : file.listFiles()) { diff --git a/tika-core/src/test/java/org/apache/tika/config/AbstractTikaConfigTest.java b/tika-core/src/test/java/org/apache/tika/config/AbstractTikaConfigTest.java deleted file mode 100644 index f817ef0..0000000 --- a/tika-core/src/test/java/org/apache/tika/config/AbstractTikaConfigTest.java +++ /dev/null @@ -1,50 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.config; - -import static org.junit.Assert.assertNotNull; - -import java.net.URL; - -import org.apache.tika.TikaTest; -import org.apache.tika.parser.ParseContext; -import org.junit.After; - -/** - * Parent of Junit test classes for {@link TikaConfig}, including - * Tika Core based ones, and ones in Tika Parsers that do things - * that {@link TikaConfigTest} can't, do due to a need for the - * full set of "real" classes of parsers / detectors - */ -public abstract class AbstractTikaConfigTest extends TikaTest { - protected static ParseContext context = new ParseContext(); - - protected static String getConfigPath(String config) throws Exception { - URL url = TikaConfig.class.getResource(config); - assertNotNull("Test Tika Config not found: " + config, url); - return url.toExternalForm(); - } - protected static TikaConfig getConfig(String config) throws Exception { - System.setProperty("tika.config", getConfigPath(config)); - return new TikaConfig(); - } - - @After - public void resetConfig() { - System.clearProperty("tika.config"); - } -} diff --git a/tika-core/src/test/java/org/apache/tika/config/DummyExecutor.java b/tika-core/src/test/java/org/apache/tika/config/DummyExecutor.java deleted file mode 100644 index c9b5dec..0000000 --- a/tika-core/src/test/java/org/apache/tika/config/DummyExecutor.java +++ /dev/null @@ -1,30 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.config; - -import java.util.concurrent.LinkedBlockingQueue; -import java.util.concurrent.ThreadPoolExecutor; -import java.util.concurrent.TimeUnit; - -import org.apache.tika.concurrent.ConfigurableThreadPoolExecutor; - -public class DummyExecutor extends ThreadPoolExecutor implements ConfigurableThreadPoolExecutor { - public DummyExecutor() - { - super(1,1, 0L, TimeUnit.SECONDS, new LinkedBlockingQueue()); - } -} diff --git a/tika-core/src/test/java/org/apache/tika/config/DummyParser.java b/tika-core/src/test/java/org/apache/tika/config/DummyParser.java deleted file mode 100644 index 78caa5c..0000000 --- a/tika-core/src/test/java/org/apache/tika/config/DummyParser.java +++ /dev/null @@ -1,38 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.config; - -import java.util.Collection; - -import org.apache.tika.mime.MediaTypeRegistry; -import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.Parser; - -public class DummyParser extends CompositeParser implements Parser { - private static final long serialVersionUID = 7179782154785528555L; - - private ServiceLoader loader; - - public DummyParser(MediaTypeRegistry registry, ServiceLoader loader, - Collection> excludeParsers) { - this.loader = loader; - } - - public ServiceLoader getLoader() { - return loader; - } -} \ No newline at end of file diff --git a/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java b/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java index fed0968..59e9dcf 100644 --- a/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java +++ b/tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java @@ -16,26 +16,14 @@ */ package org.apache.tika.config; -import java.net.URI; import java.net.URL; -import java.nio.file.Path; -import java.nio.file.Paths; import java.util.List; import java.util.Map; -import java.util.concurrent.ThreadPoolExecutor; import org.apache.tika.ResourceLoggingClassLoader; -import org.apache.tika.config.DummyExecutor; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.config.TikaConfigTest; import org.apache.tika.exception.TikaException; import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.CompositeParser; import org.apache.tika.parser.DefaultParser; -import org.apache.tika.parser.EmptyParser; -import org.apache.tika.parser.ErrorParser; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.ParserDecorator; import org.junit.Test; import static org.junit.Assert.assertEquals; @@ -43,14 +31,8 @@ import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; -/** - * Tests for the Tika Config, which don't require real parsers / - * detectors / etc. - * There's also {@link TikaParserConfigTest} and {@link TikaDetectorConfigTest} - * over in the Tika Parsers project, which do further Tika Config - * testing using real parsers and detectors. - */ -public class TikaConfigTest extends AbstractTikaConfigTest { +public class TikaConfigTest { + /** * Make sure that a configuration file can't reference the * {@link AutoDetectParser} class a <parser> configuration element. @@ -58,42 +40,16 @@ * @see TIKA-866 */ @Test - public void withInvalidParser() throws Exception { + public void testInvalidParser() throws Exception { + URL url = TikaConfigTest.class.getResource("TIKA-866-invalid.xml"); + System.setProperty("tika.config", url.toExternalForm()); try { - getConfig("TIKA-866-invalid.xml"); + new TikaConfig(); fail("AutoDetectParser allowed in a element"); - } catch (TikaException expected) {} - } - - /** - * Make sure that with a service loader given, we can - * get different configurable behaviour on parser classes - * which can't be found. - */ - @Test - public void testUnknownParser() throws Exception { - ServiceLoader ignoreLoader = new ServiceLoader( - getClass().getClassLoader(), LoadErrorHandler.IGNORE); - ServiceLoader warnLoader = new ServiceLoader( - getClass().getClassLoader(), LoadErrorHandler.WARN); - ServiceLoader throwLoader = new ServiceLoader( - getClass().getClassLoader(), LoadErrorHandler.THROW); - Path configPath = Paths.get(new URI(getConfigPath("TIKA-1700-unknown-parser.xml"))); - - TikaConfig ignore = new TikaConfig(configPath, ignoreLoader); - assertNotNull(ignore); - assertNotNull(ignore.getParser()); - assertEquals(1, ((CompositeParser)ignore.getParser()).getAllComponentParsers().size()); - - TikaConfig warn = new TikaConfig(configPath, warnLoader); - assertNotNull(warn); - assertNotNull(warn.getParser()); - assertEquals(1, ((CompositeParser)warn.getParser()).getAllComponentParsers().size()); - - try { - new TikaConfig(configPath, throwLoader); - fail("Shouldn't get here, invalid parser class"); - } catch (TikaException expected) {} + } catch (TikaException expected) { + } finally { + System.clearProperty("tika.config"); + } } /** @@ -103,12 +59,15 @@ * * @see TIKA-866 */ - @Test - public void asCompositeParser() throws Exception { + public void testCompositeParser() throws Exception { + URL url = TikaConfigTest.class.getResource("TIKA-866-composite.xml"); + System.setProperty("tika.config", url.toExternalForm()); try { - getConfig("TIKA-866-composite.xml"); + new TikaConfig(); } catch (TikaException e) { fail("Unexpected TikaException: " + e); + } finally { + System.clearProperty("tika.config"); } } @@ -118,12 +77,15 @@ * * @see TIKA-866 */ - @Test - public void onlyValidParser() throws Exception { + public void testValidParser() throws Exception { + URL url = TikaConfigTest.class.getResource("TIKA-866-valid.xml"); + System.setProperty("tika.config", url.toExternalForm()); try { - getConfig("TIKA-866-valid.xml"); + new TikaConfig(); } catch (TikaException e) { fail("Unexpected TikaException: " + e); + } finally { + System.clearProperty("tika.config"); } } @@ -132,8 +94,7 @@ * that should be used when loading the mimetypes and when * discovering services */ - @Test - public void ensureClassLoaderUsedEverywhere() throws Exception { + public void testClassLoaderUsedEverywhere() throws Exception { ResourceLoggingClassLoader customLoader = new ResourceLoggingClassLoader(getClass().getClassLoader()); TikaConfig config; @@ -166,99 +127,4 @@ // - Custom Mimetypes assertNotNull(resources.get("org/apache/tika/mime/custom-mimetypes.xml")); } - - /** - * TIKA-1445 It should be possible to exclude DefaultParser from - * certain types, so another parser explicitly listed will take them - */ - @Test - public void defaultParserWithExcludes() throws Exception { - try { - TikaConfig config = getConfig("TIKA-1445-default-except.xml"); - - CompositeParser cp = (CompositeParser)config.getParser(); - List parsers = cp.getAllComponentParsers(); - Parser p; - - // Will be the three parsers defined in the xml - assertEquals(3, parsers.size()); - - // Should have a wrapped DefaultParser, not the main DefaultParser, - // as it is excluded from handling certain classes - p = parsers.get(0); - assertTrue(p.toString(), p instanceof ParserDecorator); - assertEquals(DefaultParser.class, ((ParserDecorator)p).getWrappedParser().getClass()); - - // Should have two others which claim things, which they wouldn't - // otherwise handle - p = parsers.get(1); - assertTrue(p.toString(), p instanceof ParserDecorator); - assertEquals(EmptyParser.class, ((ParserDecorator)p).getWrappedParser().getClass()); - assertEquals("hello/world", p.getSupportedTypes(null).iterator().next().toString()); - - p = parsers.get(2); - assertTrue(p.toString(), p instanceof ParserDecorator); - assertEquals(ErrorParser.class, ((ParserDecorator)p).getWrappedParser().getClass()); - assertEquals("fail/world", p.getSupportedTypes(null).iterator().next().toString()); - } catch (TikaException e) { - fail("Unexpected TikaException: " + e); - } - } - - /** - * TIKA-1653 If one parser has child parsers, those child parsers shouldn't - * show up at the top level as well - */ - @Test - public void parserWithChildParsers() throws Exception { - try { - TikaConfig config = getConfig("TIKA-1653-norepeat.xml"); - - CompositeParser cp = (CompositeParser)config.getParser(); - List parsers = cp.getAllComponentParsers(); - Parser p; - - // Just 2 top level parsers - assertEquals(2, parsers.size()); - - // Should have a CompositeParser with 2 child ones, and - // and a wrapped empty parser - p = parsers.get(0); - assertTrue(p.toString(), p instanceof CompositeParser); - assertEquals(2, ((CompositeParser)p).getAllComponentParsers().size()); - - p = parsers.get(1); - assertTrue(p.toString(), p instanceof ParserDecorator); - assertEquals(EmptyParser.class, ((ParserDecorator)p).getWrappedParser().getClass()); - assertEquals("hello/world", p.getSupportedTypes(null).iterator().next().toString()); - } catch (TikaException e) { - fail("Unexpected TikaException: " + e); - } - } - - @Test - public void testDynamicServiceLoaderFromConfig() throws Exception { - URL url = TikaConfigTest.class.getResource("TIKA-1700-dynamic.xml"); - TikaConfig config = new TikaConfig(url); - - DummyParser parser = (DummyParser)config.getParser(); - - ServiceLoader loader = parser.getLoader(); - boolean dynamicValue = loader.isDynamic(); - - assertTrue("Dynamic Service Loading Should be true", dynamicValue); - } - - @Test - public void testTikaExecutorServiceFromConfig() throws Exception { - URL url = TikaConfigTest.class.getResource("TIKA-1762-executors.xml"); - - TikaConfig config = new TikaConfig(url); - - ThreadPoolExecutor executorService = (ThreadPoolExecutor)config.getExecutorService(); - - assertTrue("Should use Dummy Executor", (executorService instanceof DummyExecutor)); - assertEquals("Should have configured Core Threads", 3, executorService.getCorePoolSize()); - assertEquals("Should have configured Max Threads", 10, executorService.getMaximumPoolSize()); - } }diff --git a/tika-core/src/test/java/org/apache/tika/detect/MagicDetectorTest.java b/tika-core/src/test/java/org/apache/tika/detect/MagicDetectorTest.java index 3b9d373..67f21b5 100644 --- a/tika-core/src/test/java/org/apache/tika/detect/MagicDetectorTest.java +++ b/tika-core/src/test/java/org/apache/tika/detect/MagicDetectorTest.java @@ -24,9 +24,6 @@ import org.apache.tika.mime.MediaType; import org.junit.Test; -import static java.nio.charset.StandardCharsets.US_ASCII; -import static java.nio.charset.StandardCharsets.UTF_16BE; -import static java.nio.charset.StandardCharsets.UTF_16LE; import static org.junit.Assert.assertEquals; import static org.junit.Assert.fail; @@ -38,7 +35,7 @@ @Test public void testDetectNull() throws Exception { MediaType html = new MediaType("text", "html"); - Detector detector = new MagicDetector(html, ""); @@ -61,7 +58,7 @@ public void testDetectOffsetRange() throws Exception { MediaType html = new MediaType("text", "html"); Detector detector = new MagicDetector( - html, ""); @@ -114,7 +111,7 @@ public void testDetectRegExPDF() throws Exception { MediaType pdf = new MediaType("application", "pdf"); Detector detector = new MagicDetector( - pdf, "(?s)\\A.{0,144}%PDF-".getBytes(US_ASCII), null, true, 0, 0); + pdf, "(?s)\\A.{0,144}%PDF-".getBytes("ASCII"), null, true, 0, 0); assertDetect(detector, pdf, "%PDF-1.0"); assertDetect( @@ -139,7 +136,7 @@ + "\".*\\x3ctitle\\x3e.*\\x3c/title\\x3e"; MediaType xhtml = new MediaType("application", "xhtml+xml"); Detector detector = new MagicDetector(xhtml, - pattern.getBytes(US_ASCII), null, + pattern.getBytes("ASCII"), null, true, 0, 8192); assertDetect(detector, xhtml, @@ -174,7 +171,7 @@ MediaType html = new MediaType("text", "html"); Detector detector = new MagicDetector( - html, pattern.getBytes(US_ASCII), null, true, 0, 0); + html, pattern.getBytes("ASCII"), null, true, 0, 0); assertDetect(detector, html, data); assertDetect(detector, html, data1); @@ -183,7 +180,7 @@ @Test public void testDetectStreamReadProblems() throws Exception { - byte[] data = "abcdefghijklmnopqrstuvwxyz0123456789".getBytes(US_ASCII); + byte[] data = "abcdefghijklmnopqrstuvwxyz0123456789".getBytes("ASCII"); MediaType testMT = new MediaType("application", "test"); Detector detector = new MagicDetector(testMT, data, null, false, 0, 0); // Deliberately prevent InputStream.read(...) from reading the entire @@ -200,24 +197,28 @@ // Check regular String matching detector = MagicDetector.parse(testMT, "string", "0:20", "abcd", null); - assertDetect(detector, testMT, data.getBytes(US_ASCII)); + assertDetect(detector, testMT, data.getBytes("ASCII")); detector = MagicDetector.parse(testMT, "string", "0:20", "cdEFGh", null); - assertDetect(detector, testMT, data.getBytes(US_ASCII)); + assertDetect(detector, testMT, data.getBytes("ASCII")); // Check Little Endian and Big Endian utf-16 strings detector = MagicDetector.parse(testMT, "unicodeLE", "0:20", "cdEFGh", null); - assertDetect(detector, testMT, data.getBytes(UTF_16LE)); + assertDetect(detector, testMT, data.getBytes("UTF-16LE")); detector = MagicDetector.parse(testMT, "unicodeBE", "0:20", "cdEFGh", null); - assertDetect(detector, testMT, data.getBytes(UTF_16BE)); + assertDetect(detector, testMT, data.getBytes("UTF-16BE")); // Check case ignoring String matching detector = MagicDetector.parse(testMT, "stringignorecase", "0:20", "BcDeFgHiJKlm", null); - assertDetect(detector, testMT, data.getBytes(US_ASCII)); + assertDetect(detector, testMT, data.getBytes("ASCII")); } private void assertDetect(Detector detector, MediaType type, String data) { - byte[] bytes = data.getBytes(US_ASCII); - assertDetect(detector, type, bytes); + try { + byte[] bytes = data.getBytes("ASCII"); + assertDetect(detector, type, bytes); + } catch (IOException e) { + fail("Unexpected exception from MagicDetector"); + } } private void assertDetect(Detector detector, MediaType type, byte[] bytes) { try { diff --git a/tika-core/src/test/java/org/apache/tika/detect/MimeDetectionWithNNTest.java b/tika-core/src/test/java/org/apache/tika/detect/MimeDetectionWithNNTest.java deleted file mode 100644 index c815607..0000000 --- a/tika-core/src/test/java/org/apache/tika/detect/MimeDetectionWithNNTest.java +++ /dev/null @@ -1,140 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.detect; - -import static org.junit.Assert.assertEquals; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MimeDetectionTest; -import org.junit.Before; -import org.junit.Test; - -public class MimeDetectionWithNNTest { - - private Detector detector; - - /** @inheritDoc */ - @Before - public void setUp() { - detector = new NNExampleModelDetector(); - } - - /** - * The test case only works on the detector that only has grb model as - * currently the grb model is used as an example; if more models are added - * in the TrainedModelDetector, the following tests will need to modified to reflect - * the corresponding type instead of test-equal with the "OCTET_STREAM"; - * - * @throws Exception - */ - @Test - public void testDetection() throws Exception { - String octetStream_str = MediaType.OCTET_STREAM.toString(); - String grb_str = "application/x-grib"; - - testFile(grb_str, "gdas1.forecmwf.2014062612.grib2"); - testFile(grb_str, "GLDAS_CLM10SUBP_3H.A19790202.0000.001.grb"); - - testFile(octetStream_str, "circles.svg"); - testFile(octetStream_str, "circles-with-prefix.svg"); - testFile(octetStream_str, "datamatrix.png"); - testFile(octetStream_str, "test.html"); - testFile(octetStream_str, "test-iso-8859-1.xml"); - testFile(octetStream_str, "test-utf8.xml"); - testFile(octetStream_str, "test-utf8-bom.xml"); - testFile(octetStream_str, "test-utf16le.xml"); - testFile(octetStream_str, "test-utf16be.xml"); - testFile(octetStream_str, "test-long-comment.xml"); - testFile(octetStream_str, "stylesheet.xsl"); - testUrl(octetStream_str, - "http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl", - "test-difficult-rdf1.xml"); - testUrl(octetStream_str, "http://www.w3.org/2002/07/owl#", - "test-difficult-rdf2.xml"); - // add evil test from TIKA-327 - testFile(octetStream_str, "test-tika-327.html"); - // add another evil html test from TIKA-357 - testFile(octetStream_str, "testlargerbuffer.html"); - // test fragment of HTML with

    (TIKA-1102) - testFile(octetStream_str, "htmlfragment"); - // test binary CGM detection (TIKA-1170) - testFile(octetStream_str, "plotutils-bin-cgm-v3.cgm"); - // test HTML detection of malformed file, previously identified as - // image/cgm (TIKA-1170) - testFile(octetStream_str, "test-malformed-header.html.bin"); - - // test GCMD Directory Interchange Format (.dif) TIKA-1561 - testFile(octetStream_str, "brwNIMS_2014.dif"); - } - - private void testUrl(String expected, String url, String file) - throws IOException { - InputStream in = MimeDetectionTest.class.getResourceAsStream(file); - testStream(expected, url, in); - } - - private void testFile(String expected, String filename) throws IOException { - - InputStream in = MimeDetectionTest.class.getResourceAsStream(filename); - testStream(expected, filename, in); - } - - private void testStream(String expected, String urlOrFileName, - InputStream in) throws IOException { - assertNotNull("Test stream: [" + urlOrFileName + "] is null!", in); - if (!in.markSupported()) { - in = new java.io.BufferedInputStream(in); - } - try { - Metadata metadata = new Metadata(); - String mime = this.detector.detect(in, metadata).toString(); - assertEquals( - urlOrFileName + " is not properly detected: detected.", - expected, mime); - - // Add resource name and test again - // metadata.set(Metadata.RESOURCE_NAME_KEY, urlOrFileName); - mime = this.detector.detect(in, metadata).toString(); - assertEquals(urlOrFileName - + " is not properly detected after adding resource name.", - expected, mime); - } finally { - in.close(); - } - } - - private void assertNotNull(String string, InputStream in) { - // TODO Auto-generated method stub - - } - - /** - * Test for type detection of empty documents. - */ - @Test - public void testEmptyDocument() throws IOException { - assertEquals(MediaType.OCTET_STREAM, detector.detect( - new ByteArrayInputStream(new byte[0]), new Metadata())); - - } - -} diff --git a/tika-core/src/test/java/org/apache/tika/detect/TextDetectorTest.java b/tika-core/src/test/java/org/apache/tika/detect/TextDetectorTest.java index 6bf46d3..c50fc14 100644 --- a/tika-core/src/test/java/org/apache/tika/detect/TextDetectorTest.java +++ b/tika-core/src/test/java/org/apache/tika/detect/TextDetectorTest.java @@ -25,7 +25,6 @@ import org.apache.tika.mime.MediaType; import org.junit.Test; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.fail; @@ -55,8 +54,8 @@ @Test public void testDetectText() throws Exception { - assertText("Hello, World!".getBytes(UTF_8)); - assertText(" \t\r\n".getBytes(UTF_8)); + assertText("Hello, World!".getBytes("UTF-8")); + assertText(" \t\r\n".getBytes("UTF-8")); assertNotText(new byte[] { -1, -2, -3, 0x09, 0x0A, 0x0C, 0x0D, 0x1B }); assertNotText(new byte[] { 0 }); assertNotText(new byte[] { 'H', 'e', 'l', 'l', 'o', 0 }); diff --git a/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java b/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java deleted file mode 100644 index 7f10cdd..0000000 --- a/tika-core/src/test/java/org/apache/tika/io/EndianUtilsTest.java +++ /dev/null @@ -1,40 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.io; - -import static org.junit.Assert.assertEquals; - -import java.io.ByteArrayInputStream; - -import org.junit.Test; - -public class EndianUtilsTest { - @Test - public void testReadUE7() throws Exception { - byte[] data; - - data = new byte[] { 0x08 }; - assertEquals((long)8, EndianUtils.readUE7(new ByteArrayInputStream(data))); - - data = new byte[] { (byte)0x84, 0x1e }; - assertEquals((long)542, EndianUtils.readUE7(new ByteArrayInputStream(data))); - - data = new byte[] { (byte)0xac, (byte)0xbe, 0x17 }; - assertEquals((long)728855, EndianUtils.readUE7(new ByteArrayInputStream(data))); - } -} diff --git a/tika-core/src/test/java/org/apache/tika/io/FilenameUtilsTest.java b/tika-core/src/test/java/org/apache/tika/io/FilenameUtilsTest.java index 3a4769c..03452b3 100644 --- a/tika-core/src/test/java/org/apache/tika/io/FilenameUtilsTest.java +++ b/tika-core/src/test/java/org/apache/tika/io/FilenameUtilsTest.java @@ -17,11 +17,9 @@ package org.apache.tika.io; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; import org.junit.Test; +import static org.junit.Assert.*; public class FilenameUtilsTest { @@ -49,8 +47,8 @@ FilenameUtils.normalize(null); fail("missing check for null parameters"); } catch (IllegalArgumentException x) { - assertTrue(x.getMessage() != null && x.getMessage().contains("name")); - assertTrue(x.getMessage() != null && x.getMessage().contains("not be null")); + assertTrue(x.getMessage().contains("name")); + assertTrue(x.getMessage().contains("not be null")); } } @@ -96,24 +94,5 @@ assertEquals(EXPECTED_NAME, FilenameUtils.normalize(TEST_NAME)); } - @Test - public void testGetName() throws Exception { - testFilenameEquality("quick.ppt", "C:\\the\\quick.ppt"); - testFilenameEquality("quick.ppt", "/the/quick.ppt"); - testFilenameEquality("", "/the/quick/"); - testFilenameEquality("", "~/the/quick////\\\\//"); - testFilenameEquality("~~quick", "~~quick"); - testFilenameEquality("quick.ppt", "quick.ppt"); - testFilenameEquality("", "////"); - testFilenameEquality("", "C:////"); - testFilenameEquality("", ".."); - testFilenameEquality("quick", "C:////../the/D:/quick"); - testFilenameEquality("file.ppt", "path:to:file.ppt" ); - testFilenameEquality("HW.txt", "_1457338542/HW.txt" ); - } - - private void testFilenameEquality(String expected, String path) { - assertEquals(expected, FilenameUtils.getName(path)); - } } diff --git a/tika-core/src/test/java/org/apache/tika/io/TailStreamTest.java b/tika-core/src/test/java/org/apache/tika/io/TailStreamTest.java index ba0dd2e..8232257 100644 --- a/tika-core/src/test/java/org/apache/tika/io/TailStreamTest.java +++ b/tika-core/src/test/java/org/apache/tika/io/TailStreamTest.java @@ -16,7 +16,6 @@ */ package org.apache.tika.io; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; @@ -69,7 +68,7 @@ */ private static InputStream generateStream(int from, int length) { - return new ByteArrayInputStream(generateText(from, length).getBytes(UTF_8)); + return new ByteArrayInputStream(generateText(from, length).getBytes()); } /** @@ -124,7 +123,7 @@ TailStream stream = new TailStream(generateStream(0, 2 * count), count); readStream(stream); assertEquals("Wrong buffer", generateText(count, count), new String( - stream.getTail(), UTF_8)); + stream.getTail())); } /** @@ -145,7 +144,7 @@ read = stream.read(buf); } assertEquals("Wrong buffer", generateText(count - tailSize, tailSize), - new String(stream.getTail(), UTF_8)); + new String(stream.getTail())); stream.close(); } @@ -165,7 +164,7 @@ stream.reset(); readStream(stream); assertEquals("Wrong buffer", generateText(tailSize, tailSize), - new String(stream.getTail(), UTF_8)); + new String(stream.getTail())); } /** @@ -181,7 +180,7 @@ byte[] buf = new byte[count]; stream.read(buf); assertEquals("Wrong buffer", generateText(count - tailSize, tailSize), - new String(stream.getTail(), UTF_8)); + new String(stream.getTail())); stream.close(); } @@ -198,7 +197,7 @@ assertEquals("Wrong skip result", skipCount, stream.skip(skipCount)); assertEquals("Wrong buffer", generateText(skipCount - tailSize, tailSize), - new String(stream.getTail(), UTF_8)); + new String(stream.getTail())); stream.close(); } @@ -212,7 +211,7 @@ TailStream stream = new TailStream(generateStream(0, count), 2 * count); assertEquals("Wrong skip result", count, stream.skip(2 * count)); assertEquals("Wrong buffer", generateText(0, count), - new String(stream.getTail(), UTF_8)); + new String(stream.getTail())); stream.close(); } diff --git a/tika-core/src/test/java/org/apache/tika/io/TemporaryResourcesTest.java b/tika-core/src/test/java/org/apache/tika/io/TemporaryResourcesTest.java deleted file mode 100644 index 9d76641..0000000 --- a/tika-core/src/test/java/org/apache/tika/io/TemporaryResourcesTest.java +++ /dev/null @@ -1,39 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.io; - -import static org.junit.Assert.assertTrue; - -import java.io.IOException; -import java.nio.file.Files; -import java.nio.file.Path; - -import org.junit.Test; - -public class TemporaryResourcesTest { - - @Test - public void testFileDeletion() throws IOException { - Path tempFile; - try (TemporaryResources tempResources = new TemporaryResources()) { - tempFile = tempResources.createTempFile(); - assertTrue("Temp file should exist while TempResources is used", Files.exists(tempFile)); - } - assertTrue("Temp file should not exist after TempResources is closed", Files.notExists(tempFile)); - } - -} diff --git a/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java b/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java index 229d310..abef52b 100644 --- a/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java +++ b/tika-core/src/test/java/org/apache/tika/io/TikaInputStreamTest.java @@ -16,32 +16,34 @@ */ package org.apache.tika.io; -import static java.nio.charset.StandardCharsets.UTF_8; +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.net.URL; + +import org.apache.tika.metadata.Metadata; + +import org.junit.Test; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; - -import java.io.IOException; -import java.io.InputStream; -import java.net.URL; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; - -import org.apache.tika.metadata.Metadata; -import org.junit.Test; public class TikaInputStreamTest { @Test public void testFileBased() throws IOException { - Path path = createTempFile("Hello, World!"); - InputStream stream = TikaInputStream.get(path); + File file = createTempFile("Hello, World!"); + InputStream stream = TikaInputStream.get(file); assertEquals( "The file returned by the getFile() method should" + " be the file used to instantiate a TikaInputStream", - path, TikaInputStream.get(stream).getPath()); + file, TikaInputStream.get(stream).getFile()); assertEquals( "The contents of the TikaInputStream should equal the" @@ -51,19 +53,20 @@ stream.close(); assertTrue( "The close() method must not remove the file used to" - + " instantiate a TikaInputStream", - Files.exists(path)); + + " instantiate a TikaInputStream", + file.exists()); - Files.delete(path); + file.delete(); } @Test public void testStreamBased() throws IOException { - InputStream input = IOUtils.toInputStream("Hello, World!", UTF_8.name()); + InputStream input = + new ByteArrayInputStream("Hello, World!".getBytes("UTF-8")); InputStream stream = TikaInputStream.get(input); - Path file = TikaInputStream.get(stream).getPath(); - assertTrue(file != null && Files.isRegularFile(file)); + File file = TikaInputStream.get(stream).getFile(); + assertTrue(file != null && file.isFile()); assertEquals( "The contents of the file returned by the getFile method" @@ -79,21 +82,33 @@ assertFalse( "The close() method must remove the temporary file created" + " by a TikaInputStream", - Files.exists(file)); + file.exists()); } - private Path createTempFile(String data) throws IOException { - Path file = Files.createTempFile("tika-", ".tmp"); - Files.write(file, data.getBytes(UTF_8)); + private File createTempFile(String data) throws IOException { + File file = File.createTempFile("tika-", ".tmp"); + OutputStream stream = new FileOutputStream(file); + try { + stream.write(data.getBytes("UTF-8")); + } finally { + stream.close(); + } return file; } - private String readFile(Path file) throws IOException { - return new String(Files.readAllBytes(file), UTF_8); + private String readFile(File file) throws IOException { + InputStream stream = new FileInputStream(file); + try { + return readStream(stream); + } finally { + stream.close(); + } } private String readStream(InputStream stream) throws IOException { - return IOUtils.toString(stream, UTF_8.name()); + ByteArrayOutputStream buffer = new ByteArrayOutputStream(); + IOUtils.copy(stream, buffer); + return buffer.toString("UTF-8"); } @Test @@ -103,7 +118,7 @@ TikaInputStream.get(url, metadata).close(); assertEquals("test.txt", metadata.get(Metadata.RESOURCE_NAME_KEY)); assertEquals( - Long.toString(Files.size(Paths.get(url.toURI()))), + Long.toString(new File(url.toURI()).length()), metadata.get(Metadata.CONTENT_LENGTH)); } diff --git a/tika-core/src/test/java/org/apache/tika/language/LanguageIdentifierTest.java b/tika-core/src/test/java/org/apache/tika/language/LanguageIdentifierTest.java index 0c5834b..9748d29 100644 --- a/tika-core/src/test/java/org/apache/tika/language/LanguageIdentifierTest.java +++ b/tika-core/src/test/java/org/apache/tika/language/LanguageIdentifierTest.java @@ -16,17 +16,15 @@ */ package org.apache.tika.language; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.io.Writer; import java.util.HashMap; -import java.util.Locale; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; import org.apache.tika.io.IOUtils; import org.junit.Before; @@ -51,7 +49,7 @@ public void setUp() { LanguageIdentifier.initProfiles(); } - + @Test public void testLanguageDetection() throws IOException { for (String language : languages) { @@ -107,43 +105,8 @@ assertTrue(identifier.isReasonablyCertain()); } - // Enable this to compare performance - public void testPerformance() throws IOException { - final int MRUNS = 8; - final int IRUNS = 10; - int detected = 0; // To avoid code removal by JVM or compiler - String lastResult = null; - for (int m = 0 ; m < MRUNS ; m++) { - LanguageProfile.useInterleaved = (m & 1) == 1; // Alternate between standard and interleaved - String currentResult = ""; - final long start = System.nanoTime(); - for (int i = 0 ; i < IRUNS ; i++) { - for (String language : languages) { - ProfilingWriter writer = new ProfilingWriter(); - writeTo(language, writer); - LanguageIdentifier identifier = new LanguageIdentifier(writer.getProfile()); - if (identifier.isReasonablyCertain()) { - currentResult += identifier.getLanguage(); - detected++; - } - } - } - System.out.println(String.format(Locale.ROOT, - "Performed %d detections at %2d ms/test with interleaved=%b", - languages.length*IRUNS, (System.nanoTime()-start)/1000000/(languages.length*IRUNS), - LanguageProfile.useInterleaved)); - if (lastResult != null) { // Might as well test that they behave the same while we're at it - assertEquals("This result should be equal to the last", lastResult, currentResult); - } - lastResult = currentResult; - } - if (detected == -1) { - System.out.println("Never encountered but keep it to guard against over-eager optimization"); - } - } - - @Test - public void testMixedLanguages() throws IOException { + @Test + public void testMixedLanguages() throws IOException { for (String language : languages) { for (String other : languages) { if (!language.equals(other)) { @@ -173,10 +136,12 @@ } private void writeTo(String language, Writer writer) throws IOException { - try (InputStream stream = - LanguageIdentifierTest.class.getResourceAsStream( - language + ".test")) { - IOUtils.copy(new InputStreamReader(stream, UTF_8), writer); + InputStream stream = + LanguageIdentifierTest.class.getResourceAsStream(language + ".test"); + try { + IOUtils.copy(new InputStreamReader(stream, "UTF-8"), writer); + } finally { + stream.close(); } } diff --git a/tika-core/src/test/java/org/apache/tika/language/LanguageProfilerBuilderTest.java b/tika-core/src/test/java/org/apache/tika/language/LanguageProfilerBuilderTest.java index 39ba686..c5409fc 100644 --- a/tika-core/src/test/java/org/apache/tika/language/LanguageProfilerBuilderTest.java +++ b/tika-core/src/test/java/org/apache/tika/language/LanguageProfilerBuilderTest.java @@ -17,10 +17,6 @@ package org.apache.tika.language; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; @@ -34,6 +30,9 @@ import org.junit.After; import org.junit.Test; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + public class LanguageProfilerBuilderTest { /* Test members */ private LanguageProfilerBuilder ngramProfile = null; @@ -41,14 +40,19 @@ private final String profileName = "../tika-core/src/test/resources/org/apache/tika/language/langbuilder/" + LanguageProfilerBuilderTest.class.getName(); private final String corpusName = "langbuilder/welsh_corpus.txt"; + private final String encoding = "UTF-8"; private final String FILE_EXTENSION = "ngp"; private final String LANGUAGE = "welsh"; private final int maxlen = 1000; @Test public void testCreateProfile() throws TikaException, IOException, URISyntaxException { - try (InputStream is = LanguageProfilerBuilderTest.class.getResourceAsStream(corpusName)) { - ngramProfile = LanguageProfilerBuilder.create(profileName, is, UTF_8.name()); + InputStream is = + LanguageProfilerBuilderTest.class.getResourceAsStream(corpusName); + try { + ngramProfile = LanguageProfilerBuilder.create(profileName, is , encoding); + } finally { + is.close(); } File f = new File(profileName + "." + FILE_EXTENSION); @@ -74,9 +78,11 @@ langProfile = new LanguageProfile(); - try (InputStream stream = new FileInputStream(new File(profileName + "." + FILE_EXTENSION))) { + InputStream stream = new FileInputStream(new File(profileName + "." + + FILE_EXTENSION)); + try { BufferedReader reader = new BufferedReader(new InputStreamReader( - stream, UTF_8)); + stream, encoding)); String line = reader.readLine(); while (line != null) { if (line.length() > 0 && !line.startsWith("#")) {// skips the @@ -88,6 +94,8 @@ } line = reader.readLine(); } + } finally { + stream.close(); } } diff --git a/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java b/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java index b9bb14f..791c4ff 100644 --- a/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java +++ b/tika-core/src/test/java/org/apache/tika/metadata/TestMetadata.java @@ -20,9 +20,8 @@ import java.util.Date; import java.util.Properties; -import org.apache.tika.utils.DateUtils; + import org.junit.Test; - //Junit imports import static org.junit.Assert.assertEquals; @@ -338,15 +337,9 @@ public void testGetSetDateUnspecifiedTimezone() { Metadata meta = new Metadata(); - // Set explictly without a timezone meta.set(TikaCoreProperties.CREATED, "1970-01-01T00:00:01"); assertEquals("should return string without time zone specifier because zone is not known", "1970-01-01T00:00:01", meta.get(TikaCoreProperties.CREATED)); - - // Now ask DateUtils to format for us without one - meta.set(TikaCoreProperties.CREATED, DateUtils.formatDateUnknownTimezone(new Date(1000))); - assertEquals("should return string without time zone specifier because zone is not known", - "1970-01-01T00:00:01", meta.get(TikaCoreProperties.CREATED)); } /** diff --git a/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java b/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java index 1f986da..181612e 100644 --- a/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java +++ b/tika-core/src/test/java/org/apache/tika/mime/MimeDetectionTest.java @@ -16,9 +16,6 @@ */ package org.apache.tika.mime; -import static java.nio.charset.StandardCharsets.UTF_16BE; -import static java.nio.charset.StandardCharsets.UTF_16LE; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; @@ -29,6 +26,7 @@ import org.apache.tika.config.TikaConfig; import org.apache.tika.metadata.Metadata; + import org.junit.Before; import org.junit.Test; @@ -76,21 +74,18 @@ testFile("image/cgm", "plotutils-bin-cgm-v3.cgm"); // test HTML detection of malformed file, previously identified as image/cgm (TIKA-1170) testFile("text/html", "test-malformed-header.html.bin"); - - //test GCMD Directory Interchange Format (.dif) TIKA-1561 - testFile("application/dif+xml", "brwNIMS_2014.dif"); } @Test public void testByteOrderMark() throws Exception { assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_16LE)), - new Metadata())); - assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_16BE)), - new Metadata())); - assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_8)), + new ByteArrayInputStream("\ufefftest".getBytes("UTF-16LE")), + new Metadata())); + assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect( + new ByteArrayInputStream("\ufefftest".getBytes("UTF-16BE")), + new Metadata())); + assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect( + new ByteArrayInputStream("\ufefftest".getBytes("UTF-8")), new Metadata())); } @@ -200,7 +195,7 @@ @Test public void testNotXML() throws IOException { assertEquals(MediaType.TEXT_PLAIN, mimeTypes.detect( - new ByteArrayInputStream("".getBytes(UTF_8)), + new ByteArrayInputStream("".getBytes("UTF-8")), new Metadata())); } @@ -216,31 +211,4 @@ } } - /** - * Tests that when two magic matches both apply, and both - * have the same priority, we use the name to pick the - * right one based on the glob, or the first one we - * come across if not. See TIKA-1292 for more details. - */ - @Test - public void testMimeMagicClashSamePriority() throws IOException { - byte[] helloWorld = "Hello, World!".getBytes(UTF_8); - MediaType helloType = MediaType.parse("hello/world-file"); - MediaType helloXType = MediaType.parse("hello/x-world-hello"); - Metadata metadata; - - // With a filename, picks the right one - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test.hello.world"); - assertEquals(helloType, mimeTypes.detect(new ByteArrayInputStream(helloWorld), metadata)); - - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test.x-hello-world"); - assertEquals(helloXType, mimeTypes.detect(new ByteArrayInputStream(helloWorld), metadata)); - - // Without, goes for the one that sorts last - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "testingTESTINGtesting"); - assertEquals(helloXType, mimeTypes.detect(new ByteArrayInputStream(helloWorld), metadata)); - } } diff --git a/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java b/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java index a5681b3..7c2f829 100644 --- a/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java +++ b/tika-core/src/test/java/org/apache/tika/mime/MimeTypesReaderTest.java @@ -16,22 +16,20 @@ */ package org.apache.tika.mime; -import static java.nio.charset.StandardCharsets.US_ASCII; +import java.io.ByteArrayInputStream; +import java.lang.reflect.Field; +import java.util.ArrayList; +import java.util.List; + +import org.apache.tika.config.TikaConfig; +import org.apache.tika.metadata.Metadata; + +import org.junit.Before; +import org.junit.Test; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; - -import java.io.ByteArrayInputStream; -import java.lang.reflect.Field; -import java.util.ArrayList; -import java.util.List; -import java.util.Set; - -import org.apache.tika.config.TikaConfig; -import org.apache.tika.metadata.Metadata; -import org.junit.Before; -import org.junit.Test; /** * These tests try to ensure that the MimeTypesReader @@ -144,49 +142,18 @@ mime.getLinks().get(0).toString()); } - @Test - public void testReadParameterHierarchy() throws Exception { - MimeType mimeBTree4 = this.mimeTypes.forName("application/x-berkeley-db;format=btree;version=4"); - MediaType mtBTree4 = mimeBTree4.getType(); - - // Canonicalised with spaces - assertEquals("application/x-berkeley-db; format=btree; version=4", mimeBTree4.toString()); - assertEquals("application/x-berkeley-db; format=btree; version=4", mtBTree4.toString()); - - // Parent has one parameter - MediaType mtBTree = this.mimeTypes.getMediaTypeRegistry().getSupertype(mtBTree4); - assertEquals("application/x-berkeley-db; format=btree", mtBTree.toString()); - - // Parent has several children, for versions 2 through 4 - Set mtBTreeChildren = this.mimeTypes.getMediaTypeRegistry().getChildTypes(mtBTree); - assertTrue(mtBTreeChildren.toString(), mtBTreeChildren.size() >= 3); - assertTrue(mtBTreeChildren.toString(), mtBTreeChildren.contains(mtBTree4)); - - // Parent of that has none - MediaType mtBD = this.mimeTypes.getMediaTypeRegistry().getSupertype(mtBTree); - assertEquals("application/x-berkeley-db", mtBD.toString()); - - // If we use one with parameters not known in the media registry, - // getting the parent will return the non-parameter version - MediaType mtAlt = MediaType.application("x-berkeley-db; format=unknown; version=42"); - MediaType mtAltP = this.mimeTypes.getMediaTypeRegistry().getSupertype(mtAlt); - assertEquals("application/x-berkeley-db", mtAltP.toString()); - } - /** * TIKA-746 Ensures that the custom mimetype maps were also * loaded and used */ @Test public void testCustomMimeTypes() { - // Check that it knows about our three special ones + // Check that it knows about our two special ones String helloWorld = "hello/world"; String helloWorldFile = "hello/world-file"; - String helloXWorld = "hello/x-world-hello"; try { assertNotNull(this.mimeTypes.forName(helloWorld)); assertNotNull(this.mimeTypes.forName(helloWorldFile)); - assertNotNull(this.mimeTypes.forName(helloXWorld)); } catch (Exception e) { fail(e.getMessage()); } @@ -195,23 +162,15 @@ try { MimeType hw = this.mimeTypes.forName(helloWorld); MimeType hwf = this.mimeTypes.forName(helloWorldFile); - MimeType hxw = this.mimeTypes.forName(helloXWorld); - // The parent has no comments, globs, magic etc + // The parent has no comments, globs etc assertEquals("", hw.getDescription()); assertEquals("", hw.getExtension()); assertEquals(0, hw.getExtensions().size()); - assertEquals(0, hw.getMagics().size()); // The file one does assertEquals("A \"Hello World\" file", hwf.getDescription()); assertEquals(".hello.world", hwf.getExtension()); - assertEquals(1, hwf.getMagics().size()); - - // The alternate one has most - assertEquals("", hxw.getDescription()); - assertEquals(".x-hello-world", hxw.getExtension()); - assertEquals(1, hxw.getMagics().size()); // Check that we can correct detect with the file one: // By name @@ -219,15 +178,11 @@ m.add(Metadata.RESOURCE_NAME_KEY, "test.hello.world"); assertEquals(hwf.toString(), this.mimeTypes.detect(null, m).toString()); - m = new Metadata(); - m.add(Metadata.RESOURCE_NAME_KEY, "test.x-hello-world"); - assertEquals(hxw.toString(), this.mimeTypes.detect(null, m).toString()); - - // By contents - picks the x one as that sorts later + // By contents m = new Metadata(); ByteArrayInputStream s = new ByteArrayInputStream( - "Hello, World!".getBytes(US_ASCII)); - assertEquals(hxw.toString(), this.mimeTypes.detect(s, m).toString()); + "Hello, World!".getBytes("ASCII")); + assertEquals(hwf.toString(), this.mimeTypes.detect(s, m).toString()); } catch (Exception e) { fail(e.getMessage()); } @@ -240,27 +195,4 @@ assertEquals(".ppt",ext); assertEquals(".ppt",mt.getExtensions().get(0)); } - - @Test - public void testGetRegisteredMimesWithParameters() throws Exception { - //TIKA-1692 - - // Media Type always keeps details / parameters - String name = "application/xml; charset=UTF-8"; - MediaType mt = MediaType.parse(name); - assertEquals(name, mt.toString()); - - // Mime type loses details not in the file - MimeType mimeType = this.mimeTypes.getRegisteredMimeType(name); - assertEquals("application/xml", mimeType.toString()); - assertEquals(".xml", mimeType.getExtension()); - - // But on well-known parameters stays - name = "application/dita+xml; format=map"; - mt = MediaType.parse(name); - assertEquals(name, mt.toString()); - mimeType = this.mimeTypes.getRegisteredMimeType(name); - assertEquals(name, mimeType.toString()); - assertEquals(".ditamap", mimeType.getExtension()); - } } diff --git a/tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTest.java b/tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTest.java deleted file mode 100644 index 35c75b7..0000000 --- a/tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTest.java +++ /dev/null @@ -1,249 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.mime; - -import static java.nio.charset.StandardCharsets.UTF_16BE; -import static java.nio.charset.StandardCharsets.UTF_16LE; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.net.URL; - -import org.apache.tika.metadata.Metadata; -import org.junit.Before; -import org.junit.Test; - -public class ProbabilisticMimeDetectionTest { - - private ProbabilisticMimeDetectionSelector proDetector; - - private MediaTypeRegistry registry; - - /** @inheritDoc */ - @Before - public void setUp() { - proDetector = new ProbabilisticMimeDetectionSelector(); - this.registry = proDetector.getMediaTypeRegistry(); - } - - @Test - public void testDetection() throws Exception { - testFile("image/svg+xml", "circles.svg"); - testFile("image/svg+xml", "circles-with-prefix.svg"); - testFile("image/png", "datamatrix.png"); - testFile("text/html", "test.html"); - testFile("application/xml", "test-iso-8859-1.xml"); - testFile("application/xml", "test-utf8.xml"); - testFile("application/xml", "test-utf8-bom.xml"); - testFile("application/xml", "test-utf16le.xml"); - testFile("application/xml", "test-utf16be.xml"); - testFile("application/xml", "test-long-comment.xml"); - testFile("application/xslt+xml", "stylesheet.xsl"); - testUrl("application/rdf+xml", - "http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl", - "test-difficult-rdf1.xml"); - testUrl("application/rdf+xml", "http://www.w3.org/2002/07/owl#", - "test-difficult-rdf2.xml"); - // add evil test from TIKA-327 - testFile("text/html", "test-tika-327.html"); - // add another evil html test from TIKA-357 - testFile("text/html", "testlargerbuffer.html"); - // test fragment of HTML with
    (TIKA-1102) - testFile("text/html", "htmlfragment"); - // test binary CGM detection (TIKA-1170) - testFile("image/cgm", "plotutils-bin-cgm-v3.cgm"); - // test HTML detection of malformed file, previously identified as - // image/cgm (TIKA-1170) - testFile("text/html", "test-malformed-header.html.bin"); - } - - @Test - public void testByteOrderMark() throws Exception { - assertEquals(MediaType.TEXT_PLAIN, proDetector.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_16LE)), - new Metadata())); - assertEquals(MediaType.TEXT_PLAIN, proDetector.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_16BE)), - new Metadata())); - - assertEquals(MediaType.TEXT_PLAIN, proDetector.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_8)), - new Metadata())); - } - - @Test - public void testSuperTypes() { - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something; charset=UTF-8"), - MediaType.parse("text/something"))); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something; charset=UTF-8"), - MediaType.TEXT_PLAIN)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something; charset=UTF-8"), - MediaType.OCTET_STREAM)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something"), MediaType.TEXT_PLAIN)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("application/something+xml"), - MediaType.APPLICATION_XML)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("application/something+zip"), - MediaType.APPLICATION_ZIP)); - - assertTrue(registry.isSpecializationOf(MediaType.APPLICATION_XML, - MediaType.TEXT_PLAIN)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("application/vnd.apple.iwork"), - MediaType.APPLICATION_ZIP)); - } - - @SuppressWarnings("unused") - private void testUrlOnly(String expected, String url) throws IOException { - InputStream in = new URL(url).openStream(); - testStream(expected, url, in); - } - - private void testUrl(String expected, String url, String file) - throws IOException { - InputStream in = getClass().getResourceAsStream(file); - testStream(expected, url, in); - } - - private void testFile(String expected, String filename) throws IOException { - InputStream in = getClass().getResourceAsStream(filename); - testStream(expected, filename, in); - } - - private void testStream(String expected, String urlOrFileName, - InputStream in) throws IOException { - assertNotNull("Test stream: [" + urlOrFileName + "] is null!", in); - if (!in.markSupported()) { - in = new java.io.BufferedInputStream(in); - } - try { - Metadata metadata = new Metadata(); - String mime = this.proDetector.detect(in, metadata).toString(); - assertEquals( - urlOrFileName + " is not properly detected: detected.", - expected, mime); - - // Add resource name and test again - metadata.set(Metadata.RESOURCE_NAME_KEY, urlOrFileName); - mime = this.proDetector.detect(in, metadata).toString(); - assertEquals(urlOrFileName - + " is not properly detected after adding resource name.", - expected, mime); - } finally { - in.close(); - } - } - - private void assertNotNull(String string, InputStream in) { - // TODO Auto-generated method stub - - } - - /** - * Test for type detection of empty documents. - * - * @see TIKA-483 - */ - @Test - public void testEmptyDocument() throws IOException { - assertEquals(MediaType.OCTET_STREAM, proDetector.detect( - new ByteArrayInputStream(new byte[0]), new Metadata())); - - Metadata namehint = new Metadata(); - namehint.set(Metadata.RESOURCE_NAME_KEY, "test.txt"); - assertEquals(MediaType.TEXT_PLAIN, proDetector.detect( - new ByteArrayInputStream(new byte[0]), namehint)); - - Metadata typehint = new Metadata(); - typehint.set(Metadata.CONTENT_TYPE, "text/plain"); - assertEquals(MediaType.TEXT_PLAIN, proDetector.detect( - new ByteArrayInputStream(new byte[0]), typehint)); - - } - - /** - * Test for things like javascript files whose content is enclosed in XML - * comment delimiters, but that aren't actually XML. - * - * @see TIKA-426 - */ - @Test - public void testNotXML() throws IOException { - assertEquals(MediaType.TEXT_PLAIN, proDetector.detect( - new ByteArrayInputStream("".getBytes(UTF_8)), - new Metadata())); - } - - /** - * Tests that when we repeatedly test the detection of a document that can - * be detected with Mime Magic, that we consistently detect it correctly. - * See TIKA-391 for more details. - */ - @Test - public void testMimeMagicStability() throws IOException { - for (int i = 0; i < 100; i++) { - testFile("application/vnd.ms-excel", "test.xls"); - } - } - - /** - * Tests that when two magic matches both apply, and both have the same - * priority, we use the name to pick the right one based on the glob, or the - * first one we come across if not. See TIKA-1292 for more details. - */ - @Test - public void testMimeMagicClashSamePriority() throws IOException { - byte[] helloWorld = "Hello, World!".getBytes(UTF_8); - MediaType helloType = MediaType.parse("hello/world-file"); - MediaType helloXType = MediaType.parse("hello/x-world-hello"); - Metadata metadata; - - // With a filename, picks the right one - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test.hello.world"); - assertEquals(helloType, proDetector.detect( - new ByteArrayInputStream(helloWorld), metadata)); - - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test.x-hello-world"); - assertEquals(helloXType, proDetector.detect( - new ByteArrayInputStream(helloWorld), metadata)); - - // Without, goes for the one that sorts last - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "testingTESTINGtesting"); - assertEquals(helloXType, proDetector.detect( - new ByteArrayInputStream(helloWorld), metadata)); - } -} diff --git a/tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTestWithTika.java b/tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTestWithTika.java deleted file mode 100644 index 5605300..0000000 --- a/tika-core/src/test/java/org/apache/tika/mime/ProbabilisticMimeDetectionTestWithTika.java +++ /dev/null @@ -1,267 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.mime; - -import static java.nio.charset.StandardCharsets.UTF_16BE; -import static java.nio.charset.StandardCharsets.UTF_16LE; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertTrue; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.net.URL; - -import org.apache.tika.Tika; -import org.apache.tika.config.ServiceLoader; -import org.apache.tika.detect.DefaultProbDetector; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.ProbabilisticMimeDetectionSelector.Builder; -import org.junit.Before; -import org.junit.Test; - -public class ProbabilisticMimeDetectionTestWithTika { - - private ProbabilisticMimeDetectionSelector proSelector; - private MediaTypeRegistry registry; - private Tika tika; - - /** @inheritDoc */ - @Before - public void setUp() { - MimeTypes types = MimeTypes.getDefaultMimeTypes(); - ServiceLoader loader = new ServiceLoader(); - registry = types.getMediaTypeRegistry(); - - /* - * here is an example with the use of the builder to - * instantiate the object. - */ - Builder builder = new ProbabilisticMimeDetectionSelector.Builder(); - proSelector = new ProbabilisticMimeDetectionSelector( - types, builder.priorMagicFileType(0.5f) - .priorExtensionFileType(0.5f) - .priorMetaFileType(0.5f)); - DefaultProbDetector detector = new DefaultProbDetector(proSelector, loader); - - // Use a default Tika, except for our different detector - tika = new Tika(detector); - } - - @Test - public void testDetection() throws Exception { - testFile("image/svg+xml", "circles.svg"); - testFile("image/svg+xml", "circles-with-prefix.svg"); - testFile("image/png", "datamatrix.png"); - testFile("text/html", "test.html"); - testFile("application/xml", "test-iso-8859-1.xml"); - testFile("application/xml", "test-utf8.xml"); - testFile("application/xml", "test-utf8-bom.xml"); - testFile("application/xml", "test-utf16le.xml"); - testFile("application/xml", "test-utf16be.xml"); - testFile("application/xml", "test-long-comment.xml"); - testFile("application/xslt+xml", "stylesheet.xsl"); - testUrl("application/rdf+xml", - "http://www.ai.sri.com/daml/services/owl-s/1.2/Process.owl", - "test-difficult-rdf1.xml"); - testUrl("application/rdf+xml", "http://www.w3.org/2002/07/owl#", - "test-difficult-rdf2.xml"); - // add evil test from TIKA-327 - testFile("text/html", "test-tika-327.html"); - // add another evil html test from TIKA-357 - testFile("text/html", "testlargerbuffer.html"); - // test fragment of HTML with
    (TIKA-1102) - testFile("text/html", "htmlfragment"); - // test binary CGM detection (TIKA-1170) - testFile("image/cgm", "plotutils-bin-cgm-v3.cgm"); - // test HTML detection of malformed file, previously identified as - // image/cgm (TIKA-1170) - testFile("text/html", "test-malformed-header.html.bin"); - } - - @Test - public void testByteOrderMark() throws Exception { - assertEquals(MediaType.TEXT_PLAIN.toString(), tika.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_16LE)), - new Metadata())); - assertEquals(MediaType.TEXT_PLAIN.toString(), tika.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_16BE)), - new Metadata())); - - assertEquals(MediaType.TEXT_PLAIN.toString(), tika.detect( - new ByteArrayInputStream("\ufefftest".getBytes(UTF_8)), - new Metadata())); - } - - @Test - public void testSuperTypes() { - - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something; charset=UTF-8"), - MediaType.parse("text/something"))); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something; charset=UTF-8"), - MediaType.TEXT_PLAIN)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something; charset=UTF-8"), - MediaType.OCTET_STREAM)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("text/something"), MediaType.TEXT_PLAIN)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("application/something+xml"), - MediaType.APPLICATION_XML)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("application/something+zip"), - MediaType.APPLICATION_ZIP)); - - assertTrue(registry.isSpecializationOf(MediaType.APPLICATION_XML, - MediaType.TEXT_PLAIN)); - - assertTrue(registry.isSpecializationOf( - MediaType.parse("application/vnd.apple.iwork"), - MediaType.APPLICATION_ZIP)); - } - - @SuppressWarnings("unused") - private void testUrlOnly(String expected, String url) throws IOException { - InputStream in = new URL(url).openStream(); - testStream(expected, url, in); - } - - private void testUrl(String expected, String url, String file) - throws IOException { - InputStream in = getClass().getResourceAsStream(file); - testStream(expected, url, in); - } - - private void testFile(String expected, String filename) throws IOException { - InputStream in = getClass().getResourceAsStream(filename); - testStream(expected, filename, in); - } - - private void testStream(String expected, String urlOrFileName, - InputStream in) throws IOException { - assertNotNull("Test stream: [" + urlOrFileName + "] is null!", in); - if (!in.markSupported()) { - in = new java.io.BufferedInputStream(in); - } - try { - Metadata metadata = new Metadata(); - // String mime = this.proDetector.detect(in, metadata).toString(); - String mime = tika.detect(in, metadata).toString(); - assertEquals( - urlOrFileName + " is not properly detected: detected.", - expected, mime); - - // Add resource name and test again - metadata.set(Metadata.RESOURCE_NAME_KEY, urlOrFileName); - // mime = this.proDetector.detect(in, metadata).toString(); - mime = tika.detect(in, metadata).toString(); - assertEquals(urlOrFileName - + " is not properly detected after adding resource name.", - expected, mime); - } finally { - in.close(); - } - } - - /** - * Test for type detection of empty documents. - * - * @see TIKA-483 - */ - @Test - public void testEmptyDocument() throws IOException { - assertEquals(MediaType.OCTET_STREAM.toString(), tika.detect( - new ByteArrayInputStream(new byte[0]), new Metadata())); - - Metadata namehint = new Metadata(); - namehint.set(Metadata.RESOURCE_NAME_KEY, "test.txt"); - assertEquals(MediaType.TEXT_PLAIN.toString(), - tika.detect(new ByteArrayInputStream(new byte[0]), namehint)); - - Metadata typehint = new Metadata(); - typehint.set(Metadata.CONTENT_TYPE, "text/plain"); - assertEquals(MediaType.TEXT_PLAIN.toString(), - tika.detect(new ByteArrayInputStream(new byte[0]), typehint)); - - } - - /** - * Test for things like javascript files whose content is enclosed in XML - * comment delimiters, but that aren't actually XML. - * - * @see TIKA-426 - */ - @Test - public void testNotXML() throws IOException { - assertEquals(MediaType.TEXT_PLAIN.toString(), tika.detect( - new ByteArrayInputStream("".getBytes(UTF_8)), - new Metadata())); - } - - /** - * Tests that when we repeatedly test the detection of a document that can - * be detected with Mime Magic, that we consistently detect it correctly. - * See TIKA-391 for more details. - */ - @Test - public void testMimeMagicStability() throws IOException { - for (int i = 0; i < 100; i++) { - testFile("application/vnd.ms-excel", "test.xls"); - } - } - - /** - * Tests that when two magic matches both apply, and both have the same - * priority, we use the name to pick the right one based on the glob, or the - * first one we come across if not. See TIKA-1292 for more details. - */ - @Test - public void testMimeMagicClashSamePriority() throws IOException { - byte[] helloWorld = "Hello, World!".getBytes(UTF_8); - MediaType helloType = MediaType.parse("hello/world-file"); - MediaType helloXType = MediaType.parse("hello/x-world-hello"); - Metadata metadata; - - // With a filename, picks the right one - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test.hello.world"); - assertEquals(helloType.toString(), - tika.detect(new ByteArrayInputStream(helloWorld), metadata)); - - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test.x-hello-world"); - assertEquals(helloXType.toString(), - tika.detect(new ByteArrayInputStream(helloWorld), metadata)); - - // Without, goes for the one that sorts last - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "testingTESTINGtesting"); - assertEquals(helloXType.toString(), - tika.detect(new ByteArrayInputStream(helloWorld), metadata)); - } -} diff --git a/tika-core/src/test/java/org/apache/tika/parser/CompositeParserTest.java b/tika-core/src/test/java/org/apache/tika/parser/CompositeParserTest.java index 6a2d52d..fda0883 100644 --- a/tika-core/src/test/java/org/apache/tika/parser/CompositeParserTest.java +++ b/tika-core/src/test/java/org/apache/tika/parser/CompositeParserTest.java @@ -39,7 +39,6 @@ public class CompositeParserTest { @Test - @SuppressWarnings("serial") public void testFindDuplicateParsers() { Parser a = new EmptyParser() { public Set getSupportedTypes(ParseContext context) { diff --git a/tika-core/src/test/java/org/apache/tika/parser/DummyParser.java b/tika-core/src/test/java/org/apache/tika/parser/DummyParser.java index 1e6a377..04dca58 100644 --- a/tika-core/src/test/java/org/apache/tika/parser/DummyParser.java +++ b/tika-core/src/test/java/org/apache/tika/parser/DummyParser.java @@ -25,14 +25,11 @@ import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; -import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; /** * A Dummy Parser for use with unit tests. - *

    - * See also {@link org.apache.tika.parser.mock.MockParser}. */ public class DummyParser extends AbstractParser { private Set types; @@ -57,12 +54,11 @@ metadata.add(m.getKey(), m.getValue()); } - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); + handler.startDocument(); if (xmlText != null) { - xhtml.characters(xmlText.toCharArray(), 0, xmlText.length()); + handler.characters(xmlText.toCharArray(), 0, xmlText.length()); } - xhtml.endDocument(); + handler.endDocument(); } } diff --git a/tika-core/src/test/java/org/apache/tika/parser/ParserDecoratorTest.java b/tika-core/src/test/java/org/apache/tika/parser/ParserDecoratorTest.java deleted file mode 100644 index 8fc1e11..0000000 --- a/tika-core/src/test/java/org/apache/tika/parser/ParserDecoratorTest.java +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser; - -import static org.junit.Assert.assertEquals; - -import java.io.ByteArrayInputStream; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashMap; -import java.util.HashSet; -import java.util.Set; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.sax.BodyContentHandler; -import org.junit.Test; - -public class ParserDecoratorTest { - @Test - public void withAndWithoutTypes() { - Set onlyTxt = Collections.singleton(MediaType.TEXT_PLAIN); - Set onlyOct = Collections.singleton(MediaType.OCTET_STREAM); - Set both = new HashSet(); - both.addAll(onlyOct); - both.addAll(onlyTxt); - - Parser p; - Set types; - ParseContext context = new ParseContext(); - - - // With a parser of no types, get the decorated type - p = ParserDecorator.withTypes(EmptyParser.INSTANCE, onlyTxt); - types = p.getSupportedTypes(context); - assertEquals(1, types.size()); - assertEquals(types.toString(), true, types.contains(MediaType.TEXT_PLAIN)); - - // With a parser with other types, still just the decorated type - p = ParserDecorator.withTypes( - new DummyParser(onlyOct, new HashMap(), ""), onlyTxt); - types = p.getSupportedTypes(context); - assertEquals(1, types.size()); - assertEquals(types.toString(), true, types.contains(MediaType.TEXT_PLAIN)); - - - // Exclude will remove if there - p = ParserDecorator.withoutTypes(EmptyParser.INSTANCE, onlyTxt); - types = p.getSupportedTypes(context); - assertEquals(0, types.size()); - - p = ParserDecorator.withoutTypes( - new DummyParser(onlyOct, new HashMap(), ""), onlyTxt); - types = p.getSupportedTypes(context); - assertEquals(1, types.size()); - assertEquals(types.toString(), true, types.contains(MediaType.OCTET_STREAM)); - - p = ParserDecorator.withoutTypes( - new DummyParser(both, new HashMap(), ""), onlyTxt); - types = p.getSupportedTypes(context); - assertEquals(1, types.size()); - assertEquals(types.toString(), true, types.contains(MediaType.OCTET_STREAM)); - } - - /** - * Testing one proposed implementation for TIKA-1509 - */ - @Test - public void withFallback() throws Exception { - Set onlyOct = Collections.singleton(MediaType.OCTET_STREAM); - Set octAndText = new HashSet(Arrays.asList( - MediaType.OCTET_STREAM, MediaType.TEXT_PLAIN)); - - ParseContext context = new ParseContext(); - BodyContentHandler handler; - Metadata metadata; - - ErrorParser pFail = new ErrorParser(); - DummyParser pWork = new DummyParser(onlyOct, new HashMap(), "Fell back!"); - EmptyParser pNothing = new EmptyParser(); - - // Create a combination which will fail first - @SuppressWarnings("deprecation") - Parser p = ParserDecorator.withFallbacks(Arrays.asList(pFail, pWork), octAndText); - - // Will claim to support the types given, not those on the child parsers - Set types = p.getSupportedTypes(context); - assertEquals(2, types.size()); - assertEquals(types.toString(), true, types.contains(MediaType.TEXT_PLAIN)); - assertEquals(types.toString(), true, types.contains(MediaType.OCTET_STREAM)); - - // Parsing will make it to the second one - metadata = new Metadata(); - handler = new BodyContentHandler(); - p.parse(new ByteArrayInputStream(new byte[] {0,1,2,3,4}), handler, metadata, context); - assertEquals("Fell back!", handler.toString()); - - - // With a parser that will work with no output, will get nothing - p = ParserDecorator.withFallbacks(Arrays.asList(pNothing, pWork), octAndText); - metadata = new Metadata(); - handler = new BodyContentHandler(); - p.parse(new ByteArrayInputStream(new byte[] {0,1,2,3,4}), handler, metadata, context); - assertEquals("", handler.toString()); - } -} diff --git a/tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java b/tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java deleted file mode 100644 index d7ac5fe..0000000 --- a/tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java +++ /dev/null @@ -1,359 +0,0 @@ -package org.apache.tika.parser.mock; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - - -import javax.xml.parsers.DocumentBuilder; -import javax.xml.parsers.DocumentBuilderFactory; -import javax.xml.parsers.ParserConfigurationException; -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.lang.reflect.Constructor; -import java.util.ArrayList; -import java.util.Date; -import java.util.HashSet; -import java.util.List; -import java.util.Set; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaMetadataKeys; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.EmbeddedContentHandler; -import org.apache.tika.sax.XHTMLContentHandler; -import org.w3c.dom.Document; -import org.w3c.dom.NamedNodeMap; -import org.w3c.dom.Node; -import org.w3c.dom.NodeList; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * This class enables mocking of parser behavior for use in testing - * wrappers and drivers of parsers. - *

    - * See resources/test-documents/mock/example.xml in tika-parsers/test for the documentation - * of all the options for this MockParser. - *

    - * Tests for this class are in tika-parsers. - *

    - * See also {@link org.apache.tika.parser.DummyParser} for another option. - */ - -public class MockParser extends AbstractParser { - - private static final long serialVersionUID = 1L; - - @Override - public Set getSupportedTypes(ParseContext context) { - Set types = new HashSet(); - MediaType type = MediaType.application("mock+xml"); - types.add(type); - return types; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - Document doc = null; - DocumentBuilderFactory fact = DocumentBuilderFactory.newInstance(); - DocumentBuilder docBuilder = null; - try { - docBuilder = fact.newDocumentBuilder(); - doc = docBuilder.parse(stream); - } catch (ParserConfigurationException e) { - throw new IOException(e); - } catch (SAXException e) { - throw new IOException(e); - } - Node root = doc.getDocumentElement(); - NodeList actions = root.getChildNodes(); - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - for (int i = 0; i < actions.getLength(); i++) { - executeAction(actions.item(i), metadata, context, xhtml); - } - xhtml.endDocument(); - } - - private void executeAction(Node action, Metadata metadata, ParseContext context, - XHTMLContentHandler xhtml) throws SAXException, - IOException, TikaException { - - if (action.getNodeType() != 1) { - return; - } - - String name = action.getNodeName(); - if ("metadata".equals(name)) { - metadata(action, metadata); - } else if("write".equals(name)) { - write(action, xhtml); - } else if ("throw".equals(name)) { - throwIt(action); - } else if ("hang".equals(name)) { - hang(action); - } else if ("oom".equals(name)) { - kabOOM(); - } else if ("print_out".equals(name) || "print_err".equals(name)){ - print(action, name); - } else if ("embedded".equals(name)) { - handleEmbedded(action, xhtml, context); - } else { - throw new IllegalArgumentException("Didn't recognize mock action: "+name); - } - } - - private void handleEmbedded(Node action, XHTMLContentHandler handler, ParseContext context) - throws TikaException, SAXException, IOException { - String fileName = ""; - String contentType = ""; - NamedNodeMap attrs = action.getAttributes(); - if (attrs != null) { - Node n = attrs.getNamedItem("filename"); - if (n != null) { - fileName = n.getNodeValue(); - } - n = attrs.getNamedItem("content-type"); - if (n != null) { - contentType = n.getNodeValue(); - } - } - - String embeddedText = action.getTextContent(); - EmbeddedDocumentExtractor extractor = getEmbeddedDocumentExtractor(context); - Metadata m = new Metadata(); - m.set(TikaMetadataKeys.RESOURCE_NAME_KEY, fileName); - if (! "".equals(contentType)) { - m.set(Metadata.CONTENT_TYPE, contentType); - } - InputStream is = new ByteArrayInputStream(embeddedText.getBytes(UTF_8)); - - extractor.parseEmbedded( - is, - new EmbeddedContentHandler(handler), - m, true); - - - } - - protected EmbeddedDocumentExtractor getEmbeddedDocumentExtractor(ParseContext context) { - EmbeddedDocumentExtractor extractor = - context.get(EmbeddedDocumentExtractor.class); - if (extractor == null) { - Parser p = context.get(Parser.class); - if (p == null) { - context.set(Parser.class, new MockParser()); - } - extractor = new ParsingEmbeddedDocumentExtractor(context); - } - return extractor; - } - - private void print(Node action, String name) { - String content = action.getTextContent(); - if ("print_out".equals(name)) { - System.out.println(content); - } else if ("print_err".equals(name)) { - System.err.println(content); - } else { - throw new IllegalArgumentException("must be print_out or print_err"); - } - } - private void hang(Node action) { - boolean interruptible = true; - boolean heavy = false; - long millis = -1; - long pulseMillis = -1; - NamedNodeMap attrs = action.getAttributes(); - Node iNode = attrs.getNamedItem("interruptible"); - if (iNode != null) { - interruptible = ("true".equals(iNode.getNodeValue())); - } - Node hNode = attrs.getNamedItem("heavy"); - if (hNode != null) { - heavy = ("true".equals(hNode.getNodeValue())); - } - - Node mNode = attrs.getNamedItem("millis"); - if (mNode == null) { - throw new RuntimeException("Must specify \"millis\" attribute for hang."); - } - String millisString = mNode.getNodeValue(); - try { - millis = Long.parseLong(millisString); - } catch (NumberFormatException e) { - throw new RuntimeException("Value for \"millis\" attribute must be a long."); - } - - if (heavy) { - Node pNode = attrs.getNamedItem("pulse_millis"); - if (pNode == null) { - throw new RuntimeException("Must specify attribute \"pulse_millis\" if the hang is \"heavy\""); - } - String pulseMillisString = mNode.getNodeValue(); - try { - pulseMillis = Long.parseLong(pulseMillisString); - } catch (NumberFormatException e) { - throw new RuntimeException("Value for \"millis\" attribute must be a long."); - } - } - if (heavy) { - hangHeavy(millis, pulseMillis, interruptible); - } else { - sleep(millis, interruptible); - } - } - - private void throwIt(Node action) throws IOException, - SAXException, TikaException { - NamedNodeMap attrs = action.getAttributes(); - String className = attrs.getNamedItem("class").getNodeValue(); - String msg = action.getTextContent(); - throwIt(className, msg); - } - - private void metadata(Node action, Metadata metadata) { - NamedNodeMap attrs = action.getAttributes(); - //throws npe unless there is a name - String name = attrs.getNamedItem("name").getNodeValue(); - String value = action.getTextContent(); - Node actionType = attrs.getNamedItem("action"); - if (actionType == null) { - metadata.add(name, value); - } else { - if ("set".equals(actionType.getNodeValue())) { - metadata.set(name, value); - } else { - metadata.add(name, value); - } - } - } - - private void write(Node action, XHTMLContentHandler xhtml) throws SAXException { - NamedNodeMap attrs = action.getAttributes(); - Node eNode = attrs.getNamedItem("element"); - String elementType = "p"; - if (eNode != null) { - elementType = eNode.getTextContent(); - } - String text = action.getTextContent(); - xhtml.startElement(elementType); - xhtml.characters(text); - xhtml.endElement(elementType); - } - - - private void throwIt(String className, String msg) throws IOException, - SAXException, TikaException { - Throwable t = null; - if (msg == null || msg.equals("")) { - try { - t = (Throwable) Class.forName(className).newInstance(); - } catch (Exception e) { - throw new RuntimeException("couldn't create throwable class:"+className, e); - } - } else { - try { - Class clazz = Class.forName(className); - Constructor con = clazz.getConstructor(String.class); - t = (Throwable) con.newInstance(msg); - } catch (Exception e) { - throw new RuntimeException("couldn't create throwable class:" + className, e); - } - } - if (t instanceof SAXException) { - throw (SAXException)t; - } else if (t instanceof IOException) { - throw (IOException) t; - } else if (t instanceof TikaException) { - throw (TikaException) t; - } else if (t instanceof Error) { - throw (Error) t; - } else if (t instanceof RuntimeException) { - throw (RuntimeException) t; - } else { - //wrap the throwable in a RuntimeException - throw new RuntimeException(t); - } - } - - private void kabOOM() { - List ints = new ArrayList(); - - while (true) { - int[] intArr = new int[32000]; - ints.add(intArr); - } - } - - private void hangHeavy(long maxMillis, long pulseCheckMillis, boolean interruptible) { - //do some heavy computation and occasionally check for - //whether time has exceeded maxMillis (see TIKA-1132 for inspiration) - //or whether the thread was interrupted - long start = new Date().getTime(); - int lastChecked = 0; - while (true) { - for (int i = 1; i < Integer.MAX_VALUE; i++) { - for (int j = 1; j < Integer.MAX_VALUE; j++) { - double div = (double) i / (double) j; - lastChecked++; - if (lastChecked > pulseCheckMillis) { - lastChecked = 0; - if (interruptible && Thread.currentThread().isInterrupted()) { - return; - } - long elapsed = new Date().getTime()-start; - if (elapsed > maxMillis) { - return; - } - } - } - } - } - } - - private void sleep(long maxMillis, boolean isInterruptible) { - long start = new Date().getTime(); - long millisRemaining = maxMillis; - while (true) { - try { - Thread.sleep(millisRemaining); - } catch (InterruptedException e) { - if (isInterruptible) { - return; - } - } - long elapsed = new Date().getTime()-start; - millisRemaining = maxMillis - elapsed; - if (millisRemaining <= 0) { - break; - } - } - } - -} diff --git a/tika-core/src/test/java/org/apache/tika/sax/BasicContentHandlerFactoryTest.java b/tika-core/src/test/java/org/apache/tika/sax/BasicContentHandlerFactoryTest.java deleted file mode 100644 index 249b9fd..0000000 --- a/tika-core/src/test/java/org/apache/tika/sax/BasicContentHandlerFactoryTest.java +++ /dev/null @@ -1,341 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.sax; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - -import java.io.ByteArrayOutputStream; -import java.io.IOException; -import java.io.InputStream; -import java.io.UnsupportedEncodingException; -import java.util.Set; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.junit.Test; -import org.xml.sax.Attributes; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.AttributesImpl; -import org.xml.sax.helpers.DefaultHandler; - -/** - * Test cases for the {@link org.apache.tika.sax.BodyContentHandler} class. - */ -public class BasicContentHandlerFactoryTest { - - private static final String ENCODING = UTF_8.name(); - //default max char len (at least in WriteOutContentHandler is 100k) - private static final int OVER_DEFAULT = 120000; - - @Test - public void testIgnore() throws Exception { - Parser p = new MockParser(OVER_DEFAULT); - ContentHandler handler = - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1).getNewContentHandler(); - assertTrue(handler instanceof DefaultHandler); - p.parse(null, handler, null, null); - //unfortunatley, the DefaultHandler does not return "", - assertContains("org.xml.sax.helpers.DefaultHandler", handler.toString()); - - //tests that no write limit exception is thrown - p = new MockParser(100); - handler = - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, 5).getNewContentHandler(); - assertTrue(handler instanceof DefaultHandler); - p.parse(null, handler, null, null); - assertContains("org.xml.sax.helpers.DefaultHandler", handler.toString()); - } - - @Test - public void testText() throws Exception { - Parser p = new MockParser(OVER_DEFAULT); - BasicContentHandlerFactory.HANDLER_TYPE type = - BasicContentHandlerFactory.HANDLER_TYPE.TEXT; - ContentHandler handler = - new BasicContentHandlerFactory(type, -1).getNewContentHandler(); - - assertTrue(handler instanceof ToTextContentHandler); - p.parse(null, handler, null, null); - String extracted = handler.toString(); - assertContains("This is the title", extracted); - assertContains("aaaaaaaaaa", extracted); - assertNotContains(" 110000); - //now test write limit - p = new MockParser(10); - handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(); - assertTrue(handler instanceof WriteOutContentHandler); - assertWriteLimitReached(p, (WriteOutContentHandler) handler); - extracted = handler.toString(); - assertContains("This ", extracted); - assertNotContains("aaaa", extracted); - - //now test outputstream call - p = new MockParser(OVER_DEFAULT); - ByteArrayOutputStream os = new ByteArrayOutputStream(); - handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING); - assertTrue(handler instanceof ToTextContentHandler); - p.parse(null, handler, null, null); - assertContains("This is the title", os.toByteArray()); - assertContains("aaaaaaaaaa", os.toByteArray()); - assertTrue(os.toByteArray().length > 110000); - assertNotContains("This is the title", extracted); - assertContains("aaaaaaaaaa", extracted); - assertTrue(extracted.length() > 110000); - - //now test write limit - p = new MockParser(10); - handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(); - assertTrue(handler instanceof WriteOutContentHandler); - assertWriteLimitReached(p, (WriteOutContentHandler) handler); - extracted = handler.toString(); - assertContains("This ", extracted); - assertNotContains("aaaa", extracted); - - //now test outputstream call - p = new MockParser(OVER_DEFAULT); - ByteArrayOutputStream os = new ByteArrayOutputStream(); - handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING); - assertTrue(handler instanceof ToHTMLContentHandler); - p.parse(null, handler, null, null); - assertContains("This is the title", os.toByteArray()); - assertContains("aaaaaaaaaa", os.toByteArray()); - assertContains("<body", os.toByteArray()); - assertContains("<html", os.toByteArray()); - assertTrue(os.toByteArray().length > 110000); - - - p = new MockParser(10); - os = new ByteArrayOutputStream(); - handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(os, ENCODING); - assertTrue(handler instanceof WriteOutContentHandler); - assertWriteLimitReached(p, (WriteOutContentHandler) handler); - assertEquals(0, os.toByteArray().length); - } - - @Test - public void testXML() throws Exception { - Parser p = new MockParser(OVER_DEFAULT); - BasicContentHandlerFactory.HANDLER_TYPE type = - BasicContentHandlerFactory.HANDLER_TYPE.HTML; - ContentHandler handler = - new BasicContentHandlerFactory(type, -1).getNewContentHandler(); - - assertTrue(handler instanceof ToXMLContentHandler); - p.parse(null, handler, new Metadata(), null); - String extracted = handler.toString(); - assertContains("<head><title>This is the title", extracted); - assertContains("aaaaaaaaaa", extracted); - assertTrue(handler.toString().length() > 110000); - - //now test write limit - p = new MockParser(10); - handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(); - assertTrue(handler instanceof WriteOutContentHandler); - assertWriteLimitReached(p, (WriteOutContentHandler) handler); - extracted = handler.toString(); - assertContains("This ", extracted); - assertNotContains("aaaa", extracted); - - //now test outputstream call - p = new MockParser(OVER_DEFAULT); - ByteArrayOutputStream os = new ByteArrayOutputStream(); - handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING); - assertTrue(handler instanceof ToXMLContentHandler); - p.parse(null, handler, null, null); - - assertContains("This is the title", os.toByteArray()); - assertContains("aaaaaaaaaa", os.toByteArray()); - assertContains("<body", os.toByteArray()); - assertContains("<html", os.toByteArray()); - assertTrue(os.toByteArray().length > 110000); - - - p = new MockParser(10); - os = new ByteArrayOutputStream(); - handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(os, ENCODING); - assertTrue(handler instanceof WriteOutContentHandler); - assertWriteLimitReached(p, (WriteOutContentHandler) handler); - assertEquals(0, os.toByteArray().length); - } - - - @Test - public void testBody() throws Exception { - Parser p = new MockParser(OVER_DEFAULT); - BasicContentHandlerFactory.HANDLER_TYPE type = - BasicContentHandlerFactory.HANDLER_TYPE.BODY; - ContentHandler handler = - new BasicContentHandlerFactory(type, -1).getNewContentHandler(); - - assertTrue(handler instanceof BodyContentHandler); - - p.parse(null, handler, null, null); - String extracted = handler.toString(); - assertNotContains("title", extracted); - assertContains("aaaaaaaaaa", extracted); - assertTrue(extracted.length() > 110000); - - //now test write limit - p = new MockParser(10); - handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(); - assertTrue(handler instanceof BodyContentHandler); - assertWriteLimitReached(p, (BodyContentHandler)handler); - extracted = handler.toString(); - assertNotContains("This ", extracted); - assertContains("aaaa", extracted); - - //now test outputstream call - p = new MockParser(OVER_DEFAULT); - ByteArrayOutputStream os = new ByteArrayOutputStream(); - handler = new BasicContentHandlerFactory(type, -1).getNewContentHandler(os, ENCODING); - assertTrue(handler instanceof BodyContentHandler); - p.parse(null, handler, null, null); - assertNotContains("title", os.toByteArray()); - assertContains("aaaaaaaaaa", os.toByteArray()); - assertNotContains("<body", os.toByteArray()); - assertNotContains("<html", os.toByteArray()); - assertTrue(os.toByteArray().length > 110000); - - p = new MockParser(10); - os = new ByteArrayOutputStream(); - handler = new BasicContentHandlerFactory(type, 5).getNewContentHandler(os, ENCODING); - assertTrue(handler instanceof WriteOutContentHandler); - assertWriteLimitReached(p, (WriteOutContentHandler) handler); - assertEquals(0, os.toByteArray().length); - } - - private void assertWriteLimitReached(Parser p, WriteOutContentHandler handler) throws Exception { - boolean wlr = false; - try { - p.parse(null, handler, null, null); - } catch (SAXException e) { - if (! handler.isWriteLimitReached(e)) { - throw e; - } - wlr = true; - } - assertTrue("WriteLimitReached", wlr); - } - //TODO: is there a better way than to repeat this with diff signature? - private void assertWriteLimitReached(Parser p, BodyContentHandler handler) throws Exception { - boolean wlr = false; - try { - p.parse(null, handler, null, null); - } catch (SAXException e) { - if (! e.getClass().toString().contains("org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException")){ - throw e; - } - - wlr = true; - } - assertTrue("WriteLimitReached", wlr); - } - - - //copied from TikaTest in tika-parsers package - public static void assertNotContains(String needle, String haystack) { - assertFalse(needle + " found in:\n" + haystack, haystack.contains(needle)); - } - - public static void assertNotContains(String needle, byte[] hayStack) - throws UnsupportedEncodingException { - assertNotContains(needle, new String(hayStack, ENCODING)); - } - - public static void assertContains(String needle, String haystack) { - assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle)); - } - - public static void assertContains(String needle, byte[] hayStack) - throws UnsupportedEncodingException { - assertContains(needle, new String(hayStack, ENCODING)); - } - - //Simple mockparser that writes a title - //and charsToWrite number of 'a' - private class MockParser implements Parser { - private final String XHTML = "http://www.w3.org/1999/xhtml"; - private final Attributes EMPTY_ATTRIBUTES = new AttributesImpl(); - private final char[] TITLE = "This is the title".toCharArray(); - - private final int charsToWrite; - public MockParser(int charsToWrite) { - this.charsToWrite = charsToWrite; - } - - @Override - public Set<MediaType> getSupportedTypes(ParseContext context) { - return null; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - handler.startDocument(); - handler.startPrefixMapping("", XHTML); - handler.startElement(XHTML, "html", "html", EMPTY_ATTRIBUTES); - handler.startElement(XHTML, "head", "head", EMPTY_ATTRIBUTES); - handler.startElement(XHTML, "title", "head", EMPTY_ATTRIBUTES); - handler.characters(TITLE, 0, TITLE.length); - handler.endElement(XHTML, "title", "head"); - - handler.endElement(XHTML, "head", "head"); - handler.startElement(XHTML, "body", "body", EMPTY_ATTRIBUTES); - char[] body = new char[charsToWrite]; - for (int i = 0; i < charsToWrite; i++) { - body[i] = 'a'; - } - handler.characters(body, 0, body.length); - handler.endElement(XHTML, "body", "body"); - handler.endElement(XHTML, "html", "html"); - handler.endDocument(); - } - } -} diff --git a/tika-core/src/test/java/org/apache/tika/sax/BodyContentHandlerTest.java b/tika-core/src/test/java/org/apache/tika/sax/BodyContentHandlerTest.java index bf42706..920e026 100644 --- a/tika-core/src/test/java/org/apache/tika/sax/BodyContentHandlerTest.java +++ b/tika-core/src/test/java/org/apache/tika/sax/BodyContentHandlerTest.java @@ -16,7 +16,6 @@ */ package org.apache.tika.sax; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import java.io.ByteArrayOutputStream; @@ -46,7 +45,7 @@ xhtml.element("p", "Test text"); xhtml.endDocument(); - assertEquals("Test text\n", buffer.toString(UTF_8.name())); + assertEquals("Test text\n", buffer.toString()); } } diff --git a/tika-core/src/test/java/org/apache/tika/sax/XHTMLContentHandlerTest.java b/tika-core/src/test/java/org/apache/tika/sax/XHTMLContentHandlerTest.java index 88f753a..e48e5a4 100644 --- a/tika-core/src/test/java/org/apache/tika/sax/XHTMLContentHandlerTest.java +++ b/tika-core/src/test/java/org/apache/tika/sax/XHTMLContentHandlerTest.java @@ -17,12 +17,10 @@ package org.apache.tika.sax; import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; import java.util.ArrayList; import java.util.List; -import org.apache.tika.config.TikaConfigTest; import org.apache.tika.metadata.Metadata; import org.junit.Before; @@ -30,7 +28,6 @@ import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; -import org.xml.sax.helpers.AttributesImpl; /** * Unit tests for the {@link XHTMLContentHandler} class. @@ -124,23 +121,6 @@ assertEquals("two", words[1]); } - @Test - public void testAttributesOnBody() throws Exception { - ToHTMLContentHandler toHTMLContentHandler = new ToHTMLContentHandler(); - XHTMLContentHandler xhtmlContentHandler = new XHTMLContentHandler(toHTMLContentHandler, new Metadata()); - AttributesImpl attributes = new AttributesImpl(); - - attributes.addAttribute(XHTMLContentHandler.XHTML, "itemscope", "itemscope", "", ""); - attributes.addAttribute(XHTMLContentHandler.XHTML, "itemtype", "itemtype", "", "http://schema.org/Event"); - - xhtmlContentHandler.startDocument(); - xhtmlContentHandler.startElement(XHTMLContentHandler.XHTML, "body", "body", attributes); - xhtmlContentHandler.endElement("body"); - xhtmlContentHandler.endDocument(); - - assertTrue(toHTMLContentHandler.toString().contains("itemscope")); - } - /** * Return array of non-zerolength words. Splitting on whitespace will get us * empty words for emptylines. diff --git a/tika-core/src/test/java/org/apache/tika/utils/ConcurrentUtilsTest.java b/tika-core/src/test/java/org/apache/tika/utils/ConcurrentUtilsTest.java deleted file mode 100644 index ea0d195..0000000 --- a/tika-core/src/test/java/org/apache/tika/utils/ConcurrentUtilsTest.java +++ /dev/null @@ -1,63 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.utils; - -import static org.junit.Assert.*; - -import java.util.concurrent.ExecutorService; -import java.util.concurrent.Future; - -import org.apache.tika.config.TikaConfig; -import org.apache.tika.parser.ParseContext; -import org.junit.Before; -import org.junit.Test; - -public class ConcurrentUtilsTest { - - @Test - public void testExecuteThread() throws Exception { - ParseContext context = new ParseContext(); - Future result = ConcurrentUtils.execute(context, new Runnable() { - - @Override - public void run() { - //Do nothing - - } - }); - - assertNull(result.get()); - } - - @Test - public void testExecuteExecutor() throws Exception { - TikaConfig config = TikaConfig.getDefaultConfig(); - ParseContext context = new ParseContext(); - context.set(ExecutorService.class, config.getExecutorService()); - Future result = ConcurrentUtils.execute(context, new Runnable() { - - @Override - public void run() { - //Do nothing - - } - }); - - assertNull(result.get()); - } - -} diff --git a/tika-dotnet/pom.xml b/tika-dotnet/pom.xml index b2ed78b..1f5ad97 100644 --- a/tika-dotnet/pom.xml +++ b/tika-dotnet/pom.xml @@ -19,14 +19,13 @@ under the License. --> -<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" - xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> +<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <parent> <groupId>org.apache.tika</groupId> <artifactId>tika-parent</artifactId> - <version>1.11-SNAPSHOT</version> + <version>1.5</version> <relativePath>../tika-parent/pom.xml</relativePath> </parent> @@ -93,22 +92,22 @@ <configuration> <target> <exec executable="${ikvm}/bin/ikvmc.exe"> - <arg value="-nowarn:0100"/> - <arg value="-nowarn:0105"/> - <arg value="-nowarn:0109"/> - <arg value="-nowarn:0111"/> - <arg value="-nowarn:0112"/> - <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Charsets.dll"/> - <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Core.dll"/> - <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Text.dll"/> - <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Util.dll"/> - <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.API.dll"/> - <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.Transform.dll"/> - <arg value="-target:library"/> - <arg value="-compressresources"/> - <arg value="-out:${project.build.directory}/${project.build.finalName}.dll"/> - <arg value="-recurse:${project.build.directory}\*.class"/> - <arg value="${project.build.directory}/dependency/tika-app.jar"/> + <arg value="-nowarn:0100" /> + <arg value="-nowarn:0105" /> + <arg value="-nowarn:0109" /> + <arg value="-nowarn:0111" /> + <arg value="-nowarn:0112" /> + <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Charsets.dll" /> + <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Core.dll" /> + <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Text.dll" /> + <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.Util.dll" /> + <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.API.dll" /> + <arg value="-reference:${ikvm}/bin/IKVM.OpenJDK.XML.Transform.dll" /> + <arg value="-target:library" /> + <arg value="-compressresources" /> + <arg value="-out:${project.build.directory}/${project.build.finalName}.dll" /> + <arg value="-recurse:${project.build.directory}\*.class" /> + <arg value="${project.build.directory}/dependency/tika-app.jar" /> </exec> </target> </configuration> @@ -171,20 +170,20 @@ <description>A .NET port of Tika functionality.</description> <organization> - <name>The Apache Software Foundation</name> - <url>http://www.apache.org</url> + <name>The Apache Software Foundation</name> + <url>http://www.apache.org</url> </organization> <scm> - <url>http://svn.apache.org/viewvc/tika/trunk/tika-dotnet</url> - <connection>scm:svn:http://svn.apache.org/repos/asf/tika/trunk/tika-dotnet</connection> - <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/trunk/tika-dotnet</developerConnection> + <url>http://svn.apache.org/viewvc/tika/tags/1.5/tika-dotnet</url> + <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.5/tika-dotnet</connection> + <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.5/tika-dotnet</developerConnection> </scm> <issueManagement> - <system>JIRA</system> - <url>https://issues.apache.org/jira/browse/TIKA</url> + <system>JIRA</system> + <url>https://issues.apache.org/jira/browse/TIKA</url> </issueManagement> <ciManagement> - <system>Jenkins</system> - <url>https://builds.apache.org/job/Tika-trunk/</url> + <system>Jenkins</system> + <url>https://builds.apache.org/job/Tika-trunk/</url> </ciManagement> </project> diff --git a/tika-example/pom.xml b/tika-example/pom.xml deleted file mode 100644 index 9bd6afb..0000000 --- a/tika-example/pom.xml +++ /dev/null @@ -1,134 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> - -<!-- - Licensed to the Apache Software Foundation (ASF) under one - or more contributor license agreements. See the NOTICE file - distributed with this work for additional information - regarding copyright ownership. The ASF licenses this file - to you under the Apache License, Version 2.0 (the - "License"); you may not use this file except in compliance - with the License. You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, - software distributed under the License is distributed on an - "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - KIND, either express or implied. See the License for the - specific language governing permissions and limitations - under the License. ---> - -<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> - <modelVersion>4.0.0</modelVersion> - - <parent> - <groupId>org.apache.tika</groupId> - <artifactId>tika-parent</artifactId> - <version>1.11</version> - <relativePath>../tika-parent/pom.xml</relativePath> - </parent> - - <artifactId>tika-example</artifactId> - <name>Apache Tika examples</name> - <url>http://tika.apache.org/</url> - - <!-- List of dependencies that we depend on for the examples. See the full list of Tika - modules and how to use them at http://mvnrepository.com/artifact/org.apache.tika.--> - <dependencies> - <dependency> - <groupId>org.apache.tika</groupId> - <artifactId>tika-app</artifactId> - <version>${project.version}</version> - <exclusions> - <exclusion> - <artifactId>tika-parsers</artifactId> - <groupId>org.apache.tika</groupId> - </exclusion> - </exclusions> - </dependency> - <dependency> - <groupId>org.apache.tika</groupId> - <artifactId>tika-parsers</artifactId> - <version>${project.version}</version> - </dependency> - <dependency> - <groupId>org.apache.tika</groupId> - <artifactId>tika-serialization</artifactId> - <version>${project.version}</version> - </dependency> - <dependency> - <groupId>org.apache.tika</groupId> - <artifactId>tika-translate</artifactId> - <version>${project.version}</version> - </dependency> - <dependency> - <groupId>org.apache.tika</groupId> - <artifactId>tika-core</artifactId> - <version>${project.version}</version> - <type>test-jar</type> - <scope>test</scope> - </dependency> - <dependency> - <groupId>org.apache.tika</groupId> - <artifactId>tika-parsers</artifactId> - <version>${project.version}</version> - <type>test-jar</type> - <scope>test</scope> - </dependency> - <dependency> - <groupId>javax.jcr</groupId> - <artifactId>jcr</artifactId> - <version>2.0</version> - </dependency> - <dependency> - <groupId>org.apache.jackrabbit</groupId> - <artifactId>jackrabbit-jcr-server</artifactId> - <version>2.3.6</version> - </dependency> - <dependency> - <groupId>org.apache.jackrabbit</groupId> - <artifactId>jackrabbit-core</artifactId> - <version>2.3.6</version> - </dependency> - <dependency> - <groupId>org.apache.lucene</groupId> - <artifactId>lucene-core</artifactId> - <version>3.5.0</version> - </dependency> - <dependency> - <groupId>commons-io</groupId> - <artifactId>commons-io</artifactId> - <version>${commons.io.version}</version> - </dependency> - <dependency> - <groupId>org.springframework</groupId> - <artifactId>spring-context</artifactId> - <version>3.0.2.RELEASE</version> - </dependency> - <dependency> - <groupId>junit</groupId> - <artifactId>junit</artifactId> - <scope>test</scope> - </dependency> - </dependencies> - - <description>This module contains examples of how to use Apache Tika.</description> - <organization> - <name>The Apache Software Foundation</name> - <url>http://www.apache.org</url> - </organization> - <scm> - <url>http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-example</url> - <connection>scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-example</connection> - <developerConnection>scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-example</developerConnection> - </scm> - <issueManagement> - <system>JIRA</system> - <url>https://issues.apache.org/jira/browse/TIKA</url> - </issueManagement> - <ciManagement> - <system>Jenkins</system> - <url>https://builds.apache.org/job/Tika-trunk/</url> - </ciManagement> -</project> diff --git a/tika-example/src/main/java/org/apache/tika/example/AdvancedTypeDetector.java b/tika-example/src/main/java/org/apache/tika/example/AdvancedTypeDetector.java deleted file mode 100755 index d78199f..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/AdvancedTypeDetector.java +++ /dev/null @@ -1,56 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.InputStream; - -import org.apache.tika.Tika; -import org.apache.tika.detect.CompositeDetector; -import org.apache.tika.detect.Detector; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MimeTypesFactory; - -public class AdvancedTypeDetector { - public static String detectWithCustomConfig(String name) throws Exception { - String config = "/org/apache/tika/mime/tika-mimetypes.xml"; - Tika tika = new Tika(MimeTypesFactory.create(config)); - return tika.detect(name); - } - - public static String detectWithCustomDetector(String name) throws Exception { - String config = "/org/apache/tika/mime/tika-mimetypes.xml"; - Detector detector = MimeTypesFactory.create(config); - - Detector custom = new Detector() { - private static final long serialVersionUID = -5420638839201540749L; - - public MediaType detect(InputStream input, Metadata metadata) { - String type = metadata.get("my-custom-type-override"); - if (type != null) { - return MediaType.parse(type); - } else { - return MediaType.OCTET_STREAM; - } - } - }; - - Tika tika = new Tika(new CompositeDetector(custom, detector)); - return tika.detect(name); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java b/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java deleted file mode 100644 index 9021369..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/ContentHandlerExample.java +++ /dev/null @@ -1,137 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.io.InputStream; -import java.util.ArrayList; -import java.util.List; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.ContentHandlerDecorator; -import org.apache.tika.sax.ToXMLContentHandler; -import org.apache.tika.sax.XHTMLContentHandler; -import org.apache.tika.sax.xpath.Matcher; -import org.apache.tika.sax.xpath.MatchingContentHandler; -import org.apache.tika.sax.xpath.XPathParser; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * Examples of using different Content Handlers to - * get different parts of the file's contents - */ -public class ContentHandlerExample { - /** - * Example of extracting the plain text of the contents. - * Will return only the "body" part of the document - */ - public String parseToPlainText() throws IOException, SAXException, TikaException { - BodyContentHandler handler = new BodyContentHandler(); - - AutoDetectParser parser = new AutoDetectParser(); - Metadata metadata = new Metadata(); - try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) { - parser.parse(stream, handler, metadata); - return handler.toString(); - } - } - - /** - * Example of extracting the contents as HTML, as a string. - */ - public String parseToHTML() throws IOException, SAXException, TikaException { - ContentHandler handler = new ToXMLContentHandler(); - - AutoDetectParser parser = new AutoDetectParser(); - Metadata metadata = new Metadata(); - try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) { - parser.parse(stream, handler, metadata); - return handler.toString(); - } - } - - /** - * Example of extracting just the body as HTML, without the - * head part, as a string - */ - public String parseBodyToHTML() throws IOException, SAXException, TikaException { - ContentHandler handler = new BodyContentHandler( - new ToXMLContentHandler()); - - AutoDetectParser parser = new AutoDetectParser(); - Metadata metadata = new Metadata(); - try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test.doc")) { - parser.parse(stream, handler, metadata); - return handler.toString(); - } - } - - /** - * Example of extracting just one part of the document's body, - * as HTML as a string, excluding the rest - */ - public String parseOnePartToHTML() throws IOException, SAXException, TikaException { - // Only get things under html -> body -> div (class=header) - XPathParser xhtmlParser = new XPathParser("xhtml", XHTMLContentHandler.XHTML); - Matcher divContentMatcher = xhtmlParser.parse("/xhtml:html/xhtml:body/xhtml:div/descendant::node()"); - ContentHandler handler = new MatchingContentHandler( - new ToXMLContentHandler(), divContentMatcher); - - AutoDetectParser parser = new AutoDetectParser(); - Metadata metadata = new Metadata(); - try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc")) { - parser.parse(stream, handler, metadata); - return handler.toString(); - } - } - - protected final int MAXIMUM_TEXT_CHUNK_SIZE = 40; - - /** - * Example of extracting the plain text in chunks, with each chunk - * of no more than a certain maximum size - */ - public List<String> parseToPlainTextChunks() throws IOException, SAXException, TikaException { - final List<String> chunks = new ArrayList<>(); - chunks.add(""); - ContentHandlerDecorator handler = new ContentHandlerDecorator() { - @Override - public void characters(char[] ch, int start, int length) { - String lastChunk = chunks.get(chunks.size() - 1); - String thisStr = new String(ch, start, length); - - if (lastChunk.length() + length > MAXIMUM_TEXT_CHUNK_SIZE) { - chunks.add(thisStr); - } else { - chunks.set(chunks.size() - 1, lastChunk + thisStr); - } - } - }; - - AutoDetectParser parser = new AutoDetectParser(); - Metadata metadata = new Metadata(); - try (InputStream stream = ContentHandlerExample.class.getResourceAsStream("test2.doc")) { - parser.parse(stream, handler, metadata); - return chunks; - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/CustomMimeInfo.java b/tika-example/src/main/java/org/apache/tika/example/CustomMimeInfo.java deleted file mode 100755 index b6ed05c..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/CustomMimeInfo.java +++ /dev/null @@ -1,49 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.net.URL; - -import org.apache.tika.Tika; -import org.apache.tika.detect.CompositeDetector; -import org.apache.tika.mime.MimeTypes; -import org.apache.tika.mime.MimeTypesFactory; - -public class CustomMimeInfo { - public static String customMimeInfo() throws Exception { - String path = "file:///path/to/prescription-type.xml"; - MimeTypes typeDatabase = MimeTypesFactory.create(new URL(path)); - Tika tika = new Tika(typeDatabase); - String type = tika.detect("/path/to/prescription.xpd"); - return type; - } - - public static String customCompositeDetector() throws Exception { - String path = "file:///path/to/prescription-type.xml"; - MimeTypes typeDatabase = MimeTypesFactory.create(new URL(path)); - Tika tika = new Tika(new CompositeDetector(typeDatabase, - new EncryptedPrescriptionDetector())); - String type = tika.detect("/path/to/tmp/prescription.xpd"); - return type; - } - - public static void main(String[] args) throws Exception { - System.out.println("customMimeInfo=" + customMimeInfo()); - System.out.println("customCompositeDetector=" + customCompositeDetector()); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/DescribeMetadata.java b/tika-example/src/main/java/org/apache/tika/example/DescribeMetadata.java deleted file mode 100755 index 50f7840..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/DescribeMetadata.java +++ /dev/null @@ -1,29 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import org.apache.tika.cli.TikaCLI; - -/** - * Print the supported Tika Metadata models and their fields. - */ -public class DescribeMetadata { - public static void main(String[] args) throws Exception { - TikaCLI.main(new String[]{"--list-met-models"}); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/DirListParser.java b/tika-example/src/main/java/org/apache/tika/example/DirListParser.java deleted file mode 100755 index 76dcb2d..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/DirListParser.java +++ /dev/null @@ -1,143 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.HashSet; -import java.util.List; -import java.util.Set; - -import org.apache.commons.io.FileUtils; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * Parses the output of /bin/ls and counts the number of files and the number of - * executables using Tika. - */ -public class DirListParser implements Parser { - - private static final long serialVersionUID = 2717930544410610735L; - - private static Set<MediaType> SUPPORTED_TYPES = new HashSet<>( - Collections.singletonList(MediaType.TEXT_PLAIN)); - - /* - * (non-Javadoc) - * - * @see org.apache.tika.parser.Parser#getSupportedTypes( - * org.apache.tika.parser.ParseContext) - */ - public Set<MediaType> getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - /* - * (non-Javadoc) - * - * @see org.apache.tika.parser.Parser#parse(java.io.InputStream, - * org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata) - */ - public void parse(InputStream is, ContentHandler handler, Metadata metadata) - throws IOException, SAXException, TikaException { - this.parse(is, handler, metadata, new ParseContext()); - } - - /* - * (non-Javadoc) - * - * @see org.apache.tika.parser.Parser#parse(java.io.InputStream, - * org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, - * org.apache.tika.parser.ParseContext) - */ - public void parse(InputStream is, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - List<String> lines = FileUtils.readLines(TikaInputStream.get(is).getFile(), UTF_8); - for (String line : lines) { - String[] fileToks = line.split("\\s+"); - if (fileToks.length < 8) - continue; - String filePermissions = fileToks[0]; - String numHardLinks = fileToks[1]; - String fileOwner = fileToks[2]; - String fileOwnerGroup = fileToks[3]; - String fileSize = fileToks[4]; - StringBuilder lastModDate = new StringBuilder(); - lastModDate.append(fileToks[5]); - lastModDate.append(" "); - lastModDate.append(fileToks[6]); - lastModDate.append(" "); - lastModDate.append(fileToks[7]); - StringBuilder fileName = new StringBuilder(); - for (int i = 8; i < fileToks.length; i++) { - fileName.append(fileToks[i]); - fileName.append(" "); - } - fileName.deleteCharAt(fileName.length() - 1); - this.addMetadata(metadata, filePermissions, numHardLinks, - fileOwner, fileOwnerGroup, fileSize, - lastModDate.toString(), fileName.toString()); - } - } - - public static void main(String[] args) throws IOException, SAXException, - TikaException { - DirListParser parser = new DirListParser(); - Metadata met = new Metadata(); - parser.parse(System.in, new BodyContentHandler(), met); - - System.out.println("Num files: " + met.getValues("Filename").length); - System.out.println("Num executables: " + met.get("NumExecutables")); - } - - private void addMetadata(Metadata metadata, String filePerms, - String numHardLinks, String fileOwner, String fileOwnerGroup, - String fileSize, String lastModDate, String fileName) { - metadata.add("FilePermissions", filePerms); - metadata.add("NumHardLinks", numHardLinks); - metadata.add("FileOwner", fileOwner); - metadata.add("FileOwnerGroup", fileOwnerGroup); - metadata.add("FileSize", fileSize); - metadata.add("LastModifiedDate", lastModDate); - metadata.add("Filename", fileName); - - if (filePerms.indexOf("x") != -1 && filePerms.indexOf("d") == -1) { - if (metadata.get("NumExecutables") != null) { - int numExecs = Integer.valueOf(metadata.get("NumExecutables")); - numExecs++; - metadata.set("NumExecutables", String.valueOf(numExecs)); - } else { - metadata.set("NumExecutables", "1"); - } - } - } - -} diff --git a/tika-example/src/main/java/org/apache/tika/example/DisplayMetInstance.java b/tika-example/src/main/java/org/apache/tika/example/DisplayMetInstance.java deleted file mode 100755 index ddeb339..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/DisplayMetInstance.java +++ /dev/null @@ -1,46 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.net.URL; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.pdf.PDFParser; -import org.apache.tika.sax.BodyContentHandler; -import org.xml.sax.SAXException; - -/** - * Grabs a PDF file from a URL and prints its {@link Metadata} - */ -public class DisplayMetInstance { - public static Metadata getMet(URL url) throws IOException, SAXException, - TikaException { - Metadata met = new Metadata(); - PDFParser parser = new PDFParser(); - parser.parse(url.openStream(), new BodyContentHandler(), met, new ParseContext()); - return met; - } - - public static void main(String[] args) throws Exception { - Metadata met = DisplayMetInstance.getMet(new URL(args[0])); - System.out.println(met); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/DumpTikaConfigExample.java b/tika-example/src/main/java/org/apache/tika/example/DumpTikaConfigExample.java deleted file mode 100644 index 0c51634..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/DumpTikaConfigExample.java +++ /dev/null @@ -1,314 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import static java.nio.charset.StandardCharsets.UTF_8; - -import java.io.FileOutputStream; -import java.io.OutputStreamWriter; -import java.io.StringWriter; -import java.io.Writer; -import java.nio.charset.Charset; -import java.util.Collections; -import java.util.List; -import java.util.Set; -import java.util.TreeSet; - -import javax.xml.parsers.DocumentBuilder; -import javax.xml.parsers.DocumentBuilderFactory; -import javax.xml.transform.OutputKeys; -import javax.xml.transform.Transformer; -import javax.xml.transform.TransformerFactory; -import javax.xml.transform.dom.DOMSource; -import javax.xml.transform.stream.StreamResult; - -import org.apache.tika.config.LoadErrorHandler; -import org.apache.tika.config.ServiceLoader; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.detect.CompositeDetector; -import org.apache.tika.detect.DefaultDetector; -import org.apache.tika.detect.Detector; -import org.apache.tika.language.translate.DefaultTranslator; -import org.apache.tika.language.translate.Translator; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.DefaultParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.ParserDecorator; -import org.w3c.dom.Document; -import org.w3c.dom.Element; -import org.w3c.dom.Node; - - -/** - * This class shows how to dump a TikaConfig object to a configuration file. - * This allows users to easily dump the default TikaConfig as a base from which - * to start if they want to modify the default configuration file. - * <p> - * For those who want to modify the mimes file, take a look at - * tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml - * for inspiration. Consider adding org/apache/tika/mime/custom-mimetypes.xml - * for your custom mime types. - */ -public class DumpTikaConfigExample { - /** - * @param config config file to dump - * @param writer writer to which to write - * @throws Exception - */ - public void dump(TikaConfig config, Mode mode, Writer writer, String encoding) throws Exception { - DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance(); - DocumentBuilder docBuilder = docFactory.newDocumentBuilder(); - - // root elements - Document doc = docBuilder.newDocument(); - Element rootElement = doc.createElement("properties"); - - doc.appendChild(rootElement); - addMimeComment(mode, rootElement, doc); - addServiceLoader(mode, rootElement, doc, config); - addTranslator(mode, rootElement, doc, config); - addDetectors(mode, rootElement, doc, config); - addParsers(mode, rootElement, doc, config); - // TODO Service Loader section - - // now write - TransformerFactory transformerFactory = TransformerFactory.newInstance(); - Transformer transformer = transformerFactory.newTransformer(); - transformer.setOutputProperty(OutputKeys.INDENT, "yes"); - transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2"); - transformer.setOutputProperty(OutputKeys.ENCODING, encoding); - DOMSource source = new DOMSource(doc); - StreamResult result = new StreamResult(writer); - - transformer.transform(source, result); - } - - private void addServiceLoader(Mode mode, Element rootElement, Document doc, TikaConfig config) { - ServiceLoader loader = config.getServiceLoader(); - - if (mode == Mode.MINIMAL) { - // Is this the default? - if (loader.isDynamic() && loader.getLoadErrorHandler() == LoadErrorHandler.IGNORE) { - // Default config, no need to output anything - return; - } - } - - Element dslEl = doc.createElement("service-loader"); - dslEl.setAttribute("dynamic", Boolean.toString(loader.isDynamic())); - dslEl.setAttribute("loadErrorHandler", loader.getLoadErrorHandler().toString()); - rootElement.appendChild(dslEl); - } - - private void addTranslator(Mode mode, Element rootElement, Document doc, TikaConfig config) { - // Unlike the other entries, TikaConfig only wants one of - // these, and no outer <translators> list - Translator translator = config.getTranslator(); - if (mode == Mode.MINIMAL && translator instanceof DefaultTranslator) { - Node mimeComment = doc.createComment( - "for example: <translator class=\"org.apache.tika.language.translate.GoogleTranslator\"/>"); - rootElement.appendChild(mimeComment); - } else { - if (translator instanceof DefaultTranslator && mode == Mode.STATIC) { - translator = ((DefaultTranslator)translator).getTranslator(); - } - if (translator != null) { - Element translatorElement = doc.createElement("translator"); - translatorElement.setAttribute("class", translator.getClass().getCanonicalName()); - rootElement.appendChild(translatorElement); - } else { - rootElement.appendChild(doc.createComment("No translators available")); - } - } - } - - private void addMimeComment(Mode mode, Element rootElement, Document doc) { - Node mimeComment = doc.createComment( - "for example: <mimeTypeRepository resource=\"/org/apache/tika/mime/tika-mimetypes.xml\"/>"); - rootElement.appendChild(mimeComment); - } - - private void addDetectors(Mode mode, Element rootElement, Document doc, TikaConfig config) throws Exception { - Detector detector = config.getDetector(); - - if (mode == Mode.MINIMAL && detector instanceof DefaultDetector) { - // Don't output anything, all using defaults - Node detComment = doc.createComment( - "for example: <detectors><detector class=\"org.apache.tika.detector.MimeTypes\"></detectors>"); - rootElement.appendChild(detComment); - return; - } - - Element detectorsElement = doc.createElement("detectors"); - if (mode == Mode.CURRENT && detector instanceof DefaultDetector || - ! (detector instanceof CompositeDetector)) { - Element detectorElement = doc.createElement("detector"); - detectorElement.setAttribute("class", detector.getClass().getCanonicalName()); - detectorsElement.appendChild(detectorElement); - } else { - List<Detector> children = ((CompositeDetector)detector).getDetectors(); - for (Detector d : children) { - Element detectorElement = doc.createElement("detector"); - detectorElement.setAttribute("class", d.getClass().getCanonicalName()); - detectorsElement.appendChild(detectorElement); - } - } - rootElement.appendChild(detectorsElement); - } - - private void addParsers(Mode mode, Element rootElement, Document doc, TikaConfig config) throws Exception { - Parser parser = config.getParser(); - if (mode == Mode.MINIMAL && parser instanceof DefaultParser) { - // Don't output anything, all using defaults - return; - } else if (mode == Mode.MINIMAL) { - mode = Mode.CURRENT; - } - - Element parsersElement = doc.createElement("parsers"); - rootElement.appendChild(parsersElement); - - addParser(mode, parsersElement, doc, parser); - } - private void addParser(Mode mode, Element rootElement, Document doc, Parser parser) throws Exception { - // If the parser is decorated, is it a kind where we output the parser inside? - ParserDecorator decoration = null; - if (parser instanceof ParserDecorator) { - if (parser.getClass().getName().startsWith(ParserDecorator.class.getName()+"$")) { - decoration = ((ParserDecorator)parser); - parser = decoration.getWrappedParser(); - } - } - - boolean outputParser = true; - List<Parser> children = Collections.emptyList(); - if (mode == Mode.CURRENT && parser instanceof DefaultParser) { - // Only output the parser, not the children - } else if (parser instanceof CompositeParser) { - children = ((CompositeParser)parser).getAllComponentParsers(); - // Special case for a naked composite - if (parser.getClass().equals(CompositeParser.class)) { - outputParser = false; - } - // Special case for making Default to static - if (mode == Mode.STATIC && parser instanceof DefaultParser) { - outputParser = false; - } - } - - if (outputParser) { - rootElement = addParser(rootElement, doc, parser, decoration); - } - for (Parser childParser : children) { - addParser(mode, rootElement, doc, childParser); - } - // TODO Parser Exclusions - } - private Element addParser(Element rootElement, Document doc, Parser parser, ParserDecorator decorator) throws Exception { - ParseContext context = new ParseContext(); - - Set<MediaType> addedTypes = new TreeSet<>(); - Set<MediaType> excludedTypes = new TreeSet<>(); - if (decorator != null) { - Set<MediaType> types = new TreeSet<>(); - types.addAll(decorator.getSupportedTypes(context)); - addedTypes.addAll(types); - - for (MediaType type : parser.getSupportedTypes(context)) { - if (! types.contains(type)) { - excludedTypes.add(type); - } - addedTypes.remove(type); - } - } - - String className = parser.getClass().getCanonicalName(); - Element parserElement = doc.createElement("parser"); - parserElement.setAttribute("class", className); - rootElement.appendChild(parserElement); - - for (MediaType type : addedTypes) { - Element mimeElement = doc.createElement("mime"); - mimeElement.appendChild(doc.createTextNode(type.toString())); - parserElement.appendChild(mimeElement); - } - for (MediaType type : excludedTypes) { - Element mimeElement = doc.createElement("mime-exclude"); - mimeElement.appendChild(doc.createTextNode(type.toString())); - parserElement.appendChild(mimeElement); - } - - return parserElement; - } - - /** - * @param args outputFile, outputEncoding, if args is empty, this prints to console - * @throws Exception - */ - public static void main(String[] args) throws Exception { - Charset encoding = UTF_8; - Mode mode = Mode.CURRENT; - String filename = null; - - for (String arg : args) { - if (arg.startsWith("-")) { - if (arg.contains("-dump-minimal")) { - mode = Mode.MINIMAL; - } else if (arg.contains("-dump-current")) { - mode = Mode.CURRENT; - } else if (arg.contains("-dump-static")) { - mode = Mode.STATIC; - } else { - System.out.println("Use:"); - System.out.println(" DumpTikaConfig [--dump-minimal] [--dump-current] [--dump-static] [filename] [encoding]"); - System.out.println(""); - System.out.println("--dump-minimal Produce the minimal config file"); - System.out.println("--dump-current The current (with defaults) config file"); - System.out.println("--dump-static Convert dynamic parts to static"); - return; - } - } else if (filename == null) { - filename = arg; - } else { - encoding = Charset.forName(arg); - } - } - - Writer writer = null; - if (filename != null) { - writer = new OutputStreamWriter(new FileOutputStream(filename), encoding); - } else { - writer = new StringWriter(); - } - - DumpTikaConfigExample ex = new DumpTikaConfigExample(); - ex.dump(TikaConfig.getDefaultConfig(), mode, writer, encoding.name()); - - writer.flush(); - - if (writer instanceof StringWriter) { - System.out.println(writer.toString()); - } - writer.close(); - } - protected enum Mode { - MINIMAL, CURRENT, STATIC; - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionDetector.java b/tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionDetector.java deleted file mode 100755 index 9f28a78..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionDetector.java +++ /dev/null @@ -1,59 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.io.InputStream; -import java.security.GeneralSecurityException; -import java.security.Key; -import javax.crypto.Cipher; -import javax.crypto.CipherInputStream; -import javax.xml.namespace.QName; - -import org.apache.tika.detect.Detector; -import org.apache.tika.detect.XmlRootExtractor; -import org.apache.tika.io.LookaheadInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; - -public class EncryptedPrescriptionDetector implements Detector { - private static final long serialVersionUID = -1709652690773421147L; - - public MediaType detect(InputStream stream, Metadata metadata) - throws IOException { - Key key = Pharmacy.getKey(); - MediaType type = MediaType.OCTET_STREAM; - - try (InputStream lookahead = new LookaheadInputStream(stream, 1024)) { - Cipher cipher = Cipher.getInstance("RSA"); - - cipher.init(Cipher.DECRYPT_MODE, key); - InputStream decrypted = new CipherInputStream(lookahead, cipher); - - QName name = new XmlRootExtractor().extractRootElement(decrypted); - if (name != null - && "http://example.com/xpd".equals(name.getNamespaceURI()) - && "prescription".equals(name.getLocalPart())) { - type = MediaType.application("x-prescription"); - } - } catch (GeneralSecurityException e) { - // unable to decrypt, fall through - } - return type; - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionParser.java b/tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionParser.java deleted file mode 100755 index 8fda1f9..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/EncryptedPrescriptionParser.java +++ /dev/null @@ -1,60 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.io.InputStream; -import java.security.GeneralSecurityException; -import java.security.Key; -import java.util.Collections; -import java.util.Set; -import javax.crypto.Cipher; -import javax.crypto.CipherInputStream; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class EncryptedPrescriptionParser extends AbstractParser { - private static final long serialVersionUID = -7816987249611278541L; - - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - try { - Key key = Pharmacy.getKey(); - Cipher cipher = Cipher.getInstance("RSA"); - cipher.init(Cipher.DECRYPT_MODE, key); - InputStream decrypted = new CipherInputStream(stream, cipher); - - new PrescriptionParser().parse(decrypted, handler, metadata, - context); - } catch (GeneralSecurityException e) { - throw new TikaException("Unable to decrypt a digital prescription", - e); - } - } - - public Set<MediaType> getSupportedTypes(ParseContext context) { - return Collections.singleton(MediaType.application("x-prescription")); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java b/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java deleted file mode 100644 index fe35bcb..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java +++ /dev/null @@ -1,106 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.io.InputStream; -import java.nio.file.Files; -import java.nio.file.Path; - -import org.apache.commons.io.FilenameUtils; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.detect.Detector; -import org.apache.tika.exception.TikaException; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MimeTypeException; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class ExtractEmbeddedFiles { - private Parser parser = new AutoDetectParser(); - private Detector detector = ((AutoDetectParser) parser).getDetector(); - private TikaConfig config = TikaConfig.getDefaultConfig(); - - public void extract(InputStream is, Path outputDir) throws SAXException, TikaException, IOException { - Metadata m = new Metadata(); - ParseContext c = new ParseContext(); - ContentHandler h = new BodyContentHandler(-1); - - c.set(Parser.class, parser); - EmbeddedDocumentExtractor ex = new MyEmbeddedDocumentExtractor(outputDir, c); - c.set(EmbeddedDocumentExtractor.class, ex); - - parser.parse(is, h, m, c); - } - - private class MyEmbeddedDocumentExtractor extends ParsingEmbeddedDocumentExtractor { - private final Path outputDir; - private int fileCount = 0; - - private MyEmbeddedDocumentExtractor(Path outputDir, ParseContext context) { - super(context); - this.outputDir = outputDir; - } - - @Override - public boolean shouldParseEmbedded(Metadata metadata) { - return true; - } - - @Override - public void parseEmbedded(InputStream stream, ContentHandler handler, Metadata metadata, boolean outputHtml) - throws SAXException, IOException { - - //try to get the name of the embedded file from the metadata - String name = metadata.get(Metadata.RESOURCE_NAME_KEY); - - if (name == null) { - name = "file_" + fileCount++; - } else { - //make sure to select only the file name (not any directory paths - //that might be included in the name) and make sure - //to normalize the name - name = FilenameUtils.normalize(FilenameUtils.getName(name)); - } - - //now try to figure out the right extension for the embedded file - MediaType contentType = detector.detect(stream, metadata); - - if (name.indexOf('.') == -1 && contentType != null) { - try { - name += config.getMimeRepository().forName( - contentType.toString()).getExtension(); - } catch (MimeTypeException e) { - e.printStackTrace(); - } - } - //should add check to make sure that you aren't overwriting a file - Path outputFile = outputDir.resolve(name); - //do a better job than this of checking - Files.createDirectories(outputFile.getParent()); - Files.copy(stream, outputFile); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/GrabPhoneNumbersExample.java b/tika-example/src/main/java/org/apache/tika/example/GrabPhoneNumbersExample.java deleted file mode 100644 index e63ed9b..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/GrabPhoneNumbersExample.java +++ /dev/null @@ -1,103 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.BufferedInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.nio.file.FileVisitResult; -import java.nio.file.Files; -import java.nio.file.Path; -import java.nio.file.Paths; -import java.nio.file.SimpleFileVisitor; -import java.nio.file.attribute.BasicFileAttributes; -import java.util.Collections; -import java.util.HashSet; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.PhoneExtractingContentHandler; - -/** - * Class to demonstrate how to use the {@link org.apache.tika.sax.PhoneExtractingContentHandler} - * to get a list of all of the phone numbers from every file in a directory. - * <p> - * You can run this main method by running - * <code> - * mvn exec:java -Dexec.mainClass="org.apache.tika.example.GrabPhoneNumbersExample" -Dexec.args="/path/to/directory" - * </code> - * from the tika-example directory. - */ -public class GrabPhoneNumbersExample { - private static HashSet<String> phoneNumbers = new HashSet<>(); - private static int failedFiles, successfulFiles = 0; - - public static void main(String[] args) { - if (args.length != 1) { - System.err.println("Usage `java GrabPhoneNumbers [corpus]"); - return; - } - Path folder = Paths.get(args[0]); - System.out.println("Searching " + folder.toAbsolutePath() + "..."); - processFolder(folder); - System.out.println(phoneNumbers.toString()); - System.out.println("Parsed " + successfulFiles + "/" + (successfulFiles + failedFiles)); - } - - public static void processFolder(Path folder) { - try { - Files.walkFileTree(folder, new SimpleFileVisitor<Path>() { - @Override - public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException { - try { - process(file); - successfulFiles++; - } catch (Exception e) { - failedFiles++; - // ignore this file - } - return FileVisitResult.CONTINUE; - } - - @Override - public FileVisitResult visitFileFailed(Path file, IOException exc) throws IOException { - failedFiles++; - return FileVisitResult.CONTINUE; - } - }); - } catch (IOException e) { - // ignore failure - } - } - - public static void process(Path path) throws Exception { - Parser parser = new AutoDetectParser(); - Metadata metadata = new Metadata(); - // The PhoneExtractingContentHandler will examine any characters for phone numbers before passing them - // to the underlying Handler. - PhoneExtractingContentHandler handler = new PhoneExtractingContentHandler(new BodyContentHandler(), metadata); - try (InputStream stream = new BufferedInputStream(Files.newInputStream(path))) { - parser.parse(stream, handler, metadata, new ParseContext()); - } - String[] numbers = metadata.getValues("phonenumbers"); - Collections.addAll(phoneNumbers, numbers); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/ImportContextImpl.java b/tika-example/src/main/java/org/apache/tika/example/ImportContextImpl.java deleted file mode 100755 index 5cbd6e8..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/ImportContextImpl.java +++ /dev/null @@ -1,235 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.BufferedInputStream; -import java.io.File; -import java.io.FileInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.util.Date; -import javax.jcr.Item; - -import org.apache.jackrabbit.server.io.DefaultIOListener; -import org.apache.jackrabbit.server.io.IOListener; -import org.apache.jackrabbit.server.io.IOUtil; -import org.apache.jackrabbit.server.io.ImportContext; -import org.apache.jackrabbit.webdav.io.InputContext; -import org.apache.tika.detect.Detector; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -/** - * <code>ImportContextImpl</code>... - */ -public class ImportContextImpl implements ImportContext { - private static Logger log = LoggerFactory.getLogger(ImportContextImpl.class); - - private final IOListener ioListener; - private final Item importRoot; - private final String systemId; - private final File inputFile; - - private InputContext inputCtx; - private boolean completed; - - private final Detector detector; - - private final MediaType type; - - /** - * Creates a new item import context. The specified InputStream is written - * to a temporary file in order to avoid problems with multiple IOHandlers - * that try to run the import but fail. The temporary file is deleted as - * soon as this context is informed that the import has been completed and - * it will not be used any more. - * - * @param importRoot - * @param systemId - * @param ctx input context, or <code>null</code> - * @param stream document input stream, or <code>null</code> - * @param ioListener - * @param detector content type detector - * @throws IOException - * @see ImportContext#informCompleted(boolean) - */ - public ImportContextImpl(Item importRoot, String systemId, - InputContext ctx, InputStream stream, IOListener ioListener, - Detector detector) throws IOException { - this.importRoot = importRoot; - this.systemId = systemId; - this.inputCtx = ctx; - this.ioListener = (ioListener != null) ? ioListener - : new DefaultIOListener(log); - this.detector = detector; - - Metadata metadata = new Metadata(); - if (ctx != null && ctx.getContentType() != null) { - metadata.set(Metadata.CONTENT_TYPE, ctx.getContentType()); - } - if (systemId != null) { - metadata.set(Metadata.RESOURCE_NAME_KEY, systemId); - } - if (stream != null && !stream.markSupported()) { - stream = new BufferedInputStream(stream); - } - type = detector.detect(stream, metadata); - - this.inputFile = IOUtil.getTempFile(stream); - } - - /** - * @see ImportContext#getIOListener() - */ - public IOListener getIOListener() { - return ioListener; - } - - /** - * @see ImportContext#getImportRoot() - */ - public Item getImportRoot() { - return importRoot; - } - - /** - * @see ImportContext#getDetector() - */ - public Detector getDetector() { - return detector; - } - - /** - * @see ImportContext#hasStream() - */ - public boolean hasStream() { - return inputFile != null; - } - - /** - * Returns a new <code>InputStream</code> to the temporary file created - * during instanciation or <code>null</code>, if this context does not - * provide a stream. - * - * @see ImportContext#getInputStream() - * @see #hasStream() - */ - public InputStream getInputStream() { - checkCompleted(); - InputStream in = null; - if (inputFile != null) { - try { - in = new FileInputStream(inputFile); - } catch (IOException e) { - // unexpected error... ignore and return null - } - } - return in; - } - - /** - * @see ImportContext#getSystemId() - */ - public String getSystemId() { - return systemId; - } - - /** - * @see ImportContext#getModificationTime() - */ - public long getModificationTime() { - return (inputCtx != null) ? inputCtx.getModificationTime() : new Date().getTime(); - } - - /** - * @see ImportContext#getContentLanguage() - */ - public String getContentLanguage() { - return (inputCtx != null) ? inputCtx.getContentLanguage() : null; - } - - /** - * @see ImportContext#getContentLength() - */ - public long getContentLength() { - long length = IOUtil.UNDEFINED_LENGTH; - if (inputCtx != null) { - length = inputCtx.getContentLength(); - } - if (length < 0 && inputFile != null) { - length = inputFile.length(); - } - if (length < 0) { - log.debug("Unable to determine content length -> default value = " - + IOUtil.UNDEFINED_LENGTH); - } - return length; - } - - /** - * @see ImportContext#getMimeType() - */ - public String getMimeType() { - return IOUtil.getMimeType(type.toString()); - } - - /** - * @see ImportContext#getEncoding() - */ - public String getEncoding() { - return IOUtil.getEncoding(type.toString()); - } - - /** - * @see ImportContext#getProperty(Object) - */ - public Object getProperty(Object propertyName) { - return (inputCtx != null) ? inputCtx.getProperty(propertyName.toString()) : null; - } - - /** - * @see ImportContext#informCompleted(boolean) - */ - public void informCompleted(boolean success) { - checkCompleted(); - completed = true; - if (inputFile != null) { - inputFile.delete(); - } - } - - /** - * @see ImportContext#isCompleted() - */ - public boolean isCompleted() { - return completed; - } - - /** - * @throws IllegalStateException if the context is already completed. - * @see #isCompleted() - * @see #informCompleted(boolean) - */ - private void checkCompleted() { - if (completed) { - throw new IllegalStateException("ImportContext has already been consumed."); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/InterruptableParsingExample.java b/tika-example/src/main/java/org/apache/tika/example/InterruptableParsingExample.java deleted file mode 100644 index 9aadf58..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/InterruptableParsingExample.java +++ /dev/null @@ -1,92 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.BufferedInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.util.Locale; - -import org.apache.tika.Tika; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.DefaultHandler; - -/** - * This example demonstrates how to interrupt document parsing if - * some condition is met. - * <p> - * {@link InterruptingContentHandler} throws special exception as soon as - * find {@code query} string in parsed file. - * - * See also http://stackoverflow.com/questions/31939851 - */ -public class InterruptableParsingExample { - private Tika tika = new Tika(); // for default autodetect parser - - public boolean findInFile(String query, Path path) { - InterruptingContentHandler handler = new InterruptingContentHandler(query); - - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - context.set(Parser.class, tika.getParser()); - - try (InputStream is = new BufferedInputStream(Files.newInputStream(path))) { - tika.getParser().parse(is, handler, metadata, context); - } catch (QueryMatchedException e) { - return true; - } catch (SAXException | TikaException | IOException e) { - // something went wrong with parsing... - e.printStackTrace(); - } - return false; - } - - class QueryMatchedException extends SAXException { - } - - /** - * Trivial content handler that searched for {@code query} in characters send to it. - * <p> - * Throws {@link QueryMatchedException} when query string is found. - */ - class InterruptingContentHandler extends DefaultHandler { - private String query; - private StringBuilder sb = new StringBuilder(); - - InterruptingContentHandler(String query) { - this.query = query; - } - - @Override - public void characters(char[] ch, int start, int length) throws SAXException { - sb.append(new String(ch, start, length).toLowerCase(Locale.getDefault())); - - if (sb.toString().contains(query)) - throw new QueryMatchedException(); - - if (sb.length() > 2 * query.length()) - sb.delete(0, sb.length() - query.length()); // keep tail with query.length() chars - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/Language.java b/tika-example/src/main/java/org/apache/tika/example/Language.java deleted file mode 100755 index d3da55d..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/Language.java +++ /dev/null @@ -1,58 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; - -import org.apache.tika.language.LanguageIdentifier; -import org.apache.tika.language.LanguageProfile; -import org.apache.tika.language.ProfilingHandler; -import org.apache.tika.language.ProfilingWriter; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; - -public class Language { - public static void languageDetection() throws IOException { - LanguageProfile profile = new LanguageProfile( - "Alla människor är födda fria och lika i värde och rättigheter."); - - LanguageIdentifier identifier = new LanguageIdentifier(profile); - System.out.println(identifier.getLanguage()); - } - - public static void languageDetectionWithWriter() throws IOException { - ProfilingWriter writer = new ProfilingWriter(); - writer.append("Minden emberi lény"); - writer.append(" szabadon születik és"); - writer.append(" egyenlő méltósága és"); - writer.append(" joga van."); - - LanguageIdentifier identifier = writer.getLanguage(); - System.out.println(identifier.getLanguage()); - writer.close(); - } - - public static void languageDetectionWithHandler() throws Exception { - ProfilingHandler handler = new ProfilingHandler(); - new AutoDetectParser().parse(System.in, handler, new Metadata(), new ParseContext()); - - LanguageIdentifier identifier = handler.getLanguage(); - System.out.println(identifier.getLanguage()); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/LanguageDetectingParser.java b/tika-example/src/main/java/org/apache/tika/example/LanguageDetectingParser.java deleted file mode 100755 index a6945cd..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/LanguageDetectingParser.java +++ /dev/null @@ -1,50 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.io.InputStream; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.language.LanguageIdentifier; -import org.apache.tika.language.ProfilingHandler; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.DelegatingParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.TeeContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -@SuppressWarnings("deprecation") -public class LanguageDetectingParser extends DelegatingParser { - private static final long serialVersionUID = 4291320409396502774L; - - public void parse(InputStream stream, ContentHandler handler, - final Metadata metadata, ParseContext context) throws SAXException, - IOException, TikaException { - ProfilingHandler profiler = new ProfilingHandler(); - ContentHandler tee = new TeeContentHandler(handler, profiler); - - super.parse(stream, tee, metadata, context); - - LanguageIdentifier identifier = profiler.getLanguage(); - if (identifier.isReasonablyCertain()) { - metadata.set(Metadata.LANGUAGE, identifier.getLanguage()); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/LanguageIdentifierExample.java b/tika-example/src/main/java/org/apache/tika/example/LanguageIdentifierExample.java deleted file mode 100644 index 537d5ab..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/LanguageIdentifierExample.java +++ /dev/null @@ -1,27 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import org.apache.tika.language.LanguageIdentifier; - -public class LanguageIdentifierExample { - public String identifyLanguage(String text) { - LanguageIdentifier identifier = new LanguageIdentifier(text); - return identifier.getLanguage(); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/LazyTextExtractorField.java b/tika-example/src/main/java/org/apache/tika/example/LazyTextExtractorField.java deleted file mode 100755 index 4ce5f71..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/LazyTextExtractorField.java +++ /dev/null @@ -1,210 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.InputStream; -import java.io.Reader; -import java.util.concurrent.Executor; - -import org.apache.jackrabbit.core.query.lucene.FieldNames; -import org.apache.jackrabbit.core.value.InternalValue; -import org.apache.lucene.analysis.TokenStream; -import org.apache.lucene.document.AbstractField; -import org.apache.lucene.document.Field; -import org.apache.lucene.document.Field.Store; -import org.apache.lucene.document.Field.TermVector; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.DefaultHandler; - -/** - * <code>LazyTextExtractorField</code> implements a Lucene field with a String - * value that is lazily initialized from a given {@link Reader}. In addition - * this class provides a method to find out whether the purpose of the reader is - * to extract text and whether the extraction process is already finished. - * - * @see #isExtractorFinished() - */ -@SuppressWarnings("serial") -public class LazyTextExtractorField extends AbstractField { - /** - * The logger instance for this class. - */ - private static final Logger log = LoggerFactory.getLogger(LazyTextExtractorField.class); - - /** - * The exception used to forcibly terminate the extraction process when the - * maximum field length is reached. - * <p> - * Such exceptions shouldn't be used in logging since its stack trace is meaningless. - */ - private static final SAXException STOP = new SAXException("max field length reached"); - - /** - * The extracted text content of the given binary value. Set to non-null - * when the text extraction task finishes. - */ - private volatile String extract = null; - - /** - * Creates a new <code>LazyTextExtractorField</code> with the given - * <code>name</code>. - * - * @param name the name of the field. - * @param reader the reader where to obtain the string from. - * @param highlighting set to <code>true</code> to enable result highlighting support - */ - public LazyTextExtractorField(Parser parser, InternalValue value, - Metadata metadata, Executor executor, boolean highlighting, - int maxFieldLength) { - super(FieldNames.FULLTEXT, highlighting ? Store.YES : Store.NO, - Field.Index.ANALYZED, highlighting ? TermVector.WITH_OFFSETS - : TermVector.NO); - executor.execute(new ParsingTask(parser, value, metadata, - maxFieldLength)); - } - - /** - * Returns the extracted text. This method blocks until the text extraction - * task has been completed. - * - * @return the string value of this field - */ - public synchronized String stringValue() { - try { - while (!isExtractorFinished()) { - wait(); - } - return extract; - } catch (InterruptedException e) { - log.error("Text extraction thread was interrupted", e); - return ""; - } - } - - /** - * @return always <code>null</code> - */ - public Reader readerValue() { - return null; - } - - /** - * @return always <code>null</code> - */ - public byte[] binaryValue() { - return null; - } - - /** - * @return always <code>null</code> - */ - public TokenStream tokenStreamValue() { - return null; - } - - /** - * Checks whether the text extraction task has finished. - * - * @return <code>true</code> if the extracted text is available - */ - public boolean isExtractorFinished() { - return extract != null; - } - - private synchronized void setExtractedText(String value) { - extract = value; - notify(); - } - - /** - * Releases all resources associated with this field. - */ - public void dispose() { - // TODO: Cause the ContentHandler below to throw an exception - } - - /** - * The background task for extracting text from a binary value. - */ - private class ParsingTask extends DefaultHandler implements Runnable { - private final Parser parser; - - private final InternalValue value; - - private final Metadata metadata; - - private final int maxFieldLength; - - private final StringBuilder builder = new StringBuilder(); - - private final ParseContext context = new ParseContext(); - - // NOTE: not a part of Jackrabbit code, made - private final ContentHandler handler = new DefaultHandler(); - - public ParsingTask(Parser parser, InternalValue value, - Metadata metadata, int maxFieldLength) { - this.parser = parser; - this.value = value; - this.metadata = metadata; - this.maxFieldLength = maxFieldLength; - } - - public void run() { - try { - try (InputStream stream = value.getStream()) { - parser.parse(stream, handler, metadata, context); - } - } catch (LinkageError e) { - // Capture and ignore - } catch (Throwable t) { - if (t != STOP) { - log.debug("Failed to extract text.", t); - setExtractedText("TextExtractionError"); - return; - } - } finally { - value.discard(); - } - setExtractedText(handler.toString()); - - } - - @Override - public void characters(char[] ch, int start, int length) - throws SAXException { - builder.append(ch, start, - Math.min(length, maxFieldLength - builder.length())); - if (builder.length() >= maxFieldLength) { - throw STOP; - } - } - - @Override - public void ignorableWhitespace(char[] ch, int start, int length) - throws SAXException { - characters(ch, start, length); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/LuceneIndexer.java b/tika-example/src/main/java/org/apache/tika/example/LuceneIndexer.java deleted file mode 100755 index 2f7cd31..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/LuceneIndexer.java +++ /dev/null @@ -1,45 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; - -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; -import org.apache.lucene.document.Field.Index; -import org.apache.lucene.document.Field.Store; -import org.apache.lucene.index.IndexWriter; -import org.apache.tika.Tika; - -public class LuceneIndexer { - private final Tika tika; - - private final IndexWriter writer; - - public LuceneIndexer(Tika tika, IndexWriter writer) { - this.tika = tika; - this.writer = writer; - } - - public void indexDocument(File file) throws Exception { - Document document = new Document(); - document.add(new Field("filename", file.getName(), Store.YES, Index.ANALYZED)); - document.add(new Field("fulltext", tika.parseToString(file), Store.NO, Index.ANALYZED)); - writer.addDocument(document); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/LuceneIndexerExtended.java b/tika-example/src/main/java/org/apache/tika/example/LuceneIndexerExtended.java deleted file mode 100755 index 2a7fd13..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/LuceneIndexerExtended.java +++ /dev/null @@ -1,65 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; -import java.io.Reader; - -import org.apache.lucene.analysis.standard.StandardAnalyzer; -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; -import org.apache.lucene.document.Field.Index; -import org.apache.lucene.document.Field.Store; -import org.apache.lucene.index.IndexWriter; -import org.apache.lucene.index.IndexWriter.MaxFieldLength; -import org.apache.lucene.store.SimpleFSDirectory; -import org.apache.lucene.util.Version; -import org.apache.tika.Tika; - -@SuppressWarnings("deprecation") -public class LuceneIndexerExtended { - private final Tika tika; - - private final IndexWriter writer; - - public LuceneIndexerExtended(IndexWriter writer, Tika tika) { - this.writer = writer; - this.tika = tika; - } - - public static void main(String[] args) throws Exception { - try (IndexWriter writer = new IndexWriter( - new SimpleFSDirectory(new File(args[0])), - new StandardAnalyzer(Version.LUCENE_30), - MaxFieldLength.UNLIMITED)) { - LuceneIndexer indexer = new LuceneIndexer(new Tika(), writer); - for (int i = 1; i < args.length; i++) { - indexer.indexDocument(new File(args[i])); - } - } - } - - public void indexDocument(File file) throws Exception { - try (Reader fulltext = tika.parse(file)) { - Document document = new Document(); - document.add(new Field("filename", file.getName(), Store.YES, Index.ANALYZED)); - document.add(new Field("fulltext", fulltext)); - writer.addDocument(document); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/MediaTypeExample.java b/tika-example/src/main/java/org/apache/tika/example/MediaTypeExample.java deleted file mode 100755 index 91823e5..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/MediaTypeExample.java +++ /dev/null @@ -1,58 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.util.Map; -import java.util.Set; - -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MediaTypeRegistry; - -public class MediaTypeExample { - public static void describeMediaType() { - MediaType type = MediaType.parse("text/plain; charset=UTF-8"); - - System.out.println("type: " + type.getType()); - System.out.println("subtype: " + type.getSubtype()); - - Map<String, String> parameters = type.getParameters(); - System.out.println("parameters:"); - for (String name : parameters.keySet()) { - System.out.println(" " + name + "=" + parameters.get(name)); - } - } - - public static void listAllTypes() { - MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry(); - - for (MediaType type : registry.getTypes()) { - Set<MediaType> aliases = registry.getAliases(type); - System.out.println(type + ", also known as " + aliases); - } - } - - public static void main(String[] args) throws Exception { - MediaTypeRegistry registry = MediaTypeRegistry.getDefaultRegistry(); - - MediaType type = MediaType.parse("image/svg+xml"); - while (type != null) { - System.out.println(type); - type = registry.getSupertype(type); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/MetadataAwareLuceneIndexer.java b/tika-example/src/main/java/org/apache/tika/example/MetadataAwareLuceneIndexer.java deleted file mode 100755 index 5c6a9d4..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/MetadataAwareLuceneIndexer.java +++ /dev/null @@ -1,88 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; -import java.io.FileInputStream; -import java.io.InputStream; -import java.util.Date; - -import org.apache.lucene.document.Document; -import org.apache.lucene.document.Field; -import org.apache.lucene.document.Field.Index; -import org.apache.lucene.document.Field.Store; -import org.apache.lucene.index.IndexWriter; -import org.apache.tika.Tika; -import org.apache.tika.metadata.DublinCore; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Property; - -/** - * Builds on the LuceneIndexer from Chapter 5 and adds indexing of Metadata. - */ -@SuppressWarnings("deprecation") -public class MetadataAwareLuceneIndexer { - private Tika tika; - - private IndexWriter writer; - - public MetadataAwareLuceneIndexer(IndexWriter writer, Tika tika) { - this.writer = writer; - this.tika = tika; - } - - public void indexContentSpecificMet(File file) throws Exception { - Metadata met = new Metadata(); - try (InputStream is = new FileInputStream(file)) { - tika.parse(is, met); - Document document = new Document(); - for (String key : met.names()) { - String[] values = met.getValues(key); - for (String val : values) { - document.add(new Field(key, val, Store.YES, Index.ANALYZED)); - } - writer.addDocument(document); - } - } - } - - public void indexWithDublinCore(File file) throws Exception { - Metadata met = new Metadata(); - met.add(Metadata.CREATOR, "Manning"); - met.add(Metadata.CREATOR, "Tika in Action"); - met.set(Metadata.DATE, new Date()); - met.set(Metadata.FORMAT, tika.detect(file)); - met.set(DublinCore.SOURCE, file.toURI().toURL().toString()); - met.add(Metadata.SUBJECT, "File"); - met.add(Metadata.SUBJECT, "Indexing"); - met.add(Metadata.SUBJECT, "Metadata"); - met.set(Property.externalClosedChoise(Metadata.RIGHTS, "public", - "private"), "public"); - try (InputStream is = new FileInputStream(file)) { - tika.parse(is, met); - Document document = new Document(); - for (String key : met.names()) { - String[] values = met.getValues(key); - for (String val : values) { - document.add(new Field(key, val, Store.YES, Index.ANALYZED)); - } - writer.addDocument(document); - } - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/MyFirstTika.java b/tika-example/src/main/java/org/apache/tika/example/MyFirstTika.java deleted file mode 100755 index 5f2c8aa..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/MyFirstTika.java +++ /dev/null @@ -1,116 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; -import java.io.InputStream; - -import org.apache.commons.io.FileUtils; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.detect.Detector; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.language.LanguageIdentifier; -import org.apache.tika.language.LanguageProfile; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MimeTypes; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.xml.sax.ContentHandler; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * Demonstrates how to call the different components within Tika: its - * {@link Detector} framework (aka MIME identification and repository), its - * {@link Parser} interface, its {@link LanguageIdentifier} and other goodies. - * <p> - * It also shows the "easy way" via {@link AutoDetectParser} - */ -public class MyFirstTika { - public static void main(String[] args) throws Exception { - String filename = args[0]; - TikaConfig tikaConfig = TikaConfig.getDefaultConfig(); - - Metadata metadata = new Metadata(); - String text = parseUsingComponents(filename, tikaConfig, metadata); - System.out.println("Parsed Metadata: "); - System.out.println(metadata); - System.out.println("Parsed Text: "); - System.out.println(text); - - System.out.println("-------------------------"); - - metadata = new Metadata(); - text = parseUsingAutoDetect(filename, tikaConfig, metadata); - System.out.println("Parsed Metadata: "); - System.out.println(metadata); - System.out.println("Parsed Text: "); - System.out.println(text); - } - - public static String parseUsingAutoDetect(String filename, TikaConfig tikaConfig, - Metadata metadata) throws Exception { - System.out.println("Handling using AutoDetectParser: [" + filename + "]"); - - AutoDetectParser parser = new AutoDetectParser(tikaConfig); - ContentHandler handler = new BodyContentHandler(); - TikaInputStream stream = TikaInputStream.get(new File(filename), metadata); - parser.parse(stream, handler, metadata, new ParseContext()); - return handler.toString(); - } - - public static String parseUsingComponents(String filename, TikaConfig tikaConfig, - Metadata metadata) throws Exception { - MimeTypes mimeRegistry = tikaConfig.getMimeRepository(); - - System.out.println("Examining: [" + filename + "]"); - - metadata.set(Metadata.RESOURCE_NAME_KEY, filename); - System.out.println("The MIME type (based on filename) is: [" - + mimeRegistry.detect(null, metadata) + "]"); - - InputStream stream = TikaInputStream.get(new File(filename)); - System.out.println("The MIME type (based on MAGIC) is: [" - + mimeRegistry.detect(stream, metadata) + "]"); - - stream = TikaInputStream.get(new File(filename)); - Detector detector = tikaConfig.getDetector(); - System.out.println("The MIME type (based on the Detector interface) is: [" - + detector.detect(stream, metadata) + "]"); - - LanguageIdentifier lang = new LanguageIdentifier(new LanguageProfile( - FileUtils.readFileToString(new File(filename), UTF_8))); - - System.out.println("The language of this content is: [" - + lang.getLanguage() + "]"); - - // Get a non-detecting parser that handles all the types it can - Parser parser = tikaConfig.getParser(); - // Tell it what we think the content is - MediaType type = detector.detect(stream, metadata); - metadata.set(Metadata.CONTENT_TYPE, type.toString()); - // Have the file parsed to get the content and metadata - ContentHandler handler = new BodyContentHandler(); - parser.parse(stream, handler, metadata, new ParseContext()); - - return handler.toString(); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java b/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java deleted file mode 100644 index b98e428..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/ParsingExample.java +++ /dev/null @@ -1,217 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.io.InputStream; -import java.io.StringWriter; -import java.nio.file.DirectoryStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.util.ArrayList; -import java.util.List; - -import org.apache.tika.Tika; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.serialization.JsonMetadataList; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.RecursiveParserWrapper; -import org.apache.tika.sax.BasicContentHandlerFactory; -import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.ContentHandlerFactory; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.DefaultHandler; - -public class ParsingExample { - - /** - * Example of how to use Tika's parseToString method to parse the content of a file, - * and return any text found. - * <p> - * Note: Tika.parseToString() will extract content from the outer container - * document and any embedded/attached documents. - * - * @return The content of a file. - */ - public String parseToStringExample() throws IOException, SAXException, TikaException { - Tika tika = new Tika(); - try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) { - return tika.parseToString(stream); - } - } - - /** - * Example of how to use Tika to parse a file when you do not know its file type - * ahead of time. - * <p> - * AutoDetectParser attempts to discover the file's type automatically, then call - * the exact Parser built for that file type. - * <p> - * The stream to be parsed by the Parser. In this case, we get a file from the - * resources folder of this project. - * <p> - * Handlers are used to get the exact information you want out of the host of - * information gathered by Parsers. The body content handler, intuitively, extracts - * everything that would go between HTML body tags. - * <p> - * The Metadata object will be filled by the Parser with Metadata discovered about - * the file being parsed. - * <p> - * Note: This example will extract content from the outer document and all - * embedded documents. However, if you choose to use a {@link ParseContext}, - * make sure to set a {@link Parser} or else embedded content will not be - * parsed. - * - * @return The content of a file. - */ - public String parseExample() throws IOException, SAXException, TikaException { - AutoDetectParser parser = new AutoDetectParser(); - BodyContentHandler handler = new BodyContentHandler(); - Metadata metadata = new Metadata(); - try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) { - parser.parse(stream, handler, metadata); - return handler.toString(); - } - } - - /** - * If you don't want content from embedded documents, send in - * a {@link org.apache.tika.parser.ParseContext} that does not contain a - * {@link Parser}. - * - * @return The content of a file. - */ - public String parseNoEmbeddedExample() throws IOException, SAXException, TikaException { - AutoDetectParser parser = new AutoDetectParser(); - BodyContentHandler handler = new BodyContentHandler(); - Metadata metadata = new Metadata(); - try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) { - parser.parse(stream, handler, metadata, new ParseContext()); - return handler.toString(); - } - } - - - /** - * This example shows how to extract content from the outer document and all - * embedded documents. The key is to specify a {@link Parser} in the {@link ParseContext}. - * - * @return content, including from embedded documents - * @throws IOException - * @throws SAXException - * @throws TikaException - */ - public String parseEmbeddedExample() throws IOException, SAXException, TikaException { - AutoDetectParser parser = new AutoDetectParser(); - BodyContentHandler handler = new BodyContentHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - context.set(Parser.class, parser); - try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) { - parser.parse(stream, handler, metadata, context); - return handler.toString(); - } - } - - /** - * For documents that may contain embedded documents, it might be helpful - * to create list of metadata objects, one for the container document and - * one for each embedded document. This allows easy access to both the - * extracted content and the metadata of each embedded document. - * Note that many document formats can contain embedded documents, - * including traditional container formats -- zip, tar and others -- but also - * common office document formats including: MSWord, MSExcel, - * MSPowerPoint, RTF, PDF, MSG and several others. - * <p> - * The "content" format is determined by the ContentHandlerFactory, and - * the content is stored in {@link org.apache.tika.parser.RecursiveParserWrapper#TIKA_CONTENT} - * <p> - * The drawback to the RecursiveParserWrapper is that it caches metadata and contents - * in memory. This should not be used on files whose contents are too big to be handled - * in memory. - * - * @return a list of metadata object, one each for the container file and each embedded file - * @throws IOException - * @throws SAXException - * @throws TikaException - */ - public List<Metadata> recursiveParserWrapperExample() throws IOException, - SAXException, TikaException { - Parser p = new AutoDetectParser(); - ContentHandlerFactory factory = new BasicContentHandlerFactory( - BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1); - - RecursiveParserWrapper wrapper = new RecursiveParserWrapper(p, factory); - Metadata metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx"); - ParseContext context = new ParseContext(); - - try (InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx")) { - wrapper.parse(stream, new DefaultHandler(), metadata, context); - } - return wrapper.getMetadata(); - } - - /** - * We include a simple JSON serializer for a list of metadata with - * {@link org.apache.tika.metadata.serialization.JsonMetadataList}. - * That class also includes a deserializer to convert from JSON - * back to a List<Metadata>. - * <p> - * This functionality is also available in tika-app's GUI, and - * with the -J option on tika-app's commandline. For tika-server - * users, there is the "rmeta" service that will return this format. - * - * @return a JSON representation of a list of Metadata objects - * @throws IOException - * @throws SAXException - * @throws TikaException - */ - public String serializedRecursiveParserWrapperExample() throws IOException, - SAXException, TikaException { - List<Metadata> metadataList = recursiveParserWrapperExample(); - StringWriter writer = new StringWriter(); - JsonMetadataList.toJson(metadataList, writer); - return writer.toString(); - } - - - /** - * @param outputPath -- output directory to place files - * @return list of files created - * @throws IOException - * @throws SAXException - * @throws TikaException - */ - public List<Path> extractEmbeddedDocumentsExample(Path outputPath) throws IOException, - SAXException, TikaException { - InputStream stream = ParsingExample.class.getResourceAsStream("test_recursive_embedded.docx"); - ExtractEmbeddedFiles ex = new ExtractEmbeddedFiles(); - ex.extract(stream, outputPath); - List<Path> ret = new ArrayList<>(); - try (DirectoryStream<Path> dirStream = Files.newDirectoryStream(outputPath)) { - for (Path entry : dirStream) { - ret.add(entry); - } - } - return ret; - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/Pharmacy.java b/tika-example/src/main/java/org/apache/tika/example/Pharmacy.java deleted file mode 100755 index e993599..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/Pharmacy.java +++ /dev/null @@ -1,32 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.security.Key; - -public class Pharmacy { - private static Key key = null; - - public static Key getKey() { - return key; - } - - public static void setKey(Key key) { - Pharmacy.key = key; - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/PrescriptionParser.java b/tika-example/src/main/java/org/apache/tika/example/PrescriptionParser.java deleted file mode 100755 index 0688955..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/PrescriptionParser.java +++ /dev/null @@ -1,52 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.util.Collections; -import java.util.Set; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.xml.ElementMetadataHandler; -import org.apache.tika.parser.xml.XMLParser; -import org.apache.tika.sax.TeeContentHandler; -import org.xml.sax.ContentHandler; - -public class PrescriptionParser extends XMLParser { - private static final long serialVersionUID = 7690682277511967388L; - - @Override - protected ContentHandler getContentHandler(ContentHandler handler, - Metadata metadata, ParseContext context) { - String xpd = "http://example.com/2011/xpd"; - - ContentHandler doctor = new ElementMetadataHandler(xpd, "doctor", - metadata, "xpd:doctor"); - ContentHandler patient = new ElementMetadataHandler(xpd, "patient", - metadata, "xpd:patient"); - - return new TeeContentHandler(super.getContentHandler(handler, metadata, - context), doctor, patient); - } - - @Override - public Set<MediaType> getSupportedTypes(ParseContext context) { - return Collections.singleton(MediaType.application("x-prescription+xml")); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/RecentFiles.java b/tika-example/src/main/java/org/apache/tika/example/RecentFiles.java deleted file mode 100755 index d6a259b..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/RecentFiles.java +++ /dev/null @@ -1,145 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; -import java.io.IOException; -import java.text.SimpleDateFormat; -import java.util.Date; -import java.util.GregorianCalendar; -import java.util.Locale; -import java.util.TimeZone; - -import org.apache.jackrabbit.util.ISO8601; -import org.apache.lucene.document.Document; -import org.apache.lucene.index.CorruptIndexException; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.search.IndexSearcher; -import org.apache.lucene.search.ScoreDoc; -import org.apache.lucene.search.TermRangeQuery; -import org.apache.lucene.search.TopScoreDocCollector; -import org.apache.lucene.store.SimpleFSDirectory; -import org.apache.tika.metadata.DublinCore; -import org.apache.tika.metadata.Metadata; - -/** - * Builds on top of the LuceneIndexer and the Metadata discussions in Chapter 6 - * to output an RSS (or RDF) feed of files crawled by the LuceneIndexer within - * the last N minutes. - */ -@SuppressWarnings("deprecation") -public class RecentFiles { - private IndexReader reader; - - private SimpleDateFormat rssDateFormat = new SimpleDateFormat( - "E, dd MMM yyyy HH:mm:ss z", Locale.getDefault()); - - public String generateRSS(File indexFile) throws CorruptIndexException, - IOException { - StringBuffer output = new StringBuffer(); - output.append(getRSSHeaders()); - IndexSearcher searcher = null; - try { - reader = IndexReader.open(new SimpleFSDirectory(indexFile)); - searcher = new IndexSearcher(reader); - GregorianCalendar gc = new java.util.GregorianCalendar(TimeZone.getDefault(), Locale.getDefault()); - gc.setTime(new Date()); - String nowDateTime = ISO8601.format(gc); - gc.add(java.util.GregorianCalendar.MINUTE, -5); - String fiveMinsAgo = ISO8601.format(gc); - TermRangeQuery query = new TermRangeQuery(Metadata.DATE.toString(), - fiveMinsAgo, nowDateTime, true, true); - TopScoreDocCollector collector = TopScoreDocCollector.create(20, - true); - searcher.search(query, collector); - ScoreDoc[] hits = collector.topDocs().scoreDocs; - for (int i = 0; i < hits.length; i++) { - Document doc = searcher.doc(hits[i].doc); - output.append(getRSSItem(doc)); - } - - } finally { - if (reader != null) reader.close(); - if (searcher != null) searcher.close(); - } - - output.append(getRSSFooters()); - return output.toString(); - } - - public String getRSSItem(Document doc) { - StringBuilder output = new StringBuilder(); - output.append("<item>"); - output.append(emitTag("guid", doc.get(DublinCore.SOURCE.getName()), - "isPermalink", "true")); - output.append(emitTag("title", doc.get(Metadata.TITLE), null, null)); - output.append(emitTag("link", doc.get(DublinCore.SOURCE.getName()), - null, null)); - output.append(emitTag("author", doc.get(Metadata.CREATOR), null, null)); - for (String topic : doc.getValues(Metadata.SUBJECT)) { - output.append(emitTag("category", topic, null, null)); - } - output.append(emitTag("pubDate", rssDateFormat.format(ISO8601.parse(doc - .get(Metadata.DATE.toString()))), null, null)); - output.append(emitTag("description", doc.get(Metadata.TITLE), null, - null)); - output.append("</item>"); - return output.toString(); - } - - public String getRSSHeaders() { - StringBuilder output = new StringBuilder(); - output.append("<?xml version=\"1.0\" encoding=\"utf-8\">"); - output.append("<rss version=\"2.0\">"); - output.append(" <channel>"); - output.append(" <title>Tika in Action: Recent Files Feed."); - output.append(" Chapter 6 Examples demonstrating " - + "use of Tika Metadata for RSS."); - output.append(" tikainaction.rss"); - output.append(" "); - output.append(rssDateFormat.format(new Date())); - output.append(""); - output.append(" Manning Publications: Tika in Action"); - output.append(" All Rights Reserved"); - return output.toString(); - } - - public String getRSSFooters() { - return " "; - } - - private String emitTag(String tagName, String value, String attributeName, - String attributeValue) { - StringBuilder output = new StringBuilder(); - output.append("<"); - output.append(tagName); - if (attributeName != null) { - output.append(" "); - output.append(attributeName); - output.append("=\""); - output.append(attributeValue); - output.append("\""); - } - output.append(">"); - output.append(value); - output.append(""); - return output.toString(); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java b/tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java deleted file mode 100755 index 6890b75..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/RollbackSoftware.java +++ /dev/null @@ -1,137 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; -import java.io.FileFilter; -import java.io.IOException; -import java.io.InputStream; -import java.util.Arrays; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashSet; -import java.util.List; -import java.util.Set; - -import org.apache.commons.io.IOUtils; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.Link; -import org.apache.tika.sax.LinkContentHandler; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * Demonstrates Tika and its ability to sense symlinks. - */ -public class RollbackSoftware { - public static void main(String[] args) throws Exception { - RollbackSoftware r = new RollbackSoftware(); - r.rollback(new File(args[0])); - } - - public void rollback(File deployArea) throws IOException, SAXException, - TikaException { - LinkContentHandler handler = new LinkContentHandler(); - Metadata met = new Metadata(); - DeploymentAreaParser parser = new DeploymentAreaParser(); - parser.parse(IOUtils.toInputStream(deployArea.getAbsolutePath(), UTF_8), - handler, met); - List links = handler.getLinks(); - if (links.size() < 2) - throw new IOException("Must have installed at least 2 versions!"); - Collections.sort(links, new Comparator() { - public int compare(Link o1, Link o2) { - return o1.getText().compareTo(o2.getText()); - } - }); - - this.updateVersion(links.get(links.size() - 2).getText()); - } - - private void updateVersion(String version) { - System.out.println("Rolling back to version: [" + version + "]"); - } - - class DeploymentAreaParser implements Parser { - private static final long serialVersionUID = -2356647405087933468L; - - /* - * (non-Javadoc) - * - * @see org.apache.tika.parser.Parser#getSupportedTypes( - * org.apache.tika.parser.ParseContext) - */ - public Set getSupportedTypes(ParseContext context) { - return Collections.unmodifiableSet(new HashSet(Arrays - .asList(MediaType.TEXT_PLAIN))); - } - - /* - * (non-Javadoc) - * - * @see org.apache.tika.parser.Parser#parse(java.io.InputStream, - * org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata) - */ - public void parse(InputStream is, ContentHandler handler, - Metadata metadata) throws IOException, SAXException, - TikaException { - parse(is, handler, metadata, new ParseContext()); - } - - /* - * (non-Javadoc) - * - * @see org.apache.tika.parser.Parser#parse(java.io.InputStream, - * org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, - * org.apache.tika.parser.ParseContext) - */ - public void parse(InputStream is, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - File deployArea = new File(IOUtils.toString(is, UTF_8)); - File[] versions = deployArea.listFiles(new FileFilter() { - public boolean accept(File pathname) { - return !pathname.getName().startsWith("current"); - } - }); - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, - metadata); - xhtml.startDocument(); - for (File v : versions) { - if (isSymlink(v)) - continue; - xhtml.startElement("a", "href", v.toURI().toURL().toExternalForm()); - xhtml.characters(v.getName()); - xhtml.endElement("a"); - } - } - } - - private boolean isSymlink(File f) throws IOException { - return !f.getAbsolutePath().equals(f.getCanonicalPath()); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/SimpleTextExtractor.java b/tika-example/src/main/java/org/apache/tika/example/SimpleTextExtractor.java deleted file mode 100755 index 0035d02..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/SimpleTextExtractor.java +++ /dev/null @@ -1,36 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; - -import org.apache.tika.Tika; - -public class SimpleTextExtractor { - public static void main(String[] args) throws Exception { - // Create a Tika instance with the default configuration - Tika tika = new Tika(); - - // Parse all given files and print out the extracted - // text content - for (String file : args) { - String text = tika.parseToString(new File(file)); - System.out.print(text); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/SimpleTypeDetector.java b/tika-example/src/main/java/org/apache/tika/example/SimpleTypeDetector.java deleted file mode 100755 index 2b242cc..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/SimpleTypeDetector.java +++ /dev/null @@ -1,33 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; - -import org.apache.tika.Tika; - -public class SimpleTypeDetector { - public static void main(String[] args) throws Exception { - Tika tika = new Tika(); - - for (String file : args) { - String type = tika.detect(new File(file)); - System.out.println(file + ": " + type); - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/SpringExample.java b/tika-example/src/main/java/org/apache/tika/example/SpringExample.java deleted file mode 100755 index c1bafbf..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/SpringExample.java +++ /dev/null @@ -1,40 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.ByteArrayInputStream; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.WriteOutContentHandler; -import org.springframework.context.ApplicationContext; -import org.springframework.context.support.ClassPathXmlApplicationContext; - -import static java.nio.charset.StandardCharsets.UTF_8; - -public class SpringExample { - public static void main(String[] args) throws Exception { - ApplicationContext context = new ClassPathXmlApplicationContext( - new String[]{"org/apache/tika/example/spring.xml"}); - Parser parser = context.getBean("tika", Parser.class); - parser.parse(new ByteArrayInputStream("Hello, World!".getBytes(UTF_8)), - new WriteOutContentHandler(System.out), new Metadata(), - new ParseContext()); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/TIAParsingExample.java b/tika-example/src/main/java/org/apache/tika/example/TIAParsingExample.java deleted file mode 100755 index 333763c..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/TIAParsingExample.java +++ /dev/null @@ -1,201 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.ByteArrayInputStream; -import java.io.File; -import java.io.FileInputStream; -import java.io.FileOutputStream; -import java.io.IOException; -import java.io.InputStream; -import java.io.OutputStream; -import java.io.Reader; -import java.net.URL; -import java.nio.CharBuffer; -import java.util.HashMap; -import java.util.Locale; -import java.util.Map; -import java.util.zip.GZIPInputStream; - -import org.apache.tika.Tika; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.ParserDecorator; -import org.apache.tika.parser.html.HtmlMapper; -import org.apache.tika.parser.html.HtmlParser; -import org.apache.tika.parser.html.IdentityHtmlMapper; -import org.apache.tika.parser.txt.TXTParser; -import org.apache.tika.parser.xml.XMLParser; -import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.LinkContentHandler; -import org.apache.tika.sax.TeeContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.DefaultHandler; - -public class TIAParsingExample { - public static String parseToStringExample() throws Exception { - File document = new File("example.doc"); - String content = new Tika().parseToString(document); - System.out.print(content); - return content; - } - - public static void parseToReaderExample() throws Exception { - File document = new File("example.doc"); - try (Reader reader = new Tika().parse(document)) { - char[] buffer = new char[1000]; - int n = reader.read(buffer); - while (n != -1) { - System.out.append(CharBuffer.wrap(buffer, 0, n)); - n = reader.read(buffer); - } - } - } - - public static void parseFileInputStream(String filename) throws Exception { - Parser parser = new AutoDetectParser(); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - try (InputStream stream = new FileInputStream(new File(filename))) { - parser.parse(stream, handler, metadata, context); - } - } - - public static void parseURLStream(String address) throws Exception { - Parser parser = new AutoDetectParser(); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - try (InputStream stream = new GZIPInputStream(new URL(address).openStream())) { - parser.parse(stream, handler, metadata, context); - } - } - - public static void parseTikaInputStream(String filename) throws Exception { - Parser parser = new AutoDetectParser(); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - try (InputStream stream = TikaInputStream.get(new File(filename))) { - parser.parse(stream, handler, metadata, context); - } - } - - public static File tikaInputStreamGetFile(String filename) throws Exception { - try (InputStream stream = TikaInputStream.get(new File(filename))) { - TikaInputStream tikaInputStream = TikaInputStream.get(stream); - File file = tikaInputStream.getFile(); - return file; - } - } - - public static void useHtmlParser() throws Exception { - InputStream stream = new ByteArrayInputStream(new byte[0]); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - Parser parser = new HtmlParser(); - parser.parse(stream, handler, metadata, context); - } - - public static void useCompositeParser() throws Exception { - InputStream stream = new ByteArrayInputStream(new byte[0]); - ContentHandler handler = new DefaultHandler(); - ParseContext context = new ParseContext(); - Map parsersByType = new HashMap(); - parsersByType.put(MediaType.parse("text/html"), new HtmlParser()); - parsersByType.put(MediaType.parse("application/xml"), new XMLParser()); - - CompositeParser parser = new CompositeParser(); - parser.setParsers(parsersByType); - parser.setFallback(new TXTParser()); - - Metadata metadata = new Metadata(); - metadata.set(Metadata.CONTENT_TYPE, "text/html"); - parser.parse(stream, handler, metadata, context); - } - - public static void useAutoDetectParser() throws Exception { - InputStream stream = new ByteArrayInputStream(new byte[0]); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - Parser parser = new AutoDetectParser(); - parser.parse(stream, handler, metadata, context); - } - - public static void testTeeContentHandler(String filename) throws Exception { - InputStream stream = new ByteArrayInputStream(new byte[0]); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - Parser parser = new AutoDetectParser(); - LinkContentHandler linkCollector = new LinkContentHandler(); - try (OutputStream output = new FileOutputStream(new File(filename))) { - ContentHandler handler = new TeeContentHandler( - new BodyContentHandler(output), linkCollector); - parser.parse(stream, handler, metadata, context); - } - } - - public static void testLocale() throws Exception { - InputStream stream = new ByteArrayInputStream(new byte[0]); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - Parser parser = new AutoDetectParser(); - ParseContext context = new ParseContext(); - context.set(Locale.class, Locale.ENGLISH); - parser.parse(stream, handler, metadata, context); - } - - public static void testHtmlMapper() throws Exception { - InputStream stream = new ByteArrayInputStream(new byte[0]); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - Parser parser = new AutoDetectParser(); - ParseContext context = new ParseContext(); - context.set(HtmlMapper.class, new IdentityHtmlMapper()); - parser.parse(stream, handler, metadata, context); - } - - public static void testCompositeDocument() throws Exception { - InputStream stream = new ByteArrayInputStream(new byte[0]); - ContentHandler handler = new DefaultHandler(); - Metadata metadata = new Metadata(); - Parser parser = new AutoDetectParser(); - ParseContext context = new ParseContext(); - context.set(Parser.class, new ParserDecorator(parser) { - private static final long serialVersionUID = 4424210691523343833L; - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - // custom processing of the component document - } - }); - parser.parse(stream, handler, metadata, context); - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/TranslatorExample.java b/tika-example/src/main/java/org/apache/tika/example/TranslatorExample.java deleted file mode 100644 index f7906e8..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/TranslatorExample.java +++ /dev/null @@ -1,34 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import org.apache.tika.language.translate.MicrosoftTranslator; - -public class TranslatorExample { - public String microsoftTranslateToFrench(String text) { - MicrosoftTranslator translator = new MicrosoftTranslator(); - // Change the id and secret! See http://msdn.microsoft.com/en-us/library/hh454950.aspx. - translator.setId("dummy-id"); - translator.setSecret("dummy-secret"); - try { - return translator.translate(text, "fr"); - } catch (Exception e) { - return "Error while translating."; - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/TrecDocumentGenerator.java b/tika-example/src/main/java/org/apache/tika/example/TrecDocumentGenerator.java deleted file mode 100755 index d263822..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/TrecDocumentGenerator.java +++ /dev/null @@ -1,107 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.File; -import java.io.FileInputStream; -import java.io.FileNotFoundException; -import java.io.IOException; -import java.util.Date; - -import org.apache.tika.Tika; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; - -/** - * Generates document summaries for corpus analysis in the Open Relevance - * project. - */ -@SuppressWarnings("deprecation") -public class TrecDocumentGenerator { - public TrecDocument summarize(File file) throws FileNotFoundException, - IOException, TikaException { - Tika tika = new Tika(); - Metadata met = new Metadata(); - - String contents = tika.parseToString(new FileInputStream(file), met); - return new TrecDocument(met.get(Metadata.RESOURCE_NAME_KEY), contents, - met.getDate(Metadata.DATE)); - - } - - // copied from - // http://svn.apache.org/repos/asf/lucene/openrelevance/trunk/src/java/org/ - // apache/orp/util/TrecDocument.java - // since the ORP jars aren't published anywhere - class TrecDocument { - private CharSequence docname; - private CharSequence body; - private Date date; - - public TrecDocument(CharSequence docname, CharSequence body, Date date) { - this.docname = docname; - this.body = body; - this.date = date; - } - - public TrecDocument() { - } - - /** - * @return the docname - */ - public CharSequence getDocname() { - return docname; - } - - /** - * @param docname the docname to set - */ - public void setDocname(CharSequence docname) { - this.docname = docname; - } - - /** - * @return the body - */ - public CharSequence getBody() { - return body; - } - - /** - * @param body the body to set - */ - public void setBody(CharSequence body) { - this.body = body; - } - - /** - * @return the date - */ - public Date getDate() { - return date; - } - - /** - * @param date the date to set - */ - public void setDate(Date date) { - this.date = date; - } - } -} diff --git a/tika-example/src/main/java/org/apache/tika/example/ZipListFiles.java b/tika-example/src/main/java/org/apache/tika/example/ZipListFiles.java deleted file mode 100755 index b19bac8..0000000 --- a/tika-example/src/main/java/org/apache/tika/example/ZipListFiles.java +++ /dev/null @@ -1,45 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import java.io.IOException; -import java.util.Collections; -import java.util.zip.ZipEntry; -import java.util.zip.ZipFile; - -/** - * Example code listing from Chapter 1. Lists a zip file's entries using JDK's - * standard APIs. - */ -public class ZipListFiles { - public static void main(String[] args) throws Exception { - if (args.length > 0) { - for (String file : args) { - System.out.println("Files in " + file + " file:"); - listZipEntries(file); - } - } - } - - public static void listZipEntries(String path) throws IOException { - ZipFile zip = new ZipFile(path); - for (ZipEntry entry : Collections.list(zip.entries())) { - System.out.println(entry.getName()); - } - } -} diff --git a/tika-example/src/main/resources/org/apache/tika/example/spring.xml b/tika-example/src/main/resources/org/apache/tika/example/spring.xml deleted file mode 100755 index 6711351..0000000 --- a/tika-example/src/main/resources/org/apache/tika/example/spring.xml +++ /dev/null @@ -1,36 +0,0 @@ - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/tika-example/src/main/resources/org/apache/tika/example/test.doc b/tika-example/src/main/resources/org/apache/tika/example/test.doc deleted file mode 100644 index 93198c8..0000000 Binary files a/tika-example/src/main/resources/org/apache/tika/example/test.doc and /dev/null differ diff --git a/tika-example/src/main/resources/org/apache/tika/example/test2.doc b/tika-example/src/main/resources/org/apache/tika/example/test2.doc deleted file mode 100644 index 3956ab1..0000000 Binary files a/tika-example/src/main/resources/org/apache/tika/example/test2.doc and /dev/null differ diff --git a/tika-example/src/main/resources/org/apache/tika/example/test_recursive_embedded.docx b/tika-example/src/main/resources/org/apache/tika/example/test_recursive_embedded.docx deleted file mode 100644 index cd562cb..0000000 Binary files a/tika-example/src/main/resources/org/apache/tika/example/test_recursive_embedded.docx and /dev/null differ diff --git a/tika-example/src/test/java/org/apache/tika/example/AdvancedTypeDetectorTest.java b/tika-example/src/test/java/org/apache/tika/example/AdvancedTypeDetectorTest.java deleted file mode 100755 index f302db5..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/AdvancedTypeDetectorTest.java +++ /dev/null @@ -1,30 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import org.junit.Test; - -import static org.junit.Assert.assertEquals; - -@SuppressWarnings("deprecation") -public class AdvancedTypeDetectorTest { - @Test - public void testDetectWithCustomConfig() throws Exception { - assertEquals("application/xml", AdvancedTypeDetector.detectWithCustomConfig("pom.xml")); - } -} diff --git a/tika-example/src/test/java/org/apache/tika/example/ContentHandlerExampleTest.java b/tika-example/src/test/java/org/apache/tika/example/ContentHandlerExampleTest.java deleted file mode 100644 index 5b46612..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/ContentHandlerExampleTest.java +++ /dev/null @@ -1,105 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import org.apache.tika.exception.TikaException; -import org.junit.Before; -import org.junit.Test; -import org.xml.sax.SAXException; - -import java.io.IOException; -import java.util.List; - -import static org.apache.tika.TikaTest.assertContains; -import static org.apache.tika.TikaTest.assertNotContained; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - -public class ContentHandlerExampleTest { - ContentHandlerExample example; - - @Before - public void setUp() { - example = new ContentHandlerExample(); - } - - @Test - public void testParseToPlainText() throws IOException, SAXException, TikaException { - String result = example.parseToPlainText().trim(); - assertEquals("Expected 'test', but got '" + result + "'", "test", result); - } - - @Test - public void testParseToHTML() throws IOException, SAXException, TikaException { - String result = example.parseToHTML().trim(); - - assertContains("", result); - assertContains("", result); - assertContains("", result); - assertContains(">test", result); - } - - @Test - public void testParseBodyToHTML() throws IOException, SAXException, TikaException { - String result = example.parseBodyToHTML().trim(); - - assertNotContained("", result); - assertNotContained("", result); - assertNotContained("", result); - assertContains(">test", result); - } - - @Test - public void testParseOnePartToHTML() throws IOException, SAXException, TikaException { - String result = example.parseOnePartToHTML().trim(); - - assertNotContained("", result); - assertNotContained("", result); - assertNotContained("", result); - assertContains("

    Test Document", result); - assertNotContained("

    1 2 3", result); - } - - - @Test - public void testParseToPlainTextChunks() throws IOException, SAXException, TikaException { - List result = example.parseToPlainTextChunks(); - - assertEquals(3, result.size()); - for (String chunk : result) { - assertTrue("Chunk under max size", chunk.length() <= example.MAXIMUM_TEXT_CHUNK_SIZE); - } - - assertContains("This is in the header", result.get(0)); - assertContains("Test Document", result.get(0)); - - assertContains("Testing", result.get(1)); - assertContains("1 2 3", result.get(1)); - assertContains("TestTable", result.get(1)); - - assertContains("Testing 123", result.get(2)); - } -} diff --git a/tika-example/src/test/java/org/apache/tika/example/DumpTikaConfigExampleTest.java b/tika-example/src/test/java/org/apache/tika/example/DumpTikaConfigExampleTest.java deleted file mode 100644 index 29acfab..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/DumpTikaConfigExampleTest.java +++ /dev/null @@ -1,90 +0,0 @@ -package org.apache.tika.example; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import static java.nio.charset.StandardCharsets.UTF_16LE; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertTrue; - -import java.io.File; -import java.io.FileOutputStream; -import java.io.IOException; -import java.io.OutputStreamWriter; -import java.io.Writer; -import java.nio.charset.Charset; - -import org.apache.tika.config.TikaConfig; -import org.apache.tika.detect.CompositeDetector; -import org.apache.tika.example.DumpTikaConfigExample.Mode; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.Parser; -import org.junit.After; -import org.junit.Before; -import org.junit.Test; - -public class DumpTikaConfigExampleTest { - private File configFile; - @Before - public void setUp() { - try { - configFile = File.createTempFile("tmp", ".xml"); - } catch (IOException e) { - throw new RuntimeException("Failed to create tmp file"); - } - } - - @After - public void tearDown() { - if (configFile != null && configFile.exists()) { - configFile.delete(); - } - if (configFile != null && configFile.exists()) { - throw new RuntimeException("Failed to clean up: "+configFile.getAbsolutePath()); - } - } - - @Test - public void testDump() throws Exception { - DumpTikaConfigExample ex = new DumpTikaConfigExample(); - for (Charset charset : new Charset[]{UTF_8, UTF_16LE}) { - for (Mode mode : Mode.values()) { - Writer writer = new OutputStreamWriter(new FileOutputStream(configFile), charset); - ex.dump(TikaConfig.getDefaultConfig(), mode, writer, charset.name()); - writer.flush(); - writer.close(); - - TikaConfig c = new TikaConfig(configFile); - assertTrue(c.getParser().toString(), c.getParser() instanceof CompositeParser); - assertTrue(c.getDetector().toString(), c.getDetector() instanceof CompositeDetector); - - CompositeParser p = (CompositeParser) c.getParser(); - assertTrue("enough parsers?", p.getParsers().size() > 130); - - CompositeDetector d = (CompositeDetector) c.getDetector(); - assertTrue("enough detectors?", d.getDetectors().size() > 3); - - //just try to load it into autodetect to make sure no errors are thrown - Parser auto = new AutoDetectParser(c); - assertNotNull(auto); - } - } - } - -} diff --git a/tika-example/src/test/java/org/apache/tika/example/ExtractEmbeddedFilesTest.java b/tika-example/src/test/java/org/apache/tika/example/ExtractEmbeddedFilesTest.java deleted file mode 100644 index 22b5a42..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/ExtractEmbeddedFilesTest.java +++ /dev/null @@ -1,62 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import static org.junit.Assert.assertEquals; - -import java.io.IOException; -import java.nio.file.DirectoryStream; -import java.nio.file.Files; -import java.nio.file.Path; -import java.util.List; - -import org.junit.After; -import org.junit.Before; -import org.junit.Test; - -public class ExtractEmbeddedFilesTest { - - ParsingExample parsingExample; - Path outputPath; - - @Before - public void setUp() throws IOException { - parsingExample = new ParsingExample(); - outputPath = Files.createTempDirectory("tika-ext-emb-example-"); - } - - @After - public void tearDown() throws IOException { - //this does not act recursively, this only assumes single level directory - try (DirectoryStream dirStream = Files.newDirectoryStream(outputPath)) { - for (Path entry: dirStream) { - Files.delete(entry); - } - } - Files.delete(outputPath); - - } - - @Test - public void testExtractEmbeddedFiles() throws Exception { - List extracted = parsingExample.extractEmbeddedDocumentsExample(outputPath); - //this number should be bigger!!! - assertEquals(2, extracted.size()); - } - -} diff --git a/tika-example/src/test/java/org/apache/tika/example/LanguageIdentifierExampleTest.java b/tika-example/src/test/java/org/apache/tika/example/LanguageIdentifierExampleTest.java deleted file mode 100644 index 2a1717e..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/LanguageIdentifierExampleTest.java +++ /dev/null @@ -1,37 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import org.junit.Before; -import org.junit.Test; - -import static org.junit.Assert.assertEquals; - -public class LanguageIdentifierExampleTest { - LanguageIdentifierExample languageIdentifierExample; - @Before - public void setUp() { - languageIdentifierExample = new LanguageIdentifierExample(); - } - - @Test - public void testIdentifyLanguage() { - String text = "This is some text that should be identified as English."; - assertEquals("en", languageIdentifierExample.identifyLanguage(text)); - } -} diff --git a/tika-example/src/test/java/org/apache/tika/example/SimpleTextExtractorTest.java b/tika-example/src/test/java/org/apache/tika/example/SimpleTextExtractorTest.java deleted file mode 100755 index 33d00b4..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/SimpleTextExtractorTest.java +++ /dev/null @@ -1,48 +0,0 @@ -/** - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; - -import java.io.ByteArrayOutputStream; -import java.io.File; -import java.io.PrintStream; - -import org.apache.commons.io.FileUtils; -import org.junit.Test; - -public class SimpleTextExtractorTest { - @Test - public void testSimpleTextExtractor() throws Exception { - String message = - "This is Tika - Hello, World! This is simple UTF-8 text" - + " content written in English to test autodetection of" - + " the character encoding of the input stream."; - ByteArrayOutputStream buffer = new ByteArrayOutputStream(); - - PrintStream out = System.out; - System.setOut(new PrintStream(buffer, true, UTF_8.name())); - - File file = new File("target", "test.txt"); - FileUtils.writeStringToFile(file, message, UTF_8); - SimpleTextExtractor.main(new String[] { file.getPath() }); - file.delete(); - - System.setOut(out); - - assertEquals(message, buffer.toString(UTF_8.name()).trim()); - } -} diff --git a/tika-example/src/test/java/org/apache/tika/example/SimpleTypeDetectorTest.java b/tika-example/src/test/java/org/apache/tika/example/SimpleTypeDetectorTest.java deleted file mode 100755 index 552564a..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/SimpleTypeDetectorTest.java +++ /dev/null @@ -1,43 +0,0 @@ -/** - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; - -import java.io.ByteArrayOutputStream; -import java.io.PrintStream; - -import org.junit.Test; - -@SuppressWarnings("deprecation") -public class SimpleTypeDetectorTest { - - @Test - public void testSimpleTypeDetector() throws Exception { - ByteArrayOutputStream buffer = new ByteArrayOutputStream(); - - PrintStream out = System.out; - System.setOut(new PrintStream(buffer, true, UTF_8.name())); - - SimpleTypeDetector.main(new String[] { "pom.xml" }); - - System.setOut(out); - - assertEquals("pom.xml: application/xml", - buffer.toString(UTF_8.name()).trim()); - } - -} diff --git a/tika-example/src/test/java/org/apache/tika/example/TestParsingExample.java b/tika-example/src/test/java/org/apache/tika/example/TestParsingExample.java deleted file mode 100644 index 027b318..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/TestParsingExample.java +++ /dev/null @@ -1,102 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - -import java.io.IOException; -import java.io.StringReader; -import java.util.List; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.serialization.JsonMetadataList; -import org.junit.Before; -import org.junit.Test; -import org.xml.sax.SAXException; - -public class TestParsingExample { - ParsingExample parsingExample; - @Before - public void setUp() { - parsingExample = new ParsingExample(); - } - - @Test - public void testParseToStringExample() throws IOException, SAXException, TikaException { - String result = parsingExample.parseToStringExample().trim(); - assertEquals("Expected 'test', but got '" + result + "'", "test", result); - } - - @Test - public void testParseExample() throws IOException, SAXException, TikaException { - String result = parsingExample.parseExample().trim(); - assertEquals("Expected 'test', but got '" + result + "'", "test", result); - } - - @Test - public void testNoEmbeddedExample() throws IOException, SAXException, TikaException { - String result = parsingExample.parseNoEmbeddedExample(); - assertContains("embed_0", result); - assertNotContains("embed1/embed1a.txt", result); - assertNotContains("embed3/embed3.txt", result); - assertNotContains("When in the Course", result); - } - - - @Test - public void testRecursiveParseExample() throws IOException, SAXException, TikaException { - String result = parsingExample.parseEmbeddedExample(); - assertContains("embed_0", result); - assertContains("embed1/embed1a.txt", result); - assertContains("embed3/embed3.txt", result); - assertContains("When in the Course", result); - } - - @Test - public void testRecursiveParserWrapperExample() throws IOException, SAXException, TikaException { - List metadataList = parsingExample.recursiveParserWrapperExample(); - assertEquals("Number of embedded documents + 1 for the container document", 12, metadataList.size()); - Metadata m = metadataList.get(6); - //this is the location the embed3.txt text file within the outer .docx - assertEquals("/embed1.zip/embed2.zip/embed3.zip/embed3.txt", - m.get("X-TIKA:embedded_resource_path")); - //it contains some html encoded content - assertContains("When in the Course", m.get("X-TIKA:content")); - } - - @Test - public void testSerializedRecursiveParserWrapperExample() throws IOException, SAXException, TikaException { - String json = parsingExample.serializedRecursiveParserWrapperExample(); - assertTrue(json.indexOf("When in the Course") > -1); - //now try deserializing the JSON - List metadataList = JsonMetadataList.fromJson(new StringReader(json)); - assertEquals(12, metadataList.size()); - } - - public static void assertContains(String needle, String haystack) { - assertTrue("Should have found " + needle + " in: " + haystack, haystack.contains(needle)); - } - - public static void assertNotContains(String needle, String haystack) { - assertFalse("Should not have found " + needle + " in: " + haystack, haystack.contains(needle)); - } - -} diff --git a/tika-example/src/test/java/org/apache/tika/example/TranslatorExampleTest.java b/tika-example/src/test/java/org/apache/tika/example/TranslatorExampleTest.java deleted file mode 100644 index dee6275..0000000 --- a/tika-example/src/test/java/org/apache/tika/example/TranslatorExampleTest.java +++ /dev/null @@ -1,44 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.example; - -import org.junit.Before; -import org.junit.Test; - -import java.util.Locale; - -import static org.junit.Assert.assertEquals; - -public class TranslatorExampleTest { - TranslatorExample translatorExample; - - @Before - public void setUp() { - translatorExample = new TranslatorExample(); - } - - @Test - public void testMicrosoftTranslateToFrench() { - String text = "hello"; - String expected = "salut"; - String translated = translatorExample.microsoftTranslateToFrench(text); - // The user may not have set the id and secret. So, we have to check if we just - // got the same text back. - if (!translated.equals(text)) assertEquals(expected, translated.toLowerCase(Locale.ROOT)); - } -} diff --git a/tika-java7/pom.xml b/tika-java7/pom.xml index bf43c7b..06f3b4b 100644 --- a/tika-java7/pom.xml +++ b/tika-java7/pom.xml @@ -25,19 +25,15 @@ org.apache.tika tika-parent - 1.11 + 1.5 ../tika-parent/pom.xml tika-java7 bundle + Apache Tika Java-7 Components - http://tika.apache.org/ - - - 1.7 - 1.7 - + Java-7 reliant components, including FileTypeDetector implementations @@ -61,13 +57,12 @@ - org.apache.rat - apache-rat-plugin + org.apache.maven.plugins + maven-compiler-plugin + 3.1 - - src/main/resources/META-INF/services/java.nio.file.spi.FileTypeDetector - src/test/resources/test-documents/* - + 1.7 + 1.7 @@ -92,25 +87,27 @@ junit junit + test + 4.11 - Java-7 reliant components, including FileTypeDetector implementations + http://tika.apache.org/ - The Apache Software Foundation - http://www.apache.org + The Apache Software Foundation + http://www.apache.org - http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-java7 - scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-java7 - scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-java7 + http://svn.apache.org/viewvc/tika/tags/1.5/tika-java7 + scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.5/tika-java7 + scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.5/tika-java7 - JIRA - https://issues.apache.org/jira/browse/TIKA + JIRA + https://issues.apache.org/jira/browse/TIKA - Jenkins - https://builds.apache.org/job/Tika-trunk/ + Jenkins + https://builds.apache.org/job/Tika-trunk/ diff --git a/tika-java7/src/main/java/org/apache/tika/filetypedetector/TikaFileTypeDetector.java b/tika-java7/src/main/java/org/apache/tika/filetypedetector/TikaFileTypeDetector.java index c479ade..ed340db 100644 --- a/tika-java7/src/main/java/org/apache/tika/filetypedetector/TikaFileTypeDetector.java +++ b/tika-java7/src/main/java/org/apache/tika/filetypedetector/TikaFileTypeDetector.java @@ -18,6 +18,7 @@ package org.apache.tika.filetypedetector; import java.io.IOException; +import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.spi.FileTypeDetector; @@ -40,7 +41,7 @@ } // Then check the file content if necessary - String fileContentDetect = tika.detect(path); + String fileContentDetect = tika.detect(path.toFile()); if(!fileContentDetect.equals(MimeTypes.OCTET_STREAM)) { return fileContentDetect; } diff --git a/tika-parent/pom.xml b/tika-parent/pom.xml index 2d525c9..441e07b 100644 --- a/tika-parent/pom.xml +++ b/tika-parent/pom.xml @@ -25,13 +25,13 @@ org.apache apache - 17 + 10 org.apache.tika tika-parent - 1.11 + 1.5 pom Apache Tika parent @@ -187,7 +187,7 @@ +3 - + Oleg Tikhonov oleg @@ -201,30 +201,6 @@ Alfresco http://alfresco.com -5 - - committer - - - - Tyler Palsulich - tpalsulich - -8 - - committer - - - - Tim Allison - tallison - -5 - - committer - - - - Konstantin Gribov - grossws - +3 committer @@ -266,104 +242,47 @@ junit junit - 4.11 + 4.10 test - - - org.slf4j - slf4j-api - ${slf4j.version} - - - org.slf4j - slf4j-log4j12 - ${slf4j.version} - - - org.slf4j - slf4j-simple - ${slf4j.version} - - - org.slf4j - jul-to-slf4j - ${slf4j.version} - - - org.slf4j - jcl-over-slf4j - ${slf4j.version} - 1.7 - 1.7 + 1.6 + 1.6 ${project.build.sourceEncoding} - 1.10 - 2.4 - 1.7.12 maven-compiler-plugin - 3.2 - ${maven.compiler.source} - ${maven.compiler.target} + ${maven.compile.source} + ${maven.compile.target} - - de.thetaphi - forbiddenapis - 2.0 - - ${maven.compiler.target} - true - false - false - - jdk-unsafe - jdk-deprecated - commons-io-unsafe-${commons.io.version} - - - - - - check - testCheck - - - - - - org.apache.felix - maven-bundle-plugin - 2.3.4 - - - org.apache.maven.plugins - maven-surefire-plugin - 2.18.1 - - -Xmx2048m - - - - org.apache.maven.plugins - maven-shade-plugin - 2.3 - - - org.apache.maven.plugins - maven-release-plugin - 2.3.2 - + + + + org.apache.felix + maven-bundle-plugin + 2.3.4 + + + org.apache.maven.plugins + maven-surefire-plugin + 2.12 + + + org.apache.maven.plugins + maven-shade-plugin + 1.6 + + + @@ -403,10 +322,4 @@ - - - scm:svn:http://svn.apache.org/repos/asf/maven/pom/tags/1.11-rc1/tika-parent - scm:svn:https://svn.apache.org/repos/asf/maven/pom/tags/1.11-rc1/tika-parent - http://svn.apache.org/viewvc/maven/pom/tags/1.11-rc1/tika-parent - diff --git a/tika-parsers/pom.xml b/tika-parsers/pom.xml index 9557a3d..1092980 100644 --- a/tika-parsers/pom.xml +++ b/tika-parsers/pom.xml @@ -25,7 +25,7 @@ org.apache.tika tika-parent - 1.11 + 1.5 ../tika-parent/pom.xml @@ -35,16 +35,10 @@ http://tika.apache.org/ - 3.13 - - 1.9 - - 1.5 + 3.10-beta2 + 1.5 0.7.2 - 0.6 - 1.8.10 - 4.5.5 - 3.0.3 + 0.1 @@ -63,30 +57,12 @@ ${project.version} - - ${project.groupId} - tika-core - ${project.version} - test-jar - test - - org.gagravarr vorbis-java-tika ${vorbis.version} - - com.healthmarketscience.jackcess - jackcess - 2.1.2 - - - com.healthmarketscience.jackcess - jackcess-encrypt - 2.1.1 - @@ -97,9 +73,9 @@ - net.sourceforge.jmatio - jmatio - 1.0 + edu.ucar + netcdf + 4.2-min org.apache.james @@ -114,14 +90,8 @@ org.apache.commons commons-compress - ${commons.compress.version} - - - org.tukaani - xz - ${tukaani.version} - - + 1.5 + commons-codec commons-codec @@ -130,20 +100,20 @@ org.apache.pdfbox pdfbox - ${pdfbox.version} + 1.8.4 org.bouncycastle - bcmail-jdk15on - 1.52 + bcmail-jdk15 + 1.45 org.bouncycastle - bcprov-jdk15on - 1.52 + bcprov-jdk15 + 1.45 org.apache.poi @@ -171,24 +141,29 @@ + org.apache.geronimo.specs + geronimo-stax-api_1.0_spec + 1.0.1 + + org.ccil.cowan.tagsoup tagsoup 1.2.1 org.ow2.asm - asm - 5.0.4 + asm-debug-all + 4.1 com.googlecode.mp4parser isoparser - 1.0.2 - - - com.drewnoakes - metadata-extractor - 2.8.0 + 1.0-RC-1 + + + com.drewnoakes + metadata-extractor + 2.6.2 de.l3s.boilerpipe @@ -198,7 +173,7 @@ rome rome - 1.0 + 0.9 org.gagravarr @@ -211,149 +186,34 @@ 1.0.3 - org.codelibs + com.uwyn jhighlight - 1.0.2 - - - com.pff - java-libpst - 0.8.1 - - - com.github.junrar - junrar - 0.7 - - - org.apache.cxf - cxf-rt-rs-client - ${cxf.version} - - - - - - org.xerial - sqlite-jdbc - 3.8.10.1 - provided - - - - org.apache.opennlp - opennlp-tools - 1.5.3 - - - - commons-io - commons-io - ${commons.io.version} - - - - org.apache.commons - commons-exec - 1.3 - - - - com.googlecode.json-simple - json-simple - 1.1.1 + 1.0 - junit - junit + javax.servlet + servlet-api - - - org.json - json - 20140107 - - junit junit - - - org.mockito - mockito-core - 1.7 test + + + org.mockito + mockito-core + 1.7 + test org.slf4j slf4j-log4j12 + 1.5.6 test - - - - - edu.ucar - netcdf4 - ${netcdf-java.version} - - - edu.ucar - grib - ${netcdf-java.version} - - - edu.ucar - cdm - ${netcdf-java.version} - - - org.slf4j - jcl-over-slf4j - - - - - edu.ucar - httpservices - ${netcdf-java.version} - - - - org.apache.commons - commons-csv - 1.0 - - - - org.apache.sis.core - sis-utility - 0.5 - - - org.apache.sis.storage - sis-netcdf - 0.5 - - - org.apache.sis.core - sis-metadata - 0.5 - - - org.opengis - geoapi - 3.0.0 - - - - org.apache.ctakes - ctakes-core - 3.2.2 - provided @@ -370,9 +230,9 @@ org.apache.tika.parser.internal.Activator - org.w3c.dom, - org.apache.tika.*, - *;resolution:=optional + org.w3c.dom, + org.apache.tika.*, + *;resolution:=optional @@ -384,20 +244,8 @@ src/main/java/org/apache/tika/parser/txt/Charset*.java src/test/resources/test-documents/** - src/test/resources/META-INF/services/org.apache.tika.parser.Parser - - - org.apache.maven.plugins - maven-jar-plugin - - - - test-jar - - - @@ -434,20 +282,20 @@ - The Apache Software Foundation - http://www.apache.org + The Apache Software Foundation + http://www.apache.org - http://svn.apache.org/viewvc/tika/tags/1.11-rc1/tika-parsers - scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-parsers - scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.11-rc1/tika-parsers + http://svn.apache.org/viewvc/tika/tags/1.5/tika-parsers + scm:svn:http://svn.apache.org/repos/asf/tika/tags/1.5/tika-parsers + scm:svn:https://svn.apache.org/repos/asf/tika/tags/1.5/tika-parsers - JIRA - https://issues.apache.org/jira/browse/TIKA + JIRA + https://issues.apache.org/jira/browse/TIKA - Jenkins - https://builds.apache.org/job/Tika-trunk/ + Jenkins + https://builds.apache.org/job/Tika-trunk/ diff --git a/tika-parsers/src/main/appended-resources/META-INF/LICENSE b/tika-parsers/src/main/appended-resources/META-INF/LICENSE index bd54624..9dfa0ab 100644 --- a/tika-parsers/src/main/appended-resources/META-INF/LICENSE +++ b/tika-parsers/src/main/appended-resources/META-INF/LICENSE @@ -35,60 +35,3 @@ not be used in advertising or otherwise to promote the sale, use or other dealings in this Software without prior written authorization of the copyright holder. - - -JUnRAR (https://github.com/edmund-wagner/junrar/) - - JUnRAR is based on the UnRAR tool, and covered by the same license - It was formerly available from http://java-unrar.svn.sourceforge.net/ - - ****** ***** ****** UnRAR - free utility for RAR archives - ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ****** ******* ****** License for use and distribution of - ** ** ** ** ** ** ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - ** ** ** ** ** ** FREE portable version - ~~~~~~~~~~~~~~~~~~~~~ - - The source code of UnRAR utility is freeware. This means: - - 1. All copyrights to RAR and the utility UnRAR are exclusively - owned by the author - Alexander Roshal. - - 2. The UnRAR sources may be used in any software to handle RAR - archives without limitations free of charge, but cannot be used - to re-create the RAR compression algorithm, which is proprietary. - Distribution of modified UnRAR sources in separate form or as a - part of other software is permitted, provided that it is clearly - stated in the documentation and source comments that the code may - not be used to develop a RAR (WinRAR) compatible archiver. - - 3. The UnRAR utility may be freely distributed. It is allowed - to distribute UnRAR inside of other software packages. - - 4. THE RAR ARCHIVER AND THE UnRAR UTILITY ARE DISTRIBUTED "AS IS". - NO WARRANTY OF ANY KIND IS EXPRESSED OR IMPLIED. YOU USE AT - YOUR OWN RISK. THE AUTHOR WILL NOT BE LIABLE FOR DATA LOSS, - DAMAGES, LOSS OF PROFITS OR ANY OTHER KIND OF LOSS WHILE USING - OR MISUSING THIS SOFTWARE. - - 5. Installing and using the UnRAR utility signifies acceptance of - these terms and conditions of the license. - - 6. If you don't agree with terms of the license you must remove - UnRAR files from your storage devices and cease to use the - utility. - - Thank you for your interest in RAR and UnRAR. Alexander L. Roshal - -Sqlite (included in the "provided" org.xerial's sqlite-jdbc) - Sqlite is in the Public Domain. For details - see: https://www.sqlite.org/copyright.html - -Two photos in test-documents (testWebp_Alpha_Lossy.webp and testWebp_Alpha_Lossless.webp) - are in the public domain. These files were retrieved from: - https://github.com/drewnoakes/metadata-extractor-images/tree/master/webp - These photos are also available here: - https://developers.google.com/speed/webp/gallery2#webp_links - Credits for the photo: - "Free Stock Photo in High Resolution - Yellow Rose 3 - Flowers" - Image Author: Jon Sullivan diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/asm/XHTMLClassVisitor.java b/tika-parsers/src/main/java/org/apache/tika/parser/asm/XHTMLClassVisitor.java index c8ea317..e567550 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/asm/XHTMLClassVisitor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/asm/XHTMLClassVisitor.java @@ -49,7 +49,7 @@ private String packageName; public XHTMLClassVisitor(ContentHandler handler, Metadata metadata) { - super(Opcodes.ASM5); + super(Opcodes.ASM4); this.xhtml = new XHTMLContentHandler(handler, metadata); this.metadata = metadata; } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/audio/MidiParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/audio/MidiParser.java index 656d1aa..2b1ac82 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/audio/MidiParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/audio/MidiParser.java @@ -40,8 +40,6 @@ import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.ISO_8859_1; public class MidiParser extends AbstractParser { @@ -103,7 +101,7 @@ if (meta.getType() >= 1 && meta.getType() <= 15) { // FIXME: What's the encoding? xhtml.characters( - new String(meta.getData(), ISO_8859_1)); + new String(meta.getData(), "ISO-8859-1")); } } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/ChmParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/ChmParser.java index 3b4e00f..65dfe96 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/ChmParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/ChmParser.java @@ -22,6 +22,7 @@ import java.util.Arrays; import java.util.Collections; import java.util.HashSet; +import java.util.Iterator; import java.util.Set; import org.apache.tika.exception.TikaException; @@ -33,7 +34,6 @@ import org.apache.tika.parser.chm.core.ChmExtractor; import org.apache.tika.parser.html.HtmlParser; import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.EmbeddedContentHandler; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; @@ -49,12 +49,10 @@ MediaType.application("chm"), MediaType.application("x-chm")))); - @Override public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; } - @Override public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { @@ -67,41 +65,40 @@ XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); - for (DirectoryListingEntry entry : chmExtractor.getChmDirList().getDirectoryListingEntryList()) { - final String entryName = entry.getName(); - if (entryName.endsWith(".html") - || entryName.endsWith(".htm") - ) { -// AttributesImpl attrs = new AttributesImpl(); -// attrs.addAttribute("", "name", "name", "String", entryName); -// xhtml.startElement("", "document", "document", attrs); - - byte[] data = chmExtractor.extractChmEntry(entry); - - parsePage(data, xhtml); - -// xhtml.endElement("", "", "document"); + Iterator it = + chmExtractor.getChmDirList().getDirectoryListingEntryList().iterator(); + while (it.hasNext()) { + DirectoryListingEntry entry = it.next(); + if (entry.getName().endsWith(".html") || entry.getName().endsWith(".htm")) { + xhtml.characters(extract(chmExtractor.extractChmEntry(entry))); } } xhtml.endDocument(); } - - private void parsePage(byte[] byteObject, ContentHandler xhtml) throws TikaException {// throws IOException + /** + * Extracts data from byte[] + */ + private String extract(byte[] byteObject) throws TikaException {// throws IOException + StringBuilder wBuf = new StringBuilder(); InputStream stream = null; Metadata metadata = new Metadata(); HtmlParser htmlParser = new HtmlParser(); - ContentHandler handler = new EmbeddedContentHandler(new BodyContentHandler(xhtml));// -1 + BodyContentHandler handler = new BodyContentHandler(-1);// -1 ParseContext parser = new ParseContext(); try { stream = new ByteArrayInputStream(byteObject); htmlParser.parse(stream, handler, metadata, parser); + wBuf.append(handler.toString() + + System.getProperty("line.separator")); } catch (SAXException e) { throw new RuntimeException(e); } catch (IOException e) { // Pushback overflow from tagsoup } + return wBuf.toString(); } - + + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmDirectoryListingSet.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmDirectoryListingSet.java index 9d0a2f0..a3e9bd7 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmDirectoryListingSet.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmDirectoryListingSet.java @@ -19,12 +19,10 @@ import java.math.BigInteger; import java.util.ArrayList; import java.util.List; + import org.apache.tika.exception.TikaException; import org.apache.tika.parser.chm.core.ChmCommons; import org.apache.tika.parser.chm.core.ChmConstants; -import org.apache.tika.parser.chm.exception.ChmParsingException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Holds chm listing entries @@ -103,6 +101,15 @@ } /** + * Gets place holder + * + * @return place holder + */ + private int getPlaceHolder() { + return placeHolder; + } + + /** * Sets place holder * * @param placeHolder @@ -111,14 +118,13 @@ this.placeHolder = placeHolder; } - private ChmPmglHeader PMGLheader; /** * Enumerates chm directory listing entries * * @param chmItsHeader - * chm itsf PMGLheader + * chm itsf header * @param chmItspHeader - * chm itsp PMGLheader + * chm itsp header */ private void enumerateChmDirectoryListingList(ChmItsfHeader chmItsHeader, ChmItspHeader chmItspHeader) { @@ -130,19 +136,33 @@ setDataOffset(chmItsHeader.getDataOffset()); /* loops over all pmgls */ + int previous_index = 0; byte[] dir_chunk = null; - for (int i = startPmgl; i>=0; ) { - dir_chunk = new byte[(int) chmItspHeader.getBlock_len()]; - int start = i * (int) chmItspHeader.getBlock_len() + dir_offset; - dir_chunk = ChmCommons - .copyOfRange(getData(), start, - start +(int) chmItspHeader.getBlock_len()); - - PMGLheader = new ChmPmglHeader(); - PMGLheader.parse(dir_chunk, PMGLheader); + for (int i = startPmgl; i <= stopPmgl; i++) { + int data_copied = ((1 + i) * (int) chmItspHeader.getBlock_len()) + + dir_offset; + if (i == 0) { + dir_chunk = new byte[(int) chmItspHeader.getBlock_len()]; + // dir_chunk = Arrays.copyOfRange(getData(), dir_offset, + // (((1+i) * (int)chmItspHeader.getBlock_len()) + + // dir_offset)); + dir_chunk = ChmCommons + .copyOfRange(getData(), dir_offset, + (((1 + i) * (int) chmItspHeader + .getBlock_len()) + dir_offset)); + previous_index = data_copied; + } else { + dir_chunk = new byte[(int) chmItspHeader.getBlock_len()]; + // dir_chunk = Arrays.copyOfRange(getData(), previous_index, + // (((1+i) * (int)chmItspHeader.getBlock_len()) + + // dir_offset)); + dir_chunk = ChmCommons + .copyOfRange(getData(), previous_index, + (((1 + i) * (int) chmItspHeader + .getBlock_len()) + dir_offset)); + previous_index = data_copied; + } enumerateOneSegment(dir_chunk); - - i=PMGLheader.getBlockNext(); dir_chunk = null; } } catch (Exception e) { @@ -182,139 +202,112 @@ } } - public static final boolean startsWith(byte[] data, String prefix) { - for (int i=0; i 0 + && dir_chunk[getPlaceHolder() - 1] != 115) {// #{ + do { + if (dir_chunk[getPlaceHolder() - 1] > 0) { + DirectoryListingEntry dle = new DirectoryListingEntry(); + + // two cases: 1. when dir_chunk[getPlaceHolder() - + // 1] == 0x73 + // 2. when dir_chunk[getPlaceHolder() + 1] == 0x2f + doNameCheck(dir_chunk, dle); + + // dle.setName(new + // String(Arrays.copyOfRange(dir_chunk, + // getPlaceHolder(), (getPlaceHolder() + + // dle.getNameLength())))); + dle.setName(new String(ChmCommons.copyOfRange( + dir_chunk, getPlaceHolder(), + (getPlaceHolder() + dle.getNameLength())))); + checkControlData(dle); + checkResetTable(dle); + setPlaceHolder(getPlaceHolder() + + dle.getNameLength()); + + /* Sets entry type */ + if (getPlaceHolder() < dir_chunk.length + && dir_chunk[getPlaceHolder()] == 0) + dle.setEntryType(ChmCommons.EntryType.UNCOMPRESSED); + else + dle.setEntryType(ChmCommons.EntryType.COMPRESSED); + + setPlaceHolder(getPlaceHolder() + 1); + dle.setOffset(getEncint(dir_chunk)); + dle.setLength(getEncint(dir_chunk)); + getDirectoryListingEntryList().add(dle); + } else + setPlaceHolder(getPlaceHolder() + 1); + + } while (hasNext(dir_chunk)); } - else if (startsWith(dir_chunk, ChmConstants.PMGL)) { - header_len = ChmConstants.CHM_PMGL_LEN; - } - else { - throw new ChmParsingException("Bad dir entry block."); - } - - placeHolder = header_len; - //setPlaceHolder(header_len); - while (placeHolder > 0 && placeHolder < dir_chunk.length - PMGLheader.getFreeSpace() - /*&& dir_chunk[placeHolder - 1] != 115*/) - { - //get entry name length - int strlen = 0;// = getEncint(data); - byte temp; - while ((temp=dir_chunk[placeHolder++]) >= 0x80) - { - strlen <<= 7; - strlen += temp & 0x7f; - } - - strlen = (strlen << 7) + temp & 0x7f; - - if (strlen>dir_chunk.length) { - throw new ChmParsingException("Bad data of a string length."); - } - - DirectoryListingEntry dle = new DirectoryListingEntry(); - dle.setNameLength(strlen); - dle.setName(new String(ChmCommons.copyOfRange( - dir_chunk, placeHolder, - (placeHolder + dle.getNameLength())), UTF_8)); - - checkControlData(dle); - checkResetTable(dle); - setPlaceHolder(placeHolder - + dle.getNameLength()); - - /* Sets entry type */ - if (placeHolder < dir_chunk.length - && dir_chunk[placeHolder] == 0) - dle.setEntryType(ChmCommons.EntryType.UNCOMPRESSED); - else - dle.setEntryType(ChmCommons.EntryType.COMPRESSED); - - setPlaceHolder(placeHolder + 1); - dle.setOffset(getEncint(dir_chunk)); - dle.setLength(getEncint(dir_chunk)); - getDirectoryListingEntryList().add(dle); - } - -// int indexWorkData = ChmCommons.indexOf(dir_chunk, -// "::".getBytes(UTF_8)); -// int indexUserData = ChmCommons.indexOf(dir_chunk, -// "/".getBytes(UTF_8)); -// -// if (indexUserData>=0 && indexUserData < indexWorkData) -// setPlaceHolder(indexUserData); -// else if (indexWorkData>=0) { -// setPlaceHolder(indexWorkData); -// } -// else { -// setPlaceHolder(indexUserData); -// } -// -// if (placeHolder > 0 && placeHolder < dir_chunk.length - PMGLheader.getFreeSpace() -// && dir_chunk[placeHolder - 1] != 115) {// #{ -// do { -// if (dir_chunk[placeHolder - 1] > 0) { -// DirectoryListingEntry dle = new DirectoryListingEntry(); -// -// // two cases: 1. when dir_chunk[placeHolder - -// // 1] == 0x73 -// // 2. when dir_chunk[placeHolder + 1] == 0x2f -// doNameCheck(dir_chunk, dle); -// -// // dle.setName(new -// // String(Arrays.copyOfRange(dir_chunk, -// // placeHolder, (placeHolder + -// // dle.getNameLength())))); -// dle.setName(new String(ChmCommons.copyOfRange( -// dir_chunk, placeHolder, -// (placeHolder + dle.getNameLength())), UTF_8)); -// checkControlData(dle); -// checkResetTable(dle); -// setPlaceHolder(placeHolder -// + dle.getNameLength()); -// -// /* Sets entry type */ -// if (placeHolder < dir_chunk.length -// && dir_chunk[placeHolder] == 0) -// dle.setEntryType(ChmCommons.EntryType.UNCOMPRESSED); -// else -// dle.setEntryType(ChmCommons.EntryType.COMPRESSED); -// -// setPlaceHolder(placeHolder + 1); -// dle.setOffset(getEncint(dir_chunk)); -// dle.setLength(getEncint(dir_chunk)); -// getDirectoryListingEntryList().add(dle); -// } else -// setPlaceHolder(placeHolder + 1); -// -// } while (nextEntry(dir_chunk)); -// } - } - -// } catch (Exception e) { -// e.printStackTrace(); -// } - } - + } + + } catch (Exception e) { + e.printStackTrace(); + } + } + + /** + * Checks if a name and name length are correct. If not then handles it as + * follows: 1. when dir_chunk[getPlaceHolder() - 1] == 0x73 ('/') 2. when + * dir_chunk[getPlaceHolder() + 1] == 0x2f ('s') + * + * @param dir_chunk + * @param dle + */ + private void doNameCheck(byte[] dir_chunk, DirectoryListingEntry dle) { + if (dir_chunk[getPlaceHolder() - 1] == 0x73) { + dle.setNameLength(dir_chunk[getPlaceHolder() - 1] & 0x21); + } else if (dir_chunk[getPlaceHolder() + 1] == 0x2f) { + dle.setNameLength(dir_chunk[getPlaceHolder()]); + setPlaceHolder(getPlaceHolder() + 1); + } else { + dle.setNameLength(dir_chunk[getPlaceHolder() - 1]); + } + } + + /** + * Checks if it's possible move further on byte[] + * + * @param dir_chunk + * + * @return boolean + */ + private boolean hasNext(byte[] dir_chunk) { + while (getPlaceHolder() < dir_chunk.length) { + if (dir_chunk[getPlaceHolder()] == 47 + && dir_chunk[getPlaceHolder() + 1] != ':') { + setPlaceHolder(getPlaceHolder()); + return true; + } else if (dir_chunk[getPlaceHolder()] == ':' + && dir_chunk[getPlaceHolder() + 1] == ':') { + setPlaceHolder(getPlaceHolder()); + return true; + } else + setPlaceHolder(getPlaceHolder() + 1); + } + return false; + } /** * Returns encrypted integer @@ -328,17 +321,23 @@ BigInteger bi = BigInteger.ZERO; byte[] nb = new byte[1]; - if (placeHolder < data_chunk.length) { - while ((ob = data_chunk[placeHolder]) < 0) { + if (getPlaceHolder() < data_chunk.length) { + while ((ob = data_chunk[getPlaceHolder()]) < 0) { nb[0] = (byte) ((ob & 0x7f)); bi = bi.shiftLeft(7).add(new BigInteger(nb)); - setPlaceHolder(placeHolder + 1); + setPlaceHolder(getPlaceHolder() + 1); } nb[0] = (byte) ((ob & 0x7f)); bi = bi.shiftLeft(7).add(new BigInteger(nb)); - setPlaceHolder(placeHolder + 1); + setPlaceHolder(getPlaceHolder() + 1); } return bi.intValue(); + } + + /** + * @param args + */ + public static void main(String[] args) { } /** diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItsfHeader.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItsfHeader.java index 2c4dc4e..fbe3a43 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItsfHeader.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItsfHeader.java @@ -22,8 +22,6 @@ import org.apache.tika.parser.chm.assertion.ChmAssert; import org.apache.tika.parser.chm.core.ChmConstants; import org.apache.tika.parser.chm.exception.ChmParsingException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * The Header 0000: char[4] 'ITSF' 0004: DWORD 3 (Version number) 0008: DWORD @@ -44,7 +42,7 @@ /* structure of ITSF headers */ public class ChmItsfHeader implements ChmAccessor { private static final long serialVersionUID = 2215291838533213826L; - private byte[] signature; + private byte[] signature = new String("ITSF").getBytes(); /* 0 (ITSF) */ private int version; /* 4 */ private int header_len; /* 8 */ private int unknown_000c; /* c */ @@ -62,16 +60,12 @@ private int dataRemained; private int currentPlace = 0; - public ChmItsfHeader() { - signature = ChmConstants.ITSF.getBytes(UTF_8); /* 0 (ITSF) */ - } - /** * Prints the values of ChmfHeader */ public String toString() { StringBuilder sb = new StringBuilder(); - sb.append(new String(getSignature(), UTF_8) + " "); + sb.append(new String(getSignature()) + " "); sb.append(getVersion() + " "); sb.append(getHeaderLen() + " "); sb.append(getUnknown_000c() + " "); @@ -382,10 +376,10 @@ if (4 > this.getDataRemained()) throw new TikaException("4 > dataLenght"); - dest = (data[this.getCurrentPlace()] & 0xff) - | (data[this.getCurrentPlace() + 1] & 0xff) << 8 - | (data[this.getCurrentPlace() + 2] & 0xff) << 16 - | (data[this.getCurrentPlace() + 3] & 0xff) << 24; + dest = data[this.getCurrentPlace()] + | data[this.getCurrentPlace() + 1] << 8 + | data[this.getCurrentPlace() + 2] << 16 + | data[this.getCurrentPlace() + 3] << 24; this.setCurrentPlace(this.getCurrentPlace() + 4); this.setDataRemained(this.getDataRemained() - 4); @@ -464,7 +458,8 @@ chmItsfHeader.setUnknownLen(chmItsfHeader.unmarshalUint64(data, chmItsfHeader.getUnknownLen())); chmItsfHeader.setDirOffset(chmItsfHeader.unmarshalUint64(data, chmItsfHeader.getDirOffset())); chmItsfHeader.setDirLen(chmItsfHeader.unmarshalUint64(data, chmItsfHeader.getDirLen())); - if (!new String(chmItsfHeader.getSignature(), UTF_8).equals(ChmConstants.ITSF)) + + if (!new String(chmItsfHeader.getSignature()).equals(ChmConstants.ITSF)) throw new TikaException("seems not valid file"); if (chmItsfHeader.getVersion() == ChmConstants.CHM_VER_2) { if (chmItsfHeader.getHeaderLen() < ChmConstants.CHM_ITSF_V2_LEN) diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItspHeader.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItspHeader.java index 10b00ae..288f524 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItspHeader.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmItspHeader.java @@ -21,8 +21,6 @@ import org.apache.tika.parser.chm.core.ChmCommons; import org.apache.tika.parser.chm.core.ChmConstants; import org.apache.tika.parser.chm.exception.ChmParsingException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Directory header The directory starts with a header; its format is as @@ -47,7 +45,11 @@ public class ChmItspHeader implements ChmAccessor { // TODO: refactor all unmarshals private static final long serialVersionUID = 1962394421998181341L; - private byte[] signature; + private byte[] signature = new String(ChmConstants.ITSP).getBytes(); /* + * 0 + * (ITSP + * ) + */ private int version; /* 4 */ private int header_len; /* 8 */ private int unknown_000c; /* c */ @@ -67,17 +69,9 @@ private int dataRemained; private int currentPlace = 0; - public ChmItspHeader() { - signature = ChmConstants.ITSP.getBytes(UTF_8); /* - * 0 - * (ITSP - * ) - */ - } - public String toString() { StringBuilder sb = new StringBuilder(); - sb.append("[ signature:=" + new String(getSignature(), UTF_8) + sb.append("[ signature:=" + new String(getSignature()) + System.getProperty("line.separator")); sb.append("version:=\t" + getVersion() + System.getProperty("line.separator")); @@ -139,10 +133,10 @@ ChmAssert.assertByteArrayNotNull(data); if (4 > this.getDataRemained()) throw new TikaException("4 > dataLenght"); - dest = (data[this.getCurrentPlace()] & 0xff) - | (data[this.getCurrentPlace() + 1] & 0xff) << 8 - | (data[this.getCurrentPlace() + 2] & 0xff) << 16 - | (data[this.getCurrentPlace() + 3] & 0xff) << 24; + dest = data[this.getCurrentPlace()] + | data[this.getCurrentPlace() + 1] << 8 + | data[this.getCurrentPlace() + 2] << 16 + | data[this.getCurrentPlace() + 3] << 24; this.setCurrentPlace(this.getCurrentPlace() + 4); this.setDataRemained(this.getDataRemained() - 4); @@ -153,10 +147,10 @@ ChmAssert.assertByteArrayNotNull(data); if (4 > dataLenght) throw new TikaException("4 > dataLenght"); - dest = (data[this.getCurrentPlace()] & 0xff) - | (data[this.getCurrentPlace() + 1] & 0xff) << 8 - | (data[this.getCurrentPlace() + 2] & 0xff) << 16 - | (data[this.getCurrentPlace() + 3] & 0xff) << 24; + dest = data[this.getCurrentPlace()] + | data[this.getCurrentPlace() + 1] << 8 + | data[this.getCurrentPlace() + 2] << 16 + | data[this.getCurrentPlace() + 3] << 24; setDataRemained(this.getDataRemained() - 4); this.setCurrentPlace(this.getCurrentPlace() + 4); @@ -536,8 +530,8 @@ ChmConstants.BYTE_ARRAY_LENGHT)); /* Checks validity of the itsp header */ - if (!new String(chmItspHeader.getSignature(), UTF_8).equals(ChmConstants.ITSP)) - throw new ChmParsingException("seems not valid signature"); + if (!new String(chmItspHeader.getSignature()).equals(ChmConstants.ITSP)) + throw new ChmParsingException("seems not valid signature"); if (chmItspHeader.getVersion() != ChmConstants.CHM_VER_1) throw new ChmParsingException("!=ChmConstants.CHM_VER_1"); @@ -545,4 +539,10 @@ if (chmItspHeader.getHeader_len() != ChmConstants.CHM_ITSP_V1_LEN) throw new ChmParsingException("!= ChmConstants.CHM_ITSP_V1_LEN"); } + + /** + * @param args + */ + public static void main(String[] args) { + } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcControlData.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcControlData.java index 17a2e2f..bd4b53b 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcControlData.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcControlData.java @@ -20,8 +20,6 @@ import org.apache.tika.parser.chm.assertion.ChmAssert; import org.apache.tika.parser.chm.core.ChmConstants; import org.apache.tika.parser.chm.exception.ChmParsingException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * @@ -42,7 +40,11 @@ private static final long serialVersionUID = -7897854774939631565L; /* class' members */ private long size; /* 0 */ - private byte[] signature; + private byte[] signature = new String(ChmConstants.LZXC).getBytes(); /* + * 4 + * (LZXC + * ) + */ private long version; /* 8 */ private long resetInterval; /* c */ private long windowSize; /* 10 */ @@ -52,14 +54,6 @@ /* local usage */ private int dataRemained; private int currentPlace = 0; - - public ChmLzxcControlData() { - signature = ChmConstants.LZXC.getBytes(UTF_8); /* - * 4 - * (LZXC - * ) - */ - } /** * Returns a remained data @@ -254,7 +248,7 @@ StringBuilder sb = new StringBuilder(); sb.append("size(unknown):=" + this.getSize() + ", "); sb.append("signature(Compression type identifier):=" - + new String(this.getSignature(), UTF_8) + ", "); + + new String(this.getSignature()) + ", "); sb.append("version(Possibly numeric code for LZX):=" + this.getVersion() + System.getProperty("line.separator")); sb.append("resetInterval(The Huffman reset interval):=" @@ -305,7 +299,7 @@ "window size / resetInterval should be more than 1"); /* checks a signature */ - if (!new String(chmLzxcControlData.getSignature(), UTF_8) + if (!new String(chmLzxcControlData.getSignature()) .equals(ChmConstants.LZXC)) throw new ChmParsingException( "the signature does not seem to be correct"); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcResetTable.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcResetTable.java index 5823f67..fdd90db 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcResetTable.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmLzxcResetTable.java @@ -158,10 +158,10 @@ private long unmarshalUInt32(byte[] data, long dest) throws TikaException { ChmAssert.assertByteArrayNotNull(data); - dest = (data[this.getCurrentPlace()] & 0xff) - | (data[this.getCurrentPlace() + 1] & 0xff) << 8 - | (data[this.getCurrentPlace() + 2] & 0xff) << 16 - | (data[this.getCurrentPlace() + 3] & 0xff) << 24; + dest = data[this.getCurrentPlace()] + | data[this.getCurrentPlace() + 1] << 8 + | data[this.getCurrentPlace() + 2] << 16 + | data[this.getCurrentPlace() + 3] << 24; setDataRemained(this.getDataRemained() - 4); this.setCurrentPlace(this.getCurrentPlace() + 4); @@ -316,6 +316,13 @@ */ public void setBlockLlen(long block_len) { this.block_len = block_len; + } + + /** + * @param args + */ + public static void main(String[] args) { + } // @Override diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmgiHeader.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmgiHeader.java index 97eaf46..5cf2fb9 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmgiHeader.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmgiHeader.java @@ -24,8 +24,6 @@ import org.apache.tika.parser.chm.core.ChmConstants; import org.apache.tika.parser.chm.exception.ChmParsingException; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * Description Note: not always exists An index chunk has the following format: * 0000: char[4] 'PMGI' 0004: DWORD Length of quickref/free area at end of @@ -41,22 +39,20 @@ *

    * Note: This class is not in use * - * {@link http://translated.by/you/microsoft-s-html-help-chm-format-incomplete/original/?show-translation-form=1 } + * {@link http + * ://translated.by/you/microsoft-s-html-help-chm-format-incomplete/original + * /?show-translation-form=1 } * * */ public class ChmPmgiHeader implements ChmAccessor { private static final long serialVersionUID = -2092282339894303701L; - private byte[] signature; + private byte[] signature = new String(ChmConstants.CHM_PMGI_MARKER).getBytes(); /* 0 (PMGI) */ private long free_space; /* 4 */ /* local usage */ private int dataRemained; private int currentPlace = 0; - - public ChmPmgiHeader() { - signature = ChmConstants.CHM_PMGI_MARKER.getBytes(UTF_8); /* 0 (PMGI) */ - } private int getDataRemained() { return dataRemained; @@ -81,9 +77,8 @@ ChmAssert.assertChmAccessorNotNull(chmPmgiHeader); ChmAssert.assertPositiveInt(count); this.setDataRemained(data.length); - index = ChmCommons.indexOf(data, - ChmConstants.CHM_PMGI_MARKER.getBytes(UTF_8)); - + index = ChmCommons.indexOf(data, + ChmConstants.CHM_PMGI_MARKER.getBytes()); if (index >= 0) System.arraycopy(data, index, chmPmgiHeader.getSignature(), 0, count); else{ @@ -99,10 +94,10 @@ if (4 > getDataRemained()) throw new ChmParsingException("4 > dataLenght"); - dest = (data[this.getCurrentPlace()] & 0xff) - | (data[this.getCurrentPlace() + 1] & 0xff) << 8 - | (data[this.getCurrentPlace() + 2] & 0xff) << 16 - | (data[this.getCurrentPlace() + 3] & 0xff) << 24; + dest = data[this.getCurrentPlace()] + | data[this.getCurrentPlace() + 1] << 8 + | data[this.getCurrentPlace() + 2] << 16 + | data[this.getCurrentPlace() + 3] << 24; setDataRemained(this.getDataRemained() - 4); this.setCurrentPlace(this.getCurrentPlace() + 4); @@ -150,7 +145,7 @@ */ public String toString() { StringBuilder sb = new StringBuilder(); - sb.append("signature:=" + new String(getSignature(), UTF_8) + ", "); + sb.append("signature:=" + new String(getSignature()) + ", "); sb.append("free space:=" + getFreeSpace() + System.getProperty("line.separator")); return sb.toString(); @@ -168,9 +163,16 @@ /* check structure */ if (!Arrays.equals(chmPmgiHeader.getSignature(), - ChmConstants.CHM_PMGI_MARKER.getBytes(UTF_8))) + ChmConstants.CHM_PMGI_MARKER.getBytes())) throw new TikaException( "it does not seem to be valid a PMGI signature, check ChmItsp index_root if it was -1, means no PMGI, use PMGL insted"); } + + /** + * @param args + */ + public static void main(String[] args) { + + } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmglHeader.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmglHeader.java index abb7175..29af04f 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmglHeader.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/ChmPmglHeader.java @@ -20,8 +20,6 @@ import org.apache.tika.parser.chm.assertion.ChmAssert; import org.apache.tika.parser.chm.core.ChmConstants; import org.apache.tika.parser.chm.exception.ChmParsingException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Description There are two types of directory chunks -- index chunks, and @@ -57,7 +55,11 @@ */ public class ChmPmglHeader implements ChmAccessor { private static final long serialVersionUID = -6139486487475923593L; - private byte[] signature; + private byte[] signature = new String(ChmConstants.PMGL).getBytes(); /* + * 0 + * (PMGL + * ) + */ private long free_space; /* 4 */ private long unknown_0008; /* 8 */ private int block_prev; /* c */ @@ -67,14 +69,6 @@ private int dataRemained; private int currentPlace = 0; - public ChmPmglHeader() { - signature = ChmConstants.PMGL.getBytes(UTF_8); /* - * 0 - * (PMGL - * ) - */ - } - private int getDataRemained() { return dataRemained; } @@ -95,16 +89,13 @@ return free_space; } - public void setFreeSpace(long free_space) throws TikaException { - if (free_space < 0) { - throw new TikaException("Bad PMGLheader.FreeSpace="+free_space); - } + public void setFreeSpace(long free_space) { this.free_space = free_space; } public String toString() { StringBuilder sb = new StringBuilder(); - sb.append("signatute:=" + new String(getSignature(), UTF_8) + ", "); + sb.append("signatute:=" + new String(getSignature()) + ", "); sb.append("free space:=" + getFreeSpace() + ", "); sb.append("unknown0008:=" + getUnknown0008() + ", "); sb.append("prev block:=" + getBlockPrev() + ", "); @@ -122,30 +113,28 @@ this.setDataRemained(this.getDataRemained() - count); } - private int unmarshalInt32(byte[] data) throws TikaException { + private int unmarshalInt32(byte[] data, int dest) throws TikaException { ChmAssert.assertByteArrayNotNull(data); - int dest; if (4 > this.getDataRemained()) throw new TikaException("4 > dataLenght"); - dest = (data[this.getCurrentPlace()] & 0xff) - | (data[this.getCurrentPlace() + 1] & 0xff) << 8 - | (data[this.getCurrentPlace() + 2] & 0xff) << 16 - | (data[this.getCurrentPlace() + 3] & 0xff) << 24; + dest = data[this.getCurrentPlace()] + | data[this.getCurrentPlace() + 1] << 8 + | data[this.getCurrentPlace() + 2] << 16 + | data[this.getCurrentPlace() + 3] << 24; this.setCurrentPlace(this.getCurrentPlace() + 4); this.setDataRemained(this.getDataRemained() - 4); return dest; } - private long unmarshalUInt32(byte[] data) throws ChmParsingException { + private long unmarshalUInt32(byte[] data, long dest) throws ChmParsingException { ChmAssert.assertByteArrayNotNull(data); - long dest; if (4 > getDataRemained()) throw new ChmParsingException("4 > dataLenght"); - dest = (data[this.getCurrentPlace()] & 0xff) - | (data[this.getCurrentPlace() + 1] & 0xff) << 8 - | (data[this.getCurrentPlace() + 2] & 0xff) << 16 - | (data[this.getCurrentPlace() + 3] & 0xff) << 24; + dest = data[this.getCurrentPlace()] + | data[this.getCurrentPlace() + 1] << 8 + | data[this.getCurrentPlace() + 2] << 16 + | data[this.getCurrentPlace() + 3] << 24; setDataRemained(this.getDataRemained() - 4); this.setCurrentPlace(this.getCurrentPlace() + 4); @@ -161,15 +150,20 @@ /* unmarshal fields */ chmPmglHeader.unmarshalCharArray(data, chmPmglHeader, ChmConstants.CHM_SIGNATURE_LEN); - chmPmglHeader.setFreeSpace(chmPmglHeader.unmarshalUInt32(data)); - chmPmglHeader.setUnknown0008(chmPmglHeader.unmarshalUInt32(data)); - chmPmglHeader.setBlockPrev(chmPmglHeader.unmarshalInt32(data)); - chmPmglHeader.setBlockNext(chmPmglHeader.unmarshalInt32(data)); + chmPmglHeader.setFreeSpace(chmPmglHeader.unmarshalUInt32(data, + chmPmglHeader.getFreeSpace())); + chmPmglHeader.setUnknown0008(chmPmglHeader.unmarshalUInt32(data, + chmPmglHeader.getUnknown0008())); + chmPmglHeader.setBlockPrev(chmPmglHeader.unmarshalInt32(data, + chmPmglHeader.getBlockPrev())); + chmPmglHeader.setBlockNext(chmPmglHeader.unmarshalInt32(data, + chmPmglHeader.getBlockNext())); /* check structure */ - if (!new String(chmPmglHeader.getSignature(), UTF_8).equals(ChmConstants.PMGL)) + if (!new String(chmPmglHeader.getSignature()).equals(ChmConstants.PMGL)) throw new ChmParsingException(ChmPmglHeader.class.getName() + " pmgl != pmgl.signature"); + } public byte[] getSignature() { @@ -203,4 +197,11 @@ protected void setBlockNext(int block_next) { this.block_next = block_next; } + + /** + * @param args + */ + public static void main(String[] args) { + + } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/DirectoryListingEntry.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/DirectoryListingEntry.java index c413e07..ebba7c1 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/DirectoryListingEntry.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/accessor/DirectoryListingEntry.java @@ -81,7 +81,7 @@ sb.append("length:=" + getLength()); return sb.toString(); } - + /** * Returns an entry name length * @@ -148,4 +148,7 @@ protected void setLength(int length) { this.length = length; } + + public static void main(String[] args) { + } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmCommons.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmCommons.java index cded7f2..09ba0eb 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmCommons.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmCommons.java @@ -19,6 +19,7 @@ import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; +import java.util.Iterator; import java.util.List; import org.apache.tika.exception.TikaException; @@ -210,9 +211,10 @@ && !ChmCommons.isEmpty(fileToBeSaved)) { try { output = new FileOutputStream(fileToBeSaved); - for (byte[] bufferEntry : buffer) { - output.write(bufferEntry); - } + if (output != null) + for (int i = 0; i < buffer.length; i++) { + output.write(buffer[i]); + } } catch (FileNotFoundException e) { throw new TikaException(e.getMessage()); } catch (IOException e) { @@ -322,9 +324,12 @@ */ public static int indexOf(List list, String pattern) { int place = 0; - for (DirectoryListingEntry directoryListingEntry : list) { - if (directoryListingEntry.toString().contains(pattern)) return place; - ++place; + for (Iterator iterator = list.iterator(); iterator.hasNext();) { + DirectoryListingEntry directoryListingEntry = iterator.next(); + if (directoryListingEntry.toString().contains(pattern)) { + return place; + } else + ++place; } return -1;// not found } @@ -358,4 +363,10 @@ return str == null || str.length() == 0; } + /** + * @param args + */ + public static void main(String[] args) { + } + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmConstants.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmConstants.java index e423871..1fa31c9 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmConstants.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmConstants.java @@ -16,14 +16,12 @@ */ package org.apache.tika.parser.chm.core; -import static java.nio.charset.StandardCharsets.UTF_8; - public class ChmConstants { /* Prevents instantiation */ private ChmConstants() { } - public static final String DEFAULT_CHARSET = UTF_8.name(); + public static final String DEFAULT_CHARSET = "UTF-8"; public static final String ITSF = "ITSF"; public static final String ITSP = "ITSP"; public static final String PMGL = "PMGL"; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmExtractor.java index 454c1c4..cad4495 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/core/ChmExtractor.java @@ -20,10 +20,11 @@ import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; +import java.util.Iterator; import java.util.List; -import org.apache.commons.io.IOUtils; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOUtils; import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet; import org.apache.tika.parser.chm.accessor.ChmItsfHeader; import org.apache.tika.parser.chm.accessor.ChmItspHeader; @@ -34,8 +35,6 @@ import org.apache.tika.parser.chm.core.ChmCommons.EntryType; import org.apache.tika.parser.chm.lzx.ChmBlockInfo; import org.apache.tika.parser.chm.lzx.ChmLzxBlock; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Extracts text from chm file. Enumerates chm entries. @@ -97,7 +96,7 @@ } /** - * Returns lzxc hit_cache length + * Returns lzxc block length * * @return lzxBlockLength */ @@ -106,7 +105,7 @@ } /** - * Sets lzxc hit_cache length + * Sets lzxc block length * * @param lzxBlockLength */ @@ -115,7 +114,7 @@ } /** - * Returns lzxc hit_cache offset + * Returns lzxc block offset * * @return lzxBlockOffset */ @@ -124,7 +123,7 @@ } /** - * Sets lzxc hit_cache offset + * Sets lzxc block offset */ private void setLzxBlockOffset(long lzxBlockOffset) { this.lzxBlockOffset = lzxBlockOffset; @@ -175,7 +174,7 @@ int indexOfControlData = getChmDirList().getControlDataIndex(); int indexOfResetData = ChmCommons.indexOfResetTableBlock(getData(), - ChmConstants.LZXC.getBytes(UTF_8)); + ChmConstants.LZXC.getBytes()); byte[] dir_chunk = null; if (indexOfResetData > 0) dir_chunk = ChmCommons.copyOfRange( getData(), indexOfResetData, indexOfResetData @@ -216,7 +215,7 @@ setLzxBlocksCache(new ArrayList()); } catch (IOException e) { - e.printStackTrace(); + // ignore } } @@ -227,8 +226,8 @@ */ public List enumerateChm() { List listOfEntries = new ArrayList(); - for (DirectoryListingEntry directoryListingEntry : getChmDirList().getDirectoryListingEntryList()) { - listOfEntries.add(directoryListingEntry.getName()); + for (Iterator it = getChmDirList().getDirectoryListingEntryList().iterator(); it.hasNext();) { + listOfEntries.add(it.next().getName()); } return listOfEntries; } @@ -258,37 +257,34 @@ dataOffset + directoryListingEntry.getLength())); } else if (directoryListingEntry.getEntryType() == EntryType.COMPRESSED && !ChmCommons.hasSkip(directoryListingEntry)) { - /* Gets a chm hit_cache info */ + /* Gets a chm block info */ ChmBlockInfo bb = ChmBlockInfo.getChmBlockInfoInstance( directoryListingEntry, (int) getChmLzxcResetTable() .getBlockLen(), getChmLzxcControlData()); - int i = 0, start = 0, hit_cache = 0; + int i = 0, start = 0, block = 0; if ((getLzxBlockLength() < Integer.MAX_VALUE) && (getLzxBlockOffset() < Integer.MAX_VALUE)) { // TODO: Improve the caching // caching ... = O(n^2) - depends on startBlock and endBlock - start = -1; - if (!getLzxBlocksCache().isEmpty()) { + if (getLzxBlocksCache().size() != 0) { for (i = 0; i < getLzxBlocksCache().size(); i++) { - //lzxBlock = getLzxBlocksCache().get(i); - int bn = getLzxBlocksCache().get(i).getBlockNumber(); - for (int j = bb.getIniBlock(); j <= bb.getStartBlock(); j++) { - if (bn == j) { + lzxBlock = getLzxBlocksCache().get(i); + for (int j = bb.getIniBlock(); j <= bb + .getStartBlock(); j++) { + if (lzxBlock.getBlockNumber() == j) if (j > start) { start = j; - hit_cache = i; + block = i; } - } + if (start == bb.getStartBlock()) + break; } - if (start == bb.getStartBlock()) - break; } } -// if (i == getLzxBlocksCache().size() && i == 0) { - if (start<0) { + if (i == getLzxBlocksCache().size() && i == 0) { start = bb.getIniBlock(); byte[] dataSegment = ChmCommons.getChmBlockSegment( @@ -302,7 +298,7 @@ getLzxBlocksCache().add(lzxBlock); } else { - lzxBlock = getLzxBlocksCache().get(hit_cache); + lzxBlock = getLzxBlocksCache().get(block); } for (i = start; i <= bb.getEndBlock();) { @@ -353,12 +349,8 @@ .getBlockCount()) { getLzxBlocksCache().clear(); } - } //end of if - - if (buffer.size() != directoryListingEntry.getLength()) { - throw new TikaException("CHM file extract error: extracted Length is wrong."); } - } //end of if compressed + } } catch (Exception e) { throw new TikaException(e.getMessage()); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmBlockInfo.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmBlockInfo.java index cda829c..7dea007 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmBlockInfo.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmBlockInfo.java @@ -72,10 +72,8 @@ % bytesPerBlock); // potential problem with casting long to int chmBlockInfo - .setIniBlock(chmBlockInfo.startBlock - - chmBlockInfo.startBlock % (int) clcd.getResetInterval()); -// .setIniBlock((chmBlockInfo.startBlock - chmBlockInfo.startBlock) -// % (int) clcd.getResetInterval()); + .setIniBlock((chmBlockInfo.startBlock - chmBlockInfo.startBlock) + % (int) clcd.getResetInterval()); return chmBlockInfo; } @@ -91,10 +89,8 @@ (dle.getOffset() + dle.getLength()) % bytesPerBlock); // potential problem with casting long to int getChmBlockInfo().setIniBlock( - getChmBlockInfo().startBlock - getChmBlockInfo().startBlock + (getChmBlockInfo().startBlock - getChmBlockInfo().startBlock) % (int) clcd.getResetInterval()); -// (getChmBlockInfo().startBlock - getChmBlockInfo().startBlock) -// % (int) clcd.getResetInterval()); return getChmBlockInfo(); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxBlock.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxBlock.java index 9ca3595..d042183 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxBlock.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxBlock.java @@ -47,7 +47,7 @@ private int previousBlockType = -1; public ChmLzxBlock(int blockNumber, byte[] dataSegment, long blockLength, - ChmLzxBlock prevBlock) throws TikaException { + ChmLzxBlock prevBlock) { try { if (validateConstructorParams(blockNumber, dataSegment, blockLength)) { setBlockNumber(blockNumber); @@ -55,7 +55,7 @@ if (prevBlock != null && prevBlock.getState().getBlockLength() > prevBlock .getState().getBlockRemaining()) - setChmSection(new ChmSection(dataSegment, prevBlock.getContent())); + setChmSection(new ChmSection(prevBlock.getContent())); else setChmSection(new ChmSection(dataSegment)); @@ -65,12 +65,10 @@ // we need to take care of previous context // ============================================ checkLzxBlock(prevBlock); + setContent((int) blockLength); if (prevBlock == null - || blockLength < (int) getBlockLength()) { + || getContent().length < (int) getBlockLength()) { setContent((int) getBlockLength()); - } - else { - setContent((int) blockLength); } if (prevBlock != null && prevBlock.getState() != null) @@ -79,8 +77,8 @@ extractContent(); } else throw new TikaException("Check your chm lzx block parameters"); - } catch (TikaException e) { - throw e; + } catch (Exception e) { + // TODO: handle exception } } @@ -138,41 +136,40 @@ } switch (getState().getBlockType()) { - case ChmCommons.ALIGNED_OFFSET: - createAlignedTreeTable(); - //fall through - case ChmCommons.VERBATIM: - /* Creates mainTreeTable */ - createMainTreeTable(); - createLengthTreeTable(); - if (getState().getMainTreeLengtsTable()[0xe8] != 0) - getState().setIntelState(IntelState.STARTED); - break; - case ChmCommons.UNCOMPRESSED: + case ChmCommons.ALIGNED_OFFSET: + createAlignedTreeTable(); + case ChmCommons.VERBATIM: + /* Creates mainTreeTable */ + createMainTreeTable(); + createLengthTreeTable(); + if (getState().getMainTreeLengtsTable()[0xe8] != 0) getState().setIntelState(IntelState.STARTED); - if (getChmSection().getTotal() > 16) - getChmSection().setSwath( - getChmSection().getSwath() - 1); - getState().setR0( - (new BigInteger(getChmSection() - .reverseByteOrder( - getChmSection().unmarshalBytes( - 4))).longValue())); - getState().setR1( - (new BigInteger(getChmSection() - .reverseByteOrder( - getChmSection().unmarshalBytes( - 4))).longValue())); - getState().setR2( - (new BigInteger(getChmSection() - .reverseByteOrder( - getChmSection().unmarshalBytes( - 4))).longValue())); - break; - default: - break; - } - } //end of if BlockRemaining == 0 + break; + case ChmCommons.UNCOMPRESSED: + getState().setIntelState(IntelState.STARTED); + if (getChmSection().getTotal() > 16) + getChmSection().setSwath( + getChmSection().getSwath() - 1); + getState().setR0( + (new BigInteger(getChmSection() + .reverseByteOrder( + getChmSection().unmarshalBytes( + 4))).longValue())); + getState().setR1( + (new BigInteger(getChmSection() + .reverseByteOrder( + getChmSection().unmarshalBytes( + 4))).longValue())); + getState().setR2( + (new BigInteger(getChmSection() + .reverseByteOrder( + getChmSection().unmarshalBytes( + 4))).longValue())); + break; + default: + break; + } + } int tempLen; @@ -191,13 +188,14 @@ switch (getState().getBlockType()) { case ChmCommons.ALIGNED_OFFSET: // if(prevblock.lzxState.length>prevblock.lzxState.remaining) - decompressAlignedBlock(tempLen, getChmSection().getPrevContent() == null ? getChmSection().getData() : getChmSection().getPrevContent());// prevcontext + decompressAlignedBlock(tempLen, getChmSection().getData());// prevcontext break; case ChmCommons.VERBATIM: - decompressVerbatimBlock(tempLen, getChmSection().getPrevContent() == null ? getChmSection().getData() : getChmSection().getPrevContent()); + decompressVerbatimBlock(tempLen, getChmSection().getData()); break; case ChmCommons.UNCOMPRESSED: - decompressUncompressedBlock(tempLen, getChmSection().getPrevContent() == null ? getChmSection().getData() : getChmSection().getPrevContent()); + decompressUncompressedBlock(tempLen, getChmSection() + .getData()); break; } getState().increaseFramesRead(); @@ -256,7 +254,6 @@ } private void createLengthTreeTable() throws TikaException { - //Read Pre Tree Table short[] prelentable = createPreLenTable(); if (prelentable == null) { @@ -273,15 +270,14 @@ throw new ChmParsingException("pretreetable is null"); } - //Build Length Tree createLengthTreeLenTable(0, ChmConstants.LZX_NUM_SECONDARY_LENGTHS, pretreetable, prelentable); getState().setLengthTreeTable( createTreeTable2(getState().getLengthTreeLengtsTable(), - (1 << ChmConstants.LZX_LENGTH_TABLEBITS) + (1 << ChmConstants.LZX_MAINTREE_TABLEBITS) + (ChmConstants.LZX_LENGTH_MAXSYMBOLS << 1), - ChmConstants.LZX_LENGTH_TABLEBITS, + ChmConstants.LZX_MAINTREE_TABLEBITS, ChmConstants.LZX_NUM_SECONDARY_LENGTHS)); } @@ -316,15 +312,13 @@ int matchoffset = 0; for (i = getContentLength(); i < len; i++) { /* new code */ - //read huffman tree from main tree - border = getChmSection().peekBits( - ChmConstants.LZX_MAINTREE_TABLEBITS); + border = getChmSection().getDesyncBits( + ChmConstants.LZX_MAINTREE_TABLEBITS, 0); if (border >= getState().mainTreeTable.length) - throw new ChmParsingException("error decompressing aligned block."); - //break; + break; /* end new code */ - s = getState().mainTreeTable[getChmSection().peekBits( - ChmConstants.LZX_MAINTREE_TABLEBITS)]; + s = getState().mainTreeTable[getChmSection().getDesyncBits( + ChmConstants.LZX_MAINTREE_TABLEBITS, 0)]; if (s >= getState().getMainTreeElements()) { x = ChmConstants.LZX_MAINTREE_TABLEBITS; do { @@ -334,9 +328,7 @@ } while ((s = getState().mainTreeTable[s]) >= getState() .getMainTreeElements()); } - //System.out.printf("%d,", s); - //?getChmSection().getSyncBits(getState().mainTreeTable[s]); - getChmSection().getSyncBits(getState().getMainTreeLengtsTable()[s]); + getChmSection().getSyncBits(getState().mainTreeTable[s]); if (s < ChmConstants.LZX_NUM_CHARS) { content[i] = (byte) s; } else { @@ -344,9 +336,10 @@ matchlen = s & ChmConstants.LZX_NUM_PRIMARY_LENGTHS; if (matchlen == ChmConstants.LZX_NUM_PRIMARY_LENGTHS) { matchfooter = getState().lengthTreeTable[getChmSection() - .peekBits(ChmConstants.LZX_LENGTH_TABLEBITS)];//.LZX_MAINTREE_TABLEBITS)]; - if (matchfooter >= ChmConstants.LZX_LENGTH_MAXSYMBOLS/*?LZX_LENGTH_TABLEBITS*/) { - x = ChmConstants.LZX_LENGTH_TABLEBITS; + .getDesyncBits(ChmConstants.LZX_MAINTREE_TABLEBITS, + 0)]; + if (matchfooter >= ChmConstants.LZX_MAINTREE_TABLEBITS) { + x = ChmConstants.LZX_MAINTREE_TABLEBITS; do { x++; matchfooter <<= 1; @@ -364,14 +357,13 @@ matchoffset = (ChmConstants.POSITION_BASE[matchoffset] - 2); if (extra > 3) { extra -= 3; - long verbatim_bits = getChmSection().getSyncBits(extra); - matchoffset += (verbatim_bits << 3); - //READ HUFF SYM in Aligned Tree - int aligned_bits = getChmSection().peekBits( - ChmConstants.LZX_NUM_PRIMARY_LENGTHS); - int t = getState().getAlignedTreeTable()[aligned_bits]; + long l = getChmSection().getSyncBits(extra); + matchoffset += (l << 3); + int g = getChmSection().getDesyncBits( + ChmConstants.LZX_NUM_PRIMARY_LENGTHS, 0); + int t = getState().getAlignedTreeTable()[g]; if (t >= getState().getMainTreeElements()) { - x = ChmConstants.LZX_ALIGNED_TABLEBITS; //?LZX_MAINTREE_TABLEBITS; //?LZX_ALIGNED_TABLEBITS + x = ChmConstants.LZX_MAINTREE_TABLEBITS; do { x++; t <<= 1; @@ -380,14 +372,14 @@ .getMainTreeElements()); } getChmSection().getSyncBits( - getState().getAlignedLenTable()[t]); + getState().getAlignedTreeTable()[t]); matchoffset += t; } else if (extra == 3) { - int g = getChmSection().peekBits( - ChmConstants.LZX_NUM_PRIMARY_LENGTHS); + int g = (int) getChmSection().getDesyncBits( + ChmConstants.LZX_NUM_PRIMARY_LENGTHS, 0); int t = getState().getAlignedTreeTable()[g]; if (t >= getState().getMainTreeElements()) { - x = ChmConstants.LZX_ALIGNED_TABLEBITS; //?LZX_MAINTREE_TABLEBITS; + x = ChmConstants.LZX_MAINTREE_TABLEBITS; do { x++; t <<= 1; @@ -396,7 +388,7 @@ .getMainTreeElements()); } getChmSection().getSyncBits( - getState().getAlignedLenTable()[t]); + getState().getAlignedTreeTable()[t]); matchoffset += t; } else if (extra > 0) { long l = getChmSection().getSyncBits(extra); @@ -465,8 +457,8 @@ int matchlen = 0, matchfooter = 0, extra, rundest, runsrc; int matchoffset = 0; for (i = getContentLength(); i < len; i++) { - int f = getChmSection().peekBits( - ChmConstants.LZX_MAINTREE_TABLEBITS); + int f = (int) getChmSection().getDesyncBits( + ChmConstants.LZX_MAINTREE_TABLEBITS, 0); assertShortArrayNotNull(getState().getMainTreeTable()); s = getState().getMainTreeTable()[f]; if (s >= ChmConstants.LZX_MAIN_MAXSYMBOLS) { @@ -484,8 +476,8 @@ s -= ChmConstants.LZX_NUM_CHARS; matchlen = s & ChmConstants.LZX_NUM_PRIMARY_LENGTHS; if (matchlen == ChmConstants.LZX_NUM_PRIMARY_LENGTHS) { - matchfooter = getState().getLengthTreeTable()[getChmSection() - .peekBits(ChmConstants.LZX_LENGTH_TABLEBITS)]; + matchfooter = getState().getLengthTreeTable()[(int) getChmSection() + .getDesyncBits(ChmConstants.LZX_LENGTH_TABLEBITS, 0)]; if (matchfooter >= ChmConstants.LZX_NUM_SECONDARY_LENGTHS) { x = ChmConstants.LZX_LENGTH_TABLEBITS; do { @@ -577,9 +569,8 @@ int i = offset; // represents offset int z, y, x;// local counters while (i < tablelen) { - //Read HUFF sym to z - z = pretreetable[getChmSection().peekBits( - ChmConstants.LZX_PRETREE_TABLEBITS)]; + z = pretreetable[(int) getChmSection().getDesyncBits( + ChmConstants.LZX_PRETREE_TABLEBITS, 0)]; if (z >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS) {// 1 bug, should be // 20 x = ChmConstants.LZX_PRETREE_TABLEBITS; @@ -590,7 +581,6 @@ } while ((z = pretreetable[z]) >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS); } getChmSection().getSyncBits(prelentable[z]); - if (z < 17) { z = getState().getLengthTreeLengtsTable()[i] - z; if (z < 0) @@ -598,29 +588,29 @@ getState().getLengthTreeLengtsTable()[i] = (short) z; i++; } else if (z == 17) { - y = getChmSection().getSyncBits(4); + y = (int) getChmSection().getSyncBits(4); y += 4; for (int j = 0; j < y; j++) if (i < getState().getLengthTreeLengtsTable().length) getState().getLengthTreeLengtsTable()[i++] = 0; } else if (z == 18) { - y = getChmSection().getSyncBits(5); + y = (int) getChmSection().getSyncBits(5); y += 20; for (int j = 0; j < y; j++) - //no tolerate //if (i < getState().getLengthTreeLengtsTable().length) + if (i < getState().getLengthTreeLengtsTable().length) getState().getLengthTreeLengtsTable()[i++] = 0; } else if (z == 19) { y = getChmSection().getSyncBits(1); y += 4; - z = pretreetable[getChmSection().peekBits( - ChmConstants.LZX_PRETREE_TABLEBITS)]; + z = pretreetable[(int) getChmSection().getDesyncBits( + ChmConstants.LZX_PRETREE_TABLEBITS, 0)]; if (z >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS) {// 20 x = ChmConstants.LZX_PRETREE_TABLEBITS;// 6 do { x++; z <<= 1; z += getChmSection().checkBit(x); - } while ((z = pretreetable[z]) >= ChmConstants.LZX_PRETREE_NUM_ELEMENTS);//LZX_MAINTREE_TABLEBITS); + } while ((z = pretreetable[z]) >= ChmConstants.LZX_MAINTREE_TABLEBITS); } getChmSection().getSyncBits(prelentable[z]); z = getState().getLengthTreeLengtsTable()[i] - z; @@ -633,25 +623,20 @@ } private void createMainTreeTable() throws TikaException { - //Read Pre Tree Table short[] prelentable = createPreLenTable(); short[] pretreetable = createTreeTable2(prelentable, (1 << ChmConstants.LZX_PRETREE_TABLEBITS) + (ChmConstants.LZX_PRETREE_MAXSYMBOLS << 1), ChmConstants.LZX_PRETREE_TABLEBITS, ChmConstants.LZX_PRETREE_MAXSYMBOLS); - createMainTreeLenTable(0, ChmConstants.LZX_NUM_CHARS, pretreetable, prelentable); - - //Read Pre Tree Table prelentable = createPreLenTable(); pretreetable = createTreeTable2(prelentable, (1 << ChmConstants.LZX_PRETREE_TABLEBITS) + (ChmConstants.LZX_PRETREE_MAXSYMBOLS << 1), ChmConstants.LZX_PRETREE_TABLEBITS, ChmConstants.LZX_PRETREE_MAXSYMBOLS); - createMainTreeLenTable(ChmConstants.LZX_NUM_CHARS, getState().mainTreeLengtsTable.length, pretreetable, prelentable); @@ -662,6 +647,7 @@ + (ChmConstants.LZX_MAINTREE_MAXSYMBOLS << 1), ChmConstants.LZX_MAINTREE_TABLEBITS, getState() .getMainTreeElements())); + } private void createMainTreeLenTable(int offset, int tablelen, @@ -671,8 +657,8 @@ int i = offset; int z, y, x; while (i < tablelen) { - int f = getChmSection().peekBits( - ChmConstants.LZX_PRETREE_TABLEBITS); + int f = getChmSection().getDesyncBits( + ChmConstants.LZX_PRETREE_TABLEBITS, 0); z = pretreetable[f]; if (z >= ChmConstants.LZX_PRETREE_MAXSYMBOLS) { x = ChmConstants.LZX_PRETREE_TABLEBITS; @@ -706,8 +692,8 @@ } else if (z == 19) { y = getChmSection().getSyncBits(1); y += 4; - z = pretreetable[getChmSection().peekBits( - ChmConstants.LZX_PRETREE_TABLEBITS)]; + z = pretreetable[getChmSection().getDesyncBits( + ChmConstants.LZX_PRETREE_TABLEBITS, 0)]; if (z >= ChmConstants.LZX_PRETREE_MAXSYMBOLS) { x = ChmConstants.LZX_PRETREE_TABLEBITS; do { @@ -734,7 +720,7 @@ } private short[] createAlignedLenTable() { - int tablelen = ChmConstants.LZX_ALIGNED_NUM_ELEMENTS;//LZX_BLOCKTYPE_UNCOMPRESSED;// + int tablelen = ChmConstants.LZX_BLOCKTYPE_UNCOMPRESSED; int bits = ChmConstants.LZX_BLOCKTYPE_UNCOMPRESSED; short[] tmp = new short[tablelen]; for (int i = 0; i < tablelen; i++) { @@ -743,9 +729,9 @@ return tmp; } - private void createAlignedTreeTable() throws ChmParsingException { + private void createAlignedTreeTable() { getState().setAlignedLenTable(createAlignedLenTable()); - getState().setAlignedTreeTable(//setAlignedLenTable( + getState().setAlignedLenTable( createTreeTable2(getState().getAlignedLenTable(), (1 << ChmConstants.LZX_NUM_PRIMARY_LENGTHS) + (ChmConstants.LZX_ALIGNED_MAXSYMBOLS << 1), @@ -754,7 +740,7 @@ } private short[] createTreeTable2(short[] lentable, int tablelen, int bits, - int maxsymbol) throws ChmParsingException { + int maxsymbol) { short[] tmp = new short[tablelen]; short sym; int leaf; @@ -770,12 +756,10 @@ while (bit_num <= bits) { for (sym = 0; sym < maxsymbol; sym++) { if (lentable.length > sym && lentable[sym] == bit_num) { - leaf = pos; - - if ((pos += bit_mask) > table_mask) { - /* table overflow */ - throw new ChmParsingException("Table overflow"); - } + leaf = pos;// pos=0 + + if ((pos += bit_mask) > table_mask) + return null; fill = bit_mask; while (fill-- > 0) @@ -824,10 +808,11 @@ } tmp[leaf] = sym; - if ((pos += bit_mask) > table_mask) { - /* table overflow */ - throw new ChmParsingException("Table overflow"); - } + if ((pos += bit_mask) > table_mask) + return null; + /* table overflow */ + } else { + // return null; } } bit_mask >>= 1; @@ -847,13 +832,18 @@ } public byte[] getContent(int startOffset, int endOffset) { + int length = endOffset - startOffset; + // return (getContent() != null) ? Arrays.copyOfRange(getContent(), + // startOffset, (startOffset + length)) : new byte[1]; return (getContent() != null) ? ChmCommons.copyOfRange(getContent(), - startOffset, endOffset) : new byte[1]; + startOffset, (startOffset + length)) : new byte[1]; } public byte[] getContent(int start) { + // return (getContent() != null) ? Arrays.copyOfRange(getContent(), + // start, (getContent().length + start)) : new byte[1]; return (getContent() != null) ? ChmCommons.copyOfRange(getContent(), - start, getContent().length) : new byte[1]; + start, (getContent().length + start)) : new byte[1]; } private void setContent(int contentLength) { @@ -864,8 +854,7 @@ if (chmPrevLzxBlock == null && getBlockLength() < Integer.MAX_VALUE) setState(new ChmLzxState((int) getBlockLength())); else - //use clone to avoid changing a cached or to be cached block - setState(chmPrevLzxBlock.getState().clone()); + setState(chmPrevLzxBlock.getState()); } private boolean validateConstructorParams(int blockNumber, @@ -910,4 +899,12 @@ private void setState(ChmLzxState state) { this.state = state; } + + /** + * @param args + */ + public static void main(String[] args) { + // TODO Auto-generated method stub + + } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxState.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxState.java index 51dc5a5..c4f5cfa 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxState.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmLzxState.java @@ -17,14 +17,15 @@ package org.apache.tika.parser.chm.lzx; import java.util.concurrent.CancellationException; + import org.apache.tika.exception.TikaException; import org.apache.tika.parser.chm.core.ChmCommons; +import org.apache.tika.parser.chm.core.ChmConstants; import org.apache.tika.parser.chm.core.ChmCommons.IntelState; import org.apache.tika.parser.chm.core.ChmCommons.LzxState; -import org.apache.tika.parser.chm.core.ChmConstants; import org.apache.tika.parser.chm.exception.ChmParsingException; -public class ChmLzxState implements Cloneable { +public class ChmLzxState { /* Class' members */ private int window; /* the actual decoding window */ private long window_size; /* window size (32Kb through 2Mb) */ @@ -52,22 +53,6 @@ protected short[] alignedLenTable; protected short[] alignedTreeTable; - @Override - public ChmLzxState clone() { - try { - ChmLzxState clone = (ChmLzxState)super.clone(); - clone.mainTreeLengtsTable = arrayClone(mainTreeLengtsTable); - clone.mainTreeTable = arrayClone(mainTreeTable); - clone.lengthTreeTable = arrayClone(lengthTreeTable); - clone.lengthTreeLengtsTable = arrayClone(lengthTreeLengtsTable); - clone.alignedLenTable = arrayClone(alignedLenTable); - clone.alignedTreeTable = arrayClone(alignedTreeTable); - return clone; - } catch (CloneNotSupportedException ex) { - return null; - } - } - protected short[] getMainTreeTable() { return mainTreeTable; } @@ -162,7 +147,7 @@ position_slots = 50; else position_slots = win << 1; - //TODO: position_slots is not used ? + setR0(1); setR1(1); setR2(1); @@ -305,6 +290,9 @@ return R2; } + public static void main(String[] args) { + } + public void setMainTreeLengtsTable(short[] mainTreeLengtsTable) { this.mainTreeLengtsTable = mainTreeLengtsTable; } @@ -320,8 +308,4 @@ public short[] getLengthTreeLengtsTable() { return lengthTreeLengtsTable; } - - private static short[] arrayClone(short[] a) { - return a==null ? null : (short[]) a.clone(); - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmSection.java b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmSection.java index 77f9b3a..b2967d4 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmSection.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/chm/lzx/ChmSection.java @@ -23,23 +23,16 @@ import org.apache.tika.parser.chm.core.ChmCommons; public class ChmSection { - final private byte[] data; - final private byte[] prevcontent; + private byte[] data; private int swath;// kiks private int total;// remains private int buffer;// val public ChmSection(byte[] data) throws TikaException { - this(data, null); - } - - public ChmSection(byte[] data, byte[] prevconent) throws TikaException { ChmCommons.assertByteArrayNotNull(data); - this.data = data; - this.prevcontent = prevconent; - //setData(data); - } - + setData(data); + } + /* Utilities */ public byte[] reverseByteOrder(byte[] toBeReversed) throws TikaException { ChmCommons.assertByteArrayNotNull(toBeReversed); @@ -55,11 +48,7 @@ return getDesyncBits(bit, bit); } - public int peekBits(int bit) { - return getDesyncBits(bit, 0); - } - - private int getDesyncBits(int bit, int removeBit) { + public int getDesyncBits(int bit, int removeBit) { while (getTotal() < 16) { setBuffer((getBuffer() << 16) + unmarshalUByte() + (unmarshalUByte() << 8)); @@ -72,7 +61,7 @@ } public int unmarshalUByte() { - return getByte() & 255; + return (int) (getByte() & 255); } public byte getByte() { @@ -91,10 +80,6 @@ return data; } - public byte[] getPrevContent() { - return prevcontent; - } - public BigInteger getBigInteger(int i) { if (getData() == null) return BigInteger.ZERO; @@ -181,9 +166,9 @@ } } -// private void setData(byte[] data) { -// this.data = data; -// } + private void setData(byte[] data) { + this.data = data; + } public int getSwath() { return swath; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java index d17bde7..0858e8e 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/code/SourceCodeParser.java @@ -22,7 +22,6 @@ import java.io.IOException; import java.io.InputStream; -import java.io.StringReader; import java.nio.charset.Charset; import java.util.HashMap; import java.util.Map; @@ -30,26 +29,22 @@ import java.util.regex.Matcher; import java.util.regex.Pattern; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.config.ServiceLoader; import org.apache.tika.detect.AutoDetectReader; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; -import org.ccil.cowan.tagsoup.HTMLSchema; -import org.ccil.cowan.tagsoup.Schema; import org.xml.sax.ContentHandler; -import org.xml.sax.InputSource; import org.xml.sax.SAXException; import com.uwyn.jhighlight.renderer.Renderer; import com.uwyn.jhighlight.renderer.XhtmlRendererFactory; /** - * Generic Source code parser for Java, Groovy, C++. - * Aware: This parser uses JHightlight library (https://github.com/codelibs/jhighlight) under CDDL/LGPL dual license + * Generic Source code parser for Java, Groovy, C++ * * @author Hong-Thai.Nguyen * @since 1.6 @@ -70,10 +65,7 @@ }; private static final ServiceLoader LOADER = new ServiceLoader(SourceCodeParser.class.getClassLoader()); - - //Parse the HTML document - private static final Schema HTML_SCHEMA = new HTMLSchema(); - + @Override public Set getSupportedTypes(ParseContext context) { return TYPES_TO_RENDERER.keySet(); @@ -83,9 +75,9 @@ public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - try (AutoDetectReader reader = new AutoDetectReader( - new CloseShieldInputStream(stream), metadata, - context.get(ServiceLoader.class, LOADER))) { + AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), metadata, context.get(ServiceLoader.class, LOADER)); + + try { Charset charset = reader.getCharset(); String mediaType = metadata.get(Metadata.CONTENT_TYPE); String name = metadata.get(Metadata.RESOURCE_NAME_KEY); @@ -98,7 +90,7 @@ String line; int nbLines = 0; while ((line = reader.readLine()) != null) { - out.append(line + System.getProperty("line.separator")); + out.append(line); String author = parserAuthor(line); if (author != null) { metadata.add(TikaCoreProperties.CREATOR, author); @@ -106,17 +98,16 @@ nbLines ++; } metadata.set("LoC", String.valueOf(nbLines)); + Renderer renderer = getRenderer(type.toString()); - String codeAsHtml = renderer.highlight(name, out.toString(), charset.name(), false); - - Schema schema = context.get(Schema.class, HTML_SCHEMA); - - org.ccil.cowan.tagsoup.Parser parser = new org.ccil.cowan.tagsoup.Parser(); - parser.setProperty(org.ccil.cowan.tagsoup.Parser.schemaProperty, schema); - parser.setContentHandler(handler); - parser.parse(new InputSource(new StringReader(codeAsHtml))); + char[] charArray = codeAsHtml.toCharArray(); + handler.startDocument(); + handler.characters(charArray, 0, charArray.length); + handler.endDocument(); } + } finally { + reader.close(); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/crypto/Pkcs7Parser.java b/tika-parsers/src/main/java/org/apache/tika/parser/crypto/Pkcs7Parser.java index e84023c..d66b95e 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/crypto/Pkcs7Parser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/crypto/Pkcs7Parser.java @@ -20,8 +20,8 @@ import java.io.InputStream; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -31,9 +31,6 @@ import org.bouncycastle.cms.CMSException; import org.bouncycastle.cms.CMSSignedDataParser; import org.bouncycastle.cms.CMSTypedStream; -import org.bouncycastle.operator.DigestCalculatorProvider; -import org.bouncycastle.operator.OperatorCreationException; -import org.bouncycastle.operator.jcajce.JcaDigestCalculatorProviderBuilder; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; @@ -60,25 +57,24 @@ Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { try { - DigestCalculatorProvider digestCalculatorProvider = - new JcaDigestCalculatorProviderBuilder().setProvider("BC").build(); CMSSignedDataParser parser = - new CMSSignedDataParser(digestCalculatorProvider, new CloseShieldInputStream(stream)); + new CMSSignedDataParser(new CloseShieldInputStream(stream)); try { - CMSTypedStream content = parser.getSignedContent(); + CMSTypedStream content = parser.getSignedContent(); if (content == null) { - throw new TikaException("cannot parse detached pkcs7 signature (no signed data to parse)"); + throw new TikaException("cannot parse detached pkcs7 signature (no signed data to parse)"); } - try (InputStream input = content.getContentStream()) { + InputStream input = content.getContentStream(); + try { Parser delegate = context.get(Parser.class, EmptyParser.INSTANCE); delegate.parse(input, handler, metadata, context); + } finally { + input.close(); } } finally { parser.close(); } - } catch (OperatorCreationException e) { - throw new TikaException("Unable to create DigestCalculatorProvider", e); } catch (CMSException e) { throw new TikaException("Unable to parse pkcs7 signed data", e); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java b/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java deleted file mode 100644 index e6d261d..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESAnnotationProperty.java +++ /dev/null @@ -1,46 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ctakes; - -import org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation; - -/** - * This enumeration includes the properties that an {@see IdentifiedAnnotation} object can provide. - * - */ -public enum CTAKESAnnotationProperty { - BEGIN("start"), - END("end"), - CONDITIONAL("conditional"), - CONFIDENCE("confidence"), - DISCOVERY_TECNIQUE("discoveryTechnique"), - GENERIC("generic"), - HISTORY_OF("historyOf"), - ID("id"), - ONTOLOGY_CONCEPT_ARR("ontologyConceptArr"), - POLARITY("polarity"); - - private String name; - - CTAKESAnnotationProperty(String name) { - this.name = name; - } - - public String getName() { - return name; - } -} \ No newline at end of file diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java deleted file mode 100644 index 67ba993..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESConfig.java +++ /dev/null @@ -1,336 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ctakes; - -import java.io.IOException; -import java.io.InputStream; -import java.io.OutputStream; -import java.io.Serializable; -import java.util.Properties; - -import static org.apache.commons.io.output.NullOutputStream.NULL_OUTPUT_STREAM; - -/** - * Configuration for {@see CTAKESContentHandler}. - * - * This class allows to enable cTAKES and set its parameters. - */ -public class CTAKESConfig implements Serializable { - /** - * Serial version UID - */ - private static final long serialVersionUID = -1599741171775528923L; - - // Path to XML descriptor for AnalysisEngine - private String aeDescriptorPath = "/ctakes-core/desc/analysis_engine/SentencesAndTokensAggregate.xml"; - - // UMLS username - private String UMLSUser = ""; - - // UMLS password - private String UMLSPass = ""; - - // Enables formatted output - private boolean prettyPrint = true; - - // Type of cTAKES (UIMA) serializer - private CTAKESSerializer serializerType = CTAKESSerializer.XMI; - - // OutputStream object used for CAS serialization - private OutputStream stream = NULL_OUTPUT_STREAM; - - // Enables CAS serialization - private boolean serialize = false; - - // Enables text analysis using cTAKES - private boolean text = true; - - // List of metadata to analyze using cTAKES - private String[] metadata = null; - - // List of annotation properties to add to metadata in addition to text covered by an annotation - private CTAKESAnnotationProperty[] annotationProps = null; - - // Character used to separate the annotation properties into metadata - private char separatorChar = ':'; - - /** - * Default constructor. - */ - public CTAKESConfig() { - init(this.getClass().getResourceAsStream("CTAKESConfig.properties")); - } - - /** - * Loads properties from InputStream and then tries to close InputStream. - * @param stream {@see InputStream} object used to read properties. - */ - public CTAKESConfig(InputStream stream) { - init(stream); - } - - private void init(InputStream stream) { - if (stream == null) { - return; - } - Properties props = new Properties(); - - try { - props.load(stream); - } catch (IOException e) { - // TODO warning - } finally { - if (stream != null) { - try { - stream.close(); - } catch (IOException ioe) { - // TODO warning - } - } - } - - setAeDescriptorPath(props.getProperty("aeDescriptorPath", getAeDescriptorPath())); - setUMLSUser(props.getProperty("UMLSUser", getUMLSUser())); - setUMLSPass(props.getProperty("UMLSPass", getUMLSPass())); - setText(Boolean.valueOf(props.getProperty("text", Boolean.toString(isText())))); - setMetadata(props.getProperty("metadata", getMetadataAsString()).split(",")); - setAnnotationProps(props.getProperty("annotationProps", getAnnotationPropsAsString()).split(",")); - setSeparatorChar(props.getProperty("separatorChar", Character.toString(getSeparatorChar())).charAt(0)); - } - - /** - * Returns the path to XML descriptor for AnalysisEngine. - * @return the path to XML descriptor for AnalysisEngine. - */ - public String getAeDescriptorPath() { - return aeDescriptorPath; - } - - /** - * Returns the UMLS username. - * @return the UMLS username. - */ - public String getUMLSUser() { - return UMLSUser; - } - - /** - * Returns the UMLS password. - * @return the UMLS password. - */ - public String getUMLSPass() { - return UMLSPass; - } - - /** - * Returns {@code true} if formatted output is enabled, {@code false} otherwise. - * @return {@code true} if formatted output is enabled, {@code false} otherwise. - */ - public boolean isPrettyPrint() { - return prettyPrint; - } - - /** - * Returns the type of cTAKES (UIMA) serializer used to write the CAS. - * @return the type of cTAKES serializer. - */ - public CTAKESSerializer getSerializerType() { - return serializerType; - } - - /** - * Returns an {@see OutputStream} object used write the CAS. - * @return {@see OutputStream} object used write the CAS. - */ - public OutputStream getOutputStream() { - return stream; - } - - /** - * Returns {@code true} if CAS serialization is enabled, {@code false} otherwise. - * @return {@code true} if CAS serialization output is enabled, {@code false} otherwise. - */ - public boolean isSerialize() { - return serialize; - } - - /** - * Returns {@code true} if content text analysis is enabled {@code false} otherwise. - * @return {@code true} if content text analysis is enabled {@code false} otherwise. - */ - public boolean isText() { - return text; - } - - /** - * Returns an array of metadata whose values will be analyzed using cTAKES. - * @return an array of metadata whose values will be analyzed using cTAKES. - */ - public String[] getMetadata() { - return metadata; - } - - /** - * Returns a string containing a comma-separated list of metadata whose values will be analyzed using cTAKES. - * @return a string containing a comma-separated list of metadata whose values will be analyzed using cTAKES. - */ - public String getMetadataAsString() { - if (metadata == null) { - return ""; - } - StringBuilder sb = new StringBuilder(); - for (int i = 0; i < metadata.length; i++) { - sb.append(metadata[i]); - if (i < metadata.length-1) { - sb.append(","); - } - } - return sb.toString(); - } - - /** - * Returns an array of {@see CTAKESAnnotationProperty}'s that will be included into cTAKES metadata. - * @return an array of {@see CTAKESAnnotationProperty}'s that will be included into cTAKES metadata. - */ - public CTAKESAnnotationProperty[] getAnnotationProps() { - return annotationProps; - } - - /** - * Returns a string containing a comma-separated list of {@see CTAKESAnnotationProperty} names that will be included into cTAKES metadata. - * @return - */ - public String getAnnotationPropsAsString() { - StringBuilder sb = new StringBuilder(); - sb.append("coveredText"); - if (annotationProps != null) { - for (CTAKESAnnotationProperty property : annotationProps) { - sb.append(separatorChar); - sb.append(property.getName()); - } - } - return sb.toString(); - } - - /** - * Returns the separator character used for annotation properties. - * @return the separator character used for annotation properties. - */ - public char getSeparatorChar() { - return separatorChar; - } - - /** - * Sets the path to XML descriptor for AnalysisEngine. - * @param aeDescriptorPath the path to XML descriptor for AnalysisEngine. - */ - public void setAeDescriptorPath(String aeDescriptorPath) { - this.aeDescriptorPath = aeDescriptorPath; - } - - /** - * Sets the UMLS username. - * @param uMLSUser the UMLS username. - */ - public void setUMLSUser(String uMLSUser) { - this.UMLSUser = uMLSUser; - } - - /** - * Sets the UMLS password. - * @param uMLSPass the UMLS password. - */ - public void setUMLSPass(String uMLSPass) { - this.UMLSPass = uMLSPass; - } - - /** - * Enables the formatted output for serializer. - * @param prettyPrint {@true} to enable formatted output, {@code false} otherwise. - */ - public void setPrettyPrint(boolean prettyPrint) { - this.prettyPrint = prettyPrint; - } - - /** - * Sets the type of cTAKES (UIMA) serializer used to write CAS. - * @param serializerType the type of cTAKES serializer. - */ - public void setSerializerType(CTAKESSerializer serializerType) { - this.serializerType = serializerType; - } - - /** - * Sets the {@see OutputStream} object used to write the CAS. - * @param stream the {@see OutputStream} object used to write the CAS. - */ - public void setOutputStream(OutputStream stream) { - this.stream = stream; - } - - /** - * Enables CAS serialization. - * @param serialize {@true} to enable CAS serialization, {@code false} otherwise. - */ - public void setSerialize(boolean serialize) { - this.serialize = serialize; - } - - /** - * Enables content text analysis using cTAKES. - * @param text {@true} to enable content text analysis, {@code false} otherwise. - */ - public void setText(boolean text) { - this.text = text; - } - - /** - * Sets the metadata whose values will be analyzed using cTAKES. - * @param metadata the metadata whose values will be analyzed using cTAKES. - */ - public void setMetadata(String[] metadata) { - this.metadata = metadata; - } - - /** - * Sets the {@see CTAKESAnnotationProperty}'s that will be included into cTAKES metadata. - * @param annotationProps the {@see CTAKESAnnotationProperty}'s that will be included into cTAKES metadata. - */ - public void setAnnotationProps(CTAKESAnnotationProperty[] annotationProps) { - this.annotationProps = annotationProps; - } - - /** - * ets the {@see CTAKESAnnotationProperty}'s that will be included into cTAKES metadata. - * @param annotationProps the {@see CTAKESAnnotationProperty}'s that will be included into cTAKES metadata. - */ - public void setAnnotationProps(String[] annotationProps) { - CTAKESAnnotationProperty[] properties = new CTAKESAnnotationProperty[annotationProps.length]; - for (int i = 0; i < annotationProps.length; i++) { - properties[i] = CTAKESAnnotationProperty.valueOf(annotationProps[i]); - } - setAnnotationProps(properties); - } - - /** - * Sets the separator character used for annotation properties. - * @param separatorChar the separator character used for annotation properties. - */ - public void setSeparatorChar(char separatorChar) { - this.separatorChar = separatorChar; - } -} \ No newline at end of file diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java deleted file mode 100644 index 38326e3..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESContentHandler.java +++ /dev/null @@ -1,176 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ctakes; - -import java.util.Collection; -import java.util.Iterator; - -import org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.sax.ContentHandlerDecorator; -import org.apache.uima.analysis_engine.AnalysisEngine; -import org.apache.uima.fit.util.JCasUtil; -import org.apache.uima.jcas.JCas; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.DefaultHandler; - -/** - * Class used to extract biomedical information while parsing. - * - *

    - * This class relies on Apache cTAKES - * that is a natural language processing system for extraction of information - * from electronic medical record clinical free-text. - *

    - */ -public class CTAKESContentHandler extends ContentHandlerDecorator { - // Prefix used for metadata including cTAKES annotations - public static String CTAKES_META_PREFIX = "ctakes:"; - - // Configuration object for CTAKESContentHandler - private CTAKESConfig config = null; - - // StringBuilder object used to build the clinical free-text for cTAKES - private StringBuilder sb = null; - - // Metadata object used for cTAKES annotations - private Metadata metadata = null; - - // UIMA Analysis Engine - private AnalysisEngine ae = null; - - // JCas object for working with the CAS (Common Analysis System) - private JCas jcas = null; - - /** - * Creates a new {@see CTAKESContentHandler} for the given {@see - * ContentHandler} and Metadata objects. - * - * @param handler - * the {@see ContentHandler} object to be decorated. - * @param metadata - * the {@see Metadata} object that will be populated using - * biomedical information extracted by cTAKES. - * @param config - * the {@see CTAKESConfig} object used to configure the handler. - */ - public CTAKESContentHandler(ContentHandler handler, Metadata metadata, - CTAKESConfig config) { - super(handler); - this.metadata = metadata; - this.config = config; - this.sb = new StringBuilder(); - } - - /** - * Creates a new {@see CTAKESContentHandler} for the given {@see - * ContentHandler} and Metadata objects. - * - * @param handler - * the {@see ContentHandler} object to be decorated. - * @param metadata - * the {@see Metadata} object that will be populated using - * biomedical information extracted by cTAKES. - */ - public CTAKESContentHandler(ContentHandler handler, Metadata metadata) { - this(handler, metadata, new CTAKESConfig()); - } - - /** - * Default constructor. - */ - public CTAKESContentHandler() { - this(new DefaultHandler(), new Metadata()); - } - - @Override - public void characters(char[] ch, int start, int length) - throws SAXException { - if (config.isText()) { - sb.append(ch, start, length); - } - super.characters(ch, start, length); - } - - @Override - public void endDocument() throws SAXException { - try { - // create an Analysis Engine - if (ae == null) { - ae = CTAKESUtils.getAnalysisEngine(config.getAeDescriptorPath(), config.getUMLSUser(), config.getUMLSPass()); - } - - // create a JCas, given an AE - if (jcas == null) { - jcas = CTAKESUtils.getJCas(ae); - } - - // get metadata to process - StringBuilder metaText = new StringBuilder(); - String[] metadataToProcess = config.getMetadata(); - if (metadataToProcess != null) { - for (String name : config.getMetadata()) { - for (String value : metadata.getValues(name)) { - metaText.append(value); - metaText.append(System.lineSeparator()); - } - } - } - - // analyze text - jcas.setDocumentText(metaText.toString() + sb.toString()); - ae.process(jcas); - - // add annotations to metadata - metadata.add(CTAKES_META_PREFIX + "schema", config.getAnnotationPropsAsString()); - CTAKESAnnotationProperty[] annotationPros = config.getAnnotationProps(); - Collection collection = JCasUtil.select(jcas, IdentifiedAnnotation.class); - Iterator iterator = collection.iterator(); - while (iterator.hasNext()) { - IdentifiedAnnotation annotation = iterator.next(); - StringBuilder annotationBuilder = new StringBuilder(); - annotationBuilder.append(annotation.getCoveredText()); - if (annotationPros != null) { - for (CTAKESAnnotationProperty property : annotationPros) { - annotationBuilder.append(config.getSeparatorChar()); - annotationBuilder.append(CTAKESUtils.getAnnotationProperty(annotation, property)); - } - } - metadata.add(CTAKES_META_PREFIX + annotation.getType().getShortName(), annotationBuilder.toString()); - } - - if (config.isSerialize()) { - // serialize data - CTAKESUtils.serialize(jcas, config.getSerializerType(), config.isPrettyPrint(), config.getOutputStream()); - } - } catch (Exception e) { - throw new SAXException(e.getMessage()); - } finally { - CTAKESUtils.resetCAS(jcas); - } - } - - /** - * Returns metadata that includes cTAKES annotations. - * - * @return {@Metadata} object that includes cTAKES annotations. - */ - public Metadata getMetadata() { - return metadata; - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java deleted file mode 100644 index acd1965..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESParser.java +++ /dev/null @@ -1,92 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ctakes; - -import java.io.IOException; -import java.io.InputStream; - -import org.apache.tika.config.TikaConfig; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.ParserDecorator; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * CTAKESParser decorates a {@see Parser} and leverages on - * {@see CTAKESContentHandler} to extract biomedical information from - * clinical text using Apache cTAKES. - *

    It is normally called by supplying an instance to - * {@link AutoDetectParser}, such as: - * AutoDetectParser parser = new AutoDetectParser(new CTAKESParser()); - *

    It can also be used by giving a Tika Config file similar to: - * - * - * - * - * - * - * - * - * - *

    Because this is a Parser Decorator, and not a normal Parser in - * it's own right, it isn't normally selected via the Parser Service Loader. - */ -public class CTAKESParser extends ParserDecorator { - /** - * Serial version UID - */ - private static final long serialVersionUID = -2313482748027097961L; - - /** - * Wraps the default Parser - */ - public CTAKESParser() { - this(TikaConfig.getDefaultConfig()); - } - /** - * Wraps the default Parser for this Config - */ - public CTAKESParser(TikaConfig config) { - this(config.getParser()); - } - /** - * Wraps the specified Parser - */ - public CTAKESParser(Parser parser) { - super(parser); - } - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - CTAKESConfig config = context.get(CTAKESConfig.class, - new CTAKESConfig()); - CTAKESContentHandler ctakesHandler = new CTAKESContentHandler(handler, - metadata, config); - super.parse(stream, ctakesHandler, metadata, context); - } - - //@Override - public String getDecorationName() { - return "CTakes"; - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java b/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java deleted file mode 100644 index 4d4e4e2..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESSerializer.java +++ /dev/null @@ -1,42 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ctakes; - -import org.apache.uima.cas.impl.XCASSerializer; -import org.apache.uima.cas.impl.XmiCasSerializer; -import org.apache.uima.util.XmlCasSerializer; - -/** - * Enumeration for types of cTAKES (UIMA) CAS serializer supported by cTAKES. - * - * A CAS serializer writes a CAS in the given format. - */ -public enum CTAKESSerializer { - XCAS(XCASSerializer.class.getName()), - XMI(XmiCasSerializer.class.getName()), - XML(XmlCasSerializer.class.getName()); - - private final String className; - - private CTAKESSerializer(String className) { - this.className = className; - } - - public String getClassName() { - return className; - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java b/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java deleted file mode 100644 index 23f281a..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ctakes/CTAKESUtils.java +++ /dev/null @@ -1,265 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ctakes; - -import java.io.IOException; -import java.io.OutputStream; -import java.net.URISyntaxException; - -import org.apache.ctakes.typesystem.type.refsem.UmlsConcept; -import org.apache.ctakes.typesystem.type.textsem.IdentifiedAnnotation; -import org.apache.uima.UIMAFramework; -import org.apache.uima.analysis_engine.AnalysisEngine; -import org.apache.uima.cas.impl.XCASSerializer; -import org.apache.uima.cas.impl.XmiCasSerializer; -import org.apache.uima.cas.impl.XmiSerializationSharedData; -import org.apache.uima.jcas.JCas; -import org.apache.uima.jcas.cas.FSArray; -import org.apache.uima.resource.ResourceInitializationException; -import org.apache.uima.resource.ResourceSpecifier; -import org.apache.uima.util.InvalidXMLException; -import org.apache.uima.util.XMLInputSource; -import org.apache.uima.util.XmlCasSerializer; -import org.xml.sax.SAXException; - -/** - * This class provides methods to extract biomedical information from plain text - * using {@see CTAKESContentHandler} that relies on Apache cTAKES. - * - *

    - * Apache cTAKES is built on top of Apache - * UIMA framework and OpenNLP - * toolkit. - *

    - */ -public class CTAKESUtils { - // UMLS username property - private final static String CTAKES_UMLS_USER = "ctakes.umlsuser"; - - // UMLS password property - private final static String CTAKES_UMLS_PASS = "ctakes.umlspw"; - - /** - * Returns a new UIMA Analysis Engine (AE). This method ensures that only - * one instance of an AE is created. - * - *

    - * An Analysis Engine is a component responsible for analyzing unstructured - * information, discovering and representing semantic content. Unstructured - * information includes, but is not restricted to, text documents. - *

    - * - * @param aeDescriptor - * pathname for XML file including an AnalysisEngineDescription - * that contains all of the information needed to instantiate and - * use an AnalysisEngine. - * @param umlsUser - * UMLS username for NLM database - * @param umlsPass - * UMLS password for NLM database - * @return an Analysis Engine for analyzing unstructured information. - * @throws IOException - * if any I/O error occurs. - * @throws InvalidXMLException - * if the input XML is not valid or does not specify a valid - * ResourceSpecifier. - * @throws ResourceInitializationException - * if a failure occurred during production of the resource. - * @throws URISyntaxException - * if URL of the resource is not formatted strictly according to - * to RFC2396 and cannot be converted to a URI. - */ - public static AnalysisEngine getAnalysisEngine(String aeDescriptor, - String umlsUser, String umlsPass) throws IOException, - InvalidXMLException, ResourceInitializationException, - URISyntaxException { - // UMLS user ID and password. - String aeDescriptorPath = CTAKESUtils.class.getResource(aeDescriptor) - .toURI().getPath(); - - // get Resource Specifier from XML - XMLInputSource aeIputSource = new XMLInputSource(aeDescriptorPath); - ResourceSpecifier aeSpecifier = UIMAFramework.getXMLParser() - .parseResourceSpecifier(aeIputSource); - - // UMLS user ID and password - if ((umlsUser != null) && (!umlsUser.isEmpty()) && (umlsPass != null) - && (!umlsPass.isEmpty())) { - /* - * It is highly recommended that you change UMLS credentials in the - * XML configuration file instead of giving user and password using - * CTAKESConfig. - */ - System.setProperty(CTAKES_UMLS_USER, umlsUser); - System.setProperty(CTAKES_UMLS_PASS, umlsPass); - } - - // create AE - AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(aeSpecifier); - - return ae; - } - - /** - * Returns a new JCas () appropriate for the given Analysis Engine. This - * method ensures that only one instance of a JCas is created. A Jcas is a - * Java Cover Classes based Object-oriented CAS (Common Analysis System) - * API. - * - *

    - * Important: It is highly recommended that you reuse CAS objects rather - * than creating new CAS objects prior to each analysis. This is because CAS - * objects may be expensive to create and may consume a significant amount - * of memory. - *

    - * - * @param ae - * AnalysisEngine used to create an appropriate JCas object. - * @return a JCas object appropriate for the given AnalysisEngine. - * @throws ResourceInitializationException - * if a CAS could not be created because this AnalysisEngine's - * CAS metadata (type system, type priorities, or FS indexes) - * are invalid. - */ - public static JCas getJCas(AnalysisEngine ae) - throws ResourceInitializationException { - JCas jcas = ae.newJCas(); - - return jcas; - } - - /** - * Serializes a CAS in the given format. - * - * @param jcas - * CAS (Common Analysis System) to be serialized. - * @param type - * type of cTAKES (UIMA) serializer used to write CAS. - * @param prettyPrint - * {@code true} to do pretty printing of output. - * @param stream - * {@see OutputStream} object used to print out information - * extracted by using cTAKES. - * @throws SAXException - * if there was a SAX exception. - * @throws IOException - * if any I/O error occurs. - */ - public static void serialize(JCas jcas, CTAKESSerializer type, boolean prettyPrint, - OutputStream stream) throws SAXException, IOException { - if (type == CTAKESSerializer.XCAS) { - XCASSerializer.serialize(jcas.getCas(), stream, prettyPrint); - } else if (type == CTAKESSerializer.XMI) { - XmiCasSerializer.serialize(jcas.getCas(), jcas.getTypeSystem(), - stream, prettyPrint, new XmiSerializationSharedData()); - } else { - XmlCasSerializer.serialize(jcas.getCas(), jcas.getTypeSystem(), - stream); - } - } - - /** - * Returns the annotation value based on the given annotation type. - * - * @param annotation - * {@see IdentifiedAnnotation} object. - * @param property - * {@see CTAKESAnnotationProperty} enum used to identify the - * annotation type. - * @return the annotation value. - */ - public static String getAnnotationProperty(IdentifiedAnnotation annotation, - CTAKESAnnotationProperty property) { - String value = null; - if (property == CTAKESAnnotationProperty.BEGIN) { - value = Integer.toString(annotation.getBegin()); - } else if (property == CTAKESAnnotationProperty.END) { - value = Integer.toString(annotation.getEnd()); - } else if (property == CTAKESAnnotationProperty.CONDITIONAL) { - value = Boolean.toString(annotation.getConditional()); - } else if (property == CTAKESAnnotationProperty.CONFIDENCE) { - value = Float.toString(annotation.getConfidence()); - } else if (property == CTAKESAnnotationProperty.DISCOVERY_TECNIQUE) { - value = Integer.toString(annotation.getDiscoveryTechnique()); - } else if (property == CTAKESAnnotationProperty.GENERIC) { - value = Boolean.toString(annotation.getGeneric()); - } else if (property == CTAKESAnnotationProperty.HISTORY_OF) { - value = Integer.toString(annotation.getHistoryOf()); - } else if (property == CTAKESAnnotationProperty.ID) { - value = Integer.toString(annotation.getId()); - } else if (property == CTAKESAnnotationProperty.ONTOLOGY_CONCEPT_ARR) { - FSArray mentions = annotation.getOntologyConceptArr(); - StringBuilder sb = new StringBuilder(); - if (mentions != null) { - for (int i = 0; i < mentions.size(); i++) { - if (mentions.get(i) instanceof UmlsConcept) { - UmlsConcept concept = (UmlsConcept) mentions.get(i); - sb.append(concept.getCui()); - if (i < mentions.size() - 1) { - sb.append(","); - } - } - } - } - value = sb.toString(); - } else if (property == CTAKESAnnotationProperty.POLARITY) { - value = Integer.toString(annotation.getPolarity()); - } - return value; - } - - /** - * Resets cTAKES objects, if created. This method ensures that new cTAKES - * objects (a.k.a., Analysis Engine and JCas) will be created if getters of - * this class are called. - * - * @param ae UIMA Analysis Engine - * @param jcas JCas object - */ - public static void reset(AnalysisEngine ae, JCas jcas) { - // Analysis Engine - resetAE(ae); - - // JCas - resetCAS(jcas); - jcas = null; - } - - /** - * Resets the CAS (Common Analysis System), emptying it of all content. - * - * @param jcas JCas object - */ - public static void resetCAS(JCas jcas) { - if (jcas != null) { - jcas.reset(); - } - } - - /** - * Resets the AE (AnalysisEngine), releasing all resources held by the - * current AE. - * - * @param ae UIMA Analysis Engine - */ - public static void resetAE(AnalysisEngine ae) { - if (ae != null) { - ae.destroy(); - ae = null; - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/dif/DIFContentHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/dif/DIFContentHandler.java deleted file mode 100644 index cc11316..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/dif/DIFContentHandler.java +++ /dev/null @@ -1,152 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.dif; - -import java.util.Stack; - -import org.apache.tika.metadata.Metadata; -import org.xml.sax.Attributes; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.AttributesImpl; -import org.xml.sax.helpers.DefaultHandler; - -public class DIFContentHandler extends DefaultHandler { - - private static final char[] NEWLINE = new char[] { '\n' }; - private static final char[] TABSPACE = new char[] { '\t' }; - private static final Attributes EMPTY_ATTRIBUTES = new AttributesImpl(); - - private Stack treeStack; - private Stack dataStack; - private final ContentHandler delegate; - private boolean isLeaf; - private Metadata metadata; - - public DIFContentHandler(ContentHandler delegate, Metadata metadata) { - this.delegate = delegate; - this.isLeaf = false; - this.metadata = metadata; - this.treeStack = new Stack(); - this.dataStack = new Stack(); - } - - @Override - public void setDocumentLocator(org.xml.sax.Locator locator) { - delegate.setDocumentLocator(locator); - } - - @Override - public void characters(char[] ch, int start, int length) - throws SAXException { - String value = (new String(ch, start, length)).toString(); - this.dataStack.push(value); - - if (this.treeStack.peek().equals("Entry_Title")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "h3", "h3", EMPTY_ATTRIBUTES); - String title = "Title: "; - title = title + value; - this.delegate.characters(title.toCharArray(), 0, title.length()); - this.delegate.endElement("", "h3", "h3"); - } - if (this.treeStack.peek().equals("Southernmost_Latitude") - || this.treeStack.peek().equals("Northernmost_Latitude") - || this.treeStack.peek().equals("Westernmost_Longitude") - || this.treeStack.peek().equals("Easternmost_Longitude")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "tr", "tr", EMPTY_ATTRIBUTES); - this.delegate.startElement("", "td", "td", EMPTY_ATTRIBUTES); - String key = this.treeStack.peek() + " : "; - this.delegate.characters(key.toCharArray(), 0, key.length()); - this.delegate.endElement("", "td", "td"); - this.delegate.startElement("", "td", "td", EMPTY_ATTRIBUTES); - this.delegate.characters(value.toCharArray(), 0, value.length()); - this.delegate.endElement("", "td", "td"); - this.delegate.endElement("", "tr", "tr"); - } - } - - @Override - public void ignorableWhitespace(char[] ch, int start, int length) - throws SAXException { - delegate.ignorableWhitespace(ch, start, length); - } - - @Override - public void startElement(String uri, String localName, String qName, - Attributes attributes) throws SAXException { - this.isLeaf = true; - if (localName.equals("Spatial_Coverage")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "h3", "h3", EMPTY_ATTRIBUTES); - String value = "Geographic Data: "; - this.delegate.characters(value.toCharArray(), 0, value.length()); - this.delegate.endElement("", "h3", "h3"); - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.startElement("", "table", "table", EMPTY_ATTRIBUTES); - } - this.treeStack.push(localName); - } - - @Override - public void endElement(String uri, String localName, String qName) - throws SAXException { - if (localName.equals("Spatial_Coverage")) { - this.delegate.characters(NEWLINE, 0, NEWLINE.length); - this.delegate.characters(TABSPACE, 0, TABSPACE.length); - this.delegate.endElement("", "table", "table"); - } - if (this.isLeaf) { - Stack tempStack = (Stack) this.treeStack.clone(); - String key = ""; - while (!tempStack.isEmpty()) { - if (key.length() == 0) { - key = tempStack.pop(); - } else { - key = tempStack.pop() + "-" + key; - } - } - String value = this.dataStack.peek(); - this.metadata.add(key, value); - this.isLeaf = false; - } - this.treeStack.pop(); - this.dataStack.pop(); - } - - @Override - public void startDocument() throws SAXException { - delegate.startDocument(); - } - - @Override - public void endDocument() throws SAXException { - delegate.endDocument(); - } - - @Override - public String toString() { - return delegate.toString(); - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/dif/DIFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/dif/DIFParser.java deleted file mode 100644 index 5957508..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/dif/DIFParser.java +++ /dev/null @@ -1,86 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.dif; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashSet; -import java.util.Set; - -import org.apache.commons.io.input.CloseShieldInputStream; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.EmbeddedContentHandler; -import org.apache.tika.sax.OfflineContentHandler; -import org.apache.tika.sax.TaggedContentHandler; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class DIFParser extends AbstractParser { - - /** - * - */ - private static final long serialVersionUID = 971505521275777826L; - private static final Set SUPPORTED_TYPES = Collections - .unmodifiableSet(new HashSet(Arrays.asList(MediaType.application("dif+xml")))); - - @Override - public Set getSupportedTypes(ParseContext context) { - // TODO Auto-generated method stub - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - // TODO Auto-generated method stub - final XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, - metadata); - xhtml.startDocument(); - xhtml.startElement("p"); - TaggedContentHandler tagged = new TaggedContentHandler(handler); - try { - context.getSAXParser().parse( - new CloseShieldInputStream(stream), - new OfflineContentHandler(new EmbeddedContentHandler( - getContentHandler(tagged, metadata, context)))); - } catch (SAXException e) { - tagged.throwIfCauseOf(e); - throw new TikaException("XML parse error", e); - } finally { - xhtml.endElement("p"); - xhtml.endDocument(); - } - - } - - protected ContentHandler getContentHandler(ContentHandler handler, - Metadata metadata, ParseContext context) { - - return new DIFContentHandler(handler, metadata); - - } - -} \ No newline at end of file diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java deleted file mode 100644 index e3410b3..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/envi/EnviHeaderParser.java +++ /dev/null @@ -1,84 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - * - */ -package org.apache.tika.parser.envi; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Set; -import java.nio.charset.Charset; - -import org.apache.commons.io.input.CloseShieldInputStream; -import org.apache.tika.detect.AutoDetectReader; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.sax.XHTMLContentHandler; - -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class EnviHeaderParser extends AbstractParser { - - private static final long serialVersionUID = -1479368523072408091L; - - public static final String ENVI_MIME_TYPE = "application/envi.hdr"; - - private static final Set SUPPORTED_TYPES = Collections - .singleton(MediaType.application("envi.hdr")); - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - // Only outputting the MIME type as metadata - metadata.set(Metadata.CONTENT_TYPE, ENVI_MIME_TYPE); - - // The following code was taken from the TXTParser - // Automatically detect the character encoding - - try (AutoDetectReader reader = new AutoDetectReader( - new CloseShieldInputStream(stream), metadata)) { - Charset charset = reader.getCharset(); - MediaType type = new MediaType(MediaType.TEXT_PLAIN, charset); - // deprecated, see TIKA-431 - metadata.set(Metadata.CONTENT_ENCODING, charset.name()); - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, - metadata); - - xhtml.startDocument(); - - // text contents of the xhtml - String line; - while ((line = reader.readLine()) != null) { - xhtml.startElement("p"); - xhtml.characters(line); - xhtml.endElement("p"); - } - - xhtml.endDocument(); - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubContentParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubContentParser.java index 4debbf0..29765e5 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubContentParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubContentParser.java @@ -26,8 +26,8 @@ import javax.xml.parsers.SAXParser; import javax.xml.parsers.SAXParserFactory; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java index edf4614..9cf954c 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/epub/EpubParser.java @@ -25,8 +25,8 @@ import java.util.zip.ZipEntry; import java.util.zip.ZipInputStream; -import org.apache.commons.io.IOUtils; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -39,8 +39,6 @@ import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Epub parser @@ -95,7 +93,7 @@ ZipEntry entry = zip.getNextEntry(); while (entry != null) { if (entry.getName().equals("mimetype")) { - String type = IOUtils.toString(zip, UTF_8); + String type = IOUtils.toString(zip, "UTF-8"); metadata.set(Metadata.CONTENT_TYPE, type); } else if (entry.getName().equals("metadata.xml")) { meta.parse(zip, new DefaultHandler(), metadata, context); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/executable/MachineMetadata.java b/tika-parsers/src/main/java/org/apache/tika/parser/executable/MachineMetadata.java index 06a4d16..f860dfd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/executable/MachineMetadata.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/executable/MachineMetadata.java @@ -26,7 +26,7 @@ public static final String PREFIX = "machine:"; public static Property ARCHITECTURE_BITS = Property.internalClosedChoise(PREFIX+"architectureBits", - "8", "16", "32", "64"); + new String[] { "8", "16", "32", "64" }); public static final String PLATFORM_SYSV = "System V"; public static final String PLATFORM_HPUX = "HP-UX"; @@ -42,9 +42,9 @@ public static final String PLATFORM_WINDOWS = "Windows"; public static Property PLATFORM = Property.internalClosedChoise(PREFIX+"platform", - PLATFORM_SYSV, PLATFORM_HPUX, PLATFORM_NETBSD, PLATFORM_LINUX, + new String[] { PLATFORM_SYSV, PLATFORM_HPUX, PLATFORM_NETBSD, PLATFORM_LINUX, PLATFORM_SOLARIS, PLATFORM_AIX, PLATFORM_IRIX, PLATFORM_FREEBSD, PLATFORM_TRU64, - PLATFORM_ARM, PLATFORM_EMBEDDED, PLATFORM_WINDOWS); + PLATFORM_ARM, PLATFORM_EMBEDDED, PLATFORM_WINDOWS }); public static final String MACHINE_x86_32 = "x86-32"; public static final String MACHINE_x86_64 = "x86-64"; @@ -67,25 +67,23 @@ public static final String MACHINE_UNKNOWN = "Unknown"; public static Property MACHINE_TYPE = Property.internalClosedChoise(PREFIX+"machineType", - MACHINE_x86_32, MACHINE_x86_64, MACHINE_IA_64, MACHINE_SPARC, - MACHINE_M68K, MACHINE_M88K, MACHINE_MIPS, MACHINE_PPC, - MACHINE_S370, MACHINE_S390, - MACHINE_ARM, MACHINE_VAX, MACHINE_ALPHA, MACHINE_EFI, MACHINE_M32R, - MACHINE_SH3, MACHINE_SH4, MACHINE_SH5, MACHINE_UNKNOWN); + new String[] { MACHINE_x86_32, MACHINE_x86_64, MACHINE_IA_64, MACHINE_SPARC, + MACHINE_M68K, MACHINE_M88K, MACHINE_MIPS, MACHINE_PPC, + MACHINE_S370, MACHINE_S390, + MACHINE_ARM, MACHINE_VAX, MACHINE_ALPHA, MACHINE_EFI, MACHINE_M32R, + MACHINE_SH3, MACHINE_SH4, MACHINE_SH5, MACHINE_UNKNOWN }); public static final class Endian { - private String name; - private boolean msb; - public String getName() { return name; } - @SuppressWarnings("unused") - public boolean isMSB() { return msb; } - @SuppressWarnings("unused") - public String getMSB() { if(msb) { return "MSB"; } else { return "LSB"; } } - private Endian(String name, boolean msb) { this.name = name; this.msb = msb; } + private String name; + private boolean msb; + public String getName() { return name; } + public boolean isMSB() { return msb; } + public String getMSB() { if(msb) { return "MSB"; } else { return "LSB"; } } + private Endian(String name, boolean msb) { this.name = name; this.msb = msb; } - public static final Endian LITTLE = new Endian("Little", false); - public static final Endian BIG = new Endian("Big", true); + public static final Endian LITTLE = new Endian("Little", false); + public static final Endian BIG = new Endian("Big", true); } public static Property ENDIAN = Property.internalClosedChoise(PREFIX+"endian", - Endian.LITTLE.name, Endian.BIG.name); + new String[] { Endian.LITTLE.name, Endian.BIG.name }); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java index 7f597a4..fccb659 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/feed/FeedParser.java @@ -23,8 +23,8 @@ import java.util.HashSet; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; @@ -96,7 +96,7 @@ SyndContent content = entry.getDescription(); if (content != null) { xhtml.newline(); - xhtml.characters(stripTags(content)); + xhtml.characters(content.getValue()); } xhtml.endElement("li"); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java index e4bdca7..fa81feb 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/font/AdobeFontMetricParser.java @@ -46,20 +46,7 @@ MediaType.application( "x-font-adobe-metric" ); private static final Set SUPPORTED_TYPES = Collections.singleton(AFM_TYPE); - - // TIKA-1325 Replace these with properties, from a well known standard - static final String MET_AVG_CHAR_WIDTH = "AvgCharacterWidth"; - static final String MET_DOC_VERSION = "DocVersion"; - static final String MET_PS_NAME = "PSName"; - static final String MET_FONT_NAME = "FontName"; - static final String MET_FONT_FULL_NAME = "FontFullName"; - static final String MET_FONT_FAMILY_NAME = "FontFamilyName"; - static final String MET_FONT_SUB_FAMILY_NAME = "FontSubFamilyName"; - static final String MET_FONT_VERSION = "FontVersion"; - static final String MET_FONT_WEIGHT = "FontWeight"; - static final String MET_FONT_NOTICE = "FontNotice"; - static final String MET_FONT_UNDERLINE_THICKNESS = "FontUnderlineThickness"; - + public Set getSupportedTypes( ParseContext context ) { return SUPPORTED_TYPES; } @@ -84,15 +71,15 @@ metadata.set( TikaCoreProperties.TITLE, fontMetrics.getFullName() ); // Add metadata associated with the font type - addMetadataByString( metadata, MET_AVG_CHAR_WIDTH, Float.toString( fontMetrics.getAverageCharacterWidth() ) ); - addMetadataByString( metadata, MET_DOC_VERSION, Float.toString( fontMetrics.getAFMVersion() ) ); - addMetadataByString( metadata, MET_FONT_NAME, fontMetrics.getFontName() ); - addMetadataByString( metadata, MET_FONT_FULL_NAME, fontMetrics.getFullName() ); - addMetadataByString( metadata, MET_FONT_FAMILY_NAME, fontMetrics.getFamilyName() ); - addMetadataByString( metadata, MET_FONT_VERSION, fontMetrics.getFontVersion() ); - addMetadataByString( metadata, MET_FONT_WEIGHT, fontMetrics.getWeight() ); - addMetadataByString( metadata, MET_FONT_NOTICE, fontMetrics.getNotice() ); - addMetadataByString( metadata, MET_FONT_UNDERLINE_THICKNESS, Float.toString( fontMetrics.getUnderlineThickness() ) ); + addMetadataByString( metadata, "AvgCharacterWidth", Float.toString( fontMetrics.getAverageCharacterWidth() ) ); + addMetadataByString( metadata, "DocVersion", Float.toString( fontMetrics.getAFMVersion() ) ); + addMetadataByString( metadata, "FontName", fontMetrics.getFontName() ); + addMetadataByString( metadata, "FontFullName", fontMetrics.getFullName() ); + addMetadataByString( metadata, "FontFamilyName", fontMetrics.getFamilyName() ); + addMetadataByString( metadata, "FontVersion", fontMetrics.getFontVersion() ); + addMetadataByString( metadata, "FontWeight", fontMetrics.getWeight() ); + addMetadataByString( metadata, "FontNotice", fontMetrics.getNotice() ); + addMetadataByString( metadata, "FontUnderlineThickness", Float.toString( fontMetrics.getUnderlineThickness() ) ); // Output the remaining comments as text XHTMLContentHandler xhtml = new XHTMLContentHandler( handler, metadata ); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java index 26c1368..f6d442c 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/font/TrueTypeParser.java @@ -16,13 +16,13 @@ */ package org.apache.tika.parser.font; +import java.awt.Font; +import java.awt.FontFormatException; import java.io.IOException; import java.io.InputStream; import java.util.Collections; import java.util.Set; -import org.apache.fontbox.ttf.NameRecord; -import org.apache.fontbox.ttf.NamingTable; import org.apache.fontbox.ttf.TTFParser; import org.apache.fontbox.ttf.TrueTypeFont; import org.apache.tika.exception.TikaException; @@ -60,6 +60,23 @@ throws IOException, SAXException, TikaException { TikaInputStream tis = TikaInputStream.cast(stream); + // Until PDFBOX-1749 is fixed, if we can, use AWT to verify + // that the file is valid (otherwise FontBox could hang) + // See TIKA-1182 for details + if (tis != null) { + try { + if (tis.hasFile()) { + Font.createFont(Font.TRUETYPE_FONT, tis.getFile()); + } else { + tis.mark(0); + Font.createFont(Font.TRUETYPE_FONT, stream); + tis.reset(); + } + } catch (FontFormatException ex) { + throw new TikaException("Bad TrueType font."); + } + } + // Ask FontBox to parse the file for us TrueTypeFont font; TTFParser parser = new TTFParser(); @@ -71,38 +88,11 @@ // Report the details of the font metadata.set(Metadata.CONTENT_TYPE, TYPE.toString()); - metadata.set(TikaCoreProperties.CREATED, - font.getHeader().getCreated()); - metadata.set(TikaCoreProperties.MODIFIED, - font.getHeader().getModified()); - metadata.set(AdobeFontMetricParser.MET_DOC_VERSION, - Float.toString(font.getHeader().getVersion())); - - // Pull out the naming info - NamingTable fontNaming = font.getNaming(); - for (NameRecord nr : fontNaming.getNameRecords()) { - if (nr.getNameId() == NameRecord.NAME_FONT_FAMILY_NAME) { - metadata.set(AdobeFontMetricParser.MET_FONT_FAMILY_NAME, nr.getString()); - } - if (nr.getNameId() == NameRecord.NAME_FONT_SUB_FAMILY_NAME) { - metadata.set(AdobeFontMetricParser.MET_FONT_SUB_FAMILY_NAME, nr.getString()); - } - if (nr.getNameId() == NameRecord.NAME_FULL_FONT_NAME) { - metadata.set(AdobeFontMetricParser.MET_FONT_NAME, nr.getString()); - metadata.set(TikaCoreProperties.TITLE, nr.getString()); - } - if (nr.getNameId() == NameRecord.NAME_POSTSCRIPT_NAME) { - metadata.set(AdobeFontMetricParser.MET_PS_NAME, nr.getString()); - } - if (nr.getNameId() == NameRecord.NAME_COPYRIGHT) { - metadata.set("Copyright", nr.getString()); - } - if (nr.getNameId() == NameRecord.NAME_TRADEMARK) { - metadata.set("Trademark", nr.getString()); - } - } - - // For now, we only output metadata, no textual contents + metadata.set(TikaCoreProperties.CREATED, font.getHeader().getCreated().getTime()); + metadata.set( + TikaCoreProperties.MODIFIED, + font.getHeader().getModified().getTime()); + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.endDocument(); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java deleted file mode 100644 index aba00fa..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java +++ /dev/null @@ -1,415 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.gdal; - -//JDK imports -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.io.Reader; -import java.util.HashMap; -import java.util.HashSet; -import java.util.Map; -import java.util.Scanner; -import java.util.Set; -import java.util.regex.Matcher; -import java.util.regex.Pattern; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.external.ExternalParser; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.apache.tika.parser.external.ExternalParser.INPUT_FILE_TOKEN; - -//Tika imports -//SAX imports - -/** - * Wraps execution of the Geospatial Data Abstraction - * Library (GDAL) gdalinfo tool used to extract geospatial - * information out of hundreds of geo file formats. - *

    - * The parser requires the installation of GDAL and for gdalinfo to - * be located on the path. - *

    - * Basic information (Size, Coordinate System, Bounding Box, Driver, and - * resource info) are extracted as metadata, and the remaining metadata patterns - * are extracted and added. - *

    - * The output of the command is available from the provided - * {@link ContentHandler} in the - * {@link #parse(InputStream, ContentHandler, Metadata, ParseContext)} method. - */ -public class GDALParser extends AbstractParser { - - private static final long serialVersionUID = -3869130527323941401L; - - private String command; - - public GDALParser() { - setCommand("gdalinfo ${INPUT}"); - } - - public void setCommand(String command) { - this.command = command; - } - - public String getCommand() { - return this.command; - } - - public String processCommand(InputStream stream) { - TikaInputStream tis = (TikaInputStream) stream; - String pCommand = this.command; - try { - if (this.command.contains(INPUT_FILE_TOKEN)) { - pCommand = this.command.replace(INPUT_FILE_TOKEN, tis.getFile() - .getPath()); - } - } catch (Exception e) { - e.printStackTrace(); - } - - return pCommand; - } - - @Override - public Set getSupportedTypes(ParseContext context) { - Set types = new HashSet(); - types.add(MediaType.application("x-netcdf")); - types.add(MediaType.application("vrt")); - types.add(MediaType.image("geotiff")); - types.add(MediaType.image("nitf")); - types.add(MediaType.application("x-rpf-toc")); - types.add(MediaType.application("x-ecrg-toc")); - types.add(MediaType.image("hfa")); - types.add(MediaType.image("sar-ceos")); - types.add(MediaType.image("ceos")); - types.add(MediaType.application("jaxa-pal-sar")); - types.add(MediaType.application("gff")); - types.add(MediaType.application("elas")); - types.add(MediaType.application("aig")); - types.add(MediaType.application("aaigrid")); - types.add(MediaType.application("grass-ascii-grid")); - types.add(MediaType.application("sdts-raster")); - types.add(MediaType.application("dted")); - types.add(MediaType.image("png")); - types.add(MediaType.image("jpeg")); - types.add(MediaType.image("raster")); - types.add(MediaType.application("jdem")); - types.add(MediaType.image("gif")); - types.add(MediaType.image("big-gif")); - types.add(MediaType.image("envisat")); - types.add(MediaType.image("fits")); - types.add(MediaType.application("fits")); - types.add(MediaType.image("bsb")); - types.add(MediaType.application("xpm")); - types.add(MediaType.image("bmp")); - types.add(MediaType.image("x-dimap")); - types.add(MediaType.image("x-airsar")); - types.add(MediaType.application("x-rs2")); - types.add(MediaType.application("x-pcidsk")); - types.add(MediaType.application("pcisdk")); - types.add(MediaType.image("x-pcraster")); - types.add(MediaType.image("ilwis")); - types.add(MediaType.image("sgi")); - types.add(MediaType.application("x-srtmhgt")); - types.add(MediaType.application("leveller")); - types.add(MediaType.application("terragen")); - types.add(MediaType.application("x-gmt")); - types.add(MediaType.application("x-isis3")); - types.add(MediaType.application("x-isis2")); - types.add(MediaType.application("x-pds")); - types.add(MediaType.application("x-til")); - types.add(MediaType.application("x-ers")); - types.add(MediaType.application("x-l1b")); - types.add(MediaType.image("fit")); - types.add(MediaType.application("x-grib")); - types.add(MediaType.image("jp2")); - types.add(MediaType.application("x-rmf")); - types.add(MediaType.application("x-wcs")); - types.add(MediaType.application("x-wms")); - types.add(MediaType.application("x-msgn")); - types.add(MediaType.application("x-wms")); - types.add(MediaType.application("x-wms")); - types.add(MediaType.application("x-rst")); - types.add(MediaType.application("x-ingr")); - types.add(MediaType.application("x-gsag")); - types.add(MediaType.application("x-gsbg")); - types.add(MediaType.application("x-gs7bg")); - types.add(MediaType.application("x-cosar")); - types.add(MediaType.application("x-tsx")); - types.add(MediaType.application("x-coasp")); - types.add(MediaType.application("x-r")); - types.add(MediaType.application("x-map")); - types.add(MediaType.application("x-pnm")); - types.add(MediaType.application("x-doq1")); - types.add(MediaType.application("x-doq2")); - types.add(MediaType.application("x-envi")); - types.add(MediaType.application("x-envi-hdr")); - types.add(MediaType.application("x-generic-bin")); - types.add(MediaType.application("x-p-aux")); - types.add(MediaType.image("x-mff")); - types.add(MediaType.image("x-mff2")); - types.add(MediaType.image("x-fujibas")); - types.add(MediaType.application("x-gsc")); - types.add(MediaType.application("x-fast")); - types.add(MediaType.application("x-bt")); - types.add(MediaType.application("x-lan")); - types.add(MediaType.application("x-cpg")); - types.add(MediaType.image("ida")); - types.add(MediaType.application("x-ndf")); - types.add(MediaType.image("eir")); - types.add(MediaType.application("x-dipex")); - types.add(MediaType.application("x-lcp")); - types.add(MediaType.application("x-gtx")); - types.add(MediaType.application("x-los-las")); - types.add(MediaType.application("x-ntv2")); - types.add(MediaType.application("x-ctable2")); - types.add(MediaType.application("x-ace2")); - types.add(MediaType.application("x-snodas")); - types.add(MediaType.application("x-kro")); - types.add(MediaType.image("arg")); - types.add(MediaType.application("x-rik")); - types.add(MediaType.application("x-usgs-dem")); - types.add(MediaType.application("x-gxf")); - types.add(MediaType.application("x-dods")); - types.add(MediaType.application("x-http")); - types.add(MediaType.application("x-bag")); - types.add(MediaType.application("x-hdf")); - types.add(MediaType.image("x-hdf5-image")); - types.add(MediaType.application("x-nwt-grd")); - types.add(MediaType.application("x-nwt-grc")); - types.add(MediaType.image("adrg")); - types.add(MediaType.image("x-srp")); - types.add(MediaType.application("x-blx")); - types.add(MediaType.application("x-rasterlite")); - types.add(MediaType.application("x-epsilon")); - types.add(MediaType.application("x-sdat")); - types.add(MediaType.application("x-kml")); - types.add(MediaType.application("x-xyz")); - types.add(MediaType.application("x-geo-pdf")); - types.add(MediaType.image("x-ozi")); - types.add(MediaType.application("x-ctg")); - types.add(MediaType.application("x-e00-grid")); - types.add(MediaType.application("x-zmap")); - types.add(MediaType.application("x-webp")); - types.add(MediaType.application("x-ngs-geoid")); - types.add(MediaType.application("x-mbtiles")); - types.add(MediaType.application("x-ppi")); - types.add(MediaType.application("x-cappi")); - return types; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - if (!ExternalParser.check("gdalinfo")) { - return; - } - - // first set up and run GDAL - // process the command - TemporaryResources tmp = new TemporaryResources(); - TikaInputStream tis = TikaInputStream.get(stream, tmp); - - String runCommand = processCommand(tis); - String output = execCommand(new String[]{runCommand}); - - // now extract the actual metadata params - // from the GDAL output in the content stream - // to do this, we need to literally process the output - // from the invoked command b/c we can't read metadata and - // output text from the handler in ExternalParser - // at the same time, so for now, we can't use the - // ExternalParser to do this and I've had to bring some of - // that functionality directly into this class - // TODO: investigate a way to do both using ExternalParser - - extractMetFromOutput(output, metadata); - applyPatternsToOutput(output, metadata, getPatterns()); - - // make the content handler and provide output there - // now that we have metadata - processOutput(handler, metadata, output); - } - - private Map getPatterns() { - Map patterns = new HashMap(); - this.addPatternWithColon("Driver", patterns); - this.addPatternWithColon("Files", patterns); - this.addPatternWithIs("Size", patterns); - this.addPatternWithIs("Coordinate System", patterns); - this.addBoundingBoxPattern("Upper Left", patterns); - this.addBoundingBoxPattern("Lower Left", patterns); - this.addBoundingBoxPattern("Upper Right", patterns); - this.addBoundingBoxPattern("Lower Right", patterns); - return patterns; - } - - private void addPatternWithColon(String name, Map patterns) { - patterns.put( - Pattern.compile(name + "\\:\\s*([A-Za-z0-9/ _\\-\\.]+)\\s*"), - name); - } - - private void addPatternWithIs(String name, Map patterns) { - patterns.put(Pattern.compile(name + " is ([A-Za-z0-9\\.,\\s`']+)"), - name); - } - - private void addBoundingBoxPattern(String name, - Map patterns) { - patterns.put( - Pattern.compile(name - + "\\s*\\(\\s*([0-9]+\\.[0-9]+\\s*,\\s*[0-9]+\\.[0-9]+\\s*)\\)\\s*"), - name); - } - - private void extractMetFromOutput(String output, Metadata met) { - Scanner scanner = new Scanner(output); - String currentKey = null; - String[] headings = {"Subdatasets", "Corner Coordinates"}; - StringBuilder metVal = new StringBuilder(); - while (scanner.hasNextLine()) { - String line = scanner.nextLine(); - if (line.contains("=") || hasHeadings(line, headings)) { - if (currentKey != null) { - // time to flush this key and met val - met.add(currentKey, metVal.toString()); - } - metVal.setLength(0); - - String[] lineToks = line.split("="); - currentKey = lineToks[0].trim(); - if (lineToks.length == 2) { - metVal.append(lineToks[1]); - } else { - metVal.append(""); - } - } else { - metVal.append(line); - } - - } - } - - private boolean hasHeadings(String line, String[] headings) { - if (headings != null && headings.length > 0) { - for (String heading : headings) { - if (line.contains(heading)) { - return true; - } - } - return false; - } else return false; - } - - private void applyPatternsToOutput(String output, Metadata metadata, - Map metadataPatterns) { - Scanner scanner = new Scanner(output); - while (scanner.hasNextLine()) { - String line = scanner.nextLine(); - for (Pattern p : metadataPatterns.keySet()) { - Matcher m = p.matcher(line); - if (m.find()) { - if (metadataPatterns.get(p) != null - && !metadataPatterns.get(p).equals("")) { - metadata.add(metadataPatterns.get(p), m.group(1)); - } else { - metadata.add(m.group(1), m.group(2)); - } - } - } - } - - } - - private String execCommand(String[] cmd) throws IOException { - // Execute - Process process; - String output = null; - if (cmd.length == 1) { - process = Runtime.getRuntime().exec(cmd[0]); - } else { - process = Runtime.getRuntime().exec(cmd); - } - - try { - InputStream out = process.getInputStream(); - - try { - output = extractOutput(out); - } catch (Exception e) { - e.printStackTrace(); - output = ""; - } - - } finally { - try { - process.waitFor(); - } catch (InterruptedException ignore) { - } - } - return output; - - } - - private String extractOutput(InputStream stream) throws SAXException, - IOException { - StringBuilder sb = new StringBuilder(); - try (Reader reader = new InputStreamReader(stream, UTF_8)) { - char[] buffer = new char[1024]; - for (int n = reader.read(buffer); n != -1; n = reader.read(buffer)) { - sb.append(buffer, 0, n); - } - } - return sb.toString(); - } - - private void processOutput(ContentHandler handler, Metadata metadata, - String output) throws SAXException, IOException { - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - InputStream stream = new ByteArrayInputStream(output.getBytes(UTF_8)); - try (Reader reader = new InputStreamReader(stream, UTF_8)) { - xhtml.startDocument(); - xhtml.startElement("p"); - char[] buffer = new char[1024]; - for (int n = reader.read(buffer); n != -1; n = reader.read(buffer)) { - xhtml.characters(buffer, 0, n); - } - xhtml.endElement("p"); - - } finally { - xhtml.endDocument(); - } - - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoParser.java deleted file mode 100644 index 6d07220..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoParser.java +++ /dev/null @@ -1,155 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright owlocationNameEntitieship. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.geo.topic; - -import java.io.ByteArrayOutputStream; -import java.io.IOException; -import java.io.InputStream; -import java.util.ArrayList; -import java.util.Collections; -import java.util.HashMap; -import java.util.Set; -import java.util.logging.Logger; - -import org.apache.commons.exec.CommandLine; -import org.apache.commons.exec.DefaultExecutor; -import org.apache.commons.exec.ExecuteException; -import org.apache.commons.exec.ExecuteWatchdog; -import org.apache.commons.exec.PumpStreamHandler; -import org.apache.commons.exec.environment.EnvironmentUtils; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.external.ExternalParser; -import org.json.simple.JSONArray; -import org.json.simple.JSONObject; -import org.json.simple.JSONValue; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class GeoParser extends AbstractParser { - - private static final long serialVersionUID = -2241391757440215491L; - private static final MediaType MEDIA_TYPE = MediaType - .application("geotopic"); - private static final Set SUPPORTED_TYPES = Collections - .singleton(MEDIA_TYPE); - private GeoParserConfig config = new GeoParserConfig(); - private static final Logger LOG = Logger.getLogger(GeoParser.class - .getName()); - - @Override - public Set getSupportedTypes(ParseContext parseContext) { - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - /*----------------configure this parser by ParseContext Object---------------------*/ - config = context.get(GeoParserConfig.class, config); - String nerModelPath = config.getNERPath(); - - if (!isAvailable()) { - return; - } - - /*----------------get locationNameEntities and best nameEntity for the input stream---------------------*/ - NameEntityExtractor extractor = new NameEntityExtractor(nerModelPath); - extractor.getAllNameEntitiesfromInput(stream); - extractor.getBestNameEntity(); - ArrayList locationNameEntities = extractor.locationNameEntities; - String bestner = extractor.bestNameEntity; - - /*------------------------resolve geonames for each ner, store results in a hashmap---------------------*/ - HashMap> resolvedGeonames = searchGeoNames(locationNameEntities); - - /*----------------store locationNameEntities and their geonames in a geotag, each input has one geotag---------------------*/ - GeoTag geotag = new GeoTag(); - geotag.toGeoTag(resolvedGeonames, bestner); - - /* add resolved entities in metadata */ - - metadata.add("Geographic_NAME", geotag.Geographic_NAME); - metadata.add("Geographic_LONGITUDE", geotag.Geographic_LONGTITUDE); - metadata.add("Geographic_LATITUDE", geotag.Geographic_LATITUDE); - for (int i = 0; i < geotag.alternatives.size(); ++i) { - GeoTag alter = (GeoTag) geotag.alternatives.get(i); - metadata.add("Optional_NAME" + (i + 1), alter.Geographic_NAME); - metadata.add("Optional_LONGITUDE" + (i + 1), - alter.Geographic_LONGTITUDE); - metadata.add("Optional_LATITUDE" + (i + 1), - alter.Geographic_LATITUDE); - } - } - - public HashMap> searchGeoNames( - ArrayList locationNameEntities) throws ExecuteException, - IOException { - CommandLine cmdLine = new CommandLine("lucene-geo-gazetteer"); - ByteArrayOutputStream outputStream = new ByteArrayOutputStream(); - cmdLine.addArgument("-s"); - for (String name : locationNameEntities) { - cmdLine.addArgument(name); - } - - LOG.fine("Executing: " + cmdLine); - DefaultExecutor exec = new DefaultExecutor(); - exec.setExitValue(0); - ExecuteWatchdog watchdog = new ExecuteWatchdog(60000); - exec.setWatchdog(watchdog); - PumpStreamHandler streamHandler = new PumpStreamHandler(outputStream); - exec.setStreamHandler(streamHandler); - int exitValue = exec.execute(cmdLine, - EnvironmentUtils.getProcEnvironment()); - String outputJson = outputStream.toString("UTF-8"); - JSONArray json = (JSONArray) JSONValue.parse(outputJson); - - HashMap> returnHash = new HashMap>(); - for (int i = 0; i < json.size(); i++) { - JSONObject obj = (JSONObject) json.get(i); - for (Object key : obj.keySet()) { - String theKey = (String) key; - JSONArray vals = (JSONArray) obj.get(theKey); - ArrayList stringVals = new ArrayList( - vals.size()); - for (int j = 0; j < vals.size(); j++) { - String val = (String) vals.get(j); - stringVals.add(val); - } - - returnHash.put(theKey, stringVals); - } - } - - return returnHash; - - } - - public boolean isAvailable() { - return ExternalParser.check(new String[] { "lucene-geo-gazetteer", - "--help" }, -1) - && config.getNERPath() != null - && !config.getNERPath().equals(""); - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoParserConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoParserConfig.java deleted file mode 100644 index c8bfa2a..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoParserConfig.java +++ /dev/null @@ -1,54 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.geo.topic; - -import java.io.File; -import java.io.Serializable; -import java.net.URISyntaxException; - -public class GeoParserConfig implements Serializable { - - private static final long serialVersionUID = 1L; - private String nerModelPath = null; - - public GeoParserConfig() { - try { - if (GeoParserConfig.class.getResource("en-ner-location.bin") != null) { - this.nerModelPath = new File(GeoParserConfig.class.getResource( - "en-ner-location.bin").toURI()).getAbsolutePath(); - } - } catch (URISyntaxException e) { - e.printStackTrace(); - } - } - - public void setNERModelPath(String path) { - if (path == null) - return; - File file = new File(path); - if (file.isDirectory() || !file.exists()) { - return; - } - nerModelPath = path; - } - - public String getNERPath() { - return nerModelPath; - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoTag.java b/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoTag.java deleted file mode 100644 index bccaef1..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/GeoTag.java +++ /dev/null @@ -1,65 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.geo.topic; - -import java.util.ArrayList; -import java.util.HashMap; - -public class GeoTag { - String Geographic_NAME; - String Geographic_LONGTITUDE; - String Geographic_LATITUDE; - ArrayList alternatives = new ArrayList(); - - public void setMain(String name, String longitude, String latitude) { - Geographic_NAME = name; - Geographic_LONGTITUDE = longitude; - Geographic_LATITUDE = latitude; - } - - public void addAlternative(GeoTag geotag) { - alternatives.add(geotag); - } - - /* - * Store resolved geoName entities in a GeoTag - * - * @param resolvedGeonames resolved entities - * - * @param bestNER best name entity among all the extracted entities for the - * input stream - */ - public void toGeoTag(HashMap> resolvedGeonames, - String bestNER) { - - for (String key : resolvedGeonames.keySet()) { - ArrayList cur = resolvedGeonames.get(key); - if (key.equals(bestNER)) { - this.Geographic_NAME = cur.get(0); - this.Geographic_LONGTITUDE = cur.get(1); - this.Geographic_LATITUDE = cur.get(2); - } else { - GeoTag alter = new GeoTag(); - alter.Geographic_NAME = cur.get(0); - alter.Geographic_LONGTITUDE = cur.get(1); - alter.Geographic_LATITUDE = cur.get(2); - this.addAlternative(alter); - } - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/NameEntityExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/NameEntityExtractor.java deleted file mode 100644 index e7435d1..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/geo/topic/NameEntityExtractor.java +++ /dev/null @@ -1,127 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.geo.topic; - -import java.io.FileInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; -import java.util.Comparator; -import java.util.HashMap; -import java.util.List; -import java.util.Map; - -import opennlp.tools.namefind.NameFinderME; -import opennlp.tools.namefind.TokenNameFinderModel; -import opennlp.tools.util.InvalidFormatException; -import opennlp.tools.util.Span; - -import org.apache.commons.io.IOUtils; - -import static java.nio.charset.StandardCharsets.UTF_8; - -public class NameEntityExtractor { - private String nerModelPath = null; - ArrayList locationNameEntities; - String bestNameEntity; - private HashMap tf; - - public NameEntityExtractor(String nerModelpath) { - this.locationNameEntities = new ArrayList(); - this.bestNameEntity = null; - this.nerModelPath = nerModelpath; - tf = new HashMap(); - - } - - /* - * Use OpenNLP to extract location names that's appearing in the steam. - * OpenNLP's default Name Finder accuracy is not very good, please refer to - * its documentation. - * - * @param stream stream that passed from this.parse() - */ - - public void getAllNameEntitiesfromInput(InputStream stream) - throws InvalidFormatException, IOException { - - InputStream modelIn = new FileInputStream(nerModelPath); - TokenNameFinderModel model = new TokenNameFinderModel(modelIn); - NameFinderME nameFinder = new NameFinderME(model); - String[] in = IOUtils.toString(stream, UTF_8).split(" "); - - Span nameE[] = nameFinder.find(in); - - String spanNames = Arrays.toString(Span.spansToStrings(nameE, in)); - spanNames = spanNames.substring(1, spanNames.length() - 1); - modelIn.close(); - String[] tmp = spanNames.split(","); - - for (String name : tmp) { - name = name.trim(); - this.locationNameEntities.add(name); - } - - } - - /* - * Get the best location entity extracted from the input stream. Simply - * return the most frequent entity, If there several highest frequent - * entity, pick one randomly. May not be the optimal solution, but works. - * - * @param locationNameEntities OpenNLP name finder's results, stored in - * ArrayList - */ - public void getBestNameEntity() { - if (this.locationNameEntities.size() == 0) - return; - - for (int i = 0; i < this.locationNameEntities.size(); ++i) { - if (tf.containsKey(this.locationNameEntities.get(i))) - tf.put(this.locationNameEntities.get(i), - tf.get(this.locationNameEntities.get(i)) + 1); - else - tf.put(this.locationNameEntities.get(i), 1); - } - int max = 0; - List> list = new ArrayList>( - tf.entrySet()); - Collections.shuffle(list); - Collections.sort(list, new Comparator>() { - public int compare(Map.Entry o1, - Map.Entry o2) { - return o2.getValue().compareTo(o1.getValue()); // descending - // order - - } - }); - - this.locationNameEntities.clear();// update so that they are in - // descending order - for (Map.Entry entry : list) { - this.locationNameEntities.add(entry.getKey()); - if (entry.getValue() > max) { - max = entry.getValue(); - this.bestNameEntity = entry.getKey(); - } - } - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java deleted file mode 100644 index 2ebba14..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java +++ /dev/null @@ -1,391 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.geoinfo; - -import org.apache.sis.internal.util.CheckedArrayList; -import org.apache.sis.internal.util.CheckedHashSet; -import org.apache.sis.metadata.iso.DefaultMetadata; -import org.apache.sis.metadata.iso.DefaultMetadataScope; -import org.apache.sis.metadata.iso.constraint.DefaultLegalConstraints; -import org.apache.sis.metadata.iso.extent.DefaultGeographicBoundingBox; -import org.apache.sis.metadata.iso.extent.DefaultGeographicDescription; -import org.apache.sis.metadata.iso.identification.DefaultDataIdentification; -import org.apache.sis.storage.DataStore; -import org.apache.sis.storage.DataStoreException; -import org.apache.sis.storage.DataStores; -import org.apache.sis.storage.UnsupportedStorageException; -import org.apache.sis.util.collection.CodeListSet; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.opengis.metadata.Identifier; -import org.opengis.metadata.citation.Citation; -import org.opengis.metadata.citation.CitationDate; -import org.opengis.metadata.citation.OnlineResource; -import org.opengis.metadata.citation.ResponsibleParty; -import org.opengis.metadata.constraint.Restriction; -import org.opengis.metadata.distribution.DigitalTransferOptions; -import org.opengis.metadata.distribution.Distribution; -import org.opengis.metadata.distribution.Distributor; -import org.opengis.metadata.distribution.Format; -import org.opengis.metadata.extent.Extent; -import org.opengis.metadata.extent.GeographicExtent; -import org.opengis.metadata.identification.Identification; -import org.opengis.metadata.identification.Keywords; -import org.opengis.metadata.identification.Progress; -import org.opengis.metadata.identification.TopicCategory; -import org.opengis.util.InternationalString; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.nio.charset.Charset; -import java.util.*; - - -public class GeographicInformationParser extends AbstractParser{ - - public static final String geoInfoType="text/iso19139+xml"; - private final Set SUPPORTED_TYPES = - Collections.singleton(MediaType.text("iso19139+xml")); - - - @Override - public Set getSupportedTypes(ParseContext parseContext) { - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream inputStream, ContentHandler contentHandler, Metadata metadata, ParseContext parseContext) throws IOException, SAXException, TikaException { - metadata.set(Metadata.CONTENT_TYPE,geoInfoType); - DataStore dataStore= null; - DefaultMetadata defaultMetadata=null; - XHTMLContentHandler xhtmlContentHandler=new XHTMLContentHandler(contentHandler,metadata); - - try { - TemporaryResources tmp = new TemporaryResources(); - TikaInputStream tikaInputStream=TikaInputStream.get(inputStream,tmp); - File file= tikaInputStream.getFile(); - dataStore = DataStores.open(file); - defaultMetadata=new DefaultMetadata(dataStore.getMetadata()); - if(defaultMetadata!=null) - extract(xhtmlContentHandler, metadata, defaultMetadata); - - }catch (UnsupportedStorageException e) { - throw new TikaException("UnsupportedStorageException",e); - } - catch (DataStoreException e) { - throw new TikaException("DataStoreException",e); - } - } - - private void extract(XHTMLContentHandler xhtmlContentHandler,Metadata metadata, DefaultMetadata defaultMetadata) throws SAXException{ - try { - getMetaDataCharacterSet(metadata, defaultMetadata); - getMetaDataContact(metadata, defaultMetadata); - getMetaDataIdentificationInfo(metadata, defaultMetadata); - getMetaDataDistributionInfo(metadata, defaultMetadata); - getMetaDataDateInfo(metadata, defaultMetadata); - getMetaDataResourceScope(metadata, defaultMetadata); - getMetaDataParentMetaDataTitle(metadata, defaultMetadata); - getMetaDataIdetifierCode(metadata, defaultMetadata); - getMetaDataStandard(metadata, defaultMetadata); - extractContent(xhtmlContentHandler, defaultMetadata); - } - catch(Exception e){ - e.printStackTrace(); - } - } - - private void extractContent(XHTMLContentHandler xhtmlContentHandler, DefaultMetadata defaultMetadata) throws SAXException{ - xhtmlContentHandler.startDocument(); - xhtmlContentHandler.newline(); - - xhtmlContentHandler.newline(); - ArrayList identifications= (ArrayList) defaultMetadata.getIdentificationInfo(); - for(Identification i:identifications) { - xhtmlContentHandler.startElement("h1"); - xhtmlContentHandler.characters(i.getCitation().getTitle().toString()); - xhtmlContentHandler.endElement("h1"); - xhtmlContentHandler.newline(); - - ArrayList responsiblePartyArrayList = (ArrayList) i.getCitation().getCitedResponsibleParties(); - for (ResponsibleParty r : responsiblePartyArrayList) { - xhtmlContentHandler.startElement("h3"); - xhtmlContentHandler.newline(); - xhtmlContentHandler.characters("CitedResponsiblePartyRole " + r.getRole().toString()); - xhtmlContentHandler.characters("CitedResponsiblePartyName " + r.getIndividualName().toString()); - xhtmlContentHandler.endElement("h3"); - xhtmlContentHandler.newline(); - } - - xhtmlContentHandler.startElement("p"); - xhtmlContentHandler.newline(); - xhtmlContentHandler.characters("IdentificationInfoAbstract " + i.getAbstract().toString()); - xhtmlContentHandler.endElement("p"); - xhtmlContentHandler.newline(); - Collection extentList=((DefaultDataIdentification) i).getExtents(); - for(Extent e:extentList){ - ArrayList geoElements= (ArrayList) e.getGeographicElements(); - for(GeographicExtent g:geoElements) { - - if (g instanceof DefaultGeographicBoundingBox) { - xhtmlContentHandler.startElement("tr"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters("GeographicElementWestBoundLatitude"); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters(String.valueOf(((DefaultGeographicBoundingBox) g).getWestBoundLongitude())); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.endElement("tr"); - xhtmlContentHandler.startElement("tr"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters("GeographicElementEastBoundLatitude"); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters(String.valueOf(((DefaultGeographicBoundingBox) g).getEastBoundLongitude())); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.endElement("tr"); - xhtmlContentHandler.startElement("tr"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters("GeographicElementNorthBoundLatitude"); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters(String.valueOf(((DefaultGeographicBoundingBox) g).getNorthBoundLatitude())); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.endElement("tr"); - xhtmlContentHandler.startElement("tr"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters("GeographicElementSouthBoundLatitude"); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.startElement("td"); - xhtmlContentHandler.characters(String.valueOf(((DefaultGeographicBoundingBox) g).getSouthBoundLatitude())); - xhtmlContentHandler.endElement("td"); - xhtmlContentHandler.endElement("tr"); - } - } - } - } - xhtmlContentHandler.newline(); - xhtmlContentHandler.endDocument(); - } - - private void getMetaDataCharacterSet(Metadata metadata, DefaultMetadata defaultMetaData){ - CheckedHashSet charSetList= (CheckedHashSet) defaultMetaData.getCharacterSets(); - for(Charset c:charSetList){ - metadata.add("CharacterSet",c.name()); - } - } - - - private void getMetaDataContact(Metadata metadata, DefaultMetadata defaultMetaData){ - CheckedArrayList contactSet= (CheckedArrayList) defaultMetaData.getContacts(); - for(ResponsibleParty rparty:contactSet){ - if(rparty.getRole()!=null) - metadata.add("ContactRole",rparty.getRole().name()); - if(rparty.getOrganisationName()!=null) - metadata.add("ContactPartyName-",rparty.getOrganisationName().toString()); - } - } - - private void getMetaDataIdentificationInfo(Metadata metadata, DefaultMetadata defaultMetaData){ - ArrayList identifications= (ArrayList) defaultMetaData.getIdentificationInfo(); - for(Identification i:identifications){ - DefaultDataIdentification defaultDataIdentification= (DefaultDataIdentification) i; - if(i.getCitation()!=null && i.getCitation().getTitle()!=null) - metadata.add("IdentificationInfoCitationTitle ",i.getCitation().getTitle().toString()); - - ArrayList dateArrayList= (ArrayList) i.getCitation().getDates(); - for (CitationDate d:dateArrayList){ - if(d.getDateType()!=null) - metadata.add("CitationDate ",d.getDateType().name()+"-->"+d.getDate()); - } - ArrayList responsiblePartyArrayList= (ArrayList) i.getCitation().getCitedResponsibleParties(); - for(ResponsibleParty r:responsiblePartyArrayList){ - if(r.getRole()!=null) - metadata.add("CitedResponsiblePartyRole ",r.getRole().toString()); - if(r.getIndividualName()!=null) - metadata.add("CitedResponsiblePartyName ",r.getIndividualName().toString()); - if(r.getOrganisationName()!=null) - metadata.add("CitedResponsiblePartyOrganizationName ", r.getOrganisationName().toString()); - if(r.getPositionName()!=null) - metadata.add("CitedResponsiblePartyPositionName ",r.getPositionName().toString()); - - if(r.getContactInfo()!=null){ - for(String s:r.getContactInfo().getAddress().getElectronicMailAddresses()) { - metadata.add("CitedResponsiblePartyEMail ",s.toString()); - } - } - } - if(i.getAbstract()!=null) - metadata.add("IdentificationInfoAbstract ",i.getAbstract().toString()); - for(Progress p:i.getStatus()) { - metadata.add("IdentificationInfoStatus ",p.name()); - } - ArrayList formatArrayList= (ArrayList) i.getResourceFormats(); - for(Format f:formatArrayList){ - if(f.getName()!=null) - metadata.add("ResourceFormatSpecificationAlternativeTitle ",f.getName().toString()); - } - CheckedHashSet localeCheckedHashSet= (CheckedHashSet) defaultDataIdentification.getLanguages(); - for(Locale l:localeCheckedHashSet){ - metadata.add("IdentificationInfoLanguage-->",l.getDisplayLanguage(Locale.ENGLISH)); - } - CodeListSet categoryList= (CodeListSet) defaultDataIdentification.getTopicCategories(); - for(TopicCategory t:categoryList){ - metadata.add("IdentificationInfoTopicCategory-->",t.name()); - } - ArrayList keywordList= (ArrayList) i.getDescriptiveKeywords(); - int j=1; - for(Keywords k:keywordList){ - j++; - ArrayList stringList= (ArrayList) k.getKeywords(); - for(InternationalString s:stringList){ - metadata.add("Keywords "+j ,s.toString()); - } - if(k.getType()!=null) - metadata.add("KeywordsType "+j,k.getType().name()); - if(k.getThesaurusName()!=null && k.getThesaurusName().getTitle()!=null) - metadata.add("ThesaurusNameTitle "+j,k.getThesaurusName().getTitle().toString()); - if(k.getThesaurusName()!=null && k.getThesaurusName().getAlternateTitles()!=null) - metadata.add("ThesaurusNameAlternativeTitle "+j,k.getThesaurusName().getAlternateTitles().toString()); - - ArrayListcitationDates= (ArrayList) k.getThesaurusName().getDates(); - for(CitationDate cd:citationDates) { - if(cd.getDateType()!=null) - metadata.add("ThesaurusNameDate ",cd.getDateType().name() +"-->" + cd.getDate()); - } - } - ArrayList constraintList= (ArrayList) i.getResourceConstraints(); - - for(DefaultLegalConstraints c:constraintList){ - for(Restriction r:c.getAccessConstraints()){ - metadata.add("AccessContraints ",r.name()); - } - for(InternationalString s:c.getOtherConstraints()){ - metadata.add("OtherConstraints ",s.toString()); - } - for(Restriction r:c.getUseConstraints()) { - metadata.add("UserConstraints ",r.name()); - } - - } - Collection extentList=((DefaultDataIdentification) i).getExtents(); - for(Extent e:extentList){ - ArrayList geoElements= (ArrayList) e.getGeographicElements(); - for(GeographicExtent g:geoElements){ - - if(g instanceof DefaultGeographicDescription){ - if(((DefaultGeographicDescription) g).getGeographicIdentifier()!=null && ((DefaultGeographicDescription) g).getGeographicIdentifier().getCode()!=null ) - metadata.add("GeographicIdentifierCode ",((DefaultGeographicDescription) g).getGeographicIdentifier().getCode().toString()); - if(((DefaultGeographicDescription) g).getGeographicIdentifier()!=null && ((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority()!=null && ((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority().getTitle()!=null ) - metadata.add("GeographicIdentifierAuthorityTitle ",((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority().getTitle().toString()); - - for(InternationalString s:((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority().getAlternateTitles()) { - metadata.add("GeographicIdentifierAuthorityAlternativeTitle ",s.toString()); - } - for(CitationDate cd:((DefaultGeographicDescription) g).getGeographicIdentifier().getAuthority().getDates()){ - if(cd.getDateType()!=null && cd.getDate()!=null) - metadata.add("GeographicIdentifierAuthorityDate ",cd.getDateType().name()+" "+cd.getDate().toString()); - } - } - } - } - } - } - - private void getMetaDataDistributionInfo(Metadata metadata, DefaultMetadata defaultMetaData){ - Distribution distribution=defaultMetaData.getDistributionInfo(); - ArrayList distributionFormat= (ArrayList) distribution.getDistributionFormats(); - for(Format f:distributionFormat){ - if(f.getName()!=null) - metadata.add("DistributionFormatSpecificationAlternativeTitle ",f.getName().toString()); - } - ArrayList distributorList= (ArrayList) distribution.getDistributors(); - for(Distributor d:distributorList){ - if(d!=null && d.getDistributorContact()!=null && d.getDistributorContact().getRole()!=null) - metadata.add("Distributor Contact ",d.getDistributorContact().getRole().name()); - if(d!=null && d.getDistributorContact()!=null && d.getDistributorContact().getOrganisationName()!=null) - metadata.add("Distributor Organization Name ",d.getDistributorContact().getOrganisationName().toString()); - } - ArrayList transferOptionsList= (ArrayList) distribution.getTransferOptions(); - for(DigitalTransferOptions d:transferOptionsList){ - ArrayList onlineResourceList= (ArrayList) d.getOnLines(); - for(OnlineResource or:onlineResourceList){ - if(or.getLinkage()!=null) - metadata.add("TransferOptionsOnlineLinkage ",or.getLinkage().toString()); - if(or.getProtocol()!=null) - metadata.add("TransferOptionsOnlineProtocol ",or.getProtocol()); - if(or.getApplicationProfile()!=null) - metadata.add("TransferOptionsOnlineProfile ",or.getApplicationProfile()); - if(or.getName()!=null) - metadata.add("TransferOptionsOnlineName ",or.getName()); - if(or.getDescription()!=null) - metadata.add("TransferOptionsOnlineDescription ",or.getDescription().toString()); - if(or.getFunction()!=null) - metadata.add("TransferOptionsOnlineFunction ",or.getFunction().name()); - - } - } - } - - private void getMetaDataDateInfo(Metadata metadata, DefaultMetadata defaultMetaData){ - ArrayList citationDateList= (ArrayList) defaultMetaData.getDateInfo(); - for(CitationDate c:citationDateList){ - if(c.getDateType()!=null) - metadata.add("DateInfo ",c.getDateType().name()+" "+c.getDate()); - } - } - - private void getMetaDataResourceScope(Metadata metadata, DefaultMetadata defaultMetaData){ - ArrayList scopeList= (ArrayList) defaultMetaData.getMetadataScopes(); - for(DefaultMetadataScope d:scopeList){ - if(d.getResourceScope()!=null) - metadata.add("MetaDataResourceScope ",d.getResourceScope().name()); - } - } - - private void getMetaDataParentMetaDataTitle(Metadata metadata, DefaultMetadata defaultMetaData){ - Citation parentMetaData=defaultMetaData.getParentMetadata(); - if(parentMetaData!=null && parentMetaData.getTitle()!=null) - metadata.add("ParentMetaDataTitle",parentMetaData.getTitle().toString()); - } - - private void getMetaDataIdetifierCode(Metadata metadata, DefaultMetadata defaultMetaData){ - Identifier identifier= defaultMetaData.getMetadataIdentifier(); - if(identifier!=null) - metadata.add("MetaDataIdentifierCode",identifier.getCode()); - } - - private void getMetaDataStandard(Metadata metadata, DefaultMetadata defaultMetaData){ - ArrayList citationList= (ArrayList) defaultMetaData.getMetadataStandards(); - for(Citation c:citationList){ - if(c.getTitle()!=null) - metadata.add("MetaDataStandardTitle ",c.getTitle().toString()); - if(c.getEdition()!=null) - metadata.add("MetaDataStandardEdition ",c.getEdition().toString()); - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java deleted file mode 100644 index 6f8756d..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/grib/GribParser.java +++ /dev/null @@ -1,121 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.grib; - -import java.io.IOException; -import java.io.InputStream; -import java.io.File; -import java.util.Collections; -import java.util.Set; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Property; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import ucar.nc2.Attribute; -import ucar.nc2.Dimension; -import ucar.nc2.NetcdfFile; -import ucar.nc2.Variable; -import ucar.nc2.dataset.NetcdfDataset; - -public class GribParser extends AbstractParser { - - private static final long serialVersionUID = 7855458954474247655L; - - public static final String GRIB_MIME_TYPE = "application/x-grib2"; - - private final Set SUPPORTED_TYPES = - Collections.singleton(MediaType.application("x-grib2")); - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - //Set MIME type as grib2 - metadata.set(Metadata.CONTENT_TYPE, GRIB_MIME_TYPE); - - TikaInputStream tis = TikaInputStream.get(stream, new TemporaryResources()); - File gribFile = tis.getFile(); - - try { - NetcdfFile ncFile = NetcdfDataset.openFile(gribFile.getAbsolutePath(), null); - - // first parse out the set of global attributes - for (Attribute attr : ncFile.getGlobalAttributes()) { - Property property = resolveMetadataKey(attr.getFullName()); - if (attr.getDataType().isString()) { - metadata.add(property, attr.getStringValue()); - } else if (attr.getDataType().isNumeric()) { - int value = attr.getNumericValue().intValue(); - metadata.add(property, String.valueOf(value)); - } - } - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - - xhtml.startDocument(); - - xhtml.newline(); - xhtml.startElement("ul"); - xhtml.characters("dimensions:"); - xhtml.newline(); - - for (Dimension dim : ncFile.getDimensions()){ - xhtml.element("li", dim.getFullName() + "=" + String.valueOf(dim.getLength()) + ";"); - xhtml.newline(); - } - - xhtml.startElement("ul"); - xhtml.characters("variables:"); - xhtml.newline(); - - for (Variable var : ncFile.getVariables()){ - xhtml.element("p", String.valueOf(var.getDataType()) + var.getNameAndDimensions() + ";"); - for(Attribute element : var.getAttributes()){ - xhtml.element("li", " :" + element + ";"); - xhtml.newline(); - } - } - xhtml.endElement("ul"); - xhtml.endElement("ul"); - xhtml.endDocument(); - - } catch (IOException e) { - throw new TikaException("NetCDF parse error", e); - } - } - - private Property resolveMetadataKey(String localName) { - if ("title".equals(localName)) { - return TikaCoreProperties.TITLE; - } - return Property.internalText(localName); - } - -} \ No newline at end of file diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java index 821493b..1cbaa71 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/hdf/HDFParser.java @@ -24,8 +24,8 @@ import java.util.Collections; import java.util.Set; -import org.apache.commons.io.IOUtils; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -101,15 +101,13 @@ group = ncFile.getRootGroup(); } - // get file type - met.set("File-Type-Description", ncFile.getFileTypeDescription()); // unravel its string attrs for (Attribute attribute : group.getAttributes()) { if (attribute.isString()) { - met.add(attribute.getFullName(), attribute.getStringValue()); + met.add(attribute.getName(), attribute.getStringValue()); } else { // try and cast its value to a string - met.add(attribute.getFullName(), String.valueOf(attribute + met.add(attribute.getName(), String.valueOf(attribute .getNumericValue())); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java index 4d5cc46..7d1c523 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/BoilerpipeContentHandler.java @@ -20,7 +20,14 @@ import java.util.ArrayList; import java.util.BitSet; import java.util.List; -import java.util.Locale; + +import org.apache.tika.metadata.Metadata; +import org.apache.tika.sax.WriteOutContentHandler; +import org.apache.tika.sax.XHTMLContentHandler; +import org.xml.sax.Attributes; +import org.xml.sax.ContentHandler; +import org.xml.sax.SAXException; +import org.xml.sax.helpers.AttributesImpl; import de.l3s.boilerpipe.BoilerpipeExtractor; import de.l3s.boilerpipe.BoilerpipeProcessingException; @@ -29,40 +36,102 @@ import de.l3s.boilerpipe.extractors.ArticleExtractor; import de.l3s.boilerpipe.extractors.DefaultExtractor; import de.l3s.boilerpipe.sax.BoilerpipeHTMLContentHandler; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.sax.WriteOutContentHandler; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.Attributes; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.AttributesImpl; /** * Uses the boilerpipe * library to automatically extract the main content from a web page. - *

    + * * Use this as a {@link ContentHandler} object passed to * {@link HtmlParser#parse(java.io.InputStream, ContentHandler, Metadata, org.apache.tika.parser.ParseContext)} */ public class BoilerpipeContentHandler extends BoilerpipeHTMLContentHandler { + private static class RecordedElement { + public enum ElementType { + START, + END, + CONTINUE + } + + private String uri; + private String localName; + private String qName; + private Attributes attrs; + private List characters; + private ElementType elementType; + + public RecordedElement(String uri, String localName, String qName, Attributes attrs) { + this(uri, localName, qName, attrs, ElementType.START); + } + + public RecordedElement(String uri, String localName, String qName) { + this(uri, localName, qName, null, ElementType.END); + } + + public RecordedElement() { + this(null, null, null, null, ElementType.CONTINUE); + } + + protected RecordedElement(String uri, String localName, String qName, Attributes attrs, RecordedElement.ElementType elementType) { + this.uri = uri; + this.localName = localName; + this.qName = qName; + this.attrs = attrs; + this.elementType = elementType; + this.characters = new ArrayList(); + } + + @Override + public String toString() { + return String.format("<%s> of type %s", localName, elementType); + }; + + public String getUri() { + return uri; + } + + public String getLocalName() { + return localName; + } + + public String getQName() { + return qName; + } + + public Attributes getAttrs() { + return attrs; + } + + public List getCharacters() { + return characters; + } + + public RecordedElement.ElementType getElementType() { + return elementType; + } + } + /** * The newline character that gets inserted after block elements. */ - private static final char[] NL = new char[]{'\n'}; + private static final char[] NL = new char[] { '\n' }; + private ContentHandler delegate; private BoilerpipeExtractor extractor; + private boolean includeMarkup; private boolean inHeader; private boolean inFooter; private int headerCharOffset; private List elements; private TextDocument td; + /** * Creates a new boilerpipe-based content extractor, using the * {@link DefaultExtractor} extraction rules and "delegate" as the content handler. * - * @param delegate The {@link ContentHandler} object + * @param delegate + * The {@link ContentHandler} object */ public BoilerpipeContentHandler(ContentHandler delegate) { this(delegate, DefaultExtractor.INSTANCE); @@ -83,8 +152,10 @@ * extraction rules. The extracted main content will be passed to the * content handler. * - * @param delegate The {@link ContentHandler} object - * @param extractor Extraction rules to use, e.g. {@link ArticleExtractor} + * @param delegate + * The {@link ContentHandler} object + * @param extractor + * Extraction rules to use, e.g. {@link ArticleExtractor} */ public BoilerpipeContentHandler(ContentHandler delegate, BoilerpipeExtractor extractor) { this.td = null; @@ -92,12 +163,12 @@ this.extractor = extractor; } + public void setIncludeMarkup(boolean includeMarkup) { + this.includeMarkup = includeMarkup; + } + public boolean isIncludeMarkup() { return includeMarkup; - } - - public void setIncludeMarkup(boolean includeMarkup) { - this.includeMarkup = includeMarkup; } /** @@ -122,15 +193,13 @@ if (includeMarkup) { elements = new ArrayList(); } - } + }; @Override public void startPrefixMapping(String prefix, String uri) throws SAXException { super.startPrefixMapping(prefix, uri); delegate.startPrefixMapping(prefix, uri); - } - - ; + }; @Override public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException { @@ -146,9 +215,7 @@ // This happens for the element, if we're not doing markup. delegate.startElement(uri, localName, qName, atts); } - } - - ; + }; @Override public void characters(char[] chars, int offset, int length) throws SAXException { @@ -166,9 +233,7 @@ System.arraycopy(chars, offset, characters, 0, length); element.getCharacters().add(characters); } - } - - ; + }; @Override public void endElement(String uri, String localName, String qName) throws SAXException { @@ -186,9 +251,7 @@ elements.add(new RecordedElement(uri, localName, qName)); elements.add(new RecordedElement()); } - } - - ; + }; @Override public void endDocument() throws SAXException { @@ -278,70 +341,4 @@ delegate.endDocument(); } - - ; - - private static class RecordedElement { - private String uri; - private String localName; - private String qName; - private Attributes attrs; - private List characters; - private ElementType elementType; - public RecordedElement(String uri, String localName, String qName, Attributes attrs) { - this(uri, localName, qName, attrs, ElementType.START); - } - - public RecordedElement(String uri, String localName, String qName) { - this(uri, localName, qName, null, ElementType.END); - } - - public RecordedElement() { - this(null, null, null, null, ElementType.CONTINUE); - } - - protected RecordedElement(String uri, String localName, String qName, Attributes attrs, RecordedElement.ElementType elementType) { - this.uri = uri; - this.localName = localName; - this.qName = qName; - this.attrs = attrs; - this.elementType = elementType; - this.characters = new ArrayList(); - } - - @Override - public String toString() { - return String.format(Locale.ROOT, "<%s> of type %s", localName, elementType); - } - - public String getUri() { - return uri; - } - - public String getLocalName() { - return localName; - } - - public String getQName() { - return qName; - } - - public Attributes getAttrs() { - return attrs; - } - - public List getCharacters() { - return characters; - } - - public RecordedElement.ElementType getElementType() { - return elementType; - } - - public enum ElementType { - START, - END, - CONTINUE - } - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java index 4217ac5..a93d776 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/DefaultHtmlMapper.java @@ -29,10 +29,6 @@ @SuppressWarnings("serial") public class DefaultHtmlMapper implements HtmlMapper { - /** - * @since Apache Tika 0.8 - */ - public static final HtmlMapper INSTANCE = new DefaultHtmlMapper(); // Based on http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd private static final Map SAFE_ELEMENTS = new HashMap() {{ put("H1", "h1"); @@ -46,7 +42,7 @@ put("PRE", "pre"); put("BLOCKQUOTE", "blockquote"); put("Q", "q"); - + put("UL", "ul"); put("OL", "ol"); put("MENU", "ul"); @@ -63,10 +59,10 @@ put("TD", "td"); put("ADDRESS", "address"); - + // TIKA-460 - add anchors put("A", "a"); - + // TIKA-463 - add additional elements that contain URLs (and their sub-elements) put("MAP", "map"); put("AREA", "area"); @@ -79,10 +75,12 @@ put("INS", "ins"); put("DEL", "del"); }}; + private static final Set DISCARDABLE_ELEMENTS = new HashSet() {{ add("STYLE"); add("SCRIPT"); }}; + // For information on tags & attributes, see: // http://www.w3.org/TR/2002/REC-xhtml1-20020801/dtds.html#a_dtd_XHTML-1.0-Strict // http://www.w3schools.com/TAGS/ @@ -94,17 +92,17 @@ put("link", attrSet("charset", "href", "hreflang", "type", "rel", "rev", "media")); put("map", attrSet("id", "class", "style", "title", "name")); put("area", attrSet("shape", "coords", "href", "nohref", "alt")); - put("object", attrSet("declare", "classid", "codebase", "data", "type", "codetype", "archive", "standby", "height", + put("object", attrSet("declare", "classid", "codebase", "data", "type", "codetype", "archive", "standby", "height", "width", "usemap", "name", "tabindex", "align", "border", "hspace", "vspace")); put("param", attrSet("id", "name", "value", "valuetype", "type")); put("blockquote", attrSet("cite")); put("ins", attrSet("cite", "datetime")); put("del", attrSet("cite", "datetime")); put("q", attrSet("cite")); - + // TODO - fill out this set. Include core, i18n, etc sets where appropriate. }}; - + private static Set attrSet(String... attrs) { Set result = new HashSet(); for (String attr : attrs) { @@ -112,14 +110,18 @@ } return result; } + + /** + * @since Apache Tika 0.8 + */ + public static final HtmlMapper INSTANCE = new DefaultHtmlMapper(); public String mapSafeElement(String name) { return SAFE_ELEMENTS.get(name); } - /** - * Normalizes an attribute name. Assumes that the element name - * is valid and normalized + /** Normalizes an attribute name. Assumes that the element name + * is valid and normalized */ public String mapSafeAttribute(String elementName, String attributeName) { Set safeAttrs = SAFE_ATTRIBUTES.get(elementName); @@ -129,7 +131,7 @@ return null; } } - + public boolean isDiscardElement(String name) { return DISCARDABLE_ELEMENTS.contains(name); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java index edb014b..4787830 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlEncodingDetector.java @@ -25,6 +25,7 @@ import org.apache.tika.detect.EncodingDetector; import org.apache.tika.metadata.Metadata; +import org.apache.tika.mime.MediaType; import org.apache.tika.utils.CharsetUtils; /** @@ -41,11 +42,11 @@ // TIKA-357 - use bigger buffer for meta tag sniffing (was 4K) private static final int META_TAG_BUFFER_SIZE = 8192; - + private static final Pattern HTTP_META_PATTERN = Pattern.compile( - "(?is)<\\s*meta\\s+([^<>]+)" - ); - + "(?is)<\\s*meta\\s+([^<>]+)" + ); + //this should match both the older: // //and @@ -57,9 +58,9 @@ //For a more general "not" matcher, try: //("(?is)charset\\s*=\\s*['\\\"]?\\s*([^<>\\s'\\\";]+)") private static final Pattern FLEXIBLE_CHARSET_ATTR_PATTERN = Pattern.compile( - ("(?is)charset\\s*=\\s*(?:['\\\"]\\s*)?([-_:\\.a-z0-9]+)") - ); - + ("(?is)charset\\s*=\\s*(?:['\\\"]\\s*)?([-_:\\.a-z0-9]+)") + ); + private static final Charset ASCII = Charset.forName("US-ASCII"); public Charset detect(InputStream input, Metadata metadata) @@ -81,27 +82,27 @@ // Interpret the head as ASCII and try to spot a meta tag with // a possible character encoding hint - + String head = ASCII.decode(ByteBuffer.wrap(buffer, 0, n)).toString(); Matcher equiv = HTTP_META_PATTERN.matcher(head); Matcher charsetMatcher = FLEXIBLE_CHARSET_ATTR_PATTERN.matcher(""); //iterate through meta tags while (equiv.find()) { - String attrs = equiv.group(1); - charsetMatcher.reset(attrs); - //iterate through charset= and return the first match - //that is valid - while (charsetMatcher.find()) { - String candCharset = charsetMatcher.group(1); - if (CharsetUtils.isSupported(candCharset)) { - try { - return CharsetUtils.forName(candCharset); - } catch (Exception e) { - //ignore - } - } - } + String attrs = equiv.group(1); + charsetMatcher.reset(attrs); + //iterate through charset= and return the first match + //that is valid + while (charsetMatcher.find()){ + String candCharset = charsetMatcher.group(1); + if (CharsetUtils.isSupported(candCharset)){ + try{ + return CharsetUtils.forName(candCharset); + } catch (Exception e){ + //ignore + } + } + } } return null; } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java index d5dfaa6..e1b21f2 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java @@ -39,17 +39,21 @@ // List of attributes that need to be resolved. private static final Set URI_ATTRIBUTES = - new HashSet(Arrays.asList("src", "href", "longdesc", "cite")); - private static final Pattern ICBM = - Pattern.compile("\\s*(-?\\d+\\.\\d+)[,\\s]+(-?\\d+\\.\\d+)\\s*"); + new HashSet(Arrays.asList("src", "href", "longdesc", "cite")); + private final HtmlMapper mapper; + private final XHTMLContentHandler xhtml; + private final Metadata metadata; + + private int bodyLevel = 0; + + private int discardLevel = 0; + + private int titleLevel = 0; + private final StringBuilder title = new StringBuilder(); - private int bodyLevel = 0; - private int discardLevel = 0; - private int titleLevel = 0; - private boolean isTitleSetToMetadata = false; private HtmlHandler( HtmlMapper mapper, XHTMLContentHandler xhtml, Metadata metadata) { @@ -134,6 +138,9 @@ title.setLength(0); } + private static final Pattern ICBM = + Pattern.compile("\\s*(-?\\d+\\.\\d+)[,\\s]+(-?\\d+\\.\\d+)\\s*"); + /** * Adds a metadata setting from the HTML to the Tika metadata * object. The name and value are normalized where possible. @@ -150,16 +157,15 @@ } else { metadata.set("ICBM", value); } - } else if (name.equalsIgnoreCase(Metadata.CONTENT_TYPE)) { - //don't overwrite Metadata.CONTENT_TYPE! + } else if (name.equalsIgnoreCase(Metadata.CONTENT_TYPE)){ MediaType type = MediaType.parse(value); if (type != null) { - metadata.set(TikaCoreProperties.CONTENT_TYPE_HINT, type.toString()); - } else { - metadata.set(TikaCoreProperties.CONTENT_TYPE_HINT, value); + metadata.set(Metadata.CONTENT_TYPE, type.toString()); + } else { + metadata.set(Metadata.CONTENT_TYPE, value); } } else { - metadata.add(name, value); + metadata.set(name, value); } } @@ -199,7 +205,7 @@ newAttributes.setValue(att, codebase); } else if (isObject && ("data".equals(normAttrName) - || "classid".equals(normAttrName))) { + || "classid".equals(normAttrName))) { newAttributes.setValue( att, resolve(codebase, newAttributes.getValue(att))); @@ -232,9 +238,8 @@ if (titleLevel > 0) { titleLevel--; - if (titleLevel == 0 && !isTitleSetToMetadata) { + if (titleLevel == 0) { metadata.set(TikaCoreProperties.TITLE, title.toString().trim()); - isTitleSetToMetadata = true; } } if (bodyLevel > 0) { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlMapper.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlMapper.java index 1ca7434..f4b9c73 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlMapper.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlMapper.java @@ -37,7 +37,7 @@ * * @param name HTML element name (upper case) * @return XHTML element name (lower case), or - * null if the element is unsafe + * null if the element is unsafe */ String mapSafeElement(String name); @@ -47,22 +47,22 @@ * * @param name HTML element name (upper case) * @return true if content inside the named element - * should be ignored, false otherwise + * should be ignored, false otherwise */ boolean isDiscardElement(String name); - - + + /** * Maps "safe" HTML attribute names to semantic XHTML equivalents. If the * given attribute is unknown or deemed unsafe for inclusion in the parse * output, then this method returns null and the attribute - * will be ignored. This method assumes that the element name + * will be ignored. This method assumes that the element name * is valid and normalised. * - * @param elementName HTML element name (lower case) + * @param elementName HTML element name (lower case) * @param attributeName HTML attribute name (lower case) * @return XHTML attribute name (lower case), or - * null if the element is unsafe + * null if the element is unsafe */ String mapSafeAttribute(String elementName, String attributeName); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java index a9a8aa0..e940457 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlParser.java @@ -24,10 +24,10 @@ import java.util.HashSet; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.config.ServiceLoader; import org.apache.tika.detect.AutoDetectReader; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -44,21 +44,15 @@ */ public class HtmlParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = 7895315240498733128L; - private static final MediaType XHTML = MediaType.application("xhtml+xml"); - private static final MediaType WAP_XHTML = MediaType.application("vnd.wap.xhtml+xml"); - private static final MediaType X_ASP = MediaType.application("x-asp"); - private static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.text("html"), - XHTML, - WAP_XHTML, - X_ASP))); + Collections.unmodifiableSet(new HashSet(Arrays.asList( + MediaType.text("html"), + MediaType.application("xhtml+xml"), + MediaType.application("vnd.wap.xhtml+xml"), + MediaType.application("x-asp")))); private static final ServiceLoader LOADER = new ServiceLoader(HtmlParser.class.getClassLoader()); @@ -67,7 +61,6 @@ * HTML schema singleton used to amortise the heavy instantiation time. */ private static final Schema HTML_SCHEMA = new HTMLSchema(); - public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; @@ -78,22 +71,15 @@ Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { // Automatically detect the character encoding - try (AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), - metadata,context.get(ServiceLoader.class, LOADER))) { + AutoDetectReader reader = new AutoDetectReader( + new CloseShieldInputStream(stream), metadata, + context.get(ServiceLoader.class, LOADER)); + try { Charset charset = reader.getCharset(); String previous = metadata.get(Metadata.CONTENT_TYPE); - MediaType contentType = null; if (previous == null || previous.startsWith("text/html")) { - contentType = new MediaType(MediaType.TEXT_HTML, charset); - } else if (previous.startsWith("application/xhtml+xml")) { - contentType = new MediaType(XHTML, charset); - } else if (previous.startsWith("application/vnd.wap.xhtml+xml")) { - contentType = new MediaType(WAP_XHTML, charset); - } else if (previous.startsWith("application/x-asp")) { - contentType = new MediaType(X_ASP, charset); - } - if (contentType != null) { - metadata.set(Metadata.CONTENT_TYPE, contentType.toString()); + MediaType type = new MediaType(MediaType.TEXT_HTML, charset); + metadata.set(Metadata.CONTENT_TYPE, type.toString()); } // deprecated, see TIKA-431 metadata.set(Metadata.CONTENT_ENCODING, charset.name()); @@ -120,6 +106,8 @@ new HtmlHandler(mapper, handler, metadata))); parser.parse(reader.asInputSource()); + } finally { + reader.close(); } } @@ -130,15 +118,15 @@ * will be ignored but the content inside it is still processed. See * the {@link #isDiscardElement(String)} method for a way to discard * the entire contents of an element. - *

    + *

    * Subclasses can override this method to customize the default mapping. * + * @deprecated Use the {@link HtmlMapper} mechanism to customize + * the HTML mapping. This method will be removed in Tika 1.0. + * @since Apache Tika 0.5 * @param name HTML element name (upper case) * @return XHTML element name (lower case), or - * null if the element is unsafe - * @since Apache Tika 0.5 - * @deprecated Use the {@link HtmlMapper} mechanism to customize - * the HTML mapping. This method will be removed in Tika 1.0. + * null if the element is unsafe */ protected String mapSafeElement(String name) { return DefaultHtmlMapper.INSTANCE.mapSafeElement(name); @@ -149,25 +137,25 @@ * discarded instead of including it in the parse output. Subclasses * can override this method to customize the set of discarded elements. * + * @deprecated Use the {@link HtmlMapper} mechanism to customize + * the HTML mapping. This method will be removed in Tika 1.0. + * @since Apache Tika 0.5 * @param name HTML element name (upper case) * @return true if content inside the named element - * should be ignored, false otherwise - * @since Apache Tika 0.5 - * @deprecated Use the {@link HtmlMapper} mechanism to customize - * the HTML mapping. This method will be removed in Tika 1.0. + * should be ignored, false otherwise */ protected boolean isDiscardElement(String name) { return DefaultHtmlMapper.INSTANCE.isDiscardElement(name); } /** - * @deprecated Use the {@link HtmlMapper} mechanism to customize - * the HTML mapping. This method will be removed in Tika 1.0. - */ + * @deprecated Use the {@link HtmlMapper} mechanism to customize + * the HTML mapping. This method will be removed in Tika 1.0. + **/ public String mapSafeAttribute(String elementName, String attributeName) { - return DefaultHtmlMapper.INSTANCE.mapSafeAttribute(elementName, attributeName); - } - + return DefaultHtmlMapper.INSTANCE.mapSafeAttribute(elementName,attributeName) ; + } + /** * Adapter class that maintains backwards compatibility with the * protected HtmlParser methods. Making HtmlParser implement HtmlMapper @@ -175,19 +163,17 @@ * backwards compatibility with subclasses. * * @deprecated Use the {@link HtmlMapper} mechanism to customize - * the HTML mapping. This class will be removed in Tika 1.0. + * the HTML mapping. This class will be removed in Tika 1.0. */ private class HtmlParserMapper implements HtmlMapper { public String mapSafeElement(String name) { return HtmlParser.this.mapSafeElement(name); } - public boolean isDiscardElement(String name) { return HtmlParser.this.isDiscardElement(name); } - - public String mapSafeAttribute(String elementName, String attributeName) { - return HtmlParser.this.mapSafeAttribute(elementName, attributeName); + public String mapSafeAttribute(String elementName, String attributeName){ + return HtmlParser.this.mapSafeAttribute(elementName,attributeName); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/IdentityHtmlMapper.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/IdentityHtmlMapper.java index da046aa..b4e22dd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/IdentityHtmlMapper.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/IdentityHtmlMapper.java @@ -21,7 +21,7 @@ /** * Alternative HTML mapping rules that pass the input HTML as-is without any * modifications. - * + * * @since Apache Tika 0.8 */ public class IdentityHtmlMapper implements HtmlMapper { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/html/XHTMLDowngradeHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/html/XHTMLDowngradeHandler.java index 221a87a..de25800 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/html/XHTMLDowngradeHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/html/XHTMLDowngradeHandler.java @@ -16,8 +16,9 @@ */ package org.apache.tika.parser.html; +import java.util.Locale; + import javax.xml.XMLConstants; -import java.util.Locale; import org.apache.tika.sax.ContentHandlerDecorator; import org.xml.sax.Attributes; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/BPGParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/BPGParser.java deleted file mode 100644 index 2a48a55..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/BPGParser.java +++ /dev/null @@ -1,177 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.image; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashSet; -import java.util.Set; - -import org.apache.poi.util.IOUtils; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.EndianUtils; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Photoshop; -import org.apache.tika.metadata.TIFF; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * Parser for the Better Portable Graphics )BPG) File Format. - *

    - * Documentation on the file format is available from - * http://bellard.org/bpg/bpg_spec.txt - */ -public class BPGParser extends AbstractParser { - protected static final int EXTENSION_TAG_EXIF = 1; - protected static final int EXTENSION_TAG_ICC_PROFILE = 2; - protected static final int EXTENSION_TAG_XMP = 3; - protected static final int EXTENSION_TAG_THUMBNAIL = 4; - private static final long serialVersionUID = -161736541253892772L; - private static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.image("x-bpg"), MediaType.image("bpg")))); - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - public void parse( - InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - // Check for the magic header signature - byte[] signature = new byte[4]; - IOUtils.readFully(stream, signature); - if (signature[0] == (byte) 'B' && signature[1] == (byte) 'P' && - signature[2] == (byte) 'G' && signature[3] == (byte) 0xfb) { - // Good, signature found - } else { - throw new TikaException("BPG magic signature invalid"); - } - - // Grab and decode the first byte - int pdf = stream.read(); - - // Pixel format: Greyscale / 4:2:0 / 4:2:2 / 4:4:4 - int pixelFormat = pdf & 0x7; - // TODO Identify a suitable metadata key for this - - // Is there an alpha plane as well as a colour plane? - boolean hasAlphaPlane1 = (pdf & 0x8) == 0x8; - // TODO Identify a suitable metadata key for this+hasAlphaPlane2 - - // Bit depth minus 8 - int bitDepth = (pdf >> 4) + 8; - metadata.set(TIFF.BITS_PER_SAMPLE, Integer.toString(bitDepth)); - - // Grab and decode the second byte - int cer = stream.read(); - - // Colour Space: YCbCr / RGB / YCgCo / YCbCrK / CMYK - int colourSpace = cer & 0x15; - switch (colourSpace) { - case 0: - metadata.set(Photoshop.COLOR_MODE, "YCbCr Colour"); - break; - case 1: - metadata.set(Photoshop.COLOR_MODE, "RGB Colour"); - break; - case 2: - metadata.set(Photoshop.COLOR_MODE, "YCgCo Colour"); - break; - case 3: - metadata.set(Photoshop.COLOR_MODE, "YCbCrK Colour"); - break; - case 4: - metadata.set(Photoshop.COLOR_MODE, "CMYK Colour"); - break; - } - - // Are there extensions or not? - boolean hasExtensions = (cer & 16) == 16; - - // Is the Alpha Plane 2 flag set? - boolean hasAlphaPlane2 = (cer & 32) == 32; - - // cer then holds 2 more booleans - limited range, reserved - - // Width and height next - int width = (int) EndianUtils.readUE7(stream); - int height = (int) EndianUtils.readUE7(stream); - metadata.set(TIFF.IMAGE_LENGTH, height); - metadata.set(TIFF.IMAGE_WIDTH, width); - - // Picture Data length - EndianUtils.readUE7(stream); - - // Extension Data Length, if extensions present - long extensionDataLength = 0; - if (hasExtensions) - extensionDataLength = EndianUtils.readUE7(stream); - - // Alpha Data Length, if alpha used - long alphaDataLength = 0; - if (hasAlphaPlane1 || hasAlphaPlane2) - alphaDataLength = EndianUtils.readUE7(stream); - - // Extension Data - if (hasExtensions) { - long extensionsDataSeen = 0; - ImageMetadataExtractor metadataExtractor = - new ImageMetadataExtractor(metadata); - - while (extensionsDataSeen < extensionDataLength) { - int extensionType = (int) EndianUtils.readUE7(stream); - int extensionLength = (int) EndianUtils.readUE7(stream); - switch (extensionType) { - case EXTENSION_TAG_EXIF: - metadataExtractor.parseRawExif(stream, extensionLength, true); - break; - case EXTENSION_TAG_XMP: - handleXMP(stream, extensionLength, metadataExtractor); - break; - default: - stream.skip(extensionLength); - } - extensionsDataSeen += extensionLength; - } - } - - // HEVC Header + Data - // Alpha HEVC Header + Data - // We can't do anything with these parts - - // We don't have any helpful text, sorry... - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - xhtml.endDocument(); - } - - protected void handleXMP(InputStream stream, int xmpLength, - ImageMetadataExtractor extractor) throws IOException, TikaException, SAXException { - byte[] xmp = new byte[xmpLength]; - IOUtils.readFully(stream, xmp); - extractor.parseRawXMP(xmp); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java index dd732f4..2a0aa7a 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageMetadataExtractor.java @@ -18,7 +18,6 @@ import java.io.File; import java.io.IOException; -import java.io.InputStream; import java.text.DecimalFormat; import java.text.DecimalFormatSymbols; import java.text.SimpleDateFormat; @@ -28,62 +27,55 @@ import java.util.regex.Matcher; import java.util.regex.Pattern; +import org.apache.tika.exception.TikaException; +import org.apache.tika.metadata.IPTC; +import org.apache.tika.metadata.Metadata; +import org.apache.tika.metadata.Property; +import org.apache.tika.metadata.TikaCoreProperties; +import org.xml.sax.SAXException; + import com.drew.imaging.jpeg.JpegMetadataReader; import com.drew.imaging.jpeg.JpegProcessingException; -import com.drew.imaging.riff.RiffProcessingException; import com.drew.imaging.tiff.TiffMetadataReader; -import com.drew.imaging.tiff.TiffProcessingException; -import com.drew.imaging.webp.WebpMetadataReader; -import com.drew.lang.ByteArrayReader; import com.drew.lang.GeoLocation; import com.drew.lang.Rational; import com.drew.metadata.Directory; import com.drew.metadata.MetadataException; import com.drew.metadata.Tag; import com.drew.metadata.exif.ExifIFD0Directory; -import com.drew.metadata.exif.ExifReader; import com.drew.metadata.exif.ExifSubIFDDirectory; import com.drew.metadata.exif.ExifThumbnailDirectory; import com.drew.metadata.exif.GpsDirectory; import com.drew.metadata.iptc.IptcDirectory; import com.drew.metadata.jpeg.JpegCommentDirectory; import com.drew.metadata.jpeg.JpegDirectory; -import com.drew.metadata.xmp.XmpReader; -import org.apache.poi.util.IOUtils; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.IPTC; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Property; -import org.apache.tika.metadata.TikaCoreProperties; -import org.xml.sax.SAXException; /** * Uses the Metadata Extractor library * to read EXIF and IPTC image metadata and map to Tika fields. - *

    + * * As of 2.4.0 the library supports jpeg and tiff. - * As of 2.8.0 the library supports webp. */ public class ImageMetadataExtractor { - private static final String GEO_DECIMAL_FORMAT_STRING = "#.######"; // 6 dp seems to be reasonable private final Metadata metadata; private DirectoryHandler[] handlers; + private static final String GEO_DECIMAL_FORMAT_STRING = "#.######"; // 6 dp seems to be reasonable /** * @param metadata to extract to, using default directory handlers */ public ImageMetadataExtractor(Metadata metadata) { this(metadata, - new CopyUnknownFieldsHandler(), - new JpegCommentHandler(), - new ExifHandler(), - new DimensionsHandler(), - new GeotagHandler(), - new IptcHandler() + new CopyUnknownFieldsHandler(), + new JpegCommentHandler(), + new ExifHandler(), + new DimensionsHandler(), + new GeotagHandler(), + new IptcHandler() ); } - + /** * @param metadata to extract to * @param handlers handlers in order, note that handlers may override values from earlier handlers @@ -91,15 +83,6 @@ public ImageMetadataExtractor(Metadata metadata, DirectoryHandler... handlers) { this.metadata = metadata; this.handlers = handlers; - } - - private static String trimPixels(String s) { - //if height/width appears as "100 pixels", trim " pixels" - if (s != null) { - int i = s.lastIndexOf(" pixels"); - s = s.substring(0, i); - } - return s; } public void parseJpeg(File file) @@ -121,92 +104,30 @@ handle(tiffMetadata); } catch (MetadataException e) { throw new TikaException("Can't read TIFF metadata", e); - } catch (TiffProcessingException e) { - throw new TikaException("Can't read TIFF metadata", e); - } - } - - public void parseWebP(File file) throws IOException, TikaException { - - try { - com.drew.metadata.Metadata webPMetadata = new com.drew.metadata.Metadata(); - webPMetadata = WebpMetadataReader.readMetadata(file); - handle(webPMetadata); - } catch (IOException e) { - throw e; - } catch (RiffProcessingException e) { - throw new TikaException("Can't process Riff data", e); - } catch (MetadataException e) { - throw new TikaException("Can't process Riff data", e); - } - } - - public void parseRawExif(InputStream stream, int length, boolean needsExifHeader) - throws IOException, SAXException, TikaException { - byte[] exif; - if (needsExifHeader) { - exif = new byte[length + 6]; - exif[0] = (byte) 'E'; - exif[1] = (byte) 'x'; - exif[2] = (byte) 'i'; - exif[3] = (byte) 'f'; - IOUtils.readFully(stream, exif, 6, length); - } else { - exif = new byte[length]; - IOUtils.readFully(stream, exif, 0, length); - } - parseRawExif(exif); - } - - public void parseRawExif(byte[] exifData) - throws IOException, SAXException, TikaException { - com.drew.metadata.Metadata metadata = new com.drew.metadata.Metadata(); - ExifReader reader = new ExifReader(); - reader.extract(new ByteArrayReader(exifData), metadata, ExifReader.JPEG_SEGMENT_PREAMBLE.length()); - - try { - handle(metadata); - } catch (MetadataException e) { - throw new TikaException("Can't process the EXIF Data", e); - } - } - - public void parseRawXMP(byte[] xmpData) - throws IOException, SAXException, TikaException { - com.drew.metadata.Metadata metadata = new com.drew.metadata.Metadata(); - XmpReader reader = new XmpReader(); - reader.extract(xmpData, metadata); - - try { - handle(metadata); - } catch (MetadataException e) { - throw new TikaException("Can't process the XMP Data", e); } } /** * Copies extracted tags to tika metadata using registered handlers. - * * @param metadataExtractor Tag directories from a Metadata Extractor "reader" * @throws MetadataException This method does not handle exceptions from Metadata Extractor */ - protected void handle(com.drew.metadata.Metadata metadataExtractor) + protected void handle(com.drew.metadata.Metadata metadataExtractor) throws MetadataException { handle(metadataExtractor.getDirectories().iterator()); } /** * Copies extracted tags to tika metadata using registered handlers. - * * @param directories Metadata Extractor {@link com.drew.metadata.Directory} instances. * @throws MetadataException This method does not handle exceptions from Metadata Extractor - */ + */ protected void handle(Iterator directories) throws MetadataException { while (directories.hasNext()) { Directory directory = directories.next(); - for (DirectoryHandler handler : handlers) { - if (handler.supports(directory.getClass())) { - handler.handle(directory, metadata); + for (int i = 0; i < handlers.length; i++) { + if (handlers[i].supports(directory.getClass())) { + handlers[i].handle(directory, metadata); } } } @@ -221,13 +142,12 @@ * @return true if the directory type is supported by this handler */ boolean supports(Class directoryType); - /** * @param directory extracted tags - * @param metadata current tika metadata + * @param metadata current tika metadata * @throws MetadataException typically field extraction error, aborts all further extraction */ - void handle(Directory directory, Metadata metadata) + void handle(Directory directory, Metadata metadata) throws MetadataException; } @@ -239,17 +159,18 @@ public boolean supports(Class directoryType) { return true; } - public void handle(Directory directory, Metadata metadata) throws MetadataException { if (directory.getTags() != null) { - for (Tag tag : directory.getTags()) { + Iterator tags = directory.getTags().iterator(); + while (tags.hasNext()) { + Tag tag = (Tag) tags.next(); metadata.set(tag.getTagName(), tag.getDescription()); } } } - } - + } + /** * Copies all fields regardless of directory, if the tag name * is not identical to a known Metadata field name. @@ -259,89 +180,80 @@ public boolean supports(Class directoryType) { return true; } - public void handle(Directory directory, Metadata metadata) throws MetadataException { if (directory.getTags() != null) { - for (Tag tag : directory.getTags()) { + Iterator tags = directory.getTags().iterator(); + while (tags.hasNext()) { + Tag tag = (Tag) tags.next(); String name = tag.getTagName(); if (!MetadataFields.isMetadataField(name) && tag.getDescription() != null) { - String value = tag.getDescription().trim(); - if (Boolean.TRUE.toString().equalsIgnoreCase(value)) { - value = Boolean.TRUE.toString(); - } else if (Boolean.FALSE.toString().equalsIgnoreCase(value)) { - value = Boolean.FALSE.toString(); - } - metadata.set(name, value); + String value = tag.getDescription().trim(); + if (Boolean.TRUE.toString().equalsIgnoreCase(value)) { + value = Boolean.TRUE.toString(); + } else if (Boolean.FALSE.toString().equalsIgnoreCase(value)) { + value = Boolean.FALSE.toString(); + } + metadata.set(name, value); } } } } } - + /** * Basic image properties for TIFF and JPEG, at least. */ static class DimensionsHandler implements DirectoryHandler { private final Pattern LEADING_NUMBERS = Pattern.compile("(\\d+)\\s*.*"); - - public boolean supports(Class directoryType) { - return directoryType == JpegDirectory.class || - directoryType == ExifSubIFDDirectory.class || - directoryType == ExifThumbnailDirectory.class || - directoryType == ExifIFD0Directory.class; - } - + public boolean supports(Class directoryType) { + return directoryType == JpegDirectory.class || + directoryType == ExifSubIFDDirectory.class || + directoryType == ExifThumbnailDirectory.class || + directoryType == ExifIFD0Directory.class; + } public void handle(Directory directory, Metadata metadata) throws MetadataException { // The test TIFF has width and height stored as follows according to exiv2 //Exif.Image.ImageWidth Short 1 100 //Exif.Image.ImageLength Short 1 75 // and the values are found in "Thumbnail Image Width" (and Height) from Metadata Extractor - set(directory, metadata, JpegDirectory.TAG_IMAGE_WIDTH, Metadata.IMAGE_WIDTH); - set(directory, metadata, JpegDirectory.TAG_IMAGE_HEIGHT, Metadata.IMAGE_LENGTH); + set(directory, metadata, ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_WIDTH, Metadata.IMAGE_WIDTH); + set(directory, metadata, JpegDirectory.TAG_JPEG_IMAGE_WIDTH, Metadata.IMAGE_WIDTH); + set(directory, metadata, ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_HEIGHT, Metadata.IMAGE_LENGTH); + set(directory, metadata, JpegDirectory.TAG_JPEG_IMAGE_HEIGHT, Metadata.IMAGE_LENGTH); // Bits per sample, two methods of extracting, exif overrides jpeg - set(directory, metadata, JpegDirectory.TAG_DATA_PRECISION, Metadata.BITS_PER_SAMPLE); + set(directory, metadata, JpegDirectory.TAG_JPEG_DATA_PRECISION, Metadata.BITS_PER_SAMPLE); set(directory, metadata, ExifSubIFDDirectory.TAG_BITS_PER_SAMPLE, Metadata.BITS_PER_SAMPLE); // Straightforward set(directory, metadata, ExifSubIFDDirectory.TAG_SAMPLES_PER_PIXEL, Metadata.SAMPLES_PER_PIXEL); } - private void set(Directory directory, Metadata metadata, int extractTag, Property metadataField) { if (directory.containsTag(extractTag)) { Matcher m = LEADING_NUMBERS.matcher(directory.getString(extractTag)); - if (m.matches()) { + if(m.matches()) { metadata.set(metadataField, m.group(1)); } } } } - + static class JpegCommentHandler implements DirectoryHandler { public boolean supports(Class directoryType) { return directoryType == JpegCommentDirectory.class; } - public void handle(Directory directory, Metadata metadata) throws MetadataException { - if (directory.containsTag(JpegCommentDirectory.TAG_COMMENT)) { - metadata.add(TikaCoreProperties.COMMENTS, directory.getString(JpegCommentDirectory.TAG_COMMENT)); - } - } - } - + if (directory.containsTag(JpegCommentDirectory.TAG_JPEG_COMMENT)) { + metadata.add(TikaCoreProperties.COMMENTS, directory.getString(JpegCommentDirectory.TAG_JPEG_COMMENT)); + } + } + } + static class ExifHandler implements DirectoryHandler { - // There's a new ExifHandler for each file processed, so this is thread safe - private static final ThreadLocal DATE_UNSPECIFIED_TZ = new ThreadLocal() { - @Override - protected SimpleDateFormat initialValue() { - return new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss", Locale.US); - } - }; - - public boolean supports(Class directoryType) { - return directoryType == ExifIFD0Directory.class || + private static final SimpleDateFormat DATE_UNSPECIFIED_TZ = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss"); + public boolean supports(Class directoryType) { + return directoryType == ExifIFD0Directory.class || directoryType == ExifSubIFDDirectory.class; } - public void handle(Directory directory, Metadata metadata) { try { handleDateTags(directory, metadata); @@ -351,7 +263,6 @@ // ignore date parse errors and proceed with other tags } } - /** * EXIF may contain image description, although with undefined encoding. * Use IPTC for other annotation fields, and XMP for unicode support. @@ -359,107 +270,105 @@ public void handleCommentTags(Directory directory, Metadata metadata) { if (metadata.get(TikaCoreProperties.DESCRIPTION) == null && directory.containsTag(ExifIFD0Directory.TAG_IMAGE_DESCRIPTION)) { - metadata.set(TikaCoreProperties.DESCRIPTION, + metadata.set(TikaCoreProperties.DESCRIPTION, directory.getString(ExifIFD0Directory.TAG_IMAGE_DESCRIPTION)); } } - /** * Maps common TIFF and EXIF tags onto the Tika - * TIFF image metadata namespace. - */ + * TIFF image metadata namespace. + */ public void handlePhotoTags(Directory directory, Metadata metadata) { - if (directory.containsTag(ExifSubIFDDirectory.TAG_EXPOSURE_TIME)) { - Object exposure = directory.getObject(ExifSubIFDDirectory.TAG_EXPOSURE_TIME); - if (exposure instanceof Rational) { - metadata.set(Metadata.EXPOSURE_TIME, ((Rational) exposure).doubleValue()); - } else { - metadata.set(Metadata.EXPOSURE_TIME, directory.getString(ExifSubIFDDirectory.TAG_EXPOSURE_TIME)); - } - } - - if (directory.containsTag(ExifSubIFDDirectory.TAG_FLASH)) { - String flash = directory.getDescription(ExifSubIFDDirectory.TAG_FLASH); - if (flash.contains("Flash fired")) { - metadata.set(Metadata.FLASH_FIRED, Boolean.TRUE.toString()); - } else if (flash.contains("Flash did not fire")) { - metadata.set(Metadata.FLASH_FIRED, Boolean.FALSE.toString()); - } else { - metadata.set(Metadata.FLASH_FIRED, flash); - } - } - - if (directory.containsTag(ExifSubIFDDirectory.TAG_FNUMBER)) { - Object fnumber = directory.getObject(ExifSubIFDDirectory.TAG_FNUMBER); - if (fnumber instanceof Rational) { - metadata.set(Metadata.F_NUMBER, ((Rational) fnumber).doubleValue()); - } else { - metadata.set(Metadata.F_NUMBER, directory.getString(ExifSubIFDDirectory.TAG_FNUMBER)); - } - } - - if (directory.containsTag(ExifSubIFDDirectory.TAG_FOCAL_LENGTH)) { - Object length = directory.getObject(ExifSubIFDDirectory.TAG_FOCAL_LENGTH); - if (length instanceof Rational) { - metadata.set(Metadata.FOCAL_LENGTH, ((Rational) length).doubleValue()); - } else { - metadata.set(Metadata.FOCAL_LENGTH, directory.getString(ExifSubIFDDirectory.TAG_FOCAL_LENGTH)); - } - } - - if (directory.containsTag(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT)) { - metadata.set(Metadata.ISO_SPEED_RATINGS, directory.getString(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT)); - } - - if (directory.containsTag(ExifIFD0Directory.TAG_MAKE)) { - metadata.set(Metadata.EQUIPMENT_MAKE, directory.getString(ExifIFD0Directory.TAG_MAKE)); - } - if (directory.containsTag(ExifIFD0Directory.TAG_MODEL)) { - metadata.set(Metadata.EQUIPMENT_MODEL, directory.getString(ExifIFD0Directory.TAG_MODEL)); - } - - if (directory.containsTag(ExifIFD0Directory.TAG_ORIENTATION)) { - Object length = directory.getObject(ExifIFD0Directory.TAG_ORIENTATION); - if (length instanceof Integer) { - metadata.set(Metadata.ORIENTATION, Integer.toString((Integer) length)); - } else { - metadata.set(Metadata.ORIENTATION, directory.getString(ExifIFD0Directory.TAG_ORIENTATION)); - } - } - - if (directory.containsTag(ExifIFD0Directory.TAG_SOFTWARE)) { - metadata.set(Metadata.SOFTWARE, directory.getString(ExifIFD0Directory.TAG_SOFTWARE)); - } - - if (directory.containsTag(ExifIFD0Directory.TAG_X_RESOLUTION)) { - Object resolution = directory.getObject(ExifIFD0Directory.TAG_X_RESOLUTION); - if (resolution instanceof Rational) { - metadata.set(Metadata.RESOLUTION_HORIZONTAL, ((Rational) resolution).doubleValue()); - } else { - metadata.set(Metadata.RESOLUTION_HORIZONTAL, directory.getString(ExifIFD0Directory.TAG_X_RESOLUTION)); - } - } - if (directory.containsTag(ExifIFD0Directory.TAG_Y_RESOLUTION)) { - Object resolution = directory.getObject(ExifIFD0Directory.TAG_Y_RESOLUTION); - if (resolution instanceof Rational) { - metadata.set(Metadata.RESOLUTION_VERTICAL, ((Rational) resolution).doubleValue()); - } else { - metadata.set(Metadata.RESOLUTION_VERTICAL, directory.getString(ExifIFD0Directory.TAG_Y_RESOLUTION)); - } - } - if (directory.containsTag(ExifIFD0Directory.TAG_RESOLUTION_UNIT)) { - metadata.set(Metadata.RESOLUTION_UNIT, directory.getDescription(ExifIFD0Directory.TAG_RESOLUTION_UNIT)); - } - if (directory.containsTag(ExifThumbnailDirectory.TAG_IMAGE_WIDTH)) { - metadata.set(Metadata.IMAGE_WIDTH, - trimPixels(directory.getDescription(ExifThumbnailDirectory.TAG_IMAGE_WIDTH))); - } - if (directory.containsTag(ExifThumbnailDirectory.TAG_IMAGE_HEIGHT)) { - metadata.set(Metadata.IMAGE_LENGTH, - trimPixels(directory.getDescription(ExifThumbnailDirectory.TAG_IMAGE_HEIGHT))); - } - } - + if(directory.containsTag(ExifSubIFDDirectory.TAG_EXPOSURE_TIME)) { + Object exposure = directory.getObject(ExifSubIFDDirectory.TAG_EXPOSURE_TIME); + if(exposure instanceof Rational) { + metadata.set(Metadata.EXPOSURE_TIME, ((Rational)exposure).doubleValue()); + } else { + metadata.set(Metadata.EXPOSURE_TIME, directory.getString(ExifSubIFDDirectory.TAG_EXPOSURE_TIME)); + } + } + + if(directory.containsTag(ExifSubIFDDirectory.TAG_FLASH)) { + String flash = directory.getDescription(ExifSubIFDDirectory.TAG_FLASH); + if(flash.indexOf("Flash fired") > -1) { + metadata.set(Metadata.FLASH_FIRED, Boolean.TRUE.toString()); + } + else if(flash.indexOf("Flash did not fire") > -1) { + metadata.set(Metadata.FLASH_FIRED, Boolean.FALSE.toString()); + } + else { + metadata.set(Metadata.FLASH_FIRED, flash); + } + } + + if(directory.containsTag(ExifSubIFDDirectory.TAG_FNUMBER)) { + Object fnumber = directory.getObject(ExifSubIFDDirectory.TAG_FNUMBER); + if(fnumber instanceof Rational) { + metadata.set(Metadata.F_NUMBER, ((Rational)fnumber).doubleValue()); + } else { + metadata.set(Metadata.F_NUMBER, directory.getString(ExifSubIFDDirectory.TAG_FNUMBER)); + } + } + + if(directory.containsTag(ExifSubIFDDirectory.TAG_FOCAL_LENGTH)) { + Object length = directory.getObject(ExifSubIFDDirectory.TAG_FOCAL_LENGTH); + if(length instanceof Rational) { + metadata.set(Metadata.FOCAL_LENGTH, ((Rational)length).doubleValue()); + } else { + metadata.set(Metadata.FOCAL_LENGTH, directory.getString(ExifSubIFDDirectory.TAG_FOCAL_LENGTH)); + } + } + + if(directory.containsTag(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT)) { + metadata.set(Metadata.ISO_SPEED_RATINGS, directory.getString(ExifSubIFDDirectory.TAG_ISO_EQUIVALENT)); + } + + if(directory.containsTag(ExifIFD0Directory.TAG_MAKE)) { + metadata.set(Metadata.EQUIPMENT_MAKE, directory.getString(ExifIFD0Directory.TAG_MAKE)); + } + if(directory.containsTag(ExifIFD0Directory.TAG_MODEL)) { + metadata.set(Metadata.EQUIPMENT_MODEL, directory.getString(ExifIFD0Directory.TAG_MODEL)); + } + + if(directory.containsTag(ExifIFD0Directory.TAG_ORIENTATION)) { + Object length = directory.getObject(ExifIFD0Directory.TAG_ORIENTATION); + if(length instanceof Integer) { + metadata.set(Metadata.ORIENTATION, Integer.toString( ((Integer)length).intValue() )); + } else { + metadata.set(Metadata.ORIENTATION, directory.getString(ExifIFD0Directory.TAG_ORIENTATION)); + } + } + + if(directory.containsTag(ExifIFD0Directory.TAG_SOFTWARE)) { + metadata.set(Metadata.SOFTWARE, directory.getString(ExifIFD0Directory.TAG_SOFTWARE)); + } + + if(directory.containsTag(ExifIFD0Directory.TAG_X_RESOLUTION)) { + Object resolution = directory.getObject(ExifIFD0Directory.TAG_X_RESOLUTION); + if(resolution instanceof Rational) { + metadata.set(Metadata.RESOLUTION_HORIZONTAL, ((Rational)resolution).doubleValue()); + } else { + metadata.set(Metadata.RESOLUTION_HORIZONTAL, directory.getString(ExifIFD0Directory.TAG_X_RESOLUTION)); + } + } + if(directory.containsTag(ExifIFD0Directory.TAG_Y_RESOLUTION)) { + Object resolution = directory.getObject(ExifIFD0Directory.TAG_Y_RESOLUTION); + if(resolution instanceof Rational) { + metadata.set(Metadata.RESOLUTION_VERTICAL, ((Rational)resolution).doubleValue()); + } else { + metadata.set(Metadata.RESOLUTION_VERTICAL, directory.getString(ExifIFD0Directory.TAG_Y_RESOLUTION)); + } + } + if(directory.containsTag(ExifIFD0Directory.TAG_RESOLUTION_UNIT)) { + metadata.set(Metadata.RESOLUTION_UNIT, directory.getDescription(ExifIFD0Directory.TAG_RESOLUTION_UNIT)); + } + if(directory.containsTag(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_WIDTH)) { + metadata.set(Metadata.IMAGE_WIDTH, directory.getDescription(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_WIDTH)); + } + if(directory.containsTag(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_HEIGHT)) { + metadata.set(Metadata.IMAGE_LENGTH, directory.getDescription(ExifThumbnailDirectory.TAG_THUMBNAIL_IMAGE_HEIGHT)); + } + } /** * Maps exif dates to metadata fields. */ @@ -472,7 +381,7 @@ // Unless we have GPS time we don't know the time zone so date must be set // as ISO 8601 datetime without timezone suffix (no Z or +/-) if (original != null) { - String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.get().format(original); // Same time zone as Metadata Extractor uses + String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(original); // Same time zone as Metadata Extractor uses metadata.set(TikaCoreProperties.CREATED, datetimeNoTimeZone); metadata.set(Metadata.ORIGINAL_DATE, datetimeNoTimeZone); } @@ -480,7 +389,7 @@ if (directory.containsTag(ExifIFD0Directory.TAG_DATETIME)) { Date datetime = directory.getDate(ExifIFD0Directory.TAG_DATETIME); if (datetime != null) { - String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.get().format(datetime); + String datetimeNoTimeZone = DATE_UNSPECIFIED_TZ.format(datetime); metadata.set(TikaCoreProperties.MODIFIED, datetimeNoTimeZone); // If Date/Time Original does not exist this might be creation date if (metadata.get(TikaCoreProperties.CREATED) == null) { @@ -490,7 +399,7 @@ } } } - + /** * Reads image comments, originally TIKA-472. * Metadata Extractor does not read XMP so we need to use the values from Iptc or EXIF @@ -499,7 +408,6 @@ public boolean supports(Class directoryType) { return directoryType == IptcDirectory.class; } - public void handle(Directory directory, Metadata metadata) throws MetadataException { if (directory.containsTag(IptcDirectory.TAG_KEYWORDS)) { @@ -533,14 +441,13 @@ public boolean supports(Class directoryType) { return directoryType == GpsDirectory.class; } - public void handle(Directory directory, Metadata metadata) throws MetadataException { GeoLocation geoLocation = ((GpsDirectory) directory).getGeoLocation(); if (geoLocation != null) { DecimalFormat geoDecimalFormat = new DecimalFormat(GEO_DECIMAL_FORMAT_STRING, new DecimalFormatSymbols(Locale.ENGLISH)); - metadata.set(TikaCoreProperties.LATITUDE, geoDecimalFormat.format(geoLocation.getLatitude())); - metadata.set(TikaCoreProperties.LONGITUDE, geoDecimalFormat.format(geoLocation.getLongitude())); + metadata.set(TikaCoreProperties.LATITUDE, geoDecimalFormat.format(new Double(geoLocation.getLatitude()))); + metadata.set(TikaCoreProperties.LONGITUDE, geoDecimalFormat.format(new Double(geoLocation.getLongitude()))); } } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java index 8fd23eb..d84c5ac 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/image/ImageParser.java @@ -16,11 +16,6 @@ */ package org.apache.tika.parser.image; -import javax.imageio.IIOException; -import javax.imageio.ImageIO; -import javax.imageio.ImageReader; -import javax.imageio.metadata.IIOMetadata; -import javax.imageio.stream.ImageInputStream; import java.io.IOException; import java.io.InputStream; import java.util.Arrays; @@ -29,8 +24,14 @@ import java.util.Iterator; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; +import javax.imageio.IIOException; +import javax.imageio.ImageIO; +import javax.imageio.ImageReader; +import javax.imageio.metadata.IIOMetadata; +import javax.imageio.stream.ImageInputStream; + import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.Property; import org.apache.tika.metadata.TikaCoreProperties; @@ -45,38 +46,96 @@ public class ImageParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = 7852529269245520335L; private static final MediaType CANONICAL_BMP_TYPE = MediaType.image("x-ms-bmp"); private static final MediaType JAVA_BMP_TYPE = MediaType.image("bmp"); - + private static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - CANONICAL_BMP_TYPE, - JAVA_BMP_TYPE, - MediaType.image("gif"), - MediaType.image("png"), - MediaType.image("vnd.wap.wbmp"), - MediaType.image("x-icon"), - MediaType.image("x-xcf")))); + Collections.unmodifiableSet(new HashSet(Arrays.asList( + CANONICAL_BMP_TYPE, + JAVA_BMP_TYPE, + MediaType.image("gif"), + MediaType.image("png"), + MediaType.image("vnd.wap.wbmp"), + MediaType.image("x-icon"), + MediaType.image("x-xcf")))); + + public Set getSupportedTypes(ParseContext context) { + return SUPPORTED_TYPES; + } + + public void parse( + InputStream stream, ContentHandler handler, + Metadata metadata, ParseContext context) + throws IOException, SAXException, TikaException { + String type = metadata.get(Metadata.CONTENT_TYPE); + if (type != null) { + // Java has a different idea of the BMP mime type to + // what the canonical one is, fix this up. + if (CANONICAL_BMP_TYPE.toString().equals(type)) { + type = JAVA_BMP_TYPE.toString(); + } + + try { + Iterator iterator = + ImageIO.getImageReadersByMIMEType(type); + if (iterator.hasNext()) { + ImageReader reader = iterator.next(); + try { + ImageInputStream imageStream = ImageIO.createImageInputStream( + new CloseShieldInputStream(stream)); + try { + reader.setInput(imageStream); + + metadata.set(Metadata.IMAGE_WIDTH, Integer.toString(reader.getWidth(0))); + metadata.set(Metadata.IMAGE_LENGTH, Integer.toString(reader.getHeight(0))); + metadata.set("height", Integer.toString(reader.getHeight(0))); + metadata.set("width", Integer.toString(reader.getWidth(0))); + + loadMetadata(reader.getImageMetadata(0), metadata); + } finally { + imageStream.close(); + } + } finally { + reader.dispose(); + } + } + + // Translate certain Metadata tags from the ImageIO + // specific namespace into the general Tika one + setIfPresent(metadata, "CommentExtensions CommentExtension", TikaCoreProperties.COMMENTS); + setIfPresent(metadata, "markerSequence com", TikaCoreProperties.COMMENTS); + setIfPresent(metadata, "Data BitsPerSample", Metadata.BITS_PER_SAMPLE); + } catch (IIOException e) { + // TIKA-619: There is a known bug in the Sun API when dealing with GIF images + // which Tika will just ignore. + if (!(e.getMessage().equals("Unexpected block type 0!") && type.equals("image/gif"))) { + throw new TikaException(type + " parse error", e); + } + } + } + + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + xhtml.endDocument(); + } + private static void setIfPresent(Metadata metadata, String imageIOkey, String tikaKey) { - if (metadata.get(imageIOkey) != null) { - metadata.set(tikaKey, metadata.get(imageIOkey)); - } - } - + if(metadata.get(imageIOkey) != null) { + metadata.set(tikaKey, metadata.get(imageIOkey)); + } + } private static void setIfPresent(Metadata metadata, String imageIOkey, Property tikaProp) { - if (metadata.get(imageIOkey) != null) { - String v = metadata.get(imageIOkey); - if (v.endsWith(" ")) { - v = v.substring(0, v.lastIndexOf(' ')); - } - metadata.set(tikaProp, v); - } + if(metadata.get(imageIOkey) != null) { + String v = metadata.get(imageIOkey); + if(v.endsWith(" ")) { + v = v.substring(0, v.lastIndexOf(' ')); + } + metadata.set(tikaProp, v); + } } private static void loadMetadata(IIOMetadata imageMetadata, Metadata metadata) { @@ -84,8 +143,9 @@ if (names == null) { return; } - for (String name : names) { - loadNode(metadata, imageMetadata.getAsTree(name), "", false); + int length = names.length; + for (int i = 0; i < length; i++) { + loadNode(metadata, imageMetadata.getAsTree(names[i]), "", false); } } @@ -141,63 +201,4 @@ return value; } - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - public void parse( - InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - String type = metadata.get(Metadata.CONTENT_TYPE); - if (type != null) { - // Java has a different idea of the BMP mime type to - // what the canonical one is, fix this up. - if (CANONICAL_BMP_TYPE.toString().equals(type)) { - type = JAVA_BMP_TYPE.toString(); - } - - try { - Iterator iterator = - ImageIO.getImageReadersByMIMEType(type); - if (iterator.hasNext()) { - ImageReader reader = iterator.next(); - try { - try (ImageInputStream imageStream = ImageIO.createImageInputStream( - new CloseShieldInputStream(stream))) { - reader.setInput(imageStream); - - metadata.set(Metadata.IMAGE_WIDTH, Integer.toString(reader.getWidth(0))); - metadata.set(Metadata.IMAGE_LENGTH, Integer.toString(reader.getHeight(0))); - metadata.set("height", Integer.toString(reader.getHeight(0))); - metadata.set("width", Integer.toString(reader.getWidth(0))); - - loadMetadata(reader.getImageMetadata(0), metadata); - } - } finally { - reader.dispose(); - } - } - - // Translate certain Metadata tags from the ImageIO - // specific namespace into the general Tika one - setIfPresent(metadata, "CommentExtensions CommentExtension", TikaCoreProperties.COMMENTS); - setIfPresent(metadata, "markerSequence com", TikaCoreProperties.COMMENTS); - setIfPresent(metadata, "Data BitsPerSample", Metadata.BITS_PER_SAMPLE); - } catch (IIOException e) { - // TIKA-619: There is a known bug in the Sun API when dealing with GIF images - // which Tika will just ignore. - if (!(e.getMessage() != null && - e.getMessage().equals("Unexpected block type 0!") && - type.equals("image/gif"))) { - throw new TikaException(type + " parse error", e); - } - } - } - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - xhtml.endDocument(); - } - } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/MetadataFields.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/MetadataFields.java index 5238751..611b1e0 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/MetadataFields.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/image/MetadataFields.java @@ -30,15 +30,9 @@ * ImageMetadataExtractor, but it can be generalized. */ public abstract class MetadataFields { - + private static HashSet known; - - static { - known = new HashSet(); - setKnownForClass(TikaCoreProperties.class); - setKnownForClass(Metadata.class); - } - + private static void setKnownForClass(Class clazz) { Field[] fields = clazz.getFields(); for (Field f : fields) { @@ -72,13 +66,19 @@ } } } - + + static { + known = new HashSet(); + setKnownForClass(TikaCoreProperties.class); + setKnownForClass(Metadata.class); + } + public static boolean isMetadataField(String name) { return known.contains(name); } - + public static boolean isMetadataField(Property property) { return known.contains(property.getName()); } - + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/PSDParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/PSDParser.java index 9eb6eea..1bbdbc0 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/PSDParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/image/PSDParser.java @@ -18,6 +18,7 @@ import java.io.IOException; import java.io.InputStream; +import java.io.UnsupportedEncodingException; import java.util.Arrays; import java.util.Collections; import java.util.HashSet; @@ -27,7 +28,6 @@ import org.apache.tika.exception.TikaException; import org.apache.tika.io.EndianUtils; import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Photoshop; import org.apache.tika.metadata.TIFF; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; @@ -37,24 +37,20 @@ import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; -import static java.nio.charset.StandardCharsets.US_ASCII; - /** * Parser for the Adobe Photoshop PSD File Format. - *

    + * * Documentation on the file format is available from * http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/PhotoshopFileFormats.htm */ public class PSDParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = 883387734607994914L; private static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.image("vnd.adobe.photoshop")))); + Collections.unmodifiableSet(new HashSet(Arrays.asList( + MediaType.image("vnd.adobe.photoshop")))); public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; @@ -67,24 +63,24 @@ // Check for the magic header signature byte[] signature = new byte[4]; IOUtils.readFully(stream, signature); - if (signature[0] == (byte) '8' && signature[1] == (byte) 'B' && - signature[2] == (byte) 'P' && signature[3] == (byte) 'S') { - // Good, signature found + if(signature[0] == (byte)'8' && signature[1] == (byte)'B' && + signature[2] == (byte)'P' && signature[3] == (byte)'S') { + // Good, signature found } else { - throw new TikaException("PSD/PSB magic signature invalid"); + throw new TikaException("PSD/PSB magic signature invalid"); } - + // Check the version int version = EndianUtils.readUShortBE(stream); - if (version == 1 || version == 2) { - // Good, we support these two + if(version == 1 || version == 2) { + // Good, we support these two } else { - throw new TikaException("Invalid PSD/PSB version " + version); + throw new TikaException("Invalid PSD/PSB version " + version); } - + // Skip the reserved block IOUtils.readFully(stream, new byte[6]); - + // Number of channels in the image int numChannels = EndianUtils.readUShortBE(stream); // TODO Identify a suitable metadata key for this @@ -94,15 +90,16 @@ int width = EndianUtils.readIntBE(stream); metadata.set(TIFF.IMAGE_LENGTH, height); metadata.set(TIFF.IMAGE_WIDTH, width); - + // Depth (bits per channel) int depth = EndianUtils.readUShortBE(stream); metadata.set(TIFF.BITS_PER_SAMPLE, Integer.toString(depth)); - - // Colour mode, eg Bitmap or RGB + + // Colour mode + // Bitmap = 0; Grayscale = 1; Indexed = 2; RGB = 3; CMYK = 4; Multichannel = 7; Duotone = 8; Lab = 9. int colorMode = EndianUtils.readUShortBE(stream); - metadata.set(Photoshop.COLOR_MODE, Photoshop._COLOR_MODE_CHOICES_INDEXED[colorMode]); - + // TODO Identify a suitable metadata key for this + // Next is the Color Mode section // We don't care about this bit long colorModeSectionSize = EndianUtils.readIntBE(stream); @@ -112,89 +109,92 @@ // Check for certain interesting keys here long imageResourcesSectionSize = EndianUtils.readIntBE(stream); long read = 0; - while (read < imageResourcesSectionSize) { - ResourceBlock rb = new ResourceBlock(stream); - read += rb.totalLength; - - // Is it one we can do something useful with? - if (rb.id == ResourceBlock.ID_CAPTION) { - metadata.add(TikaCoreProperties.DESCRIPTION, rb.getDataAsString()); - } else if (rb.id == ResourceBlock.ID_EXIF_1) { - // TODO Parse the EXIF info via ImageMetadataExtractor - } else if (rb.id == ResourceBlock.ID_EXIF_3) { - // TODO Parse the EXIF info via ImageMetadataExtractor - } else if (rb.id == ResourceBlock.ID_XMP) { - // TODO Parse the XMP info via ImageMetadataExtractor - } + while(read < imageResourcesSectionSize) { + ResourceBlock rb = new ResourceBlock(stream); + read += rb.totalLength; + + // Is it one we can do something useful with? + if(rb.id == ResourceBlock.ID_CAPTION) { + metadata.add(TikaCoreProperties.DESCRIPTION, rb.getDataAsString()); + } else if(rb.id == ResourceBlock.ID_EXIF_1) { + // TODO Parse the EXIF info + } else if(rb.id == ResourceBlock.ID_EXIF_3) { + // TODO Parse the EXIF info + } else if(rb.id == ResourceBlock.ID_XMP) { + // TODO Parse the XMP info + } } - + // Next is the Layer and Mask Info // Finally we have Image Data // We can't do anything with these parts - + // We don't have any helpful text, sorry... XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); xhtml.endDocument(); } - + private static class ResourceBlock { - private static final long SIGNATURE = 0x3842494d; // 8BIM - private static final int ID_CAPTION = 0x03F0; - private static final int ID_URL = 0x040B; - private static final int ID_EXIF_1 = 0x0422; - private static final int ID_EXIF_3 = 0x0423; - private static final int ID_XMP = 0x0424; - - private int id; - private String name; - private byte[] data; - private int totalLength; - - private ResourceBlock(InputStream stream) throws IOException, TikaException { - // Verify the signature - long sig = EndianUtils.readIntBE(stream); - if (sig != SIGNATURE) { - throw new TikaException("Invalid Image Resource Block Signature Found, got " + - sig + " 0x" + Long.toHexString(sig) + " but the spec defines " + SIGNATURE); - } - - // Read the block - id = EndianUtils.readUShortBE(stream); - - StringBuffer nameB = new StringBuffer(); - int nameLen = 0; - while (true) { - int v = stream.read(); - nameLen++; - - if (v == 0) { - // The name length is padded to be even - if (nameLen % 2 == 1) { - stream.read(); - nameLen++; - } - break; - } else { - nameB.append((char) v); + private static final long SIGNATURE = 0x3842494d; // 8BIM + private static final int ID_CAPTION = 0x03F0; + private static final int ID_URL = 0x040B; + private static final int ID_EXIF_1 = 0x0422; + private static final int ID_EXIF_3 = 0x0423; + private static final int ID_XMP = 0x0424; + + private int id; + private String name; + private byte[] data; + private int totalLength; + private ResourceBlock(InputStream stream) throws IOException, TikaException { + // Verify the signature + long sig = EndianUtils.readIntBE(stream); + if(sig != SIGNATURE) { + throw new TikaException("Invalid Image Resource Block Signature Found, got " + + sig + " 0x" + Long.toHexString(sig) + " but the spec defines " + SIGNATURE); + } + + // Read the block + id = EndianUtils.readUShortBE(stream); + + StringBuffer nameB = new StringBuffer(); + int nameLen = 0; + while(true) { + int v = stream.read(); + nameLen++; + + if(v == 0) { + // The name length is padded to be even + if(nameLen % 2 == 1) { + stream.read(); + nameLen++; } - name = nameB.toString(); - } - - int dataLen = EndianUtils.readIntBE(stream); - if (dataLen % 2 == 1) { - // Data Length is even padded - dataLen = dataLen + 1; - } - totalLength = 4 + 2 + nameLen + 4 + dataLen; - - data = new byte[dataLen]; - IOUtils.readFully(stream, data); - } - - private String getDataAsString() { - // Will be null padded - return new String(data, 0, data.length - 1, US_ASCII); - } + break; + } else { + nameB.append((char)v); + } + name = nameB.toString(); + } + + int dataLen = EndianUtils.readIntBE(stream); + if(dataLen %2 == 1) { + // Data Length is even padded + dataLen = dataLen + 1; + } + totalLength = 4 + 2 + nameLen + 4 + dataLen; + + data = new byte[dataLen]; + IOUtils.readFully(stream, data); + } + + private String getDataAsString() { + // Will be null padded + try { + return new String(data, 0, data.length-1, "ASCII"); + } catch(UnsupportedEncodingException e) { + throw new RuntimeException("Something is very broken in your JVM!"); + } + } } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/TiffParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/TiffParser.java index d9b047e..94e935e 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/TiffParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/image/TiffParser.java @@ -35,13 +35,11 @@ public class TiffParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = -3941143576535464926L; private static final Set SUPPORTED_TYPES = - Collections.singleton(MediaType.image("tiff")); + Collections.singleton(MediaType.image("tiff")); public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/WebPParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/WebPParser.java deleted file mode 100644 index df2a2aa..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/WebPParser.java +++ /dev/null @@ -1,66 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.image; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Set; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - - -public class WebPParser extends AbstractParser { - - /** - * Serial version UID - */ - private static final long serialVersionUID = -3941143576535464926L; - - private static final Set SUPPORTED_TYPES = - Collections.singleton(MediaType.image("webp")); - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - public void parse( - InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - TemporaryResources tmp = new TemporaryResources(); - try { - TikaInputStream tis = TikaInputStream.get(stream, tmp); - new ImageMetadataExtractor(metadata).parseWebP(tis.getFile()); - } finally { - tmp.dispose(); - } - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - xhtml.endDocument(); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java index 10692b8..d3a2f53 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/JempboxExtractor.java @@ -22,6 +22,7 @@ import java.io.InputStream; import java.io.InputStreamReader; import java.io.Reader; +import java.util.Iterator; import java.util.List; import org.apache.jempbox.xmp.XMPMetadata; @@ -31,14 +32,14 @@ import org.apache.tika.metadata.TikaCoreProperties; import org.xml.sax.InputSource; -import static java.nio.charset.StandardCharsets.UTF_8; - public class JempboxExtractor { + private XMPPacketScanner scanner = new XMPPacketScanner(); + + private Metadata metadata; + // The XMP spec says it must be unicode, but for most file formats it specifies "must be encoded in UTF-8" - private static final String DEFAULT_XMP_CHARSET = UTF_8.name(); - private XMPPacketScanner scanner = new XMPPacketScanner(); - private Metadata metadata; + private static final String DEFAULT_XMP_CHARSET = "UTF-8"; public JempboxExtractor(Metadata metadata) { this.metadata = metadata; @@ -67,8 +68,9 @@ metadata.set(TikaCoreProperties.CREATOR, joinCreators(dc.getCreators())); } if (dc.getSubjects() != null && dc.getSubjects().size() > 0) { - for (String keyword : dc.getSubjects()) { - metadata.add(TikaCoreProperties.KEYWORDS, keyword); + Iterator keywords = dc.getSubjects().iterator(); + while (keywords.hasNext()) { + metadata.add(TikaCoreProperties.KEYWORDS, keywords.next()); } // TODO should we set KEYWORDS too? // All tested photo managers set the same in Iptc.Application2.Keywords and Xmp.dc.subject diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/XMPPacketScanner.java b/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/XMPPacketScanner.java index 3e23485..039ebb1 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/XMPPacketScanner.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/image/xmp/XMPPacketScanner.java @@ -22,16 +22,15 @@ import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; - -import static java.nio.charset.StandardCharsets.US_ASCII; +import java.io.UnsupportedEncodingException; /** * This class is a parser for XMP packets. By default, it tries to locate the first XMP packet * it finds and parses it. - *

    + *

    * Important: Before you use this class to look for an XMP packet in some random file, please read * the chapter on "Scanning Files for XMP Packets" in the XMP specification! - *

    + *

    * Thic class was branched from http://xmlgraphics.apache.org/ XMPPacketParser. * See also org.semanticdesktop.aperture.extractor.xmp.XMPExtractor, a variant. */ @@ -42,9 +41,47 @@ private static final byte[] PACKET_TRAILER; static { - PACKET_HEADER = "".getBytes(US_ASCII); - PACKET_TRAILER = "".getBytes("US-ASCII"); + PACKET_TRAILER = "not close the stream. + * If XMP block was found reading can continue below the block. + * + * @param in the InputStream to search + * @param xmlOut to write the XMP packet to + * @return true if XMP packet is found, false otherwise + * @throws IOException if an I/O error occurs + * @throws TransformerException if an error occurs while parsing the XMP packet + */ + public boolean parse(InputStream in, OutputStream xmlOut) throws IOException { + if (!in.markSupported()) { + in = new java.io.BufferedInputStream(in); + } + boolean foundXMP = skipAfter(in, PACKET_HEADER); + if (!foundXMP) { + return false; + } + //TODO Inspect "begin" attribute! + if (!skipAfter(in, PACKET_HEADER_END)) { + throw new IOException("Invalid XMP packet header!"); + } + //TODO Do with TeeInputStream when Commons IO 1.4 is available + if (!skipAfter(in, PACKET_TRAILER, xmlOut)) { + throw new IOException("XMP packet not properly terminated!"); + } + return true; } private static boolean skipAfter(InputStream in, byte[] match) throws IOException { @@ -75,39 +112,5 @@ return false; } - /** - * Locates an XMP packet in a stream, parses it and returns the XMP metadata. If no - * XMP packet is found until the stream ends, null is returned. Note: This method - * only finds the first XMP packet in a stream. And it cannot determine whether it - * has found the right XMP packet if there are multiple packets. - *

    - * Does not close the stream. - * If XMP block was found reading can continue below the block. - * - * @param in the InputStream to search - * @param xmlOut to write the XMP packet to - * @return true if XMP packet is found, false otherwise - * @throws IOException if an I/O error occurs - * @throws TransformerException if an error occurs while parsing the XMP packet - */ - public boolean parse(InputStream in, OutputStream xmlOut) throws IOException { - if (!in.markSupported()) { - in = new java.io.BufferedInputStream(in); - } - boolean foundXMP = skipAfter(in, PACKET_HEADER); - if (!foundXMP) { - return false; - } - //TODO Inspect "begin" attribute! - if (!skipAfter(in, PACKET_HEADER_END)) { - throw new IOException("Invalid XMP packet header!"); - } - //TODO Do with TeeInputStream when Commons IO 1.4 is available - if (!skipAfter(in, PACKET_TRAILER, xmlOut)) { - throw new IOException("XMP packet not properly terminated!"); - } - return true; - } - } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/internal/Activator.java b/tika-parsers/src/main/java/org/apache/tika/parser/internal/Activator.java index a884d3a..48d909f 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/internal/Activator.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/internal/Activator.java @@ -32,20 +32,17 @@ private ServiceRegistration parserService; - @Override public void start(BundleContext context) throws Exception { detectorService = context.registerService( Detector.class.getName(), new DefaultDetector(Activator.class.getClassLoader()), new Properties()); - Parser parser = new DefaultParser(Activator.class.getClassLoader()); parserService = context.registerService( Parser.class.getName(), - parser, + new DefaultParser(Activator.class.getClassLoader()), new Properties()); } - @Override public void stop(BundleContext context) throws Exception { parserService.unregister(); detectorService.unregister(); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java index ec436e0..652f3db 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/iptc/IptcAnpaParser.java @@ -23,7 +23,6 @@ import java.util.Collections; import java.util.Date; import java.util.HashMap; -import java.util.Locale; import java.util.Set; import java.util.TimeZone; @@ -36,8 +35,6 @@ import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** * Parser for IPTC ANPA New Wire Feeds @@ -163,7 +160,7 @@ } int msgsize = is.read(buf); // read in at least the full data - String message = (new String(buf, UTF_8)).toLowerCase(Locale.ROOT); + String message = (new String(buf)).toLowerCase(); // these are not if-then-else, because we want to go from most common // and fall through to least. this is imperfect, as these tags could // show up in other agency stories, but i can't find a spec or any @@ -593,7 +590,7 @@ --read; } } - if (tmp_line.toLowerCase(Locale.ROOT).startsWith("by") || longline.equals("bdy_author")) { + if (tmp_line.toLowerCase().startsWith("by") || longline.equals("bdy_author")) { longkey = "bdy_author"; // prepend a space to subsequent line, so it gets parsed consistent with the lead line @@ -601,30 +598,30 @@ // we have an author candidate int term = tmp_line.length(); - term = Math.min(term, (tmp_line.contains("<") ? tmp_line.indexOf("<") : term)); - term = Math.min(term, (tmp_line.contains("=") ? tmp_line.indexOf("=") : term)); - term = Math.min(term, (tmp_line.contains("\n") ? tmp_line.indexOf("\n") : term)); + term = Math.min(term, (tmp_line.indexOf("<") > -1 ? tmp_line.indexOf("<") : term)); + term = Math.min(term, (tmp_line.indexOf("=") > -1 ? tmp_line.indexOf("=") : term)); + term = Math.min(term, (tmp_line.indexOf("\n") > -1 ? tmp_line.indexOf("\n") : term)); term = (term > 0 ) ? term : tmp_line.length(); bdy_author += tmp_line.substring(tmp_line.indexOf(" "), term); metastarted = true; - longline = ((tmp_line.contains("=")) && (!longline.equals(longkey)) ? longkey : ""); + longline = ((tmp_line.indexOf("=") > -1) && (!longline.equals(longkey)) ? longkey : ""); } else if (FORMAT == this.FMT_IPTC_BLM) { String byline = " by "; - if (tmp_line.toLowerCase(Locale.ROOT).contains(byline)) { + if (tmp_line.toLowerCase().contains(byline)) { longkey = "bdy_author"; int term = tmp_line.length(); - term = Math.min(term, (tmp_line.contains("<") ? tmp_line.indexOf("<") : term)); - term = Math.min(term, (tmp_line.contains("=") ? tmp_line.indexOf("=") : term)); - term = Math.min(term, (tmp_line.contains("\n") ? tmp_line.indexOf("\n") : term)); + term = Math.min(term, (tmp_line.indexOf("<") > -1 ? tmp_line.indexOf("<") : term)); + term = Math.min(term, (tmp_line.indexOf("=") > -1 ? tmp_line.indexOf("=") : term)); + term = Math.min(term, (tmp_line.indexOf("\n") > -1 ? tmp_line.indexOf("\n") : term)); term = (term > 0 ) ? term : tmp_line.length(); // for bloomberg, the author line sits below their copyright statement - bdy_author += tmp_line.substring(tmp_line.toLowerCase(Locale.ROOT).indexOf(byline) + byline.length(), term) + " "; + bdy_author += tmp_line.substring(tmp_line.toLowerCase().indexOf(byline) + byline.length(), term) + " "; metastarted = true; - longline = ((tmp_line.contains("=")) && (!longline.equals(longkey)) ? longkey : ""); - } - else if(tmp_line.toLowerCase(Locale.ROOT).startsWith("c.")) { + longline = ((tmp_line.indexOf("=") > -1) && (!longline.equals(longkey)) ? longkey : ""); + } + else if(tmp_line.toLowerCase().startsWith("c.")) { // the author line for bloomberg is a multiline starting with c.2011 Bloomberg News // then containing the author info on the next line if (val_next == TB) { @@ -632,7 +629,7 @@ continue; } } - else if(tmp_line.toLowerCase(Locale.ROOT).trim().startsWith("(") && tmp_line.toLowerCase(Locale.ROOT).trim().endsWith(")")) { + else if(tmp_line.toLowerCase().trim().startsWith("(") && tmp_line.toLowerCase().trim().endsWith(")")) { // the author line may have one or more comment lines between the copyright // statement, and the By AUTHORNAME line if (val_next == TB) { @@ -642,15 +639,15 @@ } } - else if (tmp_line.toLowerCase(Locale.ROOT).startsWith("eds") || longline.equals("bdy_source")) { + else if (tmp_line.toLowerCase().startsWith("eds") || longline.equals("bdy_source")) { longkey = "bdy_source"; // prepend a space to subsequent line, so it gets parsed consistent with the lead line tmp_line = (longline.equals(longkey) ? " " : "") + tmp_line; // we have a source candidate int term = tmp_line.length(); - term = Math.min(term, (tmp_line.contains("<") ? tmp_line.indexOf("<") : term)); - term = Math.min(term, (tmp_line.contains("=") ? tmp_line.indexOf("=") : term)); + term = Math.min(term, (tmp_line.indexOf("<") > -1 ? tmp_line.indexOf("<") : term)); + term = Math.min(term, (tmp_line.indexOf("=") > -1 ? tmp_line.indexOf("=") : term)); // term = Math.min(term, (tmp_line.indexOf("\n") > -1 ? tmp_line.indexOf("\n") : term)); term = (term > 0 ) ? term : tmp_line.length(); bdy_source += tmp_line.substring(tmp_line.indexOf(" ") + 1, term) + " "; @@ -739,14 +736,14 @@ // standard reuters format format_in = "HH:mm MM-dd-yy"; } - SimpleDateFormat dfi = new SimpleDateFormat(format_in, Locale.ROOT); + SimpleDateFormat dfi = new SimpleDateFormat(format_in); dfi.setTimeZone(TimeZone.getTimeZone("UTC")); dateunix = dfi.parse(ftr_datetime); } catch (ParseException ep) { // failed, but this will just fall through to setting the date to now } - SimpleDateFormat dfo = new SimpleDateFormat(format_out, Locale.ROOT); + SimpleDateFormat dfo = new SimpleDateFormat(format_out); dfo.setTimeZone(TimeZone.getTimeZone("UTC")); ftr_datetime = dfo.format(dateunix); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java b/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java deleted file mode 100644 index fc4f699..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISATabUtils.java +++ /dev/null @@ -1,209 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.isatab; - -import java.io.IOException; -import java.io.InputStream; -import java.io.Reader; -import java.util.Arrays; -import java.util.HashMap; -import java.util.Iterator; -import java.util.Locale; -import java.util.Map; - -import org.apache.commons.csv.CSVFormat; -import org.apache.commons.csv.CSVParser; -import org.apache.commons.csv.CSVRecord; -import org.apache.commons.io.input.CloseShieldInputStream; -import org.apache.tika.config.ServiceLoader; -import org.apache.tika.detect.AutoDetectReader; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.SAXException; - -public class ISATabUtils { - - private static final ServiceLoader LOADER = new ServiceLoader(ISATabUtils.class.getClassLoader()); - - /** - * INVESTIGATION - */ - - // Investigation section. - private static final String[] sections = { - "ONTOLOGY SOURCE REFERENCE", - "INVESTIGATION", - "INVESTIGATION PUBLICATIONS", - "INVESTIGATION CONTACTS" - }; - - // STUDY section (inside the Study section) - private static final String studySectionField = "STUDY"; - - // Study File Name (inside the STUDY section) - private static final String studyFileNameField = "Study File Name"; - - public static void parseInvestigation(InputStream stream, XHTMLContentHandler handler, Metadata metadata, ParseContext context, String studyFileName) throws IOException, TikaException, SAXException { - // Automatically detect the character encoding - try (AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(stream), - metadata, context.get(ServiceLoader.class, LOADER))) { - extractMetadata(reader, metadata, studyFileName); - } - } - - public static void parseInvestigation(InputStream stream, XHTMLContentHandler handler, Metadata metadata, ParseContext context) throws IOException, TikaException, SAXException { - parseInvestigation(stream, handler, metadata, context, null); - } - - public static void parseStudy(InputStream stream, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, TikaException, SAXException { - TikaInputStream tis = TikaInputStream.get(stream); - // Automatically detect the character encoding - - try (AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(tis), - metadata, context.get(ServiceLoader.class, LOADER)); - CSVParser csvParser = new CSVParser(reader, CSVFormat.TDF)) { - Iterator iterator = csvParser.iterator(); - - xhtml.startElement("table"); - - xhtml.startElement("thead"); - if (iterator.hasNext()) { - CSVRecord record = iterator.next(); - for (int i = 0; i < record.size(); i++) { - xhtml.startElement("th"); - xhtml.characters(record.get(i)); - xhtml.endElement("th"); - } - } - xhtml.endElement("thead"); - - xhtml.startElement("tbody"); - while (iterator.hasNext()) { - CSVRecord record = iterator.next(); - xhtml.startElement("tr"); - for (int j = 0; j < record.size(); j++) { - xhtml.startElement("td"); - xhtml.characters(record.get(j)); - xhtml.endElement("td"); - } - xhtml.endElement("tr"); - } - xhtml.endElement("tbody"); - - xhtml.endElement("table"); - } - } - - public static void parseAssay(InputStream stream, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, TikaException, SAXException { - TikaInputStream tis = TikaInputStream.get(stream); - - // Automatically detect the character encoding - - try (AutoDetectReader reader = new AutoDetectReader(new CloseShieldInputStream(tis), - metadata, context.get(ServiceLoader.class, LOADER)); - CSVParser csvParser = new CSVParser(reader, CSVFormat.TDF)) { - xhtml.startElement("table"); - - Iterator iterator = csvParser.iterator(); - - xhtml.startElement("thead"); - if (iterator.hasNext()) { - CSVRecord record = iterator.next(); - for (int i = 0; i < record.size(); i++) { - xhtml.startElement("th"); - xhtml.characters(record.get(i)); - xhtml.endElement("th"); - } - } - xhtml.endElement("thead"); - - xhtml.startElement("tbody"); - while (iterator.hasNext()) { - CSVRecord record = iterator.next(); - xhtml.startElement("tr"); - for (int j = 0; j < record.size(); j++) { - xhtml.startElement("td"); - xhtml.characters(record.get(j)); - xhtml.endElement("td"); - } - xhtml.endElement("tr"); - } - xhtml.endElement("tbody"); - - xhtml.endElement("table"); - } - } - - private static void extractMetadata(Reader reader, Metadata metadata, String studyFileName) throws IOException { - boolean investigationSection = false; - boolean studySection = false; - boolean studyTarget = false; - - Map map = new HashMap(); - - try (CSVParser csvParser = new CSVParser(reader, CSVFormat.TDF)) { - Iterator iterator = csvParser.iterator(); - - while (iterator.hasNext()) { - CSVRecord record = iterator.next(); - String field = record.get(0); - if ((field.toUpperCase(Locale.ENGLISH).equals(field)) && (record.size() == 1)) { - investigationSection = Arrays.asList(sections).contains(field); - studySection = (studyFileName != null) && (field.equals(studySectionField)); - } else { - if (investigationSection) { - addMetadata(field, record, metadata); - } else if (studySection) { - if (studyTarget) { - break; - } - String value = record.get(1); - map.put(field, value); - studyTarget = (field.equals(studyFileNameField)) && (value.equals(studyFileName)); - if (studyTarget) { - mapStudyToMetadata(map, metadata); - studySection = false; - } - } else if (studyTarget) { - addMetadata(field, record, metadata); - } - } - } - } catch (IOException ioe) { - throw ioe; - } - } - - private static void addMetadata(String field, CSVRecord record, Metadata metadata) { - if ((record ==null) || (record.size() <= 1)) { - return; - } - - for (int i = 1; i < record.size(); i++) { - metadata.add(field, record.get(i)); - } - } - - private static void mapStudyToMetadata(Map map, Metadata metadata) { - for (Map.Entry entry : map.entrySet()) { - metadata.add(entry.getKey(), entry.getValue()); - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java deleted file mode 100644 index 4398999..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/isatab/ISArchiveParser.java +++ /dev/null @@ -1,136 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.isatab; - -import java.io.File; -import java.io.FilenameFilter; -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Set; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class ISArchiveParser implements Parser { - - /** - * Serial version UID - */ - private static final long serialVersionUID = 3640809327541300229L; - - private final Set SUPPORTED_TYPES = Collections.singleton(MediaType.application("x-isatab")); - - private static String studyAssayFileNameField = "Study Assay File Name"; - - private String location = null; - - private String studyFileName = null; - - /** - * Default constructor. - */ - public ISArchiveParser() { - this(null); - } - - /** - * Constructor that accepts the pathname of ISArchive folder. - * @param location pathname of ISArchive folder including ISA-Tab files - */ - public ISArchiveParser(String location) { - if (location != null && !location.endsWith(File.separator)) { - location += File.separator; - } - this.location = location; - } - - @Override - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, - ParseContext context) throws IOException, SAXException, TikaException { - - TikaInputStream tis = TikaInputStream.get(stream); - if (this.location == null) { - this.location = tis.getFile().getParent() + File.separator; - } - this.studyFileName = tis.getFile().getName(); - - File locationFile = new File(location); - String[] investigationList = locationFile.list(new FilenameFilter() { - - @Override - public boolean accept(File dir, String name) { - return name.matches("i_.+\\.txt"); - } - }); - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - - parseInvestigation(investigationList, xhtml, metadata, context); - parseStudy(stream, xhtml, metadata, context); - parseAssay(xhtml, metadata, context); - - xhtml.endDocument(); - } - - private void parseInvestigation(String[] investigationList, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - if ((investigationList == null) || (investigationList.length == 0)) { - // TODO warning - return; - } - if (investigationList.length > 1) { - // TODO warning - return; - } - - String investigation = investigationList[0]; // TODO add to metadata? - InputStream stream = TikaInputStream.get(new File(this.location + investigation)); - - ISATabUtils.parseInvestigation(stream, xhtml, metadata, context, this.studyFileName); - - xhtml.element("h1", "INVESTIGATION " + metadata.get("Investigation Identifier")); - } - - private void parseStudy(InputStream stream, XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - xhtml.element("h2", "STUDY " + metadata.get("Study Identifier")); - - ISATabUtils.parseStudy(stream, xhtml, metadata, context); - } - - private void parseAssay(XHTMLContentHandler xhtml, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - for (String assayFileName : metadata.getValues(studyAssayFileNameField)) { - xhtml.startElement("div"); - xhtml.element("h3", "ASSAY " + assayFileName); - InputStream stream = TikaInputStream.get(new File(this.location + assayFileName)); - ISATabUtils.parseAssay(stream, xhtml, metadata, context); - xhtml.endElement("div"); - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/iwork/AutoPageNumberUtils.java b/tika-parsers/src/main/java/org/apache/tika/parser/iwork/AutoPageNumberUtils.java index 4143932..0dda1c0 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/iwork/AutoPageNumberUtils.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/iwork/AutoPageNumberUtils.java @@ -15,8 +15,6 @@ * limitations under the License. */ package org.apache.tika.parser.iwork; - -import java.util.Locale; /** * Utility class to allow for conversion from an integer to Roman numerals @@ -46,7 +44,7 @@ } public static String asAlphaNumericLower(int i) { - return asAlphaNumeric(i).toLowerCase(Locale.ROOT); + return asAlphaNumeric(i).toLowerCase(); } /* @@ -75,7 +73,7 @@ } public static String asRomanNumeralsLower(int i) { - return asRomanNumerals(i).toLowerCase(Locale.ROOT); + return asRomanNumerals(i).toLowerCase(); } private static int i2r(StringBuffer sbuff, int i, diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/iwork/IWorkPackageParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/iwork/IWorkPackageParser.java index 79d82e8..2df1b90 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/iwork/IWorkPackageParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/iwork/IWorkPackageParser.java @@ -30,9 +30,9 @@ import org.apache.commons.compress.archivers.zip.ZipArchiveEntry; import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream; import org.apache.commons.compress.archivers.zip.ZipFile; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.detect.XmlRootExtractor; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -103,9 +103,12 @@ return null; } - try (InputStream stream = zip.getInputStream(entry)) { - return detectType(stream); - } + InputStream stream = zip.getInputStream(entry); + try { + return detectType(stream); + } finally { + stream.close(); + } } catch (IOException e) { return null; } @@ -213,7 +216,7 @@ entry = zip.getNextZipEntry(); } - // Don't close the zip InputStream (TIKA-1117). + zip.close(); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java index 9b45769..134bc90 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/iwork/PagesContentHandler.java @@ -28,6 +28,7 @@ import java.util.HashMap; import java.util.List; import java.util.Map; +import java.util.regex.Pattern; class PagesContentHandler extends DefaultHandler { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/AbstractDBParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/AbstractDBParser.java deleted file mode 100644 index e3065a2..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/AbstractDBParser.java +++ /dev/null @@ -1,189 +0,0 @@ -package org.apache.tika.parser.jdbc; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.io.InputStream; -import java.sql.Connection; -import java.sql.DriverManager; -import java.sql.SQLException; -import java.util.List; -import java.util.Set; - -import org.apache.commons.io.IOExceptionWithCause; -import org.apache.tika.exception.TikaException; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; -import org.apache.tika.metadata.Database; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * Abstract class that handles iterating through tables within a database. - */ -abstract class AbstractDBParser extends AbstractParser { - - private final static byte[] EMPTY_BYTE_ARR = new byte[0]; - - private Connection connection; - - protected static EmbeddedDocumentExtractor getEmbeddedDocumentExtractor(ParseContext context) { - return context.get(EmbeddedDocumentExtractor.class, - new ParsingEmbeddedDocumentExtractor(context)); - } - - @Override - public Set getSupportedTypes(ParseContext context) { - return null; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - connection = getConnection(stream, metadata, context); - XHTMLContentHandler xHandler = null; - List tableNames = null; - try { - tableNames = getTableNames(connection, metadata, context); - } catch (SQLException e) { - throw new IOExceptionWithCause(e); - } - for (String tableName : tableNames) { - //add table names to parent metadata - metadata.add(Database.TABLE_NAME, tableName); - } - xHandler = new XHTMLContentHandler(handler, metadata); - xHandler.startDocument(); - - try { - for (String tableName : tableNames) { - JDBCTableReader tableReader = getTableReader(connection, tableName, context); - xHandler.startElement("table", "name", tableReader.getTableName()); - xHandler.startElement("thead"); - xHandler.startElement("tr"); - for (String header : tableReader.getHeaders()) { - xHandler.startElement("th"); - xHandler.characters(header); - xHandler.endElement("th"); - } - xHandler.endElement("tr"); - xHandler.endElement("thead"); - xHandler.startElement("tbody"); - while (tableReader.nextRow(xHandler, context)) { - //no-op - } - xHandler.endElement("tbody"); - xHandler.endElement("table"); - } - } finally { - if (xHandler != null) { - xHandler.endDocument(); - } - try { - close(); - } catch (SQLException e) { - //swallow - } - } - } - - /** - * Override this for any special handling of closing the connection. - * - * @throws java.sql.SQLException - * @throws java.io.IOException - */ - protected void close() throws SQLException, IOException { - connection.close(); - } - - /** - * Override this for special configuration of the connection, such as limiting - * the number of rows to be held in memory. - * - * @param stream stream to use - * @param metadata metadata that could be used in parameterizing the connection - * @param context parsecontext that could be used in parameterizing the connection - * @return connection - * @throws java.io.IOException - * @throws org.apache.tika.exception.TikaException - */ - protected Connection getConnection(InputStream stream, Metadata metadata, ParseContext context) throws IOException, TikaException { - String connectionString = getConnectionString(stream, metadata, context); - - Connection connection = null; - try { - Class.forName(getJDBCClassName()); - } catch (ClassNotFoundException e) { - throw new TikaException(e.getMessage()); - } - try { - connection = DriverManager.getConnection(connectionString); - } catch (SQLException e) { - throw new IOExceptionWithCause(e); - } - return connection; - } - - /** - * Implement for db specific connection information, e.g. "jdbc:sqlite:/docs/mydb.db" - *

    - * Include any optimization settings, user name, password, etc. - *

    - * - * @param stream stream for processing - * @param metadata metadata might be useful in determining connection info - * @param parseContext context to use to help create connectionString - * @return connection string to be used by {@link #getConnection}. - * @throws java.io.IOException - */ - abstract protected String getConnectionString(InputStream stream, - Metadata metadata, ParseContext parseContext) throws IOException; - - /** - * JDBC class name, e.g. org.sqlite.JDBC - * - * @return jdbc class name - */ - abstract protected String getJDBCClassName(); - - /** - * Returns the names of the tables to process - * - * @param connection Connection to use to make the sql call(s) to get the names of the tables - * @param metadata Metadata to use (potentially) in decision about which tables to extract - * @param context ParseContext to use (potentially) in decision about which tables to extract - * @return - * @throws java.sql.SQLException - */ - abstract protected List getTableNames(Connection connection, Metadata metadata, - ParseContext context) throws SQLException; - - /** - * Given a connection and a table name, return the JDBCTableReader for this db. - * - * @param connection - * @param tableName - * @return - */ - abstract protected JDBCTableReader getTableReader(Connection connection, String tableName, ParseContext parseContext); - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/JDBCTableReader.java b/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/JDBCTableReader.java deleted file mode 100644 index 5b10d97..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/JDBCTableReader.java +++ /dev/null @@ -1,302 +0,0 @@ -package org.apache.tika.parser.jdbc; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.sql.Blob; -import java.sql.Clob; -import java.sql.Connection; -import java.sql.ResultSet; -import java.sql.ResultSetMetaData; -import java.sql.SQLException; -import java.sql.Statement; -import java.sql.Types; -import java.util.LinkedList; -import java.util.List; - -import org.apache.commons.io.FilenameUtils; -import org.apache.commons.io.IOExceptionWithCause; -import org.apache.commons.io.IOUtils; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.detect.Detector; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Database; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaMetadataKeys; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MimeType; -import org.apache.tika.mime.MimeTypeException; -import org.apache.tika.mime.MimeTypes; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.Attributes; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.AttributesImpl; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * General base class to iterate through rows of a JDBC table - */ -class JDBCTableReader { - - private final static Attributes EMPTY_ATTRIBUTES = new AttributesImpl(); - private final Connection connection; - private final String tableName; - int maxClobLength = 1000000; - ResultSet results = null; - int rows = 0; - private TikaConfig tikaConfig = null; - private Detector detector = null; - private MimeTypes mimeTypes = null; - - public JDBCTableReader(Connection connection, String tableName, ParseContext context) { - this.connection = connection; - this.tableName = tableName; - this.tikaConfig = context.get(TikaConfig.class); - } - - public boolean nextRow(ContentHandler handler, ParseContext context) throws IOException, SAXException { - //lazy initialization - if (results == null) { - reset(); - } - try { - if (!results.next()) { - return false; - } - } catch (SQLException e) { - throw new IOExceptionWithCause(e); - } - try { - ResultSetMetaData meta = results.getMetaData(); - handler.startElement(XHTMLContentHandler.XHTML, "tr", "tr", EMPTY_ATTRIBUTES); - for (int i = 1; i <= meta.getColumnCount(); i++) { - handler.startElement(XHTMLContentHandler.XHTML, "td", "td", EMPTY_ATTRIBUTES); - handleCell(meta, i, handler, context); - handler.endElement(XHTMLContentHandler.XHTML, "td", "td"); - } - handler.endElement(XHTMLContentHandler.XHTML, "tr", "tr"); - } catch (SQLException e) { - throw new IOExceptionWithCause(e); - } - rows++; - return true; - } - - private void handleCell(ResultSetMetaData rsmd, int i, ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException { - switch (rsmd.getColumnType(i)) { - case Types.BLOB: - handleBlob(tableName, rsmd.getColumnName(i), rows, results, i, handler, context); - break; - case Types.CLOB: - handleClob(tableName, rsmd.getColumnName(i), rows, results, i, handler, context); - break; - case Types.BOOLEAN: - handleBoolean(results.getBoolean(i), handler); - break; - case Types.DATE: - handleDate(results, i, handler); - break; - case Types.TIMESTAMP: - handleTimeStamp(results, i, handler); - break; - case Types.INTEGER: - handleInteger(rsmd.getColumnTypeName(i), results, i, handler); - break; - case Types.FLOAT: - //this is necessary to handle rounding issues in presentation - //Should we just use getString(i)? - addAllCharacters(Float.toString(results.getFloat(i)), handler); - break; - case Types.DOUBLE: - addAllCharacters(Double.toString(results.getDouble(i)), handler); - break; - default: - addAllCharacters(results.getString(i), handler); - break; - } - } - - public List getHeaders() throws IOException { - List headers = new LinkedList(); - //lazy initialization - if (results == null) { - reset(); - } - try { - ResultSetMetaData meta = results.getMetaData(); - for (int i = 1; i <= meta.getColumnCount(); i++) { - headers.add(meta.getColumnName(i)); - } - } catch (SQLException e) { - throw new IOExceptionWithCause(e); - } - return headers; - } - - protected void handleInteger(String columnTypeName, ResultSet rs, int columnIndex, ContentHandler handler) throws SQLException, SAXException { - addAllCharacters(Integer.toString(rs.getInt(columnIndex)), handler); - } - - private void handleBoolean(boolean aBoolean, ContentHandler handler) throws SAXException { - addAllCharacters(Boolean.toString(aBoolean), handler); - } - - - protected void handleClob(String tableName, String columnName, int rowNum, - ResultSet resultSet, int columnIndex, - ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException { - Clob clob = resultSet.getClob(columnIndex); - boolean truncated = clob.length() > Integer.MAX_VALUE || clob.length() > maxClobLength; - - int readSize = (clob.length() < maxClobLength ? (int) clob.length() : maxClobLength); - Metadata m = new Metadata(); - m.set(Database.TABLE_NAME, tableName); - m.set(Database.COLUMN_NAME, columnName); - m.set(Database.PREFIX + "ROW_NUM", Integer.toString(rowNum)); - m.set(Database.PREFIX + "IS_CLOB", "true"); - m.set(Database.PREFIX + "CLOB_LENGTH", Long.toString(clob.length())); - m.set(Database.PREFIX + "IS_CLOB_TRUNCATED", Boolean.toString(truncated)); - m.set(Metadata.CONTENT_TYPE, "text/plain; charset=UTF-8"); - m.set(Metadata.CONTENT_LENGTH, Integer.toString(readSize)); - m.set(TikaMetadataKeys.RESOURCE_NAME_KEY, - //just in case something screwy is going on with the column name - FilenameUtils.normalize(FilenameUtils.getName(columnName + "_" + rowNum + ".txt"))); - - - //is there a more efficient way to go from a Reader to an InputStream? - String s = clob.getSubString(0, readSize); - EmbeddedDocumentExtractor ex = AbstractDBParser.getEmbeddedDocumentExtractor(context); - ex.parseEmbedded(new ByteArrayInputStream(s.getBytes(UTF_8)), handler, m, true); - } - - protected void handleBlob(String tableName, String columnName, int rowNum, ResultSet resultSet, int columnIndex, - ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException { - Metadata m = new Metadata(); - m.set(Database.TABLE_NAME, tableName); - m.set(Database.COLUMN_NAME, columnName); - m.set(Database.PREFIX + "ROW_NUM", Integer.toString(rowNum)); - m.set(Database.PREFIX + "IS_BLOB", "true"); - Blob blob = null; - InputStream is = null; - EmbeddedDocumentExtractor ex = AbstractDBParser.getEmbeddedDocumentExtractor(context); - try { - is = TikaInputStream.get(getInputStreamFromBlob(resultSet, columnIndex, blob, m)); - - Attributes attrs = new AttributesImpl(); - ((AttributesImpl) attrs).addAttribute("", "type", "type", "CDATA", "blob"); - ((AttributesImpl) attrs).addAttribute("", "column_name", "column_name", "CDATA", columnName); - ((AttributesImpl) attrs).addAttribute("", "row_number", "row_number", "CDATA", Integer.toString(rowNum)); - handler.startElement("", "span", "span", attrs); - MediaType mediaType = getDetector().detect(is, new Metadata()); - String extension = ""; - try { - MimeType mimeType = getMimeTypes().forName(mediaType.toString()); - m.set(Metadata.CONTENT_TYPE, mimeType.toString()); - extension = mimeType.getExtension(); - } catch (MimeTypeException e) { - //swallow - } - m.set(TikaMetadataKeys.RESOURCE_NAME_KEY, - //just in case something screwy is going on with the column name - FilenameUtils.normalize(FilenameUtils.getName(columnName + "_" + rowNum + extension))); - - ex.parseEmbedded(is, handler, m, true); - - } finally { - if (blob != null) { - try { - blob.free(); - } catch (SQLException e) { - //swallow - } - } - IOUtils.closeQuietly(is); - } - handler.endElement("", "span", "span"); - } - - protected InputStream getInputStreamFromBlob(ResultSet resultSet, int columnIndex, Blob blob, Metadata metadata) throws SQLException { - return TikaInputStream.get(blob, metadata); - } - - protected void handleDate(ResultSet resultSet, int columnIndex, ContentHandler handler) throws SAXException, SQLException { - addAllCharacters(resultSet.getString(columnIndex), handler); - } - - protected void handleTimeStamp(ResultSet resultSet, int columnIndex, ContentHandler handler) throws SAXException, SQLException { - addAllCharacters(resultSet.getString(columnIndex), handler); - } - - protected void addAllCharacters(String s, ContentHandler handler) throws SAXException { - char[] chars = s.toCharArray(); - handler.characters(chars, 0, chars.length); - } - - void reset() throws IOException { - - if (results != null) { - try { - results.close(); - } catch (SQLException e) { - //swallow - } - } - - String sql = "SELECT * from " + tableName; - try { - Statement st = connection.createStatement(); - results = st.executeQuery(sql); - } catch (SQLException e) { - throw new IOExceptionWithCause(e); - } - rows = 0; - } - - public String getTableName() { - return tableName; - } - - - protected TikaConfig getTikaConfig() { - if (tikaConfig == null) { - tikaConfig = TikaConfig.getDefaultConfig(); - } - return tikaConfig; - } - - protected Detector getDetector() { - if (detector != null) return detector; - - detector = getTikaConfig().getDetector(); - return detector; - } - - protected MimeTypes getMimeTypes() { - if (mimeTypes != null) return mimeTypes; - - mimeTypes = getTikaConfig().getMimeRepository(); - return mimeTypes; - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3DBParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3DBParser.java deleted file mode 100644 index 4ea8f30..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3DBParser.java +++ /dev/null @@ -1,110 +0,0 @@ -package org.apache.tika.parser.jdbc; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.sql.Connection; -import java.sql.ResultSet; -import java.sql.SQLException; -import java.sql.Statement; -import java.util.LinkedList; -import java.util.List; -import java.util.Set; - -import org.apache.commons.io.IOExceptionWithCause; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.ParseContext; -import org.sqlite.SQLiteConfig; - -/** - * This is the implementation of the db parser for SQLite. - *

    - * This parser is internal only; it should not be registered in the services - * file or configured in the TikaConfig xml file. - */ -class SQLite3DBParser extends AbstractDBParser { - - protected static final String SQLITE_CLASS_NAME = "org.sqlite.JDBC"; - - /** - * @param context context - * @return null (always) - */ - @Override - public Set getSupportedTypes(ParseContext context) { - return null; - } - - @Override - protected Connection getConnection(InputStream stream, Metadata metadata, ParseContext context) throws IOException { - String connectionString = getConnectionString(stream, metadata, context); - - Connection connection = null; - try { - Class.forName(getJDBCClassName()); - } catch (ClassNotFoundException e) { - throw new IOExceptionWithCause(e); - } - try { - SQLiteConfig config = new SQLiteConfig(); - - //good habit, but effectively meaningless here - config.setReadOnly(true); - connection = config.createConnection(connectionString); - - } catch (SQLException e) { - throw new IOException(e.getMessage()); - } - return connection; - } - - @Override - protected String getConnectionString(InputStream is, Metadata metadata, ParseContext context) throws IOException { - File dbFile = TikaInputStream.get(is).getFile(); - return "jdbc:sqlite:" + dbFile.getAbsolutePath(); - } - - @Override - protected String getJDBCClassName() { - return SQLITE_CLASS_NAME; - } - - @Override - protected List getTableNames(Connection connection, Metadata metadata, - ParseContext context) throws SQLException { - List tableNames = new LinkedList(); - - try (Statement st = connection.createStatement()) { - String sql = "SELECT name FROM sqlite_master WHERE type='table'"; - ResultSet rs = st.executeQuery(sql); - - while (rs.next()) { - tableNames.add(rs.getString(1)); - } - } - return tableNames; - } - - @Override - public JDBCTableReader getTableReader(Connection connection, String tableName, ParseContext context) { - return new SQLite3TableReader(connection, tableName, context); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java b/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java deleted file mode 100644 index ef2fb93..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3Parser.java +++ /dev/null @@ -1,80 +0,0 @@ -package org.apache.tika.parser.jdbc; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Set; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * This is the main class for parsing SQLite3 files. When {@link #parse} is called, - * this creates a new {@link org.apache.tika.parser.jdbc.SQLite3DBParser}. - *

    - * Given potential conflicts of native libraries in web servers, users will - * need to add org.xerial's sqlite-jdbc jar to the class path for this parser - * to work. For development and testing, this jar is specified in tika-parsers' - * pom.xml, but it is currently set to "provided." - *

    - * Note that this family of jdbc parsers is designed to treat each CLOB and each BLOB - * as embedded documents. - */ -public class SQLite3Parser extends AbstractParser { - /** - * Serial version UID - */ - private static final long serialVersionUID = -752276948656079347L; - - private static final MediaType MEDIA_TYPE = MediaType.application("x-sqlite3"); - - private final Set SUPPORTED_TYPES; - - /** - * Checks to see if class is available for org.sqlite.JDBC. - *

    - * If not, this class will return an EMPTY_SET for getSupportedTypes() - */ - public SQLite3Parser() { - Set tmp; - try { - Class.forName(SQLite3DBParser.SQLITE_CLASS_NAME); - tmp = Collections.singleton(MEDIA_TYPE); - } catch (ClassNotFoundException e) { - tmp = Collections.EMPTY_SET; - } - SUPPORTED_TYPES = tmp; - } - - @Override - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - SQLite3DBParser p = new SQLite3DBParser(); - p.parse(stream, handler, metadata, context); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3TableReader.java b/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3TableReader.java deleted file mode 100644 index 8671d09..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/jdbc/SQLite3TableReader.java +++ /dev/null @@ -1,109 +0,0 @@ -package org.apache.tika.parser.jdbc; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.IOException; -import java.io.InputStream; -import java.sql.Blob; -import java.sql.Connection; -import java.sql.ResultSet; -import java.sql.SQLException; -import java.text.DateFormat; -import java.text.SimpleDateFormat; -import java.util.Locale; - -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - - -/** - * Concrete class for SQLLite table parsing. This overrides - * column type handling from JDBCRowHandler. - *

    - * This class is not designed to be thread safe (because of DateFormat)! - * Need to call a new instance for each parse, as AbstractDBParser does. - *

    - * For now, this silently skips cells of type CLOB, because xerial's jdbc connector - * does not currently support them. - */ -class SQLite3TableReader extends JDBCTableReader { - - - DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd", Locale.ROOT); - - public SQLite3TableReader(Connection connection, String tableName, ParseContext context) { - super(connection, tableName, context); - } - - - /** - * No-op for now in {@link SQLite3TableReader}. - * - * @param tableName - * @param fieldName - * @param rowNum - * @param resultSet - * @param columnIndex - * @param handler - * @param context - * @throws java.sql.SQLException - * @throws java.io.IOException - * @throws org.xml.sax.SAXException - */ - @Override - protected void handleClob(String tableName, String fieldName, int rowNum, - ResultSet resultSet, int columnIndex, - ContentHandler handler, ParseContext context) throws SQLException, IOException, SAXException { - //no-op for now. - } - - /** - * The jdbc connection to Sqlite does not yet implement blob, have to getBytes(). - * - * @param resultSet resultSet - * @param columnIndex columnIndex for blob - * @return - * @throws java.sql.SQLException - */ - @Override - protected InputStream getInputStreamFromBlob(ResultSet resultSet, int columnIndex, Blob blob, Metadata m) throws SQLException { - return TikaInputStream.get(resultSet.getBytes(columnIndex), m); - } - - @Override - protected void handleInteger(String columnTypeName, ResultSet rs, int columnIndex, - ContentHandler handler) throws SQLException, SAXException { - //As of this writing, with xerial's sqlite jdbc connector, a timestamp is - //stored as a column of type Integer, but the columnTypeName is TIMESTAMP, and the - //value is a string representing a Long. - if (columnTypeName.equals("TIMESTAMP")) { - addAllCharacters(parseDateFromLongString(rs.getString(columnIndex)), handler); - } else { - addAllCharacters(Integer.toString(rs.getInt(columnIndex)), handler); - } - - } - - private String parseDateFromLongString(String longString) throws SAXException { - java.sql.Date d = new java.sql.Date(Long.parseLong(longString)); - return dateFormat.format(d); - - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java deleted file mode 100644 index 05b09fc..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/journal/GrobidRESTParser.java +++ /dev/null @@ -1,112 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.journal; - -import java.io.File; -import java.io.FileInputStream; -import java.io.FileNotFoundException; -import java.io.IOException; -import java.util.Properties; - -import javax.ws.rs.core.MediaType; -import javax.ws.rs.core.Response; - -import org.apache.cxf.jaxrs.client.WebClient; -import org.apache.cxf.jaxrs.ext.multipart.Attachment; -import org.apache.cxf.jaxrs.ext.multipart.ContentDisposition; -import org.apache.cxf.jaxrs.ext.multipart.MultipartBody; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.xml.sax.ContentHandler; - -public class GrobidRESTParser { - - private static final String GROBID_REST_HOST = "http://localhost:8080"; - - private static final String GROBID_ISALIVE_PATH = "/grobid"; // isalive - // doesn't work - // nfc why - - private static final String GROBID_PROCESSHEADER_PATH = "/processHeaderDocument"; - - private String restHostUrlStr; - - public GrobidRESTParser() { - String restHostUrlStr = null; - try { - restHostUrlStr = readRestUrl(); - } catch (IOException e) { - e.printStackTrace(); - } - - if (restHostUrlStr == null - || (restHostUrlStr != null && restHostUrlStr.equals(""))) { - this.restHostUrlStr = GROBID_REST_HOST; - } else { - this.restHostUrlStr = restHostUrlStr; - } - } - - public void parse(String filePath, ContentHandler handler, Metadata metadata, - ParseContext context) throws FileNotFoundException { - - File pdfFile = new File(filePath); - ContentDisposition cd = new ContentDisposition( - "form-data; name=\"input\"; filename=\"" + pdfFile.getName() + "\""); - Attachment att = new Attachment("input", new FileInputStream(pdfFile), cd); - MultipartBody body = new MultipartBody(att); - - Response response = WebClient - .create(restHostUrlStr + GROBID_PROCESSHEADER_PATH) - .accept(MediaType.APPLICATION_XML).type(MediaType.MULTIPART_FORM_DATA) - .post(body); - - try { - String resp = response.readEntity(String.class); - Metadata teiMet = new TEIParser().parse(resp); - for (String key : teiMet.names()) { - metadata.add("grobid:header_" + key, teiMet.get(key)); - } - } catch (Exception e) { - e.printStackTrace(); - } - } - - private static String readRestUrl() throws IOException { - Properties grobidProperties = new Properties(); - grobidProperties.load(GrobidRESTParser.class - .getResourceAsStream("GrobidExtractor.properties")); - - return grobidProperties.getProperty("grobid.server.url"); - } - - protected static boolean canRun() { - Response response = null; - - try { - response = WebClient.create(readRestUrl() + GROBID_ISALIVE_PATH) - .accept(MediaType.TEXT_HTML).get(); - String resp = response.readEntity(String.class); - return resp != null && !resp.equals("") && resp.startsWith("

    "); - } catch (Exception e) { - e.printStackTrace(); - return false; - } - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java deleted file mode 100644 index 04fc237..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/journal/JournalParser.java +++ /dev/null @@ -1,65 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.journal; - -import java.io.File; -import java.io.FileInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Set; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.pdf.PDFParser; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -public class JournalParser extends AbstractParser { - - /** - * Generated serial ID - */ - private static final long serialVersionUID = 4664255544154296438L; - - private static final MediaType TYPE = MediaType.application("pdf"); - - private static final Set SUPPORTED_TYPES = Collections - .singleton(TYPE); - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - TikaInputStream tis = TikaInputStream.get(stream, new TemporaryResources()); - File tmpFile = tis.getFile(); - - GrobidRESTParser grobidParser = new GrobidRESTParser(); - grobidParser.parse(tmpFile.getAbsolutePath(), handler, metadata, context); - - PDFParser parser = new PDFParser(); - parser.parse(new FileInputStream(tmpFile), handler, metadata, context); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java deleted file mode 100644 index 04d5195..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/journal/TEIParser.java +++ /dev/null @@ -1,893 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.journal; - -import java.util.ArrayList; -import java.util.List; - -import org.apache.tika.metadata.Metadata; -import org.json.JSONArray; -import org.json.JSONObject; -import org.json.XML; - -public class TEIParser { - - public TEIParser() { - } - - public Metadata parse(String source) { - JSONObject obj = XML.toJSONObject(source); - Metadata metadata = new Metadata(); - createGrobidMetadata(source, obj, metadata); - return metadata; - } - - private void createGrobidMetadata(String source, JSONObject obj, - Metadata metadata) { - if (obj != null) { - JSONObject teiHeader = obj.getJSONObject("TEI") - .getJSONObject("teiHeader"); - if (teiHeader.has("text")) { - parseText(teiHeader.getJSONObject("text"), metadata); - } - - if (teiHeader.has("fileDesc")) { - parseFileDesc(teiHeader.getJSONObject("fileDesc"), metadata); - - } - if (teiHeader.has("profileDesc")) { - parseProfileDesc(teiHeader.getJSONObject("profileDesc"), metadata); - } - } - - addStaticMet(source, obj, metadata); - } - - private void addStaticMet(String source, JSONObject obj, Metadata metadata) { - metadata.add("Class", Metadata.class.getName()); - metadata.add("TEIJSONSource", obj.toString()); - metadata.add("TEIXMLSource", source); - } - - private void parseText(JSONObject text, Metadata metadata) { - if (text.has("xml:lang")) { - metadata.add("Language", text.getString("xml:lang")); - } - } - - private void parseFileDesc(JSONObject fileDesc, Metadata metadata) { - if (fileDesc.has("titleStmt")) { - parseTitleStmt(fileDesc.getJSONObject("titleStmt"), metadata); - } - - if (fileDesc.has("sourceDesc")) { - parseSourceDesc(fileDesc.getJSONObject("sourceDesc"), metadata); - } - } - - private void parseTitleStmt(JSONObject titleStmt, Metadata metadata) { - if (titleStmt.has("title")) { - JSONObject title = titleStmt.getJSONObject("title"); - if (title.has("content")) { - metadata.add("Title", title.getString("content")); - } - } - } - - private void parseSourceDesc(JSONObject sourceDesc, Metadata metadata) { - if (sourceDesc.has("biblStruct")) { - parseBiblStruct(sourceDesc.getJSONObject("biblStruct"), metadata); - } - } - - private void parseBiblStruct(JSONObject biblStruct, Metadata metadata) { - if (biblStruct.has("analytic") - && biblStruct.get("analytic") instanceof JSONObject) { - JSONObject analytic = biblStruct.getJSONObject("analytic"); - if (analytic.has("author")) { - Object authorObj = analytic.get("author"); - - List authorList = new ArrayList(); - if (authorObj instanceof JSONObject) { - parseAuthor((JSONObject) authorObj, authorList); - } else if (authorObj instanceof JSONArray) { - JSONArray authors = (JSONArray) authorObj; - if (authors.length() > 0) { - for (int i = 0; i < authors.length(); i++) { - JSONObject author = authors.getJSONObject(i); - parseAuthor(author, authorList); - } - } - - metadata.add("Address", getMetadataAddresses(authorList)); - metadata.add("Affiliation", getMetadataAffiliations(authorList)); - metadata.add("Authors", getMetadataAuthors(authorList)); - metadata.add("FullAffiliations", - getMetadataFullAffiliations(authorList)); - } - - } - } else { - metadata.add("Error", "Unable to parse: no analytic section in JSON"); - } - - } - - private String getMetadataFullAffiliations(List authorList) { - List unique = new ArrayList(); - StringBuilder metAffils = new StringBuilder(); - - for (Author a : authorList) { - for (Affiliation af : a.getAffiliations()) { - if (!unique.contains(af)) { - unique.add(af); - } - } - } - metAffils.append("["); - for (Affiliation af : unique) { - metAffils.append(af.toString()); - metAffils.append(","); - } - metAffils.append(metAffils.deleteCharAt(metAffils.length() - 1)); - metAffils.append("]"); - return metAffils.toString(); - } - - private String getMetadataAuthors(List authorList) { - // generates Chris A. Mattmann 1, 2 Daniel J. Crichton 1 Nenad Medvidovic 2 - // Steve Hughes 1 - List unique = new ArrayList(); - StringBuilder metAuthors = new StringBuilder(); - - for (Author a : authorList) { - for (Affiliation af : a.getAffiliations()) { - if (!unique.contains(af)) { - unique.add(af); - } - } - } - - for (Author a : authorList) { - metAuthors.append(printOrBlank(a.getFirstName())); - metAuthors.append(printOrBlank(a.getMiddleName())); - metAuthors.append(printOrBlank(a.getSurName())); - - StringBuilder affilBuilder = new StringBuilder(); - for (int idx = 0; idx < unique.size(); idx++) { - Affiliation af = unique.get(idx); - if (a.getAffiliations().contains(af)) { - affilBuilder.append((idx + 1)); - affilBuilder.append(","); - } - } - - if (affilBuilder.length() > 0) - affilBuilder.deleteCharAt(affilBuilder.length() - 1); - - metAuthors.append(affilBuilder.toString()); - metAuthors.append(" "); - } - - return metAuthors.toString(); - } - - private String getMetadataAffiliations(List authorList) { - // generates 1 Jet Propulsion Laboratory California Institute of Technology - // ; 2 Computer Science Department University of Southern California - List unique = new ArrayList(); - StringBuilder metAffil = new StringBuilder(); - - for (Author a : authorList) { - for (Affiliation af : a.getAffiliations()) { - if (!unique.contains(af)) { - unique.add(af); - } - } - } - - int count = 1; - for (Affiliation a : unique) { - metAffil.append(count); - metAffil.append(" "); - metAffil.append(a.getOrgName().toString()); - metAffil.deleteCharAt(metAffil.length() - 1); - metAffil.append("; "); - count++; - } - - if (count > 1) { - metAffil.deleteCharAt(metAffil.length() - 1); - metAffil.deleteCharAt(metAffil.length() - 1); - } - - return metAffil.toString(); - } - - private String getMetadataAddresses(List authorList) { - // generates: "Pasadena, CA 91109, USA Los Angeles, CA 90089, USA", - List
    unique = new ArrayList
    (); - StringBuilder metAddress = new StringBuilder(); - - for (Author a : authorList) { - for (Affiliation af : a.getAffiliations()) { - if (!unique.contains(af.getAddress())) { - unique.add(af.getAddress()); - } - } - } - - for (Address ad : unique) { - metAddress.append(ad.toString()); - metAddress.append(" "); - } - - return metAddress.toString(); - } - - private void parseAuthor(JSONObject authorObj, List authorList) { - Author author = new Author(); - - if (authorObj.has("persName")) { - JSONObject persName = authorObj.getJSONObject("persName"); - - if (persName.has("forename")) { - - Object foreNameObj = persName.get("forename"); - - if (foreNameObj instanceof JSONObject) { - parseNamePart((JSONObject) foreNameObj, author); - } else if (foreNameObj instanceof JSONArray) { - JSONArray foreName = persName.getJSONArray("forename"); - - if (foreName.length() > 0) { - for (int i = 0; i < foreName.length(); i++) { - JSONObject namePart = foreName.getJSONObject(i); - parseNamePart(namePart, author); - } - } - } - } - - if (persName.has("surname")) { - author.setSurName(persName.getString("surname")); - } - - if (authorObj.has("affiliation")) { - parseAffiliation(authorObj.get("affiliation"), author); - } - - } - - authorList.add(author); - } - - private void parseNamePart(JSONObject namePart, Author author) { - if (namePart.has("type") && namePart.has("content")) { - String type = namePart.getString("type"); - String content = namePart.getString("content"); - - if (type.equals("first")) { - author.setFirstName(content); - } - - if (type.equals("middle")) { - author.setMiddleName(content); - } - } - } - - private void parseAffiliation(Object affiliationJSON, Author author) { - if (affiliationJSON instanceof JSONObject) { - parseOneAffiliation((JSONObject) affiliationJSON, author); - } else if (affiliationJSON instanceof JSONArray) { - JSONArray affiliationArray = (JSONArray) affiliationJSON; - if (affiliationArray != null && affiliationArray.length() > 0) { - for (int i = 0; i < affiliationArray.length(); i++) { - JSONObject affiliationObj = affiliationArray.getJSONObject(i); - parseOneAffiliation(affiliationObj, author); - } - } - } - } - - private void parseOneAffiliation(JSONObject affiliationObj, Author author) { - - Affiliation affiliation = new Affiliation(); - if (affiliationObj.has("address")) { - parseAddress(affiliationObj.getJSONObject("address"), affiliation); - } - - if (affiliationObj.has("orgName")) { - OrgName orgName = new OrgName(); - Object orgObject = affiliationObj.get("orgName"); - if (orgObject instanceof JSONObject) { - parseOrgName((JSONObject) orgObject, orgName); - } else if (orgObject instanceof JSONArray) { - JSONArray orgNames = (JSONArray) orgObject; - if (orgNames != null && orgNames.length() > 0) { - for (int i = 0; i < orgNames.length(); i++) { - parseOrgName(orgNames.getJSONObject(i), orgName); - } - } - - affiliation.setOrgName(orgName); - } - - } - - author.getAffiliations().add(affiliation); - } - - private void parseAddress(JSONObject addressObj, Affiliation affiliation) { - Address address = new Address(); - - if (addressObj.has("region")) { - address.setRegion(addressObj.getString("region")); - } - - if (addressObj.has("postCode")) { - address.setPostCode(JSONObject.valueToString(addressObj.get("postCode"))); - } - - if (addressObj.has("settlement")) { - address.setSettlment(addressObj.getString("settlement")); - } - - if (addressObj.has("country")) { - Country country = new Country(); - Object countryObj = addressObj.get("country"); - - if (countryObj instanceof JSONObject) { - JSONObject countryJson = addressObj.getJSONObject("country"); - - if (countryJson.has("content")) { - country.setContent(countryJson.getString("content")); - } - - if (countryJson.has("key")) { - country.setKey(countryJson.getString("key")); - } - } else if (countryObj instanceof String) { - country.setContent((String) countryObj); - } - address.setCountry(country); - } - - affiliation.setAddress(address); - } - - private void parseOrgName(JSONObject orgObj, OrgName orgName) { - OrgTypeName typeName = new OrgTypeName(); - if (orgObj.has("content")) { - typeName.setName(orgObj.getString("content")); - } - - if (orgObj.has("type")) { - typeName.setType(orgObj.getString("type")); - } - - orgName.getTypeNames().add(typeName); - } - - private void parseProfileDesc(JSONObject profileDesc, Metadata metadata) { - if (profileDesc.has("abstract")) { - if (profileDesc.has("p")) { - metadata.add("Abstract", profileDesc.getString("p")); - } - } - - if (profileDesc.has("textClass")) { - JSONObject textClass = profileDesc.getJSONObject("textClass"); - - if (textClass.has("keywords")) { - Object keywordsObj = textClass.get("keywords"); - // test AJ15.pdf - if (keywordsObj instanceof String) { - metadata.add("Keyword", (String) keywordsObj); - } else if (keywordsObj instanceof JSONObject) { - JSONObject keywords = textClass.getJSONObject("keywords"); - if (keywords.has("term")) { - JSONArray termArr = keywords.getJSONArray("term"); - for (int i = 0; i < termArr.length(); i++) { - metadata.add("Keyword", JSONObject.valueToString(termArr.get(i))); - } - } - } - - } - } - - } - - private String printOrBlank(String val) { - if (val != null && !val.equals("")) { - return val + " "; - } else - return " "; - } - - class Author { - - private String surName; - - private String middleName; - - private String firstName; - - private List affiliations; - - public Author() { - this.surName = null; - this.middleName = null; - this.firstName = null; - this.affiliations = new ArrayList(); - } - - /** - * @return the surName - */ - public String getSurName() { - return surName; - } - - /** - * @param surName - * the surName to set - */ - public void setSurName(String surName) { - this.surName = surName; - } - - /** - * @return the middleName - */ - public String getMiddleName() { - return middleName; - } - - /** - * @param middleName - * the middleName to set - */ - public void setMiddleName(String middleName) { - this.middleName = middleName; - } - - /** - * @return the firstName - */ - public String getFirstName() { - return firstName; - } - - /** - * @param firstName - * the firstName to set - */ - public void setFirstName(String firstName) { - this.firstName = firstName; - } - - /** - * @return the affiliations - */ - public List getAffiliations() { - return affiliations; - } - - /** - * @param affiliations - * the affiliations to set - */ - public void setAffiliations(List affiliations) { - this.affiliations = affiliations; - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#toString() - */ - @Override - public String toString() { - return "Author [surName=" + surName + ", middleName=" + middleName != null ? middleName - : "" + ", firstName=" + firstName + ", affiliations=" + affiliations - + "]"; - } - - } - - class Affiliation { - - private OrgName orgName; - - private Address address; - - public Affiliation() { - this.orgName = new OrgName(); - this.address = new Address(); - } - - /** - * @return the orgName - */ - public OrgName getOrgName() { - return orgName; - } - - /** - * @param orgName - * the orgName to set - */ - public void setOrgName(OrgName orgName) { - this.orgName = orgName; - } - - /** - * @return the address - */ - public Address getAddress() { - return address; - } - - /** - * @param address - * the address to set - */ - public void setAddress(Address address) { - this.address = address; - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#equals(java.lang.Object) - */ - @Override - public boolean equals(Object obj) { - Affiliation otherA = (Affiliation) obj; - return this.getAddress().equals(otherA.getAddress()) - && this.getOrgName().equals(otherA.getOrgName()); - - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#toString() - */ - @Override - public String toString() { - return "Affiliation {orgName=" + orgName + ", address=" + address + "}"; - } - - } - - class OrgName { - private List typeNames; - - public OrgName() { - this.typeNames = new ArrayList(); - } - - /** - * @return the typeNames - */ - public List getTypeNames() { - return typeNames; - } - - /** - * @param typeNames - * the typeNames to set - */ - public void setTypeNames(List typeNames) { - this.typeNames = typeNames; - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#toString() - */ - - @Override - public String toString() { - StringBuilder builder = new StringBuilder(); - for (OrgTypeName on : this.typeNames) { - builder.append(on.getName()); - builder.append(" "); - } - return builder.toString(); - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#equals(java.lang.Object) - */ - @Override - public boolean equals(Object obj) { - OrgName otherA = (OrgName) obj; - - if (otherA.getTypeNames() != null) { - if (this.typeNames == null) { - return false; - } else { - return this.typeNames.size() == otherA.getTypeNames().size(); - } - } else { - if (this.typeNames == null) { - return true; - } else - return false; - } - - } - - } - - class OrgTypeName { - private String name; - private String type; - - public OrgTypeName() { - this.name = null; - this.type = null; - } - - /** - * @return the name - */ - public String getName() { - return name; - } - - /** - * @param name - * the name to set - */ - public void setName(String name) { - this.name = name; - } - - /** - * @return the type - */ - public String getType() { - return type; - } - - /** - * @param type - * the type to set - */ - public void setType(String type) { - this.type = type; - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#equals(java.lang.Object) - */ - @Override - public boolean equals(Object obj) { - OrgTypeName otherOrgName = (OrgTypeName) obj; - return this.type.equals(otherOrgName.getType()) - && this.name.equals(otherOrgName.getName()); - } - - } - - private class Address { - - private String region; - private String postCode; - private String settlment; - private Country country; - - public Address() { - this.region = null; - this.postCode = null; - this.settlment = null; - this.country = new Country(); - } - - /** - * @return the region - */ - public String getRegion() { - return region; - } - - /** - * @param region - * the region to set - */ - public void setRegion(String region) { - this.region = region; - } - - /** - * @return the postCode - */ - public String getPostCode() { - return postCode; - } - - /** - * @param postCode - * the postCode to set - */ - public void setPostCode(String postCode) { - this.postCode = postCode; - } - - /** - * @return the settlment - */ - public String getSettlment() { - return settlment; - } - - /** - * @param settlment - * the settlment to set - */ - public void setSettlment(String settlment) { - this.settlment = settlment; - } - - /** - * @return the country - */ - public Country getCountry() { - return country; - } - - /** - * @param country - * the country to set - */ - public void setCountry(Country country) { - this.country = country; - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#equals(java.lang.Object) - */ - @Override - public boolean equals(Object obj) { - Address otherA = (Address) obj; - if (this.settlment == null) { - return otherA.getSettlment() == null; - } else if (this.country == null) { - return otherA.getCountry() == null; - } else if (this.postCode == null) { - return otherA.getPostCode() == null; - } else if (this.region == null) { - return otherA.getRegion() == null; - } - - return this.settlment.equals(otherA.getSettlment()) - && this.country.equals(otherA.getCountry()) - && this.postCode.equals(otherA.getPostCode()) - && this.region.equals(otherA.getRegion()); - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#toString() - */ - @Override - public String toString() { - StringBuilder builder = new StringBuilder(); - builder.append(settlment); - builder.append(", "); - builder.append(region); - builder.append(" "); - builder.append(postCode); - builder.append(" "); - builder.append(country.getContent()); - return builder.toString(); - } - } - - private class Country { - private String key; - private String content; - - public Country() { - this.key = null; - this.content = null; - } - - /** - * @return the key - */ - public String getKey() { - return key; - } - - /** - * @param key - * the key to set - */ - public void setKey(String key) { - this.key = key; - } - - /** - * @return the content - */ - public String getContent() { - return content; - } - - /** - * @param content - * the content to set - */ - public void setContent(String content) { - this.content = content; - } - - /* - * (non-Javadoc) - * - * @see java.lang.Object#equals(java.lang.Object) - */ - @Override - public boolean equals(Object obj) { - Country otherC = (Country) obj; - - if (this.key == null) { - if (otherC.getKey() != null) { - return false; - } else { - if (this.content == null) { - if (otherC.getContent() != null) { - return false; - } else { - return true; - } - } else { - return content.equals(otherC.getContent()); - } - } - } else { - if (this.content == null) { - if (otherC.getContent() != null) { - return false; - } else { - return this.key.equals(otherC.getKey()); - } - } else { - return this.key.equals(otherC.getKey()) - && this.content.equals(otherC.getContent()); - } - } - } - - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/jpeg/JpegParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/jpeg/JpegParser.java index 4e76b5c..788e367 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/jpeg/JpegParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/jpeg/JpegParser.java @@ -36,13 +36,11 @@ public class JpegParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = -1355028253756234603L; private static final Set SUPPORTED_TYPES = - Collections.singleton(MediaType.image("jpeg")); + Collections.singleton(MediaType.image("jpeg")); public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java index 5369c1d..a1b829d 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mail/MailContentHandler.java @@ -36,13 +36,14 @@ import org.apache.james.mime4j.stream.BodyDescriptor; import org.apache.james.mime4j.stream.Field; import org.apache.tika.config.TikaConfig; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; +import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; +import org.apache.tika.sax.BodyContentHandler; +import org.apache.tika.sax.EmbeddedContentHandler; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.SAXException; @@ -56,47 +57,38 @@ private boolean strictParsing = false; private XHTMLContentHandler handler; + private ParseContext context; private Metadata metadata; - private EmbeddedDocumentExtractor extractor; + private TikaConfig tikaConfig = null; private boolean inPart = false; - + MailContentHandler(XHTMLContentHandler xhtml, Metadata metadata, ParseContext context, boolean strictParsing) { this.handler = xhtml; + this.context = context; this.metadata = metadata; this.strictParsing = strictParsing; - - // Fetch / Build an EmbeddedDocumentExtractor with which - // to handle/process the parts/attachments - - // Was an EmbeddedDocumentExtractor explicitly supplied? - this.extractor = context.get(EmbeddedDocumentExtractor.class); - - // If there's no EmbeddedDocumentExtractor, then try using a normal parser - // This will ensure that the contents are made available to the user, so - // the see the text, but without fine-grained control/extraction - // (This also maintains backward compatibility with older versions!) - if (this.extractor == null) { - // If the user gave a parser, use that, if not the default - Parser parser = context.get(AutoDetectParser.class); - if (parser == null) { - parser = context.get(Parser.class); - } - if (parser == null) { - TikaConfig tikaConfig = context.get(TikaConfig.class); - if (tikaConfig == null) { - tikaConfig = TikaConfig.getDefaultConfig(); - } - parser = new AutoDetectParser(tikaConfig.getParser()); - } - ParseContext ctx = new ParseContext(); - ctx.set(Parser.class, parser); - extractor = new ParsingEmbeddedDocumentExtractor(ctx); - } } public void body(BodyDescriptor body, InputStream is) throws MimeException, IOException { + // Work out the best underlying parser for the part + // Check first for a specified AutoDetectParser (which may have a + // specific Config), then a recursing parser, and finally the default + Parser parser = context.get(AutoDetectParser.class); + if (parser == null) { + parser = context.get(Parser.class); + } + if (parser == null) { + if (tikaConfig == null) { + tikaConfig = context.get(TikaConfig.class); + if (tikaConfig == null) { + tikaConfig = TikaConfig.getDefaultConfig(); + } + } + parser = tikaConfig.getParser(); + } + // use a different metadata object // in order to specify the mime type of the // sub part without damaging the main metadata @@ -106,10 +98,11 @@ submd.set(Metadata.CONTENT_ENCODING, body.getCharset()); try { - if (extractor.shouldParseEmbedded(submd)) { - extractor.parseEmbedded(is, handler, submd, false); - } - } catch (SAXException e) { + BodyContentHandler bch = new BodyContentHandler(handler); + parser.parse(is, new EmbeddedContentHandler(bch), submd, context); + } catch (SAXException e) { + throw new MimeException(e); + } catch (TikaException e) { throw new MimeException(e); } } @@ -151,10 +144,11 @@ /** * Header for the whole message or its parts - * - * @see http://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/parser/ - * Field.html - */ + * + * @see http + * ://james.apache.org/mime4j/apidocs/org/apache/james/mime4j/parser/ + * Field.html + **/ public void field(Field field) throws MimeException { // inPart indicates whether these metadata correspond to the // whole message or its parts @@ -170,8 +164,8 @@ MailboxListField fromField = (MailboxListField) parsedField; MailboxList mailboxList = fromField.getMailboxList(); if (fromField.isValidField() && mailboxList != null) { - for (Address address : mailboxList) { - String from = getDisplayString(address); + for (int i = 0; i < mailboxList.size(); i++) { + String from = getDisplayString(mailboxList.get(i)); metadata.add(Metadata.MESSAGE_FROM, from); metadata.add(TikaCoreProperties.CREATOR, from); } @@ -207,12 +201,12 @@ } private void processAddressList(ParsedField field, String addressListType, - String metadataField) throws MimeException { + String metadataField) throws MimeException { AddressListField toField = (AddressListField) field; if (toField.isValidField()) { AddressList addressList = toField.getAddressList(); - for (Address address : addressList) { - metadata.add(metadataField, getDisplayString(address)); + for (int i = 0; i < addressList.size(); ++i) { + metadata.add(metadataField, getDisplayString(addressList.get(i))); } } else { String to = stripOutFieldPrefix(field, @@ -265,7 +259,7 @@ private String stripOutFieldPrefix(Field field, String fieldname) { String temp = field.getRaw().toString(); int loc = fieldname.length(); - while (temp.charAt(loc) == ' ') { + while (temp.charAt(loc) ==' ') { loc++; } return temp.substring(loc); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java index 6299d3f..4fdfc06 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mail/RFC822Parser.java @@ -25,7 +25,7 @@ import org.apache.james.mime4j.parser.MimeStreamParser; import org.apache.james.mime4j.stream.MimeConfig; import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; +import org.apache.tika.io.TaggedInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -37,16 +37,15 @@ /** * Uses apache-mime4j to parse emails. Each part is treated with the * corresponding parser and displayed within elements. - *

    + *

    * A {@link MimeEntityConfig} object can be passed in the parsing context * to better control the parsing process. * * @author jnioche@digitalpebble.com */ public class RFC822Parser extends AbstractParser { - /** - * Serial version UID - */ + + /** Serial version UID */ private static final long serialVersionUID = -5504243905998074168L; private static final Set SUPPORTED_TYPES = Collections @@ -57,7 +56,7 @@ } public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, + Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { // Get the mime4j configuration, or use a default one MimeConfig config = new MimeConfig(); @@ -72,12 +71,11 @@ xhtml, metadata, context, config.isStrictParsing()); parser.setContentHandler(mch); parser.setContentDecoding(true); - - TikaInputStream tstream = TikaInputStream.get(stream); + TaggedInputStream tagged = TaggedInputStream.get(stream); try { - parser.parse(tstream); + parser.parse(tagged); } catch (IOException e) { - tstream.throwIfCauseOf(e); + tagged.throwIfCauseOf(e); throw new TikaException("Failed to parse an email message", e); } catch (MimeException e) { // Unwrap the exception in case it was not thrown by mime4j diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mat/MatParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mat/MatParser.java deleted file mode 100644 index 354356a..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mat/MatParser.java +++ /dev/null @@ -1,133 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.mat; - -//JDK imports -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Set; -import java.util.Map; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.mime.MediaType; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -//JMatIO imports -import com.jmatio.io.MatFileHeader; -import com.jmatio.io.MatFileReader; -import com.jmatio.types.MLArray; -import com.jmatio.types.MLStructure; - -import static java.nio.charset.StandardCharsets.UTF_8; - - -public class MatParser extends AbstractParser { - - public static final String MATLAB_MIME_TYPE = - "application/x-matlab-data"; - - private final Set SUPPORTED_TYPES = - Collections.singleton(MediaType.application("x-matlab-data")); - - public Set getSupportedTypes(ParseContext context){ - return SUPPORTED_TYPES; - } - - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - - //Set MIME type as Matlab - metadata.set(Metadata.CONTENT_TYPE, MATLAB_MIME_TYPE); - - try { - // Use TIS so we can spool a temp file for parsing. - TikaInputStream tis = TikaInputStream.get(stream); - - //Extract information from header file - MatFileReader mfr = new MatFileReader(tis.getFile()); //input .mat file - MatFileHeader hdr = mfr.getMatFileHeader(); //.mat header information - - // Example header: "MATLAB 5.0 MAT-file, Platform: MACI64, Created on: Sun Mar 2 23:41:57 2014" - String[] parts = hdr.getDescription().split(","); // Break header information into its parts - - if (parts[2].contains("Created")) { - int lastIndex1 = parts[2].lastIndexOf("Created on:"); - String dateCreated = parts[2].substring(lastIndex1 + "Created on:".length()).trim(); - metadata.set("createdOn", dateCreated); - } - - if (parts[1].contains("Platform")) { - int lastIndex2 = parts[1].lastIndexOf("Platform:"); - String platform = parts[1].substring(lastIndex2 + "Platform:".length()).trim(); - metadata.set("platform" , platform); - } - - if (parts[0].contains("MATLAB")) { - metadata.set("fileType", parts[0]); - } - - // Get endian indicator from header file - String endianBytes = new String(hdr.getEndianIndicator(), UTF_8); // Retrieve endian bytes and convert to string - String endianCode = String.valueOf(endianBytes.toCharArray()); // Convert bytes to characters to string - metadata.set("endian", endianCode); - - //Text output - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - xhtml.newline(); - //Loop through each variable - for (Map.Entry entry : mfr.getContent().entrySet()) { - String varName = entry.getKey(); - MLArray varData = entry.getValue(); - - xhtml.element("p", varName + ":" + String.valueOf(varData)); - - // If the variable is a structure, extract variable info from structure - if (varData.isStruct()){ - MLStructure mlStructure = (MLStructure) mfr.getMLArray(varName); - xhtml.startElement("ul"); - xhtml.newline(); - for (MLArray element : mlStructure.getAllFields()){ - xhtml.startElement("li"); - xhtml.characters(String.valueOf(element)); - - // If there is an embedded structure, extract variable info. - if (element.isStruct()){ - xhtml.startElement("ul"); - // Should this actually be a recursive call? - xhtml.element("li", element.contentToString()); - xhtml.endElement("ul"); - } - - xhtml.endElement("li"); - } - xhtml.endElement("ul"); - } - } - xhtml.endDocument(); - } catch (IOException e) { - throw new TikaException("Error parsing Matlab file with MatParser", e); - } - } -} \ No newline at end of file diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java index 86b1dd4..fb8b41a 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mbox/MboxParser.java @@ -17,27 +17,20 @@ package org.apache.tika.parser.mbox; import java.io.BufferedReader; -import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; +import java.io.UnsupportedEncodingException; import java.text.ParseException; import java.text.SimpleDateFormat; import java.util.Collections; import java.util.Date; -import java.util.HashMap; -import java.util.LinkedList; import java.util.Locale; -import java.util.Map; -import java.util.Queue; import java.util.Set; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.tika.exception.TikaException; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; @@ -48,143 +41,185 @@ import org.xml.sax.SAXException; /** - * Mbox (mailbox) parser. This version extracts each mail from Mbox and uses the - * DelegatingParser to process each mail. + * Mbox (mailbox) parser. This version returns the headers for the first email + * via metadata, which means headers from subsequent emails will be lost. */ public class MboxParser extends AbstractParser { + /** Serial version UID */ + private static final long serialVersionUID = -1762689436731160661L; + + private static final Set SUPPORTED_TYPES = + Collections.singleton(MediaType.application("mbox")); + public static final String MBOX_MIME_TYPE = "application/mbox"; public static final String MBOX_RECORD_DIVIDER = "From "; - public static final int MAIL_MAX_SIZE = 50000000; - /** - * Serial version UID - */ - private static final long serialVersionUID = -1762689436731160661L; - private static final Set SUPPORTED_TYPES = Collections.singleton(MediaType.application("mbox")); private static final Pattern EMAIL_HEADER_PATTERN = Pattern.compile("([^ ]+):[ \t]*(.*)"); private static final Pattern EMAIL_ADDRESS_PATTERN = Pattern.compile("<(.*@.*)>"); private static final String EMAIL_HEADER_METADATA_PREFIX = "MboxParser-"; private static final String EMAIL_FROMLINE_METADATA = EMAIL_HEADER_METADATA_PREFIX + "from"; - private final Map trackingMetadata = new HashMap(); - private boolean tracking = false; - - public static Date parseDate(String headerContent) throws ParseException { - SimpleDateFormat dateFormat = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z", Locale.US); - return dateFormat.parse(headerContent); + + private enum ParseStates { + START, IN_HEADER, IN_CONTENT } public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; } - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) + public void parse( + InputStream stream, ContentHandler handler, + Metadata metadata, ParseContext context) throws IOException, TikaException, SAXException { - EmbeddedDocumentExtractor extractor = context.get(EmbeddedDocumentExtractor.class, - new ParsingEmbeddedDocumentExtractor(context)); - - String charsetName = "windows-1252"; + InputStreamReader isr; + try { + // Headers are going to be 7-bit ascii + isr = new InputStreamReader(stream, "US-ASCII"); + } catch (UnsupportedEncodingException e) { + throw new TikaException("US-ASCII is not supported!", e); + } + + BufferedReader reader = new BufferedReader(isr); metadata.set(Metadata.CONTENT_TYPE, MBOX_MIME_TYPE); - metadata.set(Metadata.CONTENT_ENCODING, charsetName); + metadata.set(Metadata.CONTENT_ENCODING, "us-ascii"); XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); xhtml.startDocument(); - InputStreamReader isr = new InputStreamReader(stream, charsetName); - try (BufferedReader reader = new BufferedReader(isr)) { - String curLine = reader.readLine(); - int mailItem = 0; - do { - if (curLine.startsWith(MBOX_RECORD_DIVIDER)) { - Metadata mailMetadata = new Metadata(); - Queue multiline = new LinkedList(); - mailMetadata.add(EMAIL_FROMLINE_METADATA, curLine.substring(MBOX_RECORD_DIVIDER.length())); - mailMetadata.set(Metadata.CONTENT_TYPE, "message/rfc822"); - curLine = reader.readLine(); - - ByteArrayOutputStream message = new ByteArrayOutputStream(100000); - do { - if (curLine.startsWith(" ") || curLine.startsWith("\t")) { - String latestLine = multiline.poll(); - latestLine += " " + curLine.trim(); - multiline.add(latestLine); - } else { - multiline.add(curLine); + ParseStates parseState = ParseStates.START; + String multiLine = null; + boolean inQuote = false; + int numEmails = 0; + + // We're going to scan, line-by-line, for a line that starts with + // "From " + for (String curLine = reader.readLine(); curLine != null; curLine = reader.readLine()) { + boolean newMessage = curLine.startsWith(MBOX_RECORD_DIVIDER); + if (newMessage) { + numEmails += 1; + } + + switch (parseState) { + case START: + if (newMessage) { + parseState = ParseStates.IN_HEADER; + newMessage = false; + // Fall through to IN_HEADER + } else { + break; + } + + case IN_HEADER: + if (newMessage) { + saveHeaderInMetadata(numEmails, metadata, multiLine); + multiLine = curLine; + } else if (curLine.length() == 0) { + // Blank line is signal that we're transitioning to the content. + saveHeaderInMetadata(numEmails, metadata, multiLine); + parseState = ParseStates.IN_CONTENT; + + // Mimic what PackageParser does between entries. + xhtml.startElement("div", "class", "email-entry"); + xhtml.startElement("p"); + inQuote = false; + } else if (curLine.startsWith(" ") || curLine.startsWith("\t")) { + multiLine += " " + curLine.trim(); + } else { + saveHeaderInMetadata(numEmails, metadata, multiLine); + multiLine = curLine; + } + + break; + + // TODO - use real email parsing support so we can correctly handle + // things like multipart messages and quoted-printable encoding. + // We'd also want this for charset handling, where content isn't 7-bit + // ascii. + case IN_CONTENT: + if (newMessage) { + endMessage(xhtml, inQuote); + parseState = ParseStates.IN_HEADER; + multiLine = curLine; + } else { + boolean quoted = curLine.startsWith(">"); + if (inQuote) { + if (!quoted) { + xhtml.endElement("q"); + inQuote = false; } - - message.write(curLine.getBytes(charsetName)); - message.write(0x0A); - curLine = reader.readLine(); + } else if (quoted) { + xhtml.startElement("q"); + inQuote = true; } - while (curLine != null && !curLine.startsWith(MBOX_RECORD_DIVIDER) && message.size() < MAIL_MAX_SIZE); - - for (String item : multiline) { - saveHeaderInMetadata(mailMetadata, item); - } - - ByteArrayInputStream messageStream = new ByteArrayInputStream(message.toByteArray()); - message = null; - - if (extractor.shouldParseEmbedded(mailMetadata)) { - extractor.parseEmbedded(messageStream, xhtml, mailMetadata, true); - } - - if (tracking) { - getTrackingMetadata().put(mailItem++, mailMetadata); - } - } else { - curLine = reader.readLine(); + + xhtml.characters(curLine); + + // For plain text email, each line is a real break position. + xhtml.element("br", ""); } - - } while (curLine != null && !Thread.currentThread().isInterrupted()); + } + } + + if (parseState == ParseStates.IN_HEADER) { + saveHeaderInMetadata(numEmails, metadata, multiLine); + } else if (parseState == ParseStates.IN_CONTENT) { + endMessage(xhtml, inQuote); } xhtml.endDocument(); } - public boolean isTracking() { - return tracking; - } - - public void setTracking(boolean tracking) { - this.tracking = tracking; - } - - public Map getTrackingMetadata() { - return trackingMetadata; - } - - private void saveHeaderInMetadata(Metadata metadata, String curLine) { + private void endMessage(XHTMLContentHandler xhtml, boolean inQuote) throws SAXException { + if (inQuote) { + xhtml.endElement("q"); + } + + xhtml.endElement("p"); + xhtml.endElement("div"); + } + + private void saveHeaderInMetadata(int numEmails, Metadata metadata, String curLine) { + if ((curLine == null) || (numEmails > 1)) { + return; + } else if (curLine.startsWith(MBOX_RECORD_DIVIDER)) { + metadata.add(EMAIL_FROMLINE_METADATA, curLine.substring(MBOX_RECORD_DIVIDER.length())); + return; + } + Matcher headerMatcher = EMAIL_HEADER_PATTERN.matcher(curLine); if (!headerMatcher.matches()) { return; // ignore malformed header lines } - String headerTag = headerMatcher.group(1).toLowerCase(Locale.ROOT); + String headerTag = headerMatcher.group(1).toLowerCase(); String headerContent = headerMatcher.group(2); if (headerTag.equalsIgnoreCase("From")) { metadata.set(TikaCoreProperties.CREATOR, headerContent); - } else if (headerTag.equalsIgnoreCase("To") || headerTag.equalsIgnoreCase("Cc") - || headerTag.equalsIgnoreCase("Bcc")) { + } else if (headerTag.equalsIgnoreCase("To") || + headerTag.equalsIgnoreCase("Cc") || + headerTag.equalsIgnoreCase("Bcc")) { Matcher address = EMAIL_ADDRESS_PATTERN.matcher(headerContent); - if (address.find()) { - metadata.add(Metadata.MESSAGE_RECIPIENT_ADDRESS, address.group(1)); - } else if (headerContent.indexOf('@') > -1) { - metadata.add(Metadata.MESSAGE_RECIPIENT_ADDRESS, headerContent); - } - + if(address.find()) { + metadata.add(Metadata.MESSAGE_RECIPIENT_ADDRESS, address.group(1)); + } else if(headerContent.indexOf('@') > -1) { + metadata.add(Metadata.MESSAGE_RECIPIENT_ADDRESS, headerContent); + } + String property = Metadata.MESSAGE_TO; if (headerTag.equalsIgnoreCase("Cc")) { - property = Metadata.MESSAGE_CC; + property = Metadata.MESSAGE_CC; } else if (headerTag.equalsIgnoreCase("Bcc")) { - property = Metadata.MESSAGE_BCC; + property = Metadata.MESSAGE_BCC; } metadata.add(property, headerContent); } else if (headerTag.equalsIgnoreCase("Subject")) { - metadata.add(Metadata.SUBJECT, headerContent); + // TODO Move to title in Tika 2.0 + metadata.add(TikaCoreProperties.TRANSITION_SUBJECT_TO_DC_TITLE, + headerContent); } else if (headerTag.equalsIgnoreCase("Date")) { try { Date date = parseDate(headerContent); @@ -206,4 +241,10 @@ metadata.add(EMAIL_HEADER_METADATA_PREFIX + headerTag, headerContent); } } + + public static Date parseDate(String headerContent) throws ParseException { + SimpleDateFormat dateFormat = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z", Locale.US); + return dateFormat.parse(headerContent); + } + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java deleted file mode 100644 index 5883bd5..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mbox/OutlookPSTParser.java +++ /dev/null @@ -1,203 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.mbox; - -import static java.lang.String.valueOf; -import static java.nio.charset.StandardCharsets.UTF_8; -import static java.util.Collections.singleton; - -import java.io.ByteArrayInputStream; -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.util.Set; - -import com.pff.PSTAttachment; -import com.pff.PSTFile; -import com.pff.PSTFolder; -import com.pff.PSTMessage; -import org.apache.tika.exception.TikaException; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; -import org.xml.sax.helpers.AttributesImpl; - -/** - * Parser for MS Outlook PST email storage files - */ -public class OutlookPSTParser extends AbstractParser { - - private static final long serialVersionUID = 620998217748364063L; - - public static final MediaType MS_OUTLOOK_PST_MIMETYPE = MediaType.application("vnd.ms-outlook-pst"); - private static final Set SUPPORTED_TYPES = singleton(MS_OUTLOOK_PST_MIMETYPE); - - private static AttributesImpl createAttribute(String attName, String attValue) { - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", attName, attName, "CDATA", attValue); - return attributes; - } - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - - // Use the delegate parser to parse the contained document - EmbeddedDocumentExtractor embeddedExtractor = context.get(EmbeddedDocumentExtractor.class, - new ParsingEmbeddedDocumentExtractor(context)); - - metadata.set(Metadata.CONTENT_TYPE, MS_OUTLOOK_PST_MIMETYPE.toString()); - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - - TikaInputStream in = TikaInputStream.get(stream); - PSTFile pstFile = null; - try { - pstFile = new PSTFile(in.getFile().getPath()); - metadata.set(Metadata.CONTENT_LENGTH, valueOf(pstFile.getFileHandle().length())); - boolean isValid = pstFile.getFileHandle().getFD().valid(); - metadata.set("isValid", valueOf(isValid)); - if (isValid) { - parseFolder(xhtml, pstFile.getRootFolder(), embeddedExtractor); - } - } catch (Exception e) { - throw new TikaException(e.getMessage(), e); - } finally { - if (pstFile != null && pstFile.getFileHandle() != null) { - try { - pstFile.getFileHandle().close(); - } catch (IOException e) { - //swallow closing exception - } - } - } - - xhtml.endDocument(); - } - - private void parseFolder(XHTMLContentHandler handler, PSTFolder pstFolder, EmbeddedDocumentExtractor embeddedExtractor) - throws Exception { - if (pstFolder.getContentCount() > 0) { - PSTMessage pstMail = (PSTMessage) pstFolder.getNextChild(); - while (pstMail != null) { - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", "class", "class", "CDATA", "embedded"); - attributes.addAttribute("", "id", "id", "CDATA", pstMail.getInternetMessageId()); - handler.startElement("div", attributes); - handler.element("h1", pstMail.getSubject()); - - parserMailItem(handler, pstMail, embeddedExtractor); - parseMailAttachments(handler, pstMail, embeddedExtractor); - - handler.endElement("div"); - - pstMail = (PSTMessage) pstFolder.getNextChild(); - } - } - - if (pstFolder.hasSubfolders()) { - for (PSTFolder pstSubFolder : pstFolder.getSubFolders()) { - handler.startElement("div", createAttribute("class", "email-folder")); - handler.element("h1", pstSubFolder.getDisplayName()); - parseFolder(handler, pstSubFolder, embeddedExtractor); - handler.endElement("div"); - } - } - } - - private void parserMailItem(XHTMLContentHandler handler, PSTMessage pstMail, EmbeddedDocumentExtractor embeddedExtractor) throws SAXException, IOException { - Metadata mailMetadata = new Metadata(); - mailMetadata.set(Metadata.RESOURCE_NAME_KEY, pstMail.getInternetMessageId()); - mailMetadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, pstMail.getInternetMessageId()); - mailMetadata.set(TikaCoreProperties.IDENTIFIER, pstMail.getInternetMessageId()); - mailMetadata.set(TikaCoreProperties.TITLE, pstMail.getSubject()); - mailMetadata.set(Metadata.MESSAGE_FROM, pstMail.getSenderName()); - mailMetadata.set(TikaCoreProperties.CREATOR, pstMail.getSenderName()); - mailMetadata.set(TikaCoreProperties.CREATED, pstMail.getCreationTime()); - mailMetadata.set(TikaCoreProperties.MODIFIED, pstMail.getLastModificationTime()); - mailMetadata.set(TikaCoreProperties.COMMENTS, pstMail.getComment()); - mailMetadata.set("descriptorNodeId", valueOf(pstMail.getDescriptorNodeId())); - mailMetadata.set("senderEmailAddress", pstMail.getSenderEmailAddress()); - mailMetadata.set("recipients", pstMail.getRecipientsString()); - mailMetadata.set("displayTo", pstMail.getDisplayTo()); - mailMetadata.set("displayCC", pstMail.getDisplayCC()); - mailMetadata.set("displayBCC", pstMail.getDisplayBCC()); - mailMetadata.set("importance", valueOf(pstMail.getImportance())); - mailMetadata.set("priority", valueOf(pstMail.getPriority())); - mailMetadata.set("flagged", valueOf(pstMail.isFlagged())); - - byte[] mailContent = pstMail.getBody().getBytes(UTF_8); - embeddedExtractor.parseEmbedded(new ByteArrayInputStream(mailContent), handler, mailMetadata, true); - } - - private void parseMailAttachments(XHTMLContentHandler xhtml, PSTMessage email, EmbeddedDocumentExtractor embeddedExtractor) - throws TikaException { - int numberOfAttachments = email.getNumberOfAttachments(); - for (int i = 0; i < numberOfAttachments; i++) { - File tempFile = null; - try { - PSTAttachment attach = email.getAttachment(i); - - // Get the filename; both long and short filenames can be used for attachments - String filename = attach.getLongFilename(); - if (filename.isEmpty()) { - filename = attach.getFilename(); - } - - xhtml.element("p", filename); - - Metadata attachMeta = new Metadata(); - attachMeta.set(Metadata.RESOURCE_NAME_KEY, filename); - attachMeta.set(Metadata.EMBEDDED_RELATIONSHIP_ID, filename); - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", "class", "class", "CDATA", "embedded"); - attributes.addAttribute("", "id", "id", "CDATA", filename); - xhtml.startElement("div", attributes); - if (embeddedExtractor.shouldParseEmbedded(attachMeta)) { - TemporaryResources tmp = new TemporaryResources(); - try { - TikaInputStream tis = TikaInputStream.get(attach.getFileInputStream(), tmp); - embeddedExtractor.parseEmbedded(tis, xhtml, attachMeta, true); - } finally { - tmp.dispose(); - } - } - xhtml.endElement("div"); - - } catch (Exception e) { - throw new TikaException("Unable to unpack document stream", e); - } finally { - if (tempFile != null) - tempFile.delete(); - } - } - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java deleted file mode 100644 index 2c02dfc..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractListManager.java +++ /dev/null @@ -1,269 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.microsoft; - -import java.util.ArrayList; -import java.util.HashMap; -import java.util.List; -import java.util.Map; -import java.util.regex.Matcher; -import java.util.regex.Pattern; - -import org.apache.poi.hwpf.converter.NumberFormatter; - -public abstract class AbstractListManager { - private final static String BULLET = "\u00b7"; - - protected Map listLevelMap = new HashMap(); - protected Map overrideTupleMap = new HashMap(); - - //helper class that is docx/doc format agnostic - protected class ParagraphLevelCounter { - - //counts can == 0 if the format is decimal, make sure - //that flag values are < 0 - private final Integer NOT_SEEN_YET = -1; - private final Integer FIRST_SKIPPED = -2; - private final LevelTuple[] levelTuples; - Pattern LEVEL_INTERPOLATOR = Pattern.compile("%(\\d+)"); - private List counts = new ArrayList(); - private int lastLevel = -1; - - public ParagraphLevelCounter(LevelTuple[] levelTuples) { - this.levelTuples = levelTuples; - } - - public int getNumberOfLevels() { - return levelTuples.length; - } - - /** - * Apply this to every numbered paragraph in order. - * - * @param levelNumber level number that is being incremented - * @return the new formatted number string for this level - */ - public String incrementLevel(int levelNumber, LevelTuple[] overrideLevelTuples) { - - for (int i = lastLevel + 1; i < levelNumber; i++) { - if (i >= counts.size()) { - int val = getStart(i, overrideLevelTuples); - counts.add(i, val); - } else { - int count = counts.get(i); - if (count == NOT_SEEN_YET) { - count = getStart(i, overrideLevelTuples); - counts.set(i, count); - } - } - } - - if (levelNumber < counts.size()) { - resetAfter(levelNumber, overrideLevelTuples); - int count = counts.get(levelNumber); - if (count == NOT_SEEN_YET) { - count = getStart(levelNumber, overrideLevelTuples); - } else { - count++; - } - counts.set(levelNumber, count); - lastLevel = levelNumber; - return format(levelNumber, overrideLevelTuples); - } - - counts.add(levelNumber, getStart(levelNumber, overrideLevelTuples)); - lastLevel = levelNumber; - return format(levelNumber, overrideLevelTuples); - } - - /** - * @param level which level to format - * @return the string that represents the number and the surrounding text for this paragraph - */ - private String format(int level, LevelTuple[] overrideLevelTuples) { - if (level < 0 || level >= levelTuples.length) { - //log? - return ""; - } - boolean isLegal = (overrideLevelTuples != null) ? overrideLevelTuples[level].isLegal : levelTuples[level].isLegal; - //short circuit bullet - String numFmt = getNumFormat(level, isLegal, overrideLevelTuples); - if ("bullet".equals(numFmt)) { - return BULLET + " "; - } - - String lvlText = (overrideLevelTuples == null || overrideLevelTuples[level].lvlText == null) ? - levelTuples[level].lvlText : overrideLevelTuples[level].lvlText; - StringBuilder sb = new StringBuilder(); - Matcher m = LEVEL_INTERPOLATOR.matcher(lvlText); - int last = 0; - while (m.find()) { - sb.append(lvlText.substring(last, m.start())); - String lvlString = m.group(1); - int lvlNum = -1; - try { - lvlNum = Integer.parseInt(lvlString); - } catch (NumberFormatException e) { - //swallow - } - String numString = ""; - //need to subtract 1 because, e.g. %1 is the format - //for the number at array offset 0 - numString = formatNum(lvlNum - 1, isLegal, overrideLevelTuples); - - sb.append(numString); - last = m.end(); - } - sb.append(lvlText.substring(last)); - if (sb.length() > 0) { - //TODO: add in character after number - sb.append(" "); - } - return sb.toString(); - } - - //actual level number; can return empty string if numberformatter fails - private String formatNum(int lvlNum, boolean isLegal, LevelTuple[] overrideLevelTuples) { - - int numFmtStyle = 0; - String numFmt = getNumFormat(lvlNum, isLegal, overrideLevelTuples); - - int count = getCount(lvlNum); - if (count < 0) { - count = 1; - } - if ("lowerLetter".equals(numFmt)) { - numFmtStyle = 4; - } else if ("lowerRoman".equals(numFmt)) { - numFmtStyle = 2; - } else if ("decimal".equals(numFmt)) { - numFmtStyle = 0; - } else if ("upperLetter".equals(numFmt)) { - numFmtStyle = 3; - } else if ("upperRoman".equals(numFmt)) { - numFmtStyle = 1; - } else if ("bullet".equals(numFmt)) { - return ""; - //not yet handled by NumberFormatter...TODO: add to NumberFormatter? - } else if ("ordinal".equals(numFmt)) { - return ordinalize(count); - } else if ("decimalZero".equals(numFmt)) { - return "0" + NumberFormatter.getNumber(count, 0); - } else if ("none".equals(numFmt)) { - return ""; - } - try { - return NumberFormatter.getNumber(count, numFmtStyle); - } catch (IllegalArgumentException e) { - return ""; - } - } - - private String ordinalize(int count) { - //this is only good for locale == English - String countString = Integer.toString(count); - if (countString.endsWith("1")) { - return countString + "st"; - } else if (countString.endsWith("2")) { - return countString + "nd"; - } else if (countString.endsWith("3")) { - return countString + "rd"; - } - return countString + "th"; - } - - private String getNumFormat(int lvlNum, boolean isLegal, LevelTuple[] overrideLevelTuples) { - if (lvlNum < 0 || lvlNum >= levelTuples.length) { - //log? - return "decimal"; - } - if (isLegal) { - //return decimal no matter the level if isLegal is true - return "decimal"; - } - return (overrideLevelTuples == null || overrideLevelTuples[lvlNum].numFmt == null) ? - levelTuples[lvlNum].numFmt : overrideLevelTuples[lvlNum].numFmt; - } - - private int getCount(int lvlNum) { - if (lvlNum < 0 || lvlNum >= counts.size()) { - //log? - return 1; - } - return counts.get(lvlNum); - } - - private void resetAfter(int startlevelNumber, LevelTuple[] overrideLevelTuples) { - for (int levelNumber = startlevelNumber + 1; levelNumber < counts.size(); levelNumber++) { - int cnt = counts.get(levelNumber); - if (cnt == NOT_SEEN_YET) { - //do nothing - } else if (cnt == FIRST_SKIPPED) { - //do nothing - } else if (levelTuples.length > levelNumber) { - //never reset if restarts == 0 - int restart = (overrideLevelTuples == null || overrideLevelTuples[levelNumber].restart < 0) ? - levelTuples[levelNumber].restart : overrideLevelTuples[levelNumber].restart; - if (restart == 0) { - return; - } else if (restart == -1 || - startlevelNumber <= restart - 1) { - counts.set(levelNumber, NOT_SEEN_YET); - } else { - //do nothing/don't reset - } - } else { - //reset! - counts.set(levelNumber, NOT_SEEN_YET); - } - } - } - - private int getStart(int levelNumber, LevelTuple[] overrideLevelTuples) { - if (levelNumber >= levelTuples.length) { - return 1; - } else { - return (overrideLevelTuples == null || overrideLevelTuples[levelNumber].start < 0) ? - levelTuples[levelNumber].start : overrideLevelTuples[levelNumber].start; - } - } - } - - protected class LevelTuple { - private final int start; - private final int restart; - private final String lvlText; - private final String numFmt; - private final boolean isLegal; - - public LevelTuple(String lvlText) { - this.lvlText = lvlText; - start = 1; - restart = -1; - numFmt = "decimal"; - isLegal = false; - } - - public LevelTuple(int start, int restart, String lvlText, String numFmt, boolean isLegal) { - this.start = start; - this.restart = restart; - this.lvlText = lvlText; - this.numFmt = numFmt; - this.isLegal = isLegal; - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java index 5526c99..ab36bf7 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java @@ -40,97 +40,75 @@ import org.apache.tika.mime.MimeTypeException; import org.apache.tika.mime.MimeTypes; import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.PasswordProvider; import org.apache.tika.parser.microsoft.OfficeParser.POIFSDocumentType; import org.apache.tika.parser.pkg.ZipContainerDetector; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.SAXException; abstract class AbstractPOIFSExtractor { - private static final Log logger = LogFactory.getLog(AbstractPOIFSExtractor.class); private final EmbeddedDocumentExtractor extractor; - private PasswordProvider passwordProvider; private TikaConfig tikaConfig; private MimeTypes mimeTypes; private Detector detector; - private Metadata metadata; + private static final Log logger = LogFactory.getLog(AbstractPOIFSExtractor.class); protected AbstractPOIFSExtractor(ParseContext context) { - this(context, null); - } - - protected AbstractPOIFSExtractor(ParseContext context, Metadata metadata) { EmbeddedDocumentExtractor ex = context.get(EmbeddedDocumentExtractor.class); - if (ex == null) { + if (ex==null) { this.extractor = new ParsingEmbeddedDocumentExtractor(context); } else { this.extractor = ex; } - - this.passwordProvider = context.get(PasswordProvider.class); - this.tikaConfig = context.get(TikaConfig.class); - this.mimeTypes = context.get(MimeTypes.class); - this.detector = context.get(Detector.class); - this.metadata = metadata; - } - + + tikaConfig = context.get(TikaConfig.class); + mimeTypes = context.get(MimeTypes.class); + detector = context.get(Detector.class); + } + // Note - these cache, but avoid creating the default TikaConfig if not needed protected TikaConfig getTikaConfig() { - if (tikaConfig == null) { - tikaConfig = TikaConfig.getDefaultConfig(); - } - return tikaConfig; - } - + if (tikaConfig == null) { + tikaConfig = TikaConfig.getDefaultConfig(); + } + return tikaConfig; + } protected Detector getDetector() { - if (detector != null) return detector; - - detector = getTikaConfig().getDetector(); - return detector; - } - + if (detector != null) return detector; + + detector = getTikaConfig().getDetector(); + return detector; + } protected MimeTypes getMimeTypes() { - if (mimeTypes != null) return mimeTypes; - - mimeTypes = getTikaConfig().getMimeRepository(); - return mimeTypes; - } - - /** - * Returns the password to be used for this file, or null - * if no / default password should be used - */ - protected String getPassword() { - if (passwordProvider != null) { - return passwordProvider.getPassword(metadata); - } - return null; - } - + if (mimeTypes != null) return mimeTypes; + + mimeTypes = getTikaConfig().getMimeRepository(); + return mimeTypes; + } + protected void handleEmbeddedResource(TikaInputStream resource, String filename, String relationshipID, String mediaType, XHTMLContentHandler xhtml, boolean outputHtml) - throws IOException, SAXException, TikaException { - try { - Metadata metadata = new Metadata(); - if (filename != null) { - metadata.set(Metadata.TIKA_MIME_FILE, filename); - metadata.set(Metadata.RESOURCE_NAME_KEY, filename); - } - if (relationshipID != null) { - metadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, relationshipID); - } - if (mediaType != null) { - metadata.set(Metadata.CONTENT_TYPE, mediaType); - } - - if (extractor.shouldParseEmbedded(metadata)) { - extractor.parseEmbedded(resource, xhtml, metadata, outputHtml); - } - } finally { - resource.close(); - } + throws IOException, SAXException, TikaException { + try { + Metadata metadata = new Metadata(); + if(filename != null) { + metadata.set(Metadata.TIKA_MIME_FILE, filename); + metadata.set(Metadata.RESOURCE_NAME_KEY, filename); + } + if (relationshipID != null) { + metadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, relationshipID); + } + if(mediaType != null) { + metadata.set(Metadata.CONTENT_TYPE, mediaType); + } + + if (extractor.shouldParseEmbedded(metadata)) { + extractor.parseEmbedded(resource, xhtml, metadata, outputHtml); + } + } finally { + resource.close(); + } } /** @@ -146,12 +124,15 @@ // It's OOXML (has a ZipFile): Entry ooxml = dir.getEntry("Package"); - try (TikaInputStream stream = TikaInputStream.get( - new DocumentInputStream((DocumentEntry) ooxml))) { + TikaInputStream stream = TikaInputStream.get( + new DocumentInputStream((DocumentEntry) ooxml)); + try { ZipContainerDetector detector = new ZipContainerDetector(); MediaType type = detector.detect(stream, new Metadata()); handleEmbeddedResource(stream, null, dir.getName(), type.toString(), xhtml, true); return; + } finally { + stream.close(); } } @@ -167,10 +148,9 @@ if (type == POIFSDocumentType.OLE10_NATIVE) { try { // Try to un-wrap the OLE10Native record: - Ole10Native ole = Ole10Native.createFromEmbeddedOleObject((DirectoryNode) dir); - if (ole.getLabel() != null) { - metadata.set(Metadata.RESOURCE_NAME_KEY, dir.getName() + '/' + ole.getLabel()); - } + Ole10Native ole = Ole10Native.createFromEmbeddedOleObject((DirectoryNode)dir); + metadata.set(Metadata.RESOURCE_NAME_KEY, dir.getName() + '/' + ole.getLabel()); + byte[] data = ole.getDataBuffer(); embedded = TikaInputStream.get(data); } catch (Ole10NativeException ex) { @@ -180,33 +160,33 @@ } } else if (type == POIFSDocumentType.COMP_OBJ) { try { - // Grab the contents and process - DocumentEntry contentsEntry; - try { - contentsEntry = (DocumentEntry) dir.getEntry("CONTENTS"); - } catch (FileNotFoundException ioe) { - contentsEntry = (DocumentEntry) dir.getEntry("Contents"); - } - DocumentInputStream inp = new DocumentInputStream(contentsEntry); - byte[] contents = new byte[contentsEntry.getSize()]; - inp.readFully(contents); - embedded = TikaInputStream.get(contents); - - // Try to work out what it is - MediaType mediaType = getDetector().detect(embedded, new Metadata()); - String extension = type.getExtension(); - try { - MimeType mimeType = getMimeTypes().forName(mediaType.toString()); - extension = mimeType.getExtension(); - } catch (MimeTypeException mte) { - // No details on this type are known - } - - // Record what we can do about it - metadata.set(Metadata.CONTENT_TYPE, mediaType.getType().toString()); - metadata.set(Metadata.RESOURCE_NAME_KEY, dir.getName() + extension); - } catch (Exception e) { - throw new TikaException("Invalid embedded resource", e); + // Grab the contents and process + DocumentEntry contentsEntry; + try { + contentsEntry = (DocumentEntry)dir.getEntry("CONTENTS"); + } catch (FileNotFoundException ioe) { + contentsEntry = (DocumentEntry)dir.getEntry("Contents"); + } + DocumentInputStream inp = new DocumentInputStream(contentsEntry); + byte[] contents = new byte[contentsEntry.getSize()]; + inp.readFully(contents); + embedded = TikaInputStream.get(contents); + + // Try to work out what it is + MediaType mediaType = getDetector().detect(embedded, new Metadata()); + String extension = type.getExtension(); + try { + MimeType mimeType = getMimeTypes().forName(mediaType.toString()); + extension = mimeType.getExtension(); + } catch(MimeTypeException mte) { + // No details on this type are known + } + + // Record what we can do about it + metadata.set(Metadata.CONTENT_TYPE, mediaType.getType().toString()); + metadata.set(Metadata.RESOURCE_NAME_KEY, dir.getName() + extension); + } catch(Exception e) { + throw new TikaException("Invalid embedded resource", e); } } else { metadata.set(Metadata.CONTENT_TYPE, type.getType().toString()); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java index 87f395c..61e6f64 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ExcelExtractor.java @@ -16,7 +16,7 @@ */ package org.apache.tika.parser.microsoft; -import java.awt.*; +import java.awt.Point; import java.io.IOException; import java.text.NumberFormat; import java.util.ArrayList; @@ -34,7 +34,6 @@ import org.apache.poi.hssf.eventusermodel.HSSFEventFactory; import org.apache.poi.hssf.eventusermodel.HSSFListener; import org.apache.poi.hssf.eventusermodel.HSSFRequest; -import org.apache.poi.hssf.extractor.OldExcelExtractor; import org.apache.poi.hssf.record.BOFRecord; import org.apache.poi.hssf.record.BoundSheetRecord; import org.apache.poi.hssf.record.CellValueRecordInterface; @@ -43,10 +42,8 @@ import org.apache.poi.hssf.record.DrawingGroupRecord; import org.apache.poi.hssf.record.EOFRecord; import org.apache.poi.hssf.record.ExtendedFormatRecord; -import org.apache.poi.hssf.record.FooterRecord; import org.apache.poi.hssf.record.FormatRecord; import org.apache.poi.hssf.record.FormulaRecord; -import org.apache.poi.hssf.record.HeaderRecord; import org.apache.poi.hssf.record.HyperlinkRecord; import org.apache.poi.hssf.record.LabelRecord; import org.apache.poi.hssf.record.LabelSSTRecord; @@ -58,7 +55,6 @@ import org.apache.poi.hssf.record.TextObjectRecord; import org.apache.poi.hssf.record.chart.SeriesTextRecord; import org.apache.poi.hssf.record.common.UnicodeString; -import org.apache.poi.hssf.record.crypto.Biff8EncryptionKey; import org.apache.poi.hssf.usermodel.HSSFPictureData; import org.apache.poi.poifs.filesystem.DirectoryEntry; import org.apache.poi.poifs.filesystem.DirectoryNode; @@ -68,7 +64,6 @@ import org.apache.tika.exception.EncryptedDocumentException; import org.apache.tika.exception.TikaException; import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.SAXException; @@ -76,11 +71,11 @@ /** * Excel parser implementation which uses POI's Event API * to handle the contents of a Workbook. - *

    + *

    * The Event API uses a much smaller memory footprint than * HSSFWorkbook when processing excel files * but at the cost of more complexity. - *

    + *

    * With the Event API a listener is registered for * specific record types and those records are created, * fired off to the listener and then discarded as the stream @@ -92,8 +87,6 @@ */ public class ExcelExtractor extends AbstractPOIFSExtractor { - private static final String WORKBOOK_ENTRY = "Workbook"; - private static final String BOOK_ENTRY = "Book"; /** * true if the HSSFListener should be registered * to listen for all records or false (the default) @@ -101,9 +94,11 @@ * records. */ private boolean listenForAllRecords = false; - - public ExcelExtractor(ParseContext context, Metadata metadata) { - super(context, metadata); + + private static final String WORKBOOK_ENTRY = "Workbook"; + + public ExcelExtractor(ParseContext context) { + super(context); } /** @@ -117,14 +112,14 @@ /** * Specifies whether this parser should to listen for all * records or just for the specified few. - *

    + *

    * Note: Under normal operation this setting should * be false (the default), but you can experiment with * this setting for testing and debugging purposes. * * @param listenForAllRecords true if the HSSFListener - * should be registered to listen for all records or false - * if the listener should be configured to only receive specified records. + * should be registered to listen for all records or false + * if the listener should be configured to only receive specified records. */ public void setListenForAllRecords(boolean listenForAllRecords) { this.listenForAllRecords = listenForAllRecords; @@ -136,7 +131,7 @@ * * @param filesystem POI file system * @throws IOException if an error occurs processing the workbook - * or writing the extracted content + * or writing the extracted content */ protected void parse( NPOIFSFileSystem filesystem, XHTMLContentHandler xhtml, @@ -147,24 +142,11 @@ protected void parse( DirectoryNode root, XHTMLContentHandler xhtml, Locale locale) throws IOException, SAXException, TikaException { - if (!root.hasEntry(WORKBOOK_ENTRY)) { - if (root.hasEntry(BOOK_ENTRY)) { - // Excel 5 / Excel 95 file - // Records are in a different structure so needs a - // different parser to process them - OldExcelExtractor extractor = new OldExcelExtractor(root); - OldExcelParser.parse(extractor, xhtml); - return; - } else { - // Corrupt file / very old file, just skip text extraction - return; - } - } - - // If a password was supplied, use it, otherwise the default - Biff8EncryptionKey.setCurrentUserPassword(getPassword()); - - // Have the file processed in event mode + if (! root.hasEntry(WORKBOOK_ENTRY)) { + // Corrupt file / very old file, just skip + return; + } + TikaHSSFListener listener = new TikaHSSFListener(xhtml, locale, this); listener.processFile(root, isListenForAllRecords()); listener.throwStoredException(); @@ -178,7 +160,7 @@ // ignore parse errors from embedded documents } } - } + } } // ====================================================================== @@ -192,11 +174,54 @@ * XHTML content handler to which the document content is rendered. */ private final XHTMLContentHandler handler; - + /** * The POIFS Extractor, used for embeded resources. */ private final AbstractPOIFSExtractor extractor; + + /** + * Potential exception thrown by the content handler. When set to + * non-null, causes all subsequent HSSF records to be + * ignored and the stored exception to be thrown when + * {@link #throwStoredException()} is invoked. + */ + private Exception exception = null; + + private SSTRecord sstRecord; + private FormulaRecord stringFormulaRecord; + + private short previousSid; + + /** + * Internal FormatTrackingHSSFListener to handle cell + * formatting within the extraction. + */ + private FormatTrackingHSSFListener formatListener; + + /** + * List of worksheet names. + */ + private List sheetNames = new ArrayList(); + + /** + * Index of the current worksheet within the workbook. + * Used to find the worksheet name in the {@link #sheetNames} list. + */ + private short currentSheetIndex; + + /** + * Content of the current worksheet, or null if no + * worksheet is currently active. + */ + private SortedMap currentSheet = null; + + /** + * Extra text or cells that crops up, typically as part of a + * worksheet but not always. + */ + private List extraTextCells = new ArrayList(); + /** * Format for rendering numbers in the worksheet. Currently we just * use the platform default formatting. @@ -204,44 +229,11 @@ * @see TIKA-103 */ private final NumberFormat format; - /** - * Potential exception thrown by the content handler. When set to - * non-null, causes all subsequent HSSF records to be - * ignored and the stored exception to be thrown when - * {@link #throwStoredException()} is invoked. - */ - private Exception exception = null; - private SSTRecord sstRecord; - private FormulaRecord stringFormulaRecord; - private short previousSid; - /** - * Internal FormatTrackingHSSFListener to handle cell - * formatting within the extraction. - */ - private FormatTrackingHSSFListener formatListener; - /** - * List of worksheet names. - */ - private List sheetNames = new ArrayList(); - /** - * Index of the current worksheet within the workbook. - * Used to find the worksheet name in the {@link #sheetNames} list. - */ - private short currentSheetIndex; - /** - * Content of the current worksheet, or null if no - * worksheet is currently active. - */ - private SortedMap currentSheet = null; - /** - * Extra text or cells that crops up, typically as part of a - * worksheet but not always. - */ - private List extraTextCells = new ArrayList(); + /** * These aren't complete when we first see them, as the - * depend on continue records that aren't always - * contiguous. Collect them for later processing. + * depend on continue records that aren't always + * contiguous. Collect them for later processing. */ private List drawingGroups = new ArrayList(); @@ -261,21 +253,21 @@ /** * Entry point to listener to start the processing of a file. * - * @param filesystem POI file system. + * @param filesystem POI file system. * @param listenForAllRecords sets whether the listener is configured to listen - * for all records types or not. - * @throws IOException on any IO errors. + * for all records types or not. + * @throws IOException on any IO errors. * @throws SAXException on any SAX parsing errors. */ - public void processFile(NPOIFSFileSystem filesystem, boolean listenForAllRecords) - throws IOException, SAXException, TikaException { + public void processFile(NPOIFSFileSystem filesystem, boolean listenForAllRecords) + throws IOException, SAXException, TikaException { processFile(filesystem.getRoot(), listenForAllRecords); } - public void processFile(DirectoryNode root, boolean listenForAllRecords) - throws IOException, SAXException, TikaException { - - // Set up listener and register the records we want to process + public void processFile(DirectoryNode root, boolean listenForAllRecords) + throws IOException, SAXException, TikaException { + + // Set up listener and register the records we want to process HSSFRequest hssfRequest = new HSSFRequest(); if (listenForAllRecords) { hssfRequest.addListenerForAllRecords(formatListener); @@ -298,8 +290,6 @@ hssfRequest.addListener(formatListener, FormatRecord.sid); hssfRequest.addListener(formatListener, ExtendedFormatRecord.sid); hssfRequest.addListener(formatListener, DrawingGroupRecord.sid); - hssfRequest.addListener(formatListener, HeaderRecord.sid); - hssfRequest.addListener(formatListener, FooterRecord.sid); } // Create event factory and process Workbook (fire events) @@ -310,17 +300,17 @@ } catch (org.apache.poi.EncryptedDocumentException e) { throw new EncryptedDocumentException(e); } - + // Output any extra text that came after all the sheets - processExtraText(); - + processExtraText(); + // Look for embeded images, now that the drawing records // have been fully matched with their continue data - for (DrawingGroupRecord dgr : drawingGroups) { - dgr.decode(); - findPictures(dgr.getEscherRecords()); - } - } + for(DrawingGroupRecord dgr : drawingGroups) { + dgr.decode(); + findPictures(dgr.getEscherRecords()); + } + } /** * Process a HSSF record. @@ -332,7 +322,7 @@ try { internalProcessRecord(record); } catch (TikaException te) { - exception = te; + exception = te; } catch (IOException ie) { exception = ie; } catch (SAXException se) { @@ -343,152 +333,142 @@ public void throwStoredException() throws TikaException, SAXException, IOException { if (exception != null) { - if (exception instanceof IOException) - throw (IOException) exception; - if (exception instanceof SAXException) - throw (SAXException) exception; - if (exception instanceof TikaException) - throw (TikaException) exception; + if(exception instanceof IOException) + throw (IOException)exception; + if(exception instanceof SAXException) + throw (SAXException)exception; + if(exception instanceof TikaException) + throw (TikaException)exception; throw new TikaException(exception.getMessage()); } } private void internalProcessRecord(Record record) throws SAXException, TikaException, IOException { switch (record.getSid()) { - case BOFRecord.sid: // start of workbook, worksheet etc. records - BOFRecord bof = (BOFRecord) record; - if (bof.getType() == BOFRecord.TYPE_WORKBOOK) { - currentSheetIndex = -1; - } else if (bof.getType() == BOFRecord.TYPE_CHART) { - if (previousSid == EOFRecord.sid) { - // This is a sheet which contains only a chart + case BOFRecord.sid: // start of workbook, worksheet etc. records + BOFRecord bof = (BOFRecord) record; + if (bof.getType() == BOFRecord.TYPE_WORKBOOK) { + currentSheetIndex = -1; + } else if (bof.getType() == BOFRecord.TYPE_CHART) { + if(previousSid == EOFRecord.sid) { + // This is a sheet which contains only a chart + newSheet(); + } else { + // This is a chart within a normal sheet + // Handling of this is a bit hacky... + if (currentSheet != null) { + processSheet(); + currentSheetIndex--; newSheet(); - } else { - // This is a chart within a normal sheet - // Handling of this is a bit hacky... - if (currentSheet != null) { - processSheet(); - currentSheetIndex--; - newSheet(); - } - } - } else if (bof.getType() == BOFRecord.TYPE_WORKSHEET) { - newSheet(); - } - break; - - case EOFRecord.sid: // end of workbook, worksheet etc. records - if (currentSheet != null) { - processSheet(); - } - currentSheet = null; - break; - - case BoundSheetRecord.sid: // Worksheet index record - BoundSheetRecord boundSheetRecord = (BoundSheetRecord) record; - sheetNames.add(boundSheetRecord.getSheetname()); - break; - - case SSTRecord.sid: // holds all the strings for LabelSSTRecords - sstRecord = (SSTRecord) record; - break; - - case FormulaRecord.sid: // Cell value from a formula - FormulaRecord formula = (FormulaRecord) record; - if (formula.hasCachedResultString()) { - // The String itself should be the next record - stringFormulaRecord = formula; - } else { - addTextCell(record, formatListener.formatNumberDateCell(formula)); - } - break; - - case StringRecord.sid: - if (previousSid == FormulaRecord.sid) { - // Cached string value of a string formula - StringRecord sr = (StringRecord) record; - addTextCell(stringFormulaRecord, sr.getString()); - } else { - // Some other string not associated with a cell, skip - } - break; - - case LabelRecord.sid: // strings stored directly in the cell - LabelRecord label = (LabelRecord) record; - addTextCell(record, label.getValue()); - break; - - case LabelSSTRecord.sid: // Ref. a string in the shared string table - LabelSSTRecord sst = (LabelSSTRecord) record; - UnicodeString unicode = sstRecord.getString(sst.getSSTIndex()); - addTextCell(record, unicode.getString()); - break; - - case NumberRecord.sid: // Contains a numeric cell value - NumberRecord number = (NumberRecord) record; - addTextCell(record, formatListener.formatNumberDateCell(number)); - break; - - case RKRecord.sid: // Excel internal number record - RKRecord rk = (RKRecord) record; - addCell(record, new NumberCell(rk.getRKNumber(), format)); - break; - - case HyperlinkRecord.sid: // holds a URL associated with a cell - if (currentSheet != null) { - HyperlinkRecord link = (HyperlinkRecord) record; - Point point = - new Point(link.getFirstColumn(), link.getFirstRow()); - Cell cell = currentSheet.get(point); - if (cell != null) { - String address = link.getAddress(); - if (address != null) { - addCell(record, new LinkedCell(cell, address)); - } else { - addCell(record, cell); - } } } - break; - - case TextObjectRecord.sid: - TextObjectRecord tor = (TextObjectRecord) record; - addTextCell(record, tor.getStr().getString()); - break; - - case SeriesTextRecord.sid: // Chart label or title - SeriesTextRecord str = (SeriesTextRecord) record; - addTextCell(record, str.getText()); - break; - - case DrawingGroupRecord.sid: - // Collect this now, we'll process later when all - // the continue records are in - drawingGroups.add((DrawingGroupRecord) record); - break; - - case HeaderRecord.sid: - HeaderRecord headerRecord = (HeaderRecord) record; - addTextCell(record, headerRecord.getText()); - break; - - case FooterRecord.sid: - FooterRecord footerRecord = (FooterRecord) record; - addTextCell(record, footerRecord.getText()); - break; + } else if (bof.getType() == BOFRecord.TYPE_WORKSHEET) { + newSheet(); + } + break; + + case EOFRecord.sid: // end of workbook, worksheet etc. records + if (currentSheet != null) { + processSheet(); + } + currentSheet = null; + break; + + case BoundSheetRecord.sid: // Worksheet index record + BoundSheetRecord boundSheetRecord = (BoundSheetRecord) record; + sheetNames.add(boundSheetRecord.getSheetname()); + break; + + case SSTRecord.sid: // holds all the strings for LabelSSTRecords + sstRecord = (SSTRecord) record; + break; + + case FormulaRecord.sid: // Cell value from a formula + FormulaRecord formula = (FormulaRecord) record; + if (formula.hasCachedResultString()) { + // The String itself should be the next record + stringFormulaRecord = formula; + } else { + addTextCell(record, formatListener.formatNumberDateCell(formula)); + } + break; + + case StringRecord.sid: + if (previousSid == FormulaRecord.sid) { + // Cached string value of a string formula + StringRecord sr = (StringRecord) record; + addTextCell(stringFormulaRecord, sr.getString()); + } else { + // Some other string not associated with a cell, skip + } + break; + + case LabelRecord.sid: // strings stored directly in the cell + LabelRecord label = (LabelRecord) record; + addTextCell(record, label.getValue()); + break; + + case LabelSSTRecord.sid: // Ref. a string in the shared string table + LabelSSTRecord sst = (LabelSSTRecord) record; + UnicodeString unicode = sstRecord.getString(sst.getSSTIndex()); + addTextCell(record, unicode.getString()); + break; + + case NumberRecord.sid: // Contains a numeric cell value + NumberRecord number = (NumberRecord) record; + addTextCell(record, formatListener.formatNumberDateCell(number)); + break; + + case RKRecord.sid: // Excel internal number record + RKRecord rk = (RKRecord) record; + addCell(record, new NumberCell(rk.getRKNumber(), format)); + break; + + case HyperlinkRecord.sid: // holds a URL associated with a cell + if (currentSheet != null) { + HyperlinkRecord link = (HyperlinkRecord) record; + Point point = + new Point(link.getFirstColumn(), link.getFirstRow()); + Cell cell = currentSheet.get(point); + if (cell != null) { + String address = link.getAddress(); + if (address != null) { + addCell(record, new LinkedCell(cell, address)); + } else { + addCell(record, cell); + } + } + } + break; + + case TextObjectRecord.sid: + TextObjectRecord tor = (TextObjectRecord) record; + addTextCell(record, tor.getStr().getString()); + break; + + case SeriesTextRecord.sid: // Chart label or title + SeriesTextRecord str = (SeriesTextRecord) record; + addTextCell(record, str.getText()); + break; + + case DrawingGroupRecord.sid: + // Collect this now, we'll process later when all + // the continue records are in + drawingGroups.add( (DrawingGroupRecord)record ); + break; } previousSid = record.getSid(); - + if (stringFormulaRecord != record) { - stringFormulaRecord = null; + stringFormulaRecord = null; } } private void processExtraText() throws SAXException { - if (extraTextCells.size() > 0) { - for (Cell cell : extraTextCells) { + if(extraTextCells.size() > 0) { + for(Cell cell : extraTextCells) { handler.startElement("div", "class", "outside"); cell.render(handler); handler.endElement("div"); @@ -504,7 +484,7 @@ * worksheet (if any) at the position (if any) of the given record. * * @param record record that holds the cell value - * @param cell cell value (or null) + * @param cell cell value (or null) */ private void addCell(Record record, Cell cell) throws SAXException { if (cell == null) { @@ -513,7 +493,7 @@ && record instanceof CellValueRecordInterface) { // Normal cell inside a worksheet CellValueRecordInterface value = - (CellValueRecordInterface) record; + (CellValueRecordInterface) record; Point point = new Point(value.getColumn(), value.getRow()); currentSheet.put(point, cell); } else { @@ -527,7 +507,7 @@ * is trimmed, and ignored if null or empty. * * @param record record that holds the text value - * @param text text content, may be null + * @param text text content, may be null * @throws SAXException */ private void addTextCell(Record record, String text) throws SAXException { @@ -587,32 +567,32 @@ // Sheet End handler.endElement("tbody"); handler.endElement("table"); - + // Finish up processExtraText(); handler.endElement("div"); } private void findPictures(List records) throws IOException, SAXException, TikaException { - for (EscherRecord escherRecord : records) { - if (escherRecord instanceof EscherBSERecord) { - EscherBlipRecord blip = ((EscherBSERecord) escherRecord).getBlipRecord(); - if (blip != null) { - HSSFPictureData picture = new HSSFPictureData(blip); - String mimeType = picture.getMimeType(); - TikaInputStream stream = TikaInputStream.get(picture.getData()); - - // Handle the embeded resource - extractor.handleEmbeddedResource( - stream, null, null, mimeType, - handler, true - ); - } - } - - // Recursive call. - findPictures(escherRecord.getChildRecords()); - } + for(EscherRecord escherRecord : records) { + if (escherRecord instanceof EscherBSERecord) { + EscherBlipRecord blip = ((EscherBSERecord) escherRecord).getBlipRecord(); + if (blip != null) { + HSSFPictureData picture = new HSSFPictureData(blip); + String mimeType = picture.getMimeType(); + TikaInputStream stream = TikaInputStream.get(picture.getData()); + + // Handle the embeded resource + extractor.handleEmbeddedResource( + stream, null, null, mimeType, + handler, true + ); + } + } + + // Recursive call. + findPictures(escherRecord.getChildRecords()); + } } } @@ -630,4 +610,5 @@ } } + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java index dedb135..3b7f252 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/HSLFExtractor.java @@ -18,23 +18,12 @@ import java.io.IOException; import java.util.HashSet; -import java.util.List; - -import org.apache.poi.hslf.model.Comment; -import org.apache.poi.hslf.model.HeadersFooters; -import org.apache.poi.hslf.model.OLEShape; -import org.apache.poi.hslf.usermodel.HSLFMasterSheet; -import org.apache.poi.hslf.usermodel.HSLFNotes; -import org.apache.poi.hslf.usermodel.HSLFObjectData; -import org.apache.poi.hslf.usermodel.HSLFPictureData; -import org.apache.poi.hslf.usermodel.HSLFShape; -import org.apache.poi.hslf.usermodel.HSLFSlide; -import org.apache.poi.hslf.usermodel.HSLFSlideShow; -import org.apache.poi.hslf.usermodel.HSLFTable; -import org.apache.poi.hslf.usermodel.HSLFTableCell; -import org.apache.poi.hslf.usermodel.HSLFTextParagraph; -import org.apache.poi.hslf.usermodel.HSLFTextRun; -import org.apache.poi.hslf.usermodel.HSLFTextShape; + +import org.apache.poi.hslf.HSLFSlideShow; +import org.apache.poi.hslf.model.*; +import org.apache.poi.hslf.usermodel.ObjectData; +import org.apache.poi.hslf.usermodel.PictureData; +import org.apache.poi.hslf.usermodel.SlideShow; import org.apache.poi.poifs.filesystem.DirectoryNode; import org.apache.poi.poifs.filesystem.NPOIFSFileSystem; import org.apache.tika.exception.TikaException; @@ -45,294 +34,278 @@ import org.xml.sax.helpers.AttributesImpl; public class HSLFExtractor extends AbstractPOIFSExtractor { - public HSLFExtractor(ParseContext context) { - super(context); - } - - protected void parse( - NPOIFSFileSystem filesystem, XHTMLContentHandler xhtml) - throws IOException, SAXException, TikaException { - parse(filesystem.getRoot(), xhtml); - } - - protected void parse( - DirectoryNode root, XHTMLContentHandler xhtml) - throws IOException, SAXException, TikaException { - HSLFSlideShow ss = new HSLFSlideShow(root); - List _slides = ss.getSlides(); - - xhtml.startElement("div", "class", "slideShow"); + public HSLFExtractor(ParseContext context) { + super(context); + } + + protected void parse( + NPOIFSFileSystem filesystem, XHTMLContentHandler xhtml) + throws IOException, SAXException, TikaException { + parse(filesystem.getRoot(), xhtml); + } + + protected void parse( + DirectoryNode root, XHTMLContentHandler xhtml) + throws IOException, SAXException, TikaException { + HSLFSlideShow ss = new HSLFSlideShow(root); + SlideShow _show = new SlideShow(ss); + Slide[] _slides = _show.getSlides(); + + xhtml.startElement("div", "class", "slideShow"); /* Iterate over slides and extract text */ - for (HSLFSlide slide : _slides) { - xhtml.startElement("div", "class", "slide"); - - // Slide header, if present - HeadersFooters hf = slide.getHeadersFooters(); - if (hf != null && hf.isHeaderVisible() && hf.getHeaderText() != null) { - xhtml.startElement("p", "class", "slide-header"); - - xhtml.characters(hf.getHeaderText()); - - xhtml.endElement("p"); - } - - // Slide master, if present - extractMaster(xhtml, slide.getMasterSheet()); - - // Slide text - { - xhtml.startElement("div", "class", "slide-content"); - - textRunsToText(xhtml, slide.getTextParagraphs()); - - xhtml.endElement("div"); - } - - // Table text - for (HSLFShape shape : slide.getShapes()) { - if (shape instanceof HSLFTable) { - extractTableText(xhtml, (HSLFTable) shape); - } - } - - // Slide footer, if present - if (hf != null && hf.isFooterVisible() && hf.getFooterText() != null) { - xhtml.startElement("p", "class", "slide-footer"); - - xhtml.characters(hf.getFooterText()); - - xhtml.endElement("p"); - } - - // Comments, if present - StringBuilder authorStringBuilder = new StringBuilder(); - for (Comment comment : slide.getComments()) { - authorStringBuilder.setLength(0); - xhtml.startElement("p", "class", "slide-comment"); - - if (comment.getAuthor() != null) { - authorStringBuilder.append(comment.getAuthor()); - } - if (comment.getAuthorInitials() != null) { - if (authorStringBuilder.length() > 0) { - authorStringBuilder.append(" "); - } - authorStringBuilder.append("("+comment.getAuthorInitials()+")"); - } - if (authorStringBuilder.length() > 0) { - if (comment.getText() != null) { - authorStringBuilder.append(" - "); - } - xhtml.startElement("b"); - xhtml.characters(authorStringBuilder.toString()); - xhtml.endElement("b"); - } - if (comment.getText() != null) { - xhtml.characters(comment.getText()); - } - xhtml.endElement("p"); - } - - // Now any embedded resources - handleSlideEmbeddedResources(slide, xhtml); - - // TODO Find the Notes for this slide and extract inline - - // Slide complete - xhtml.endElement("div"); - } - - // All slides done - xhtml.endElement("div"); + for( Slide slide : _slides ) { + xhtml.startElement("div", "class", "slide"); + + // Slide header, if present + HeadersFooters hf = slide.getHeadersFooters(); + if (hf != null && hf.isHeaderVisible() && hf.getHeaderText() != null) { + xhtml.startElement("p", "class", "slide-header"); + + xhtml.characters( hf.getHeaderText() ); + + xhtml.endElement("p"); + } + + // Slide master, if present + extractMaster(xhtml, slide.getMasterSheet()); + + // Slide text + { + xhtml.startElement("p", "class", "slide-content"); + + textRunsToText(xhtml, slide.getTextRuns()); + + xhtml.endElement("p"); + } + + // Table text + for (Shape shape: slide.getShapes()){ + if (shape instanceof Table){ + extractTableText(xhtml, (Table)shape); + } + } + + // Slide footer, if present + if (hf != null && hf.isFooterVisible() && hf.getFooterText() != null) { + xhtml.startElement("p", "class", "slide-footer"); + + xhtml.characters( hf.getFooterText() ); + + xhtml.endElement("p"); + } + + // Comments, if present + for( Comment comment : slide.getComments() ) { + xhtml.startElement("p", "class", "slide-comment"); + if (comment.getAuthor() != null) { + xhtml.startElement("b"); + xhtml.characters( comment.getAuthor() ); + xhtml.endElement("b"); + + if (comment.getText() != null) { + xhtml.characters( " - "); + } + } + if (comment.getText() != null) { + xhtml.characters( comment.getText() ); + } + xhtml.endElement("p"); + } + + // Now any embedded resources + handleSlideEmbeddedResources(slide, xhtml); + + // TODO Find the Notes for this slide and extract inline + + // Slide complete + xhtml.endElement("div"); + } + + // All slides done + xhtml.endElement("div"); /* notes */ - xhtml.startElement("div", "class", "slide-notes"); - HashSet seenNotes = new HashSet<>(); - HeadersFooters hf = ss.getNotesHeadersFooters(); - - for (HSLFSlide slide : _slides) { - HSLFNotes notes = slide.getNotes(); - if (notes == null) { - continue; - } - Integer id = notes._getSheetNumber(); - if (seenNotes.contains(id)) { - continue; - } - seenNotes.add(id); - - // Repeat the Notes header, if set - if (hf != null && hf.isHeaderVisible() && hf.getHeaderText() != null) { - xhtml.startElement("p", "class", "slide-note-header"); - xhtml.characters(hf.getHeaderText()); - xhtml.endElement("p"); - } - - // Notes text - textRunsToText(xhtml, notes.getTextParagraphs()); - - // Repeat the notes footer, if set - if (hf != null && hf.isFooterVisible() && hf.getFooterText() != null) { - xhtml.startElement("p", "class", "slide-note-footer"); - xhtml.characters(hf.getFooterText()); - xhtml.endElement("p"); - } - } - - handleSlideEmbeddedPictures(ss, xhtml); - - xhtml.endElement("div"); - } - - private void extractMaster(XHTMLContentHandler xhtml, HSLFMasterSheet master) throws SAXException { - if (master == null) { - return; - } - List shapes = master.getShapes(); - if (shapes == null || shapes.isEmpty()) { - return; - } - - xhtml.startElement("div", "class", "slide-master-content"); - for (HSLFShape shape : shapes) { - if (shape != null && !HSLFMasterSheet.isPlaceholder(shape)) { - if (shape instanceof HSLFTextShape) { - HSLFTextShape tsh = (HSLFTextShape) shape; - String text = tsh.getText(); - if (text != null) { - xhtml.element("p", text); - } - } - } - } - xhtml.endElement("div"); - } - - private void extractTableText(XHTMLContentHandler xhtml, HSLFTable shape) throws SAXException { - xhtml.startElement("table"); - for (int row = 0; row < shape.getNumberOfRows(); row++) { - xhtml.startElement("tr"); - for (int col = 0; col < shape.getNumberOfColumns(); col++) { - HSLFTableCell cell = shape.getCell(row, col); - //insert empty string for empty cell if cell is null - String txt = ""; - if (cell != null) { - txt = cell.getText(); - } - xhtml.element("td", txt); - } - xhtml.endElement("tr"); - } - xhtml.endElement("table"); - } - - private void textRunsToText(XHTMLContentHandler xhtml, List> paragraphsList) throws SAXException { - if (paragraphsList == null) { - return; - } - - for (List run : paragraphsList) { - // Leaving in wisdom from TIKA-712 for easy revert. - // Avoid boiler-plate text on the master slide (0 - // = TextHeaderAtom.TITLE_TYPE, 1 = TextHeaderAtom.BODY_TYPE): - //if (!isMaster || (run.getRunType() != 0 && run.getRunType() != 1)) { - - for (HSLFTextParagraph htp : run) { - xhtml.startElement("p"); - - for (HSLFTextRun htr : htp.getTextRuns()) { - String line = htr.getRawText(); - if (line != null) { - boolean isfirst = true; - for (String fragment : line.split("\\u000b")){ - if (!isfirst) { - xhtml.startElement("br"); - xhtml.endElement("br"); - } - isfirst = false; - xhtml.characters(fragment.trim()); - } - } - } - xhtml.endElement("p"); - - } - + xhtml.startElement("div", "class", "slideNotes"); + HashSet seenNotes = new HashSet(); + HeadersFooters hf = _show.getNotesHeadersFooters(); + + for (Slide slide : _slides) { + Notes notes = slide.getNotesSheet(); + if (notes == null) { + continue; + } + Integer id = Integer.valueOf(notes._getSheetNumber()); + if (seenNotes.contains(id)) { + continue; + } + seenNotes.add(id); + + // Repeat the Notes header, if set + if (hf != null && hf.isHeaderVisible() && hf.getHeaderText() != null) { + xhtml.startElement("p", "class", "slide-note-header"); + xhtml.characters( hf.getHeaderText() ); + xhtml.endElement("p"); + } + + // Notes text + textRunsToText(xhtml, notes.getTextRuns()); + + // Repeat the notes footer, if set + if (hf != null && hf.isFooterVisible() && hf.getFooterText() != null) { + xhtml.startElement("p", "class", "slide-note-footer"); + xhtml.characters( hf.getFooterText() ); + xhtml.endElement("p"); + } + } + + handleSlideEmbeddedPictures(_show, xhtml); + + xhtml.endElement("div"); + } + + private void extractMaster(XHTMLContentHandler xhtml, MasterSheet master) throws SAXException { + if (master == null){ + return; + } + Shape[] shapes = master.getShapes(); + if (shapes == null || shapes.length == 0){ + return; + } + + xhtml.startElement("div", "class", "slide-master-content"); + for (int i = 0; i < shapes.length; i++){ + Shape sh = shapes[i]; + if (sh != null && ! MasterSheet.isPlaceholder(sh)){ + if (sh instanceof TextShape){ + TextShape tsh = (TextShape)sh; + String text = tsh.getText(); + if (text != null){ + xhtml.element("p", text); + } + } + } + } + xhtml.endElement("div"); + } + + private void extractTableText(XHTMLContentHandler xhtml, Table shape) throws SAXException { + xhtml.startElement("table"); + for (int row = 0; row < shape.getNumberOfRows(); row++){ + xhtml.startElement("tr"); + for (int col = 0; col < shape.getNumberOfColumns(); col++){ + TableCell cell = shape.getCell(row, col); + //insert empty string for empty cell if cell is null + String txt = ""; + if (cell != null){ + txt = cell.getText(); + } + xhtml.element("td", txt); + } + xhtml.endElement("tr"); + } + xhtml.endElement("table"); + } + + private void textRunsToText(XHTMLContentHandler xhtml, TextRun[] runs) throws SAXException { + if (runs==null) { + return; + } + + for (TextRun run : runs) { + if (run != null) { + // Leaving in wisdom from TIKA-712 for easy revert. + // Avoid boiler-plate text on the master slide (0 + // = TextHeaderAtom.TITLE_TYPE, 1 = TextHeaderAtom.BODY_TYPE): + //if (!isMaster || (run.getRunType() != 0 && run.getRunType() != 1)) { + String txt = run.getText(); + if (txt != null){ + xhtml.characters(txt); + xhtml.startElement("br"); + xhtml.endElement("br"); + } + } + } + } + + private void handleSlideEmbeddedPictures(SlideShow slideshow, XHTMLContentHandler xhtml) + throws TikaException, SAXException, IOException { + for (PictureData pic : slideshow.getPictureData()) { + String mediaType = null; + + switch (pic.getType()) { + case Picture.EMF: + mediaType = "application/x-emf"; + break; + case Picture.JPEG: + mediaType = "image/jpeg"; + break; + case Picture.PNG: + mediaType = "image/png"; + break; + case Picture.WMF: + mediaType = "application/x-msmetafile"; + break; + case Picture.DIB: + mediaType = "image/bmp"; + break; + } + + handleEmbeddedResource( + TikaInputStream.get(pic.getData()), null, null, + mediaType, xhtml, false); } } - private void handleSlideEmbeddedPictures(HSLFSlideShow slideshow, XHTMLContentHandler xhtml) - throws TikaException, SAXException, IOException { - for (HSLFPictureData pic : slideshow.getPictureData()) { - String mediaType; - - switch (pic.getType()) { - case EMF: - mediaType = "application/x-emf"; - break; - case WMF: - mediaType = "application/x-msmetafile"; - break; - case DIB: - mediaType = "image/bmp"; - break; - default: - mediaType = pic.getContentType(); - break; - } - - handleEmbeddedResource( - TikaInputStream.get(pic.getData()), null, null, - mediaType, xhtml, false); - } - } - - private void handleSlideEmbeddedResources(HSLFSlide slide, XHTMLContentHandler xhtml) - throws TikaException, SAXException, IOException { - List shapes; - try { - shapes = slide.getShapes(); - } catch (NullPointerException e) { - // Sometimes HSLF hits problems - // Please open POI bugs for any you come across! - return; - } - - for (HSLFShape shape : shapes) { - if (shape instanceof OLEShape) { - OLEShape oleShape = (OLEShape) shape; - HSLFObjectData data = null; - try { - data = oleShape.getObjectData(); - } catch (NullPointerException e) { + private void handleSlideEmbeddedResources(Slide slide, XHTMLContentHandler xhtml) + throws TikaException, SAXException, IOException { + Shape[] shapes; + try { + shapes = slide.getShapes(); + } catch(NullPointerException e) { + // Sometimes HSLF hits problems + // Please open POI bugs for any you come across! + return; + } + + for( Shape shape : shapes ) { + if( shape instanceof OLEShape ) { + OLEShape oleShape = (OLEShape)shape; + ObjectData data = null; + try { + data = oleShape.getObjectData(); + } catch( NullPointerException e ) { /* getObjectData throws NPE some times. */ - } - - if (data != null) { - String objID = Integer.toString(oleShape.getObjectID()); - - // Embedded Object: add a

    so consumer can see where - // in the main text each embedded document - // occurred: - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", "class", "class", "CDATA", "embedded"); - attributes.addAttribute("", "id", "id", "CDATA", objID); - xhtml.startElement("div", attributes); - xhtml.endElement("div"); - - try (TikaInputStream stream = TikaInputStream.get(data.getData())) { - String mediaType = null; - if ("Excel.Chart.8".equals(oleShape.getProgID())) { - mediaType = "application/vnd.ms-excel"; - } - handleEmbeddedResource( - stream, objID, objID, - mediaType, xhtml, false); - } - } - } - } - } + } + + if (data != null) { + String objID = Integer.toString(oleShape.getObjectID()); + + // Embedded Object: add a
    so consumer can see where + // in the main text each embedded document + // occurred: + AttributesImpl attributes = new AttributesImpl(); + attributes.addAttribute("", "class", "class", "CDATA", "embedded"); + attributes.addAttribute("", "id", "id", "CDATA", objID); + xhtml.startElement("div", attributes); + xhtml.endElement("div"); + + TikaInputStream stream = + TikaInputStream.get(data.getData()); + try { + String mediaType = null; + if ("Excel.Chart.8".equals(oleShape.getProgID())) { + mediaType = "application/vnd.ms-excel"; + } + handleEmbeddedResource( + stream, objID, objID, + mediaType, xhtml, false); + } finally { + stream.close(); + } + } + } + } + } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java deleted file mode 100644 index e224d54..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessExtractor.java +++ /dev/null @@ -1,345 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.microsoft; - - -import static java.nio.charset.StandardCharsets.UTF_8; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.math.BigDecimal; -import java.text.DateFormat; -import java.text.NumberFormat; -import java.util.Date; -import java.util.HashSet; -import java.util.Iterator; -import java.util.List; -import java.util.Locale; -import java.util.Set; - -import com.healthmarketscience.jackcess.Column; -import com.healthmarketscience.jackcess.DataType; -import com.healthmarketscience.jackcess.Database; -import com.healthmarketscience.jackcess.PropertyMap; -import com.healthmarketscience.jackcess.Row; -import com.healthmarketscience.jackcess.Table; -import com.healthmarketscience.jackcess.query.Query; -import com.healthmarketscience.jackcess.util.OleBlob; -import org.apache.poi.poifs.filesystem.NPOIFSFileSystem; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.OfficeOpenXMLExtended; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.html.HtmlParser; -import org.apache.tika.sax.BodyContentHandler; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.SAXException; - -/** - * Internal class. Needs to be instantiated for each parse because of - * the lack of thread safety with the dateTimeFormatter - */ -class JackcessExtractor extends AbstractPOIFSExtractor { - - final static String TITLE_PROP_KEY = "Title"; - final static String AUTHOR_PROP_KEY = "Author"; - final static String COMPANY_PROP_KEY = "Company"; - - final static String TEXT_FORMAT_KEY = "TextFormat"; - final static String CURRENCY_FORMAT_KEY = "Format"; - final static byte TEXT_FORMAT = 0; - final static byte RICH_TEXT_FORMAT = 1; - final static ParseContext EMPTY_PARSE_CONTEXT = new ParseContext(); - - final NumberFormat currencyFormatter; - final DateFormat shortDateTimeFormatter; - - final HtmlParser htmlParser = new HtmlParser(); - - protected JackcessExtractor(ParseContext context, Locale locale) { - super(context); - currencyFormatter = NumberFormat.getCurrencyInstance(locale); - shortDateTimeFormatter = DateFormat.getDateInstance(DateFormat.SHORT, locale); - } - - public void parse(Database db, XHTMLContentHandler xhtml, Metadata metadata) throws IOException, SAXException, TikaException { - - - String pw = db.getDatabasePassword(); - if (pw != null) { - metadata.set(JackcessParser.MDB_PW, pw); - } - - PropertyMap dbp = db.getDatabaseProperties(); - for (PropertyMap.Property p : dbp) { - metadata.add(JackcessParser.MDB_PROPERTY_PREFIX + p.getName(), - toString(p.getValue(), p.getType())); - } - - PropertyMap up = db.getUserDefinedProperties(); - for (PropertyMap.Property p : up) { - metadata.add(JackcessParser.USER_DEFINED_PROPERTY_PREFIX+ p.getName(), - toString(p.getValue(), p.getType())); - } - - Set found = new HashSet<>(); - PropertyMap summaryProperties = db.getSummaryProperties(); - if (summaryProperties != null) { - //try to get core properties - PropertyMap.Property title = summaryProperties.get(TITLE_PROP_KEY); - if (title != null) { - metadata.set(TikaCoreProperties.TITLE, toString(title.getValue(), title.getType())); - found.add(title.getName()); - } - PropertyMap.Property author = summaryProperties.get(AUTHOR_PROP_KEY); - if (author != null && author.getValue() != null) { - String authorString = toString(author.getValue(), author.getType()); - SummaryExtractor.addMulti(metadata, TikaCoreProperties.CREATOR, authorString); - found.add(author.getName()); - } - PropertyMap.Property company = summaryProperties.get(COMPANY_PROP_KEY); - if (company != null) { - metadata.set(OfficeOpenXMLExtended.COMPANY, toString(company.getValue(), company.getType())); - found.add(company.getName()); - } - - for (PropertyMap.Property p : db.getSummaryProperties()) { - if (! found.contains(p.getName())) { - metadata.add(JackcessParser.SUMMARY_PROPERTY_PREFIX + p.getName(), - toString(p.getValue(), p.getType())); - } - } - - } - - Iterator it = db.newIterable(). - setIncludeLinkedTables(false). - setIncludeSystemTables(false).iterator(); - - while (it.hasNext()) { - Table table = it.next(); - String tableName = table.getName(); - List columns = table.getColumns(); - xhtml.startElement("table", "name", tableName); - addHeaders(columns, xhtml); - xhtml.startElement("tbody"); - - Row r = table.getNextRow(); - - while (r != null) { - xhtml.startElement("tr"); - for (Column c : columns) { - handleCell(r, c, xhtml); - } - xhtml.endElement("tr"); - r = table.getNextRow(); - } - xhtml.endElement("tbody"); - xhtml.endElement("table"); - } - - for (Query q : db.getQueries()) { - xhtml.startElement("div", "type", "sqlQuery"); - xhtml.characters(q.toSQLString()); - xhtml.endElement("div"); - } - } - - private void addHeaders(List columns, XHTMLContentHandler xhtml) throws SAXException { - xhtml.startElement("thead"); - xhtml.startElement("tr"); - for (Column c : columns) { - xhtml.startElement("th"); - xhtml.characters(c.getName()); - xhtml.endElement("th"); - } - xhtml.endElement("tr"); - xhtml.endElement("thead"); - - } - - private void handleCell(Row r, Column c, XHTMLContentHandler handler) - throws SAXException, IOException, TikaException { - - handler.startElement("td"); - if (c.getType().equals(DataType.OLE)) { - handleOLE(r, c.getName(), handler); - } else if (c.getType().equals(DataType.BINARY)) { - Object obj = r.get(c.getName()); - if (obj != null) { - byte[] bytes = (byte[])obj; - handleEmbeddedResource( - TikaInputStream.get(bytes), - null,//filename - null,//relationshipId - null,//mediatype - handler, false); - } - } else { - Object obj = r.get(c.getName()); - String v = toString(obj, c.getType()); - if (isRichText(c)) { - BodyContentHandler h = new BodyContentHandler(); - Metadata m = new Metadata(); - m.set(Metadata.CONTENT_TYPE, "text/html; charset=UTF-8"); - try { - htmlParser.parse(new ByteArrayInputStream(v.getBytes(UTF_8)), - h, - m, EMPTY_PARSE_CONTEXT); - handler.characters(h.toString()); - } catch (SAXException e) { - //if something went wrong in htmlparser, just append the characters - handler.characters(v); - } - } else { - handler.characters(v); - } - } - handler.endElement("td"); - } - - private boolean isRichText(Column c) throws IOException { - - if (c == null) { - return false; - } - - PropertyMap m = c.getProperties(); - if (m == null) { - return false; - } - if (c.getType() == null || ! c.getType().equals(DataType.MEMO)) { - return false; - } - Object b = m.getValue(TEXT_FORMAT_KEY); - if (b instanceof Byte) { - if (((Byte)b).byteValue() == RICH_TEXT_FORMAT) { - return true; - } - } - return false; - } - - private String toString(Object value, DataType type) { - if (value == null) { - return ""; - } - if (type == null) { - //this shouldn't happen - return value.toString(); - } - switch (type) { - case LONG: - return Integer.toString((Integer)value); - case TEXT: - return (String)value; - case MONEY: - //TODO: consider getting parsing "Format" field from - //field properties. - return formatCurrency(((BigDecimal)value).doubleValue(), type); - case SHORT_DATE_TIME: - return formatShortDateTime((Date)value); - case BOOLEAN: - return Boolean.toString((Boolean) value); - case MEMO: - return (String)value; - case INT: - return Short.toString((Short)value); - case DOUBLE: - return Double.toString((Double)value); - case FLOAT: - return Float.toString((Float)value); - case NUMERIC: - return value.toString(); - case BYTE: - return Byte.toString((Byte)value); - case GUID: - return value.toString(); - case COMPLEX_TYPE: //skip all these - case UNKNOWN_0D: - case UNKNOWN_11: - case UNSUPPORTED_FIXEDLEN: - case UNSUPPORTED_VARLEN: - default: - return ""; - - } - } - - private void handleOLE(Row row, String cName, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException { - OleBlob blob = row.getBlob(cName); - //lifted shamelessly from Jackcess's OleBlobTest - if (blob == null) - return; - - OleBlob.Content content = blob.getContent(); - if (content == null) - return; - - switch (content.getType()) { - case LINK: - xhtml.characters(((OleBlob.LinkContent) content).getLinkPath()); - break; - case SIMPLE_PACKAGE: - OleBlob.SimplePackageContent spc = (OleBlob.SimplePackageContent) content; - - handleEmbeddedResource( - TikaInputStream.get(spc.getStream()), - spc.getFileName(),//filename - null,//relationshipId - spc.getTypeName(),//mediatype - xhtml, false); - break; - case OTHER: - OleBlob.OtherContent oc = (OleBlob.OtherContent) content; - handleEmbeddedResource( - TikaInputStream.get(oc.getStream()), - null,//filename - null,//relationshipId - oc.getTypeName(),//mediatype - xhtml, false); - break; - case COMPOUND_STORAGE: - OleBlob.CompoundContent cc = (OleBlob.CompoundContent) content; - handleCompoundContent(cc, xhtml); - break; - } - } - - private void handleCompoundContent(OleBlob.CompoundContent cc, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException { - NPOIFSFileSystem nfs = new NPOIFSFileSystem(cc.getStream()); - handleEmbeddedOfficeDoc(nfs.getRoot(), xhtml); - } - - String formatCurrency(Double d, DataType type) { - if (d == null) { - return ""; - } - return currencyFormatter.format(d); - } - - String formatShortDateTime(Date d) { - if (d == null) { - return ""; - } - return shortDateTimeFormatter.format(d); - } -} - diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessParser.java deleted file mode 100644 index 9704fbb..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/JackcessParser.java +++ /dev/null @@ -1,129 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.microsoft; - - -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Locale; -import java.util.Set; - -import com.healthmarketscience.jackcess.CryptCodecProvider; -import com.healthmarketscience.jackcess.Database; -import com.healthmarketscience.jackcess.DatabaseBuilder; -import com.healthmarketscience.jackcess.util.LinkResolver; -import org.apache.tika.exception.EncryptedDocumentException; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.Property; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.PasswordProvider; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * Parser that handles Microsoft Access files via - * SUPPORTED_TYPES = Collections.singleton(MEDIA_TYPE); - - private Locale locale = Locale.ROOT; - - @Override - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, - ParseContext context) throws IOException, SAXException, TikaException { - TikaInputStream tis = TikaInputStream.get(stream); - Database db = null; - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - - String password = null; - PasswordProvider passwordProvider = context.get(PasswordProvider.class); - if (passwordProvider != null) { - password = passwordProvider.getPassword(metadata); - } - try { - if (password == null) { - //do this to ensure encryption/wrong password exception vs. more generic - //"need right codec" error message. - db = new DatabaseBuilder(tis.getFile()) - .setCodecProvider(new CryptCodecProvider()) - .setReadOnly(true).open(); - } else { - db = new DatabaseBuilder(tis.getFile()) - .setCodecProvider(new CryptCodecProvider(password)) - .setReadOnly(true).open(); - } - db.setLinkResolver(IGNORE_LINK_RESOLVER);//just in case - JackcessExtractor ex = new JackcessExtractor(context, locale); - ex.parse(db, xhtml, metadata); - } catch (IllegalStateException e) { - if (e.getMessage() != null && e.getMessage().contains("Incorrect password")) { - throw new EncryptedDocumentException(e); - } - throw e; - } finally { - if (db != null) { - try { - db.close(); - } catch (IOException e) { - //swallow = silent close - } - } - } - xhtml.endDocument(); - } - - private static final class IgnoreLinkResolver implements LinkResolver { - //If links are resolved, Jackcess might try to open and process - //any file on the current system that is specified as a linked db. - //This could be a nasty security issue. - @Override - public Database resolveLinkedDatabase(Database database, String s) throws IOException { - throw new AssertionError("DO NOT ALLOW RESOLVING OF LINKS!!!"); - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java deleted file mode 100644 index a211de5..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ListManager.java +++ /dev/null @@ -1,190 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.microsoft; - -import java.util.NoSuchElementException; - -import org.apache.commons.logging.Log; -import org.apache.commons.logging.LogFactory; -import org.apache.poi.hwpf.HWPFDocument; -import org.apache.poi.hwpf.model.ListData; -import org.apache.poi.hwpf.model.ListFormatOverrideLevel; -import org.apache.poi.hwpf.model.ListLevel; -import org.apache.poi.hwpf.model.ListTables; -import org.apache.poi.hwpf.usermodel.Paragraph; - -/** - * Computes the number text which goes at the beginning of each list paragraph - *

    - *

    Note: This class only handles the raw number text and does not apply any further formatting as described in [MS-DOC], v20140721, 2.4.6.3, Part 3 to it.

    - *

    Note 2: The {@code tplc}, a visual override for the appearance of list levels, as defined in [MS-DOC], v20140721, 2.9.328 is not taken care of in this class.

    - *

    Further, this class does not yet handle overrides

    - */ -public class ListManager extends AbstractListManager { - - private static final Log logger = LogFactory.getLog(ListManager.class); - private final ListTables listTables; - - /** - * Ordinary constructor for a new list reader - * - * @param document Document to process - */ - public ListManager(final HWPFDocument document) { - this.listTables = document.getListTables(); - } - - /** - * Get the formatted number for a given paragraph - *

    - *

    Note: This only works correctly if called subsequently for all paragraphs in a valid selection (main document, text field, ...) which are part of a list.

    - * - * @param paragraph list paragraph to process - * @return String which represents the numbering of this list paragraph; never {@code null}, can be empty string, though, - * if something goes wrong in getList() - * @throws IllegalArgumentException If the given paragraph is {@code null} or is not part of a list - */ - public String getFormattedNumber(final Paragraph paragraph) { - if (paragraph == null) throw new IllegalArgumentException("Given paragraph cannot be null."); - if (!paragraph.isInList()) throw new IllegalArgumentException("Can only process list paragraphs."); - //lsid is equivalent to docx's abnum - //ilfo is equivalent to docx's num - int currAbNumId = -1; - try{ - currAbNumId = paragraph.getList().getLsid(); - } catch (NoSuchElementException e) { - //somewhat frequent exception when initializing HWPFList - return ""; - } catch (IllegalArgumentException e) { - return ""; - } catch (NullPointerException e) { - return ""; - } - - int currNumId = paragraph.getIlfo(); - ParagraphLevelCounter lc = listLevelMap.get(currAbNumId); - LevelTuple[] overrideTuples = overrideTupleMap.get(currNumId); - - if (lc == null) { - ListData listData = listTables.getListData(paragraph.getList().getLsid()); - LevelTuple[] levelTuples = new LevelTuple[listData.getLevels().length]; - for (int i = 0; i < listData.getLevels().length; i++) { - levelTuples[i] = buildTuple(i, listData.getLevels()[i]); - } - lc = new ParagraphLevelCounter(levelTuples); - } - if (overrideTuples == null) { - overrideTuples = buildOverrideTuples(paragraph, lc.getNumberOfLevels()); - } - String formattedString = lc.incrementLevel(paragraph.getIlvl(), overrideTuples); - - listLevelMap.put(currAbNumId, lc); - overrideTupleMap.put(currNumId, overrideTuples); - return formattedString; - } - - private LevelTuple buildTuple(int i, ListLevel listLevel) { - boolean isLegal = false; - int start = 1; - int restart = -1; - String lvlText = "%" + i + "."; - String numFmt = "decimal"; - - start = listLevel.getStartAt(); - restart = listLevel.getRestart(); - isLegal = listLevel.isLegalNumbering(); - numFmt = convertToNewNumFormat(listLevel.getNumberFormat()); - lvlText = convertToNewNumberText(listLevel.getNumberText(), listLevel.getLevelNumberingPlaceholderOffsets()); - return new LevelTuple(start, restart, lvlText, numFmt, isLegal); - } - - private LevelTuple[] buildOverrideTuples(Paragraph par, int length) { - ListFormatOverrideLevel overrideLevel; - // find the override for this level - if (listTables.getLfoData(par.getIlfo()).getRgLfoLvl().length == 0) { - return null; - } - overrideLevel = listTables.getLfoData(par.getIlfo()).getRgLfoLvl()[0]; - if (overrideLevel == null) { - return null; - } - LevelTuple[] levelTuples = new LevelTuple[length]; - ListLevel listLevel = overrideLevel.getLevel(); - if (listLevel == null) { - return null; - } - for (int i = 0; i < length; i++) { - levelTuples[i] = buildTuple(i, listLevel); - } - - return levelTuples; - - } - - private String convertToNewNumberText(String numberText, byte[] numberOffsets) { - - StringBuilder sb = new StringBuilder(); - int last = 0; - for (int i = 0; i < numberOffsets.length; i++) { - int offset = (int) numberOffsets[i]; - - if (offset == 0) { - break; - } - sb.append(numberText.substring(last, offset - 1)); - //need to add one because newer format - //adds one. In .doc, this was the array index; - //but in .docx, this is the level number - int lvlNum = (int) numberText.charAt(offset - 1) + 1; - sb.append("%" + lvlNum); - last = offset; - } - if (last < numberText.length()) { - sb.append(numberText.substring(last)); - } - return sb.toString(); - } - - private String convertToNewNumFormat(int numberFormat) { - switch (numberFormat) { - case -1: - return "none"; - case 0: - return "decimal"; - case 1: - return "upperRoman"; - case 2: - return "lowerRoman"; - case 3: - return "upperLetter"; - case 4: - return "lowerLetter"; - case 5: - return "ordinal"; - case 22: - return "decimalZero"; - case 23: - return "bullet"; - case 47: - return "none"; - default: - //do we really want to silently swallow these uncovered cases? - //throw new RuntimeException("NOT COVERED: " + numberFormat); - return "decimal"; - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java index 96dfd4b..407bd45 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java @@ -25,7 +25,6 @@ import java.util.Locale; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.poi.hdgf.extractor.VisioTextExtractor; import org.apache.poi.hpbf.extractor.PublisherTextExtractor; import org.apache.poi.poifs.crypt.Decryptor; @@ -37,6 +36,7 @@ import org.apache.poi.poifs.filesystem.POIFSFileSystem; import org.apache.tika.exception.EncryptedDocumentException; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; @@ -55,9 +55,7 @@ */ public class OfficeParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = 7393462244028653479L; private static final Set SUPPORTED_TYPES = @@ -77,141 +75,7 @@ POIFSDocumentType.SOLIDWORKS_PART.type, POIFSDocumentType.SOLIDWORKS_ASSEMBLY.type, POIFSDocumentType.SOLIDWORKS_DRAWING.type - ))); - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - /** - * Extracts properties and text from an MS Document input stream - */ - public void parse( - InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - - final DirectoryNode root; - TikaInputStream tstream = TikaInputStream.cast(stream); - if (tstream == null) { - root = new NPOIFSFileSystem(new CloseShieldInputStream(stream)).getRoot(); - } else { - final Object container = tstream.getOpenContainer(); - if (container instanceof NPOIFSFileSystem) { - root = ((NPOIFSFileSystem) container).getRoot(); - } else if (container instanceof DirectoryNode) { - root = (DirectoryNode) container; - } else { - NPOIFSFileSystem fs; - if (tstream.hasFile()) { - fs = new NPOIFSFileSystem(tstream.getFile(), true); - } else { - fs = new NPOIFSFileSystem(new CloseShieldInputStream(tstream)); - } - tstream.setOpenContainer(fs); - root = fs.getRoot(); - } - } - parse(root, context, metadata, xhtml); - xhtml.endDocument(); - } - - protected void parse( - DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) - throws IOException, SAXException, TikaException { - - // Parse summary entries first, to make metadata available early - new SummaryExtractor(metadata).parseSummaries(root); - - // Parse remaining document entries - POIFSDocumentType type = POIFSDocumentType.detectType(root); - - if (type != POIFSDocumentType.UNKNOWN) { - setType(metadata, type.getType()); - } - - switch (type) { - case SOLIDWORKS_PART: - case SOLIDWORKS_ASSEMBLY: - case SOLIDWORKS_DRAWING: - break; - case PUBLISHER: - PublisherTextExtractor publisherTextExtractor = - new PublisherTextExtractor(root); - xhtml.element("p", publisherTextExtractor.getText()); - break; - case WORDDOCUMENT: - new WordExtractor(context).parse(root, xhtml); - break; - case POWERPOINT: - new HSLFExtractor(context).parse(root, xhtml); - break; - case WORKBOOK: - case XLR: - Locale locale = context.get(Locale.class, Locale.getDefault()); - new ExcelExtractor(context, metadata).parse(root, xhtml, locale); - break; - case PROJECT: - // We currently can't do anything beyond the metadata - break; - case VISIO: - VisioTextExtractor visioTextExtractor = - new VisioTextExtractor(root); - for (String text : visioTextExtractor.getAllText()) { - xhtml.element("p", text); - } - break; - case OUTLOOK: - OutlookExtractor extractor = - new OutlookExtractor(root, context); - - extractor.parse(xhtml, metadata); - break; - case ENCRYPTED: - EncryptionInfo info = new EncryptionInfo(root); - Decryptor d = Decryptor.getInstance(info); - - try { - // By default, use the default Office Password - String password = Decryptor.DEFAULT_PASSWORD; - - // If they supplied a Password Provider, ask that for the password, - // and use the provider given one if available (stick with default if not) - PasswordProvider passwordProvider = context.get(PasswordProvider.class); - if (passwordProvider != null) { - String suppliedPassword = passwordProvider.getPassword(metadata); - if (suppliedPassword != null) { - password = suppliedPassword; - } - } - - // Check if we've the right password or not - if (!d.verifyPassword(password)) { - throw new EncryptedDocumentException(); - } - - // Decrypt the OLE2 stream, and delegate the resulting OOXML - // file to the regular OOXML parser for normal handling - OOXMLParser parser = new OOXMLParser(); - - parser.parse(d.getDataStream(root), new EmbeddedContentHandler( - new BodyContentHandler(xhtml)), - metadata, context); - } catch (GeneralSecurityException ex) { - throw new EncryptedDocumentException(ex); - } - default: - // For unsupported / unhandled types, just the metadata - // is extracted, which happened above - break; - } - } - - private void setType(Metadata metadata, MediaType type) { - metadata.set(Metadata.CONTENT_TYPE, type.toString()); - } + ))); public enum POIFSDocumentType { WORKBOOK("xls", MediaType.application("vnd.ms-excel")), @@ -239,13 +103,21 @@ this.type = type; } + public String getExtension() { + return extension; + } + + public MediaType getType() { + return type; + } + public static POIFSDocumentType detectType(POIFSFileSystem fs) { return detectType(fs.getRoot()); } public static POIFSDocumentType detectType(NPOIFSFileSystem fs) { - return detectType(fs.getRoot()); - } + return detectType(fs.getRoot()); + } public static POIFSDocumentType detectType(DirectoryEntry node) { Set names = new HashSet(); @@ -254,20 +126,140 @@ } MediaType type = POIFSContainerDetector.detect(names, node); for (POIFSDocumentType poifsType : values()) { - if (type.equals(poifsType.type)) { - return poifsType; - } + if (type.equals(poifsType.type)) { + return poifsType; + } } return UNKNOWN; } - - public String getExtension() { - return extension; - } - - public MediaType getType() { - return type; - } + } + + public Set getSupportedTypes(ParseContext context) { + return SUPPORTED_TYPES; + } + + /** + * Extracts properties and text from an MS Document input stream + */ + public void parse( + InputStream stream, ContentHandler handler, + Metadata metadata, ParseContext context) + throws IOException, SAXException, TikaException { + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + + final DirectoryNode root; + TikaInputStream tstream = TikaInputStream.cast(stream); + if (tstream == null) { + root = new NPOIFSFileSystem(new CloseShieldInputStream(stream)).getRoot(); + } else { + final Object container = tstream.getOpenContainer(); + if (container instanceof NPOIFSFileSystem) { + root = ((NPOIFSFileSystem) container).getRoot(); + } else if (container instanceof DirectoryNode) { + root = (DirectoryNode) container; + } else if (tstream.hasFile()) { + root = new NPOIFSFileSystem(tstream.getFileChannel()).getRoot(); + } else { + root = new NPOIFSFileSystem(new CloseShieldInputStream(tstream)).getRoot(); + } + } + parse(root, context, metadata, xhtml); + xhtml.endDocument(); + } + + protected void parse( + DirectoryNode root, ParseContext context, Metadata metadata, XHTMLContentHandler xhtml) + throws IOException, SAXException, TikaException { + + // Parse summary entries first, to make metadata available early + new SummaryExtractor(metadata).parseSummaries(root); + + // Parse remaining document entries + POIFSDocumentType type = POIFSDocumentType.detectType(root); + + if (type!=POIFSDocumentType.UNKNOWN) { + setType(metadata, type.getType()); + } + + switch (type) { + case SOLIDWORKS_PART: +// new SolidworksExtractor(context).parse(root, xhtml); + break; + case SOLIDWORKS_ASSEMBLY: + break; + case SOLIDWORKS_DRAWING: + break; + case PUBLISHER: + PublisherTextExtractor publisherTextExtractor = + new PublisherTextExtractor(root); + xhtml.element("p", publisherTextExtractor.getText()); + break; + case WORDDOCUMENT: + new WordExtractor(context).parse(root, xhtml); + break; + case POWERPOINT: + new HSLFExtractor(context).parse(root, xhtml); + break; + case WORKBOOK: + case XLR: + Locale locale = context.get(Locale.class, Locale.getDefault()); + new ExcelExtractor(context).parse(root, xhtml, locale); + break; + case PROJECT: + // We currently can't do anything beyond the metadata + break; + case VISIO: + VisioTextExtractor visioTextExtractor = + new VisioTextExtractor(root); + for (String text : visioTextExtractor.getAllText()) { + xhtml.element("p", text); + } + break; + case OUTLOOK: + OutlookExtractor extractor = + new OutlookExtractor(root, context); + + extractor.parse(xhtml, metadata); + break; + case ENCRYPTED: + EncryptionInfo info = new EncryptionInfo(root); + Decryptor d = Decryptor.getInstance(info); + + try { + // By default, use the default Office Password + String password = Decryptor.DEFAULT_PASSWORD; + + // If they supplied a Password Provider, ask that for the password, + // and use the provider given one if available (stick with default if not) + PasswordProvider passwordProvider = context.get(PasswordProvider.class); + if (passwordProvider != null) { + String suppliedPassword = passwordProvider.getPassword(metadata); + if (suppliedPassword != null) { + password = suppliedPassword; + } + } + + // Check if we've the right password or not + if (!d.verifyPassword(password)) { + throw new EncryptedDocumentException(); + } + + // Decrypt the OLE2 stream, and delegate the resulting OOXML + // file to the regular OOXML parser for normal handling + OOXMLParser parser = new OOXMLParser(); + + parser.parse(d.getDataStream(root), new EmbeddedContentHandler( + new BodyContentHandler(xhtml)), + metadata, context); + } catch (GeneralSecurityException ex) { + throw new EncryptedDocumentException(ex); + } + } + } + + private void setType(Metadata metadata, MediaType type) { + metadata.set(Metadata.CONTENT_TYPE, type.toString()); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OldExcelParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OldExcelParser.java deleted file mode 100644 index 446eea9..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OldExcelParser.java +++ /dev/null @@ -1,97 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.microsoft; - -import java.io.BufferedReader; -import java.io.IOException; -import java.io.InputStream; -import java.io.StringReader; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashSet; -import java.util.Set; - -import org.apache.poi.hssf.extractor.OldExcelExtractor; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * A POI-powered Tika Parser for very old versions of Excel, from - * pre-OLE2 days, such as Excel 4. - */ -public class OldExcelParser extends AbstractParser { - private static final long serialVersionUID = 4611820730372823452L; - - private static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.application("vnd.ms-excel.sheet.4"), - MediaType.application("vnd.ms-excel.workspace.4"), - MediaType.application("vnd.ms-excel.sheet.3"), - MediaType.application("vnd.ms-excel.workspace.3"), - MediaType.application("vnd.ms-excel.sheet.2") - ))); - - protected static void parse(OldExcelExtractor extractor, - XHTMLContentHandler xhtml) throws TikaException, IOException, SAXException { - // Get the whole text, as a single string - String text = extractor.getText(); - - // Split and output - xhtml.startDocument(); - - String line; - BufferedReader reader = new BufferedReader(new StringReader(text)); - while ((line = reader.readLine()) != null) { - xhtml.startElement("p"); - xhtml.characters(line); - xhtml.endElement("p"); - } - - xhtml.endDocument(); - } - - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - /** - * Extracts properties and text from an MS Document input stream - */ - public void parse( - InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - // Open the POI provided extractor - OldExcelExtractor extractor = new OldExcelExtractor(stream); - - // We can't do anything about metadata, as these old formats - // didn't have any stored with them - - // Set the content type - // TODO Get the version and type, to set as the Content Type - - // Have the text extracted and given to our Content Handler - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - parse(extractor, xhtml); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java index 14397b9..3de5305 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java @@ -18,38 +18,25 @@ import java.io.ByteArrayInputStream; import java.io.IOException; -import java.io.UnsupportedEncodingException; -import java.nio.charset.Charset; -import java.nio.charset.IllegalCharsetNameException; -import java.nio.charset.UnsupportedCharsetException; import java.text.ParseException; import java.util.Date; -import java.util.List; -import java.util.Locale; -import java.util.Map; -import java.util.regex.Matcher; -import java.util.regex.Pattern; import org.apache.poi.hmef.attribute.MAPIRtfAttribute; import org.apache.poi.hsmf.MAPIMessage; import org.apache.poi.hsmf.datatypes.AttachmentChunks; import org.apache.poi.hsmf.datatypes.ByteChunk; import org.apache.poi.hsmf.datatypes.Chunk; -import org.apache.poi.hsmf.datatypes.Chunks; import org.apache.poi.hsmf.datatypes.MAPIProperty; -import org.apache.poi.hsmf.datatypes.PropertyValue; import org.apache.poi.hsmf.datatypes.StringChunk; import org.apache.poi.hsmf.datatypes.Types; import org.apache.poi.hsmf.exceptions.ChunkNotFoundException; import org.apache.poi.poifs.filesystem.DirectoryNode; import org.apache.poi.poifs.filesystem.NPOIFSFileSystem; -import org.apache.poi.util.CodePageUtil; import org.apache.tika.exception.TikaException; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.html.HtmlEncodingDetector; import org.apache.tika.parser.html.HtmlParser; import org.apache.tika.parser.mbox.MboxParser; import org.apache.tika.parser.rtf.RTFParser; @@ -60,15 +47,10 @@ import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.SAXException; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * Outlook Message Parser. */ public class OutlookExtractor extends AbstractPOIFSExtractor { - private static final Metadata EMPTY_METADATA = new Metadata(); - HtmlEncodingDetector detector = new HtmlEncodingDetector(); - private final MAPIMessage msg; public OutlookExtractor(NPOIFSFileSystem filesystem, ParseContext context) throws TikaException { @@ -77,7 +59,7 @@ public OutlookExtractor(DirectoryNode root, ParseContext context) throws TikaException { super(context); - + try { this.msg = new MAPIMessage(root); } catch (IOException e) { @@ -88,172 +70,185 @@ public void parse(XHTMLContentHandler xhtml, Metadata metadata) throws TikaException, SAXException, IOException { try { - msg.setReturnNullOnMissingChunk(true); - - // If the message contains strings that aren't stored - // as Unicode, try to sort out an encoding for them - if (msg.has7BitEncodingStrings()) { - guess7BitEncoding(msg); - } - - // Start with the metadata - String subject = msg.getSubject(); - String from = msg.getDisplayFrom(); - - metadata.set(TikaCoreProperties.CREATOR, from); - metadata.set(Metadata.MESSAGE_FROM, from); - metadata.set(Metadata.MESSAGE_TO, msg.getDisplayTo()); - metadata.set(Metadata.MESSAGE_CC, msg.getDisplayCC()); - metadata.set(Metadata.MESSAGE_BCC, msg.getDisplayBCC()); - - metadata.set(TikaCoreProperties.TITLE, subject); - // TODO: Move to description in Tika 2.0 - metadata.set(TikaCoreProperties.TRANSITION_SUBJECT_TO_DC_DESCRIPTION, - msg.getConversationTopic()); - - try { - for (String recipientAddress : msg.getRecipientEmailAddressList()) { - if (recipientAddress != null) - metadata.add(Metadata.MESSAGE_RECIPIENT_ADDRESS, recipientAddress); - } - } catch (ChunkNotFoundException he) { - } // Will be fixed in POI 3.7 Final - - // Date - try two ways to find it - // First try via the proper chunk - if (msg.getMessageDate() != null) { - metadata.set(TikaCoreProperties.CREATED, msg.getMessageDate().getTime()); - metadata.set(TikaCoreProperties.MODIFIED, msg.getMessageDate().getTime()); - } else { - try { - // Failing that try via the raw headers - String[] headers = msg.getHeaders(); - if (headers != null && headers.length > 0) { - for (String header : headers) { - if (header.toLowerCase(Locale.ROOT).startsWith("date:")) { - String date = header.substring(header.indexOf(':') + 1).trim(); - - // See if we can parse it as a normal mail date - try { - Date d = MboxParser.parseDate(date); - metadata.set(TikaCoreProperties.CREATED, d); - metadata.set(TikaCoreProperties.MODIFIED, d); - } catch (ParseException e) { - // Store it as-is, and hope for the best... - metadata.set(TikaCoreProperties.CREATED, date); - metadata.set(TikaCoreProperties.MODIFIED, date); - } - break; + msg.setReturnNullOnMissingChunk(true); + + // If the message contains strings that aren't stored + // as Unicode, try to sort out an encoding for them + if(msg.has7BitEncodingStrings()) { + if(msg.getHeaders() != null) { + // There's normally something in the headers + msg.guess7BitEncoding(); + } else { + // Nothing in the header, try encoding detection + // on the message body + StringChunk text = msg.getMainChunks().textBodyChunk; + if(text != null) { + CharsetDetector detector = new CharsetDetector(); + detector.setText( text.getRawValue() ); + CharsetMatch match = detector.detect(); + if(match.getConfidence() > 35) { + msg.set7BitEncoding( match.getName() ); + } + } + } + } + + // Start with the metadata + String subject = msg.getSubject(); + String from = msg.getDisplayFrom(); + + metadata.set(TikaCoreProperties.CREATOR, from); + metadata.set(Metadata.MESSAGE_FROM, from); + metadata.set(Metadata.MESSAGE_TO, msg.getDisplayTo()); + metadata.set(Metadata.MESSAGE_CC, msg.getDisplayCC()); + metadata.set(Metadata.MESSAGE_BCC, msg.getDisplayBCC()); + + metadata.set(TikaCoreProperties.TITLE, subject); + // TODO: Move to description in Tika 2.0 + metadata.set(TikaCoreProperties.TRANSITION_SUBJECT_TO_DC_DESCRIPTION, + msg.getConversationTopic()); + + try { + for(String recipientAddress : msg.getRecipientEmailAddressList()) { + if(recipientAddress != null) + metadata.add(Metadata.MESSAGE_RECIPIENT_ADDRESS, recipientAddress); + } + } catch(ChunkNotFoundException he) {} // Will be fixed in POI 3.7 Final + + // Date - try two ways to find it + // First try via the proper chunk + if(msg.getMessageDate() != null) { + metadata.set(TikaCoreProperties.CREATED, msg.getMessageDate().getTime()); + metadata.set(TikaCoreProperties.MODIFIED, msg.getMessageDate().getTime()); + } else { + try { + // Failing that try via the raw headers + String[] headers = msg.getHeaders(); + if(headers != null && headers.length > 0) { + for(String header: headers) { + if(header.toLowerCase().startsWith("date:")) { + String date = header.substring(header.indexOf(':')+1).trim(); + + // See if we can parse it as a normal mail date + try { + Date d = MboxParser.parseDate(date); + metadata.set(TikaCoreProperties.CREATED, d); + metadata.set(TikaCoreProperties.MODIFIED, d); + } catch(ParseException e) { + // Store it as-is, and hope for the best... + metadata.set(TikaCoreProperties.CREATED, date); + metadata.set(TikaCoreProperties.MODIFIED, date); } + break; } - } - } catch (ChunkNotFoundException he) { - // We can't find the date, sorry... - } - } - - - xhtml.element("h1", subject); - - // Output the from and to details in text, as you - // often want them in text form for searching - xhtml.startElement("dl"); - if (from != null) { - header(xhtml, "From", from); - } - header(xhtml, "To", msg.getDisplayTo()); - header(xhtml, "Cc", msg.getDisplayCC()); - header(xhtml, "Bcc", msg.getDisplayBCC()); - try { - header(xhtml, "Recipients", msg.getRecipientEmailAddress()); - } catch (ChunkNotFoundException e) { - } - xhtml.endElement("dl"); - - // Get the message body. Preference order is: html, rtf, text - Chunk htmlChunk = null; - Chunk rtfChunk = null; - Chunk textChunk = null; - for (Chunk chunk : msg.getMainChunks().getChunks()) { - if (chunk.getChunkId() == MAPIProperty.BODY_HTML.id) { - htmlChunk = chunk; - } - if (chunk.getChunkId() == MAPIProperty.RTF_COMPRESSED.id) { - rtfChunk = chunk; - } - if (chunk.getChunkId() == MAPIProperty.BODY.id) { - textChunk = chunk; - } - } - - boolean doneBody = false; - xhtml.startElement("div", "class", "message-body"); - if (htmlChunk != null) { - byte[] data = null; - if (htmlChunk instanceof ByteChunk) { - data = ((ByteChunk) htmlChunk).getValue(); - } else if (htmlChunk instanceof StringChunk) { - data = ((StringChunk) htmlChunk).getRawValue(); - } - if (data != null) { - HtmlParser htmlParser = new HtmlParser(); - htmlParser.parse( - new ByteArrayInputStream(data), - new EmbeddedContentHandler(new BodyContentHandler(xhtml)), - new Metadata(), new ParseContext() - ); - doneBody = true; - } - } - if (rtfChunk != null && !doneBody) { - ByteChunk chunk = (ByteChunk) rtfChunk; - MAPIRtfAttribute rtf = new MAPIRtfAttribute( - MAPIProperty.RTF_COMPRESSED, Types.BINARY.getId(), chunk.getValue() - ); - RTFParser rtfParser = new RTFParser(); - rtfParser.parse( - new ByteArrayInputStream(rtf.getData()), - new EmbeddedContentHandler(new BodyContentHandler(xhtml)), - new Metadata(), new ParseContext()); - doneBody = true; - } - if (textChunk != null && !doneBody) { - xhtml.element("p", ((StringChunk) textChunk).getValue()); - } - xhtml.endElement("div"); - - // Process the attachments - for (AttachmentChunks attachment : msg.getAttachmentFiles()) { - xhtml.startElement("div", "class", "attachment-entry"); - - String filename = null; - if (attachment.attachLongFileName != null) { - filename = attachment.attachLongFileName.getValue(); - } else if (attachment.attachFileName != null) { - filename = attachment.attachFileName.getValue(); - } - if (filename != null && filename.length() > 0) { - xhtml.element("h1", filename); - } - - if (attachment.attachData != null) { - handleEmbeddedResource( - TikaInputStream.get(attachment.attachData.getValue()), - filename, null, - null, xhtml, true - ); - } - if (attachment.attachmentDirectory != null) { - handleEmbeddedOfficeDoc( - attachment.attachmentDirectory.getDirectory(), - xhtml - ); - } - - xhtml.endElement("div"); - } - } catch (ChunkNotFoundException e) { - throw new TikaException("POI MAPIMessage broken - didn't return null on missing chunk", e); + } + } + } catch(ChunkNotFoundException he) { + // We can't find the date, sorry... + } + } + + + xhtml.element("h1", subject); + + // Output the from and to details in text, as you + // often want them in text form for searching + xhtml.startElement("dl"); + if (from!=null) { + header(xhtml, "From", from); + } + header(xhtml, "To", msg.getDisplayTo()); + header(xhtml, "Cc", msg.getDisplayCC()); + header(xhtml, "Bcc", msg.getDisplayBCC()); + try { + header(xhtml, "Recipients", msg.getRecipientEmailAddress()); + } catch(ChunkNotFoundException e) {} + xhtml.endElement("dl"); + + // Get the message body. Preference order is: html, rtf, text + Chunk htmlChunk = null; + Chunk rtfChunk = null; + Chunk textChunk = null; + for(Chunk chunk : msg.getMainChunks().getChunks()) { + if(chunk.getChunkId() == MAPIProperty.BODY_HTML.id) { + htmlChunk = chunk; + } + if(chunk.getChunkId() == MAPIProperty.RTF_COMPRESSED.id) { + rtfChunk = chunk; + } + if(chunk.getChunkId() == MAPIProperty.BODY.id) { + textChunk = chunk; + } + } + + boolean doneBody = false; + xhtml.startElement("div", "class", "message-body"); + if(htmlChunk != null) { + byte[] data = null; + if(htmlChunk instanceof ByteChunk) { + data = ((ByteChunk)htmlChunk).getValue(); + } else if(htmlChunk instanceof StringChunk) { + data = ((StringChunk)htmlChunk).getRawValue(); + } + if(data != null) { + HtmlParser htmlParser = new HtmlParser(); + htmlParser.parse( + new ByteArrayInputStream(data), + new EmbeddedContentHandler(new BodyContentHandler(xhtml)), + new Metadata(), new ParseContext() + ); + doneBody = true; + } + } + if(rtfChunk != null && !doneBody) { + ByteChunk chunk = (ByteChunk)rtfChunk; + MAPIRtfAttribute rtf = new MAPIRtfAttribute( + MAPIProperty.RTF_COMPRESSED, Types.BINARY.getId(), chunk.getValue() + ); + RTFParser rtfParser = new RTFParser(); + rtfParser.parse( + new ByteArrayInputStream(rtf.getData()), + new EmbeddedContentHandler(new BodyContentHandler(xhtml)), + new Metadata(), new ParseContext()); + doneBody = true; + } + if(textChunk != null && !doneBody) { + xhtml.element("p", ((StringChunk)textChunk).getValue()); + } + xhtml.endElement("div"); + + // Process the attachments + for (AttachmentChunks attachment : msg.getAttachmentFiles()) { + xhtml.startElement("div", "class", "attachment-entry"); + + String filename = null; + if (attachment.attachLongFileName != null) { + filename = attachment.attachLongFileName.getValue(); + } else if (attachment.attachFileName != null) { + filename = attachment.attachFileName.getValue(); + } + if (filename != null && filename.length() > 0) { + xhtml.element("h1", filename); + } + + if(attachment.attachData != null) { + handleEmbeddedResource( + TikaInputStream.get(attachment.attachData.getValue()), + filename, null, + null, xhtml, true + ); + } + if(attachment.attachmentDirectory != null) { + handleEmbeddedOfficeDoc( + attachment.attachmentDirectory.getDirectory(), + xhtml + ); + } + + xhtml.endElement("div"); + } + } catch(ChunkNotFoundException e) { + throw new TikaException("POI MAPIMessage broken - didn't return null on missing chunk", e); } } @@ -264,123 +259,4 @@ xhtml.element("dd", value); } } - - /** - * Tries to identify the correct encoding for 7-bit (non-unicode) - * strings in the file. - *

    Many messages store their strings as unicode, which is - * nice and easy. Some use one-byte encodings for their - * strings, but don't always store the encoding anywhere - * helpful in the file.

    - *

    This method checks for codepage properties, and failing that - * looks at the headers for the message, and uses these to - * guess the correct encoding for your file.

    - *

    Bug #49441 has more on why this is needed

    - *

    This is taken verbatim from POI (TIKA-1238) - * as a temporary workaround to prevent unsupported encoding exceptions

    - */ - private void guess7BitEncoding(MAPIMessage msg) { - Chunks mainChunks = msg.getMainChunks(); - //sanity check - if (mainChunks == null) { - return; - } - - Map> props = mainChunks.getProperties(); - if (props != null) { - // First choice is a codepage property - for (MAPIProperty prop : new MAPIProperty[]{ - MAPIProperty.MESSAGE_CODEPAGE, - MAPIProperty.INTERNET_CPID - }) { - List val = props.get(prop); - if (val != null && val.size() > 0) { - int codepage = ((PropertyValue.LongPropertyValue) val.get(0)).getValue(); - String encoding = null; - try { - encoding = CodePageUtil.codepageToEncoding(codepage, true); - } catch (UnsupportedEncodingException e) { - //swallow - } - if (tryToSet7BitEncoding(msg, encoding)) { - return; - } - } - } - } - - // Second choice is a charset on a content type header - try { - String[] headers = msg.getHeaders(); - if(headers != null && headers.length > 0) { - // Look for a content type with a charset - Pattern p = Pattern.compile("Content-Type:.*?charset=[\"']?([^;'\"]+)[\"']?", Pattern.CASE_INSENSITIVE); - - for(String header : headers) { - if(header.startsWith("Content-Type")) { - Matcher m = p.matcher(header); - if(m.matches()) { - // Found it! Tell all the string chunks - String charset = m.group(1); - if (tryToSet7BitEncoding(msg, charset)) { - return; - } - } - } - } - } - } catch(ChunkNotFoundException e) {} - - // Nothing suitable in the headers, try HTML - // TODO: do we need to replicate this in Tika? If we wind up - // parsing the html version of the email, this is duplicative?? - // Or do we need to reset the header strings based on the html - // meta header if there is no other information? - try { - String html = msg.getHtmlBody(); - if(html != null && html.length() > 0) { - Charset charset = null; - try { - charset = detector.detect(new ByteArrayInputStream( - html.getBytes(UTF_8)), EMPTY_METADATA); - } catch (IOException e) { - //swallow - } - if (charset != null && tryToSet7BitEncoding(msg, charset.name())) { - return; - } - } - } catch(ChunkNotFoundException e) {} - - //absolute last resort, try charset detector - StringChunk text = mainChunks.textBodyChunk; - if (text != null) { - CharsetDetector detector = new CharsetDetector(); - detector.setText(text.getRawValue()); - CharsetMatch match = detector.detect(); - if (match != null && match.getConfidence() > 35 && - tryToSet7BitEncoding(msg, match.getName())) { - return; - } - } - } - - private boolean tryToSet7BitEncoding(MAPIMessage msg, String charsetName) { - if (charsetName == null) { - return false; - } - - if (charsetName.equalsIgnoreCase("utf-8")) { - return false; - } - try { - if (Charset.isSupported(charsetName)) { - msg.set7BitEncoding(charsetName); - return true; - } - } catch (IllegalCharsetNameException | UnsupportedCharsetException e) { - //swallow - } - return false; - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java index 02a7330..ae7eee9 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/POIFSContainerDetector.java @@ -18,15 +18,14 @@ import static org.apache.tika.mime.MediaType.application; -import java.io.File; import java.io.IOException; import java.io.InputStream; +import java.nio.channels.FileChannel; import java.util.Collections; import java.util.HashSet; import java.util.Set; import java.util.regex.Pattern; -import org.apache.commons.io.IOUtils; import org.apache.poi.poifs.filesystem.DirectoryEntry; import org.apache.poi.poifs.filesystem.DirectoryNode; import org.apache.poi.poifs.filesystem.DocumentInputStream; @@ -34,145 +33,169 @@ import org.apache.poi.poifs.filesystem.Entry; import org.apache.poi.poifs.filesystem.NPOIFSFileSystem; import org.apache.tika.detect.Detector; +import org.apache.tika.io.IOUtils; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; /** * A detector that works on a POIFS OLE2 document - * to figure out exactly what the file is. + * to figure out exactly what the file is. * This should work for all OLE2 documents, whether - * they are ones supported by POI or not. + * they are ones supported by POI or not. */ public class POIFSContainerDetector implements Detector { - /** - * The OLE base file format - */ + /** Serial version UID */ + private static final long serialVersionUID = -3028021741663605293L; + + /** An ASCII String "StarImpress" */ + private static final byte [] STAR_IMPRESS = new byte [] { + 0x53, 0x74, 0x61, 0x72, 0x49, 0x6d, 0x70, 0x72, 0x65, 0x73, 0x73 + }; + + /** An ASCII String "StarDraw" */ + private static final byte [] STAR_DRAW = new byte [] { + 0x53, 0x74, 0x61, 0x72, 0x44, 0x72, 0x61, 0x77 + }; + + /** An ASCII String "Quill96" for Works Files */ + private static final byte [] WORKS_QUILL96 = new byte[] { + 0x51, 0x75, 0x69, 0x6c, 0x6c, 0x39, 0x36 + }; + + /** The OLE base file format */ public static final MediaType OLE = application("x-tika-msoffice"); - /** - * The protected OOXML base file format - */ + + /** The protected OOXML base file format */ public static final MediaType OOXML_PROTECTED = application("x-tika-ooxml-protected"); - /** - * General embedded document type within an OLE2 container - */ + + /** General embedded document type within an OLE2 container */ public static final MediaType GENERAL_EMBEDDED = application("x-tika-msoffice-embedded"); - /** - * An OLE10 Native embedded document within another OLE2 document - */ + + /** An OLE10 Native embedded document within another OLE2 document */ public static final MediaType OLE10_NATIVE = new MediaType(GENERAL_EMBEDDED, "format", "ole10_native"); - /** - * Some other kind of embedded document, in a CompObj container within another OLE2 document - */ + + /** Some other kind of embedded document, in a CompObj container within another OLE2 document */ public static final MediaType COMP_OBJ = new MediaType(GENERAL_EMBEDDED, "format", "comp_obj"); - /** - * Microsoft Excel - */ + + /** Microsoft Excel */ public static final MediaType XLS = application("vnd.ms-excel"); - /** - * Microsoft Word - */ + + /** Microsoft Word */ public static final MediaType DOC = application("msword"); - /** - * Microsoft PowerPoint - */ + + /** Microsoft PowerPoint */ public static final MediaType PPT = application("vnd.ms-powerpoint"); - /** - * Microsoft Publisher - */ + + /** Microsoft Publisher */ public static final MediaType PUB = application("x-mspublisher"); - /** - * Microsoft Visio - */ + + /** Microsoft Visio */ public static final MediaType VSD = application("vnd.visio"); - /** - * Microsoft Works - */ + + /** Microsoft Works */ public static final MediaType WPS = application("vnd.ms-works"); - /** - * Microsoft Works Spreadsheet 7.0 - */ + + /** Microsoft Works Spreadsheet 7.0 */ public static final MediaType XLR = application("x-tika-msworks-spreadsheet"); - /** - * Microsoft Outlook - */ + + /** Microsoft Outlook */ public static final MediaType MSG = application("vnd.ms-outlook"); - /** - * Microsoft Project - */ + + /** Microsoft Project */ public static final MediaType MPP = application("vnd.ms-project"); - /** - * StarOffice Calc - */ + + /** StarOffice Calc */ public static final MediaType SDC = application("vnd.stardivision.calc"); - /** - * StarOffice Draw - */ + + /** StarOffice Draw */ public static final MediaType SDA = application("vnd.stardivision.draw"); - /** - * StarOffice Impress - */ + + /** StarOffice Impress */ public static final MediaType SDD = application("vnd.stardivision.impress"); - /** - * StarOffice Writer - */ + + /** StarOffice Writer */ public static final MediaType SDW = application("vnd.stardivision.writer"); - /** - * SolidWorks CAD file - */ + + /** SolidWorks CAD file */ public static final MediaType SLDWORKS = application("sldworks"); - /** - * Hangul Word Processor (Korean) - */ - public static final MediaType HWP = application("x-hwp-v5"); - /** - * Serial version UID - */ - private static final long serialVersionUID = -3028021741663605293L; - /** - * An ASCII String "StarImpress" - */ - private static final byte[] STAR_IMPRESS = new byte[]{ - 0x53, 0x74, 0x61, 0x72, 0x49, 0x6d, 0x70, 0x72, 0x65, 0x73, 0x73 - }; - /** - * An ASCII String "StarDraw" - */ - private static final byte[] STAR_DRAW = new byte[]{ - 0x53, 0x74, 0x61, 0x72, 0x44, 0x72, 0x61, 0x77 - }; - /** - * An ASCII String "Quill96" for Works Files - */ - private static final byte[] WORKS_QUILL96 = new byte[]{ - 0x51, 0x75, 0x69, 0x6c, 0x6c, 0x39, 0x36 - }; - /** - * Regexp for matching the MPP Project Data stream - */ + + /** Regexp for matching the MPP Project Data stream */ private static final Pattern mppDataMatch = Pattern.compile("\\s\\s\\s\\d+"); + + public MediaType detect(InputStream input, Metadata metadata) + throws IOException { + // Check if we have access to the document + if (input == null) { + return MediaType.OCTET_STREAM; + } + + // If this is a TikaInputStream wrapping an already + // parsed NPOIFileSystem/DirectoryNode, just get the + // names from the root: + TikaInputStream tis = TikaInputStream.cast(input); + Set names = null; + if (tis != null) { + Object container = tis.getOpenContainer(); + if (container instanceof NPOIFSFileSystem) { + names = getTopLevelNames(((NPOIFSFileSystem) container).getRoot()); + } else if (container instanceof DirectoryNode) { + names = getTopLevelNames((DirectoryNode) container); + } + } + + if (names == null) { + // Check if the document starts with the OLE header + input.mark(8); + try { + if (input.read() != 0xd0 || input.read() != 0xcf + || input.read() != 0x11 || input.read() != 0xe0 + || input.read() != 0xa1 || input.read() != 0xb1 + || input.read() != 0x1a || input.read() != 0xe1) { + return MediaType.OCTET_STREAM; + } + } finally { + input.reset(); + } + } + + // We can only detect the exact type when given a TikaInputStream + if (names == null && tis != null) { + // Look for known top level entry names to detect the document type + names = getTopLevelNames(tis); + } + + // Detect based on the names (as available) + if (tis != null && + tis.getOpenContainer() != null && + tis.getOpenContainer() instanceof NPOIFSFileSystem) { + return detect(names, ((NPOIFSFileSystem)tis.getOpenContainer()).getRoot()); + } else { + return detect(names, null); + } + } /** * Internal detection of the specific kind of OLE2 document, based on the * names of the top level streams within the file. - * + * * @deprecated Use {@link #detect(Set, DirectoryEntry)} and pass the root - * entry of the filesystem whose type is to be detected, as a - * second argument. + * entry of the filesystem whose type is to be detected, as a + * second argument. */ protected static MediaType detect(Set names) { return detect(names, null); } - + /** * Internal detection of the specific kind of OLE2 document, based on the * names of the top-level streams within the file. In some cases the * detection may need access to the root {@link DirectoryEntry} of that file * for best results. The entry can be given as a second, optional argument. - * + * * @param names * @param root * @return @@ -200,27 +223,24 @@ } else { return processCompObjFormatType(root); } - } else if (names.contains("\u0005HwpSummaryInformation")) { - // Hangul Word Processor v5+ (previous aren't OLE2-based) - return HWP; } else if (names.contains("WksSSWorkBook")) { // This check has to be before names.contains("Workbook") // Works 7.0 spreadsheet files contain both // we want to avoid classifying this as Excel - return XLR; + return XLR; } else if (names.contains("Workbook") || names.contains("WORKBOOK")) { return XLS; } else if (names.contains("Book")) { - // Excel 95 or older, we won't be able to parse this.... - return XLS; - } else if (names.contains("EncryptedPackage") && + // Excel 95 or older, we won't be able to parse this.... + return XLS; + } else if (names.contains("EncryptedPackage") && names.contains("EncryptionInfo") && names.contains("\u0006DataSpaces")) { // This is a protected OOXML document, which is an OLE2 file // with an Encrypted Stream which holds the OOXML data // Without decrypting the stream, we can't tell what kind of // OOXML file we have. Return a general OOXML Protected type, - // and hope the name based detection can guess the rest! + // and hope the name based detection can guess the rest! return OOXML_PROTECTED; } else if (names.contains("EncryptedPackage")) { return OLE; @@ -243,33 +263,33 @@ } else if (names.contains("Contents") && names.contains("\u0003ObjInfo")) { return COMP_OBJ; } else if (names.contains("CONTENTS") && names.contains("\u0001CompObj")) { - // CompObj is a general kind of OLE2 embedding, but this may be an old Works file - // If we have the Directory, check - if (root != null) { - MediaType type = processCompObjFormatType(root); - if (type == WPS) { - return WPS; - } else { - // Assume it's a general CompObj embedded resource - return COMP_OBJ; - } - } else { - // Assume it's a general CompObj embedded resource - return COMP_OBJ; - } + // CompObj is a general kind of OLE2 embedding, but this may be an old Works file + // If we have the Directory, check + if (root != null) { + MediaType type = processCompObjFormatType(root); + if (type == WPS) { + return WPS; + } else { + // Assume it's a general CompObj embedded resource + return COMP_OBJ; + } + } else { + // Assume it's a general CompObj embedded resource + return COMP_OBJ; + } } else if (names.contains("CONTENTS")) { - // CONTENTS without SPELLING nor CompObj normally means some sort - // of embedded non-office file inside an OLE2 document - // This is most commonly triggered on nested directories - return OLE; + // CONTENTS without SPELLING nor CompObj normally means some sort + // of embedded non-office file inside an OLE2 document + // This is most commonly triggered on nested directories + return OLE; } else if (names.contains("\u0001CompObj") && - (names.contains("Props") || names.contains("Props9") || names.contains("Props12"))) { - // Could be Project, look for common name patterns - for (String name : names) { - if (mppDataMatch.matcher(name).matches()) { - return MPP; - } - } + (names.contains("Props") || names.contains("Props9") || names.contains("Props12"))) { + // Could be Project, look for common name patterns + for (String name : names) { + if (mppDataMatch.matcher(name).matches()) { + return MPP; + } + } } else if (names.contains("PerfectOffice_MAIN")) { if (names.contains("SlideShow")) { return MediaType.application("x-corelpresentations"); // .shw @@ -293,36 +313,36 @@ /** * Is this one of the kinds of formats which uses CompObj to - * store all of their data, eg Star Draw, Star Impress or - * (older) Works? + * store all of their data, eg Star Draw, Star Impress or + * (older) Works? * If not, it's likely an embedded resource */ private static MediaType processCompObjFormatType(DirectoryEntry root) { try { Entry e = root.getEntry("\u0001CompObj"); if (e != null && e.isDocumentEntry()) { - DocumentNode dn = (DocumentNode) e; + DocumentNode dn = (DocumentNode)e; DocumentInputStream stream = new DocumentInputStream(dn); - byte[] bytes = IOUtils.toByteArray(stream); + byte [] bytes = IOUtils.toByteArray(stream); /* * This array contains a string with a normal ASCII name of the * application used to create this file. We want to search for that * name. */ - if (arrayContains(bytes, STAR_DRAW)) { + if ( arrayContains(bytes, STAR_DRAW) ) { return SDA; } else if (arrayContains(bytes, STAR_IMPRESS)) { return SDD; } else if (arrayContains(bytes, WORKS_QUILL96)) { - return WPS; - } - } + return WPS; + } + } } catch (Exception e) { /* * "root.getEntry" can throw FileNotFoundException. The code inside * "if" can throw IOExceptions. Theoretically. Practically no * exceptions will likely ever appear. - * + * * Swallow all of them. If any occur, we just assume that we can't * distinguish between Draw and Impress and return something safe: * x-tika-msoffice @@ -330,10 +350,10 @@ } return OLE; } - + // poor man's search for byte arrays, replace with some library call if // you know one without adding new dependencies - private static boolean arrayContains(byte[] larger, byte[] smaller) { + private static boolean arrayContains(byte [] larger, byte [] smaller) { int largerCounter = 0; int smallerCounter = 0; while (largerCounter < larger.length) { @@ -345,7 +365,7 @@ } } else { largerCounter = largerCounter - smallerCounter + 1; - smallerCounter = 0; + smallerCounter=0; } } return false; @@ -355,10 +375,10 @@ throws IOException { // Force the document stream to a (possibly temporary) file // so we don't modify the current position of the stream - File file = stream.getFile(); + FileChannel channel = stream.getFileChannel(); try { - NPOIFSFileSystem fs = new NPOIFSFileSystem(file, true); + NPOIFSFileSystem fs = new NPOIFSFileSystem(channel); // Optimize a possible later parsing process by keeping // a reference to the already opened POI file system @@ -381,56 +401,4 @@ } return names; } - - public MediaType detect(InputStream input, Metadata metadata) - throws IOException { - // Check if we have access to the document - if (input == null) { - return MediaType.OCTET_STREAM; - } - - // If this is a TikaInputStream wrapping an already - // parsed NPOIFileSystem/DirectoryNode, just get the - // names from the root: - TikaInputStream tis = TikaInputStream.cast(input); - Set names = null; - if (tis != null) { - Object container = tis.getOpenContainer(); - if (container instanceof NPOIFSFileSystem) { - names = getTopLevelNames(((NPOIFSFileSystem) container).getRoot()); - } else if (container instanceof DirectoryNode) { - names = getTopLevelNames((DirectoryNode) container); - } - } - - if (names == null) { - // Check if the document starts with the OLE header - input.mark(8); - try { - if (input.read() != 0xd0 || input.read() != 0xcf - || input.read() != 0x11 || input.read() != 0xe0 - || input.read() != 0xa1 || input.read() != 0xb1 - || input.read() != 0x1a || input.read() != 0xe1) { - return MediaType.OCTET_STREAM; - } - } finally { - input.reset(); - } - } - - // We can only detect the exact type when given a TikaInputStream - if (names == null && tis != null) { - // Look for known top level entry names to detect the document type - names = getTopLevelNames(tis); - } - - // Detect based on the names (as available) - if (tis != null && - tis.getOpenContainer() != null && - tis.getOpenContainer() instanceof NPOIFSFileSystem) { - return detect(names, ((NPOIFSFileSystem) tis.getOpenContainer()).getRoot()); - } else { - return detect(names, null); - } - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/SummaryExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/SummaryExtractor.java index ce9b2fa..fdab90c 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/SummaryExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/SummaryExtractor.java @@ -19,8 +19,6 @@ import java.io.FileNotFoundException; import java.io.IOException; import java.util.Date; -import java.util.HashSet; -import java.util.Set; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; @@ -52,10 +50,10 @@ private static final Log logger = LogFactory.getLog(AbstractPOIFSExtractor.class); private static final String SUMMARY_INFORMATION = - SummaryInformation.DEFAULT_STREAM_NAME; + SummaryInformation.DEFAULT_STREAM_NAME; private static final String DOCUMENT_SUMMARY_INFORMATION = - DocumentSummaryInformation.DEFAULT_STREAM_NAME; + DocumentSummaryInformation.DEFAULT_STREAM_NAME; private final Metadata metadata; @@ -79,9 +77,9 @@ throws IOException, TikaException { try { DocumentEntry entry = - (DocumentEntry) root.getEntry(entryName); + (DocumentEntry) root.getEntry(entryName); PropertySet properties = - new PropertySet(new DocumentInputStream(entry)); + new PropertySet(new DocumentInputStream(entry)); if (properties.isSummaryInformation()) { parse(new SummaryInformation(properties)); } @@ -103,7 +101,7 @@ private void parse(SummaryInformation summary) { set(TikaCoreProperties.TITLE, summary.getTitle()); - addMulti(metadata, TikaCoreProperties.CREATOR, summary.getAuthor()); + set(TikaCoreProperties.CREATOR, summary.getAuthor()); set(TikaCoreProperties.KEYWORDS, summary.getKeywords()); // TODO Move to OO subject in Tika 2.0 set(TikaCoreProperties.TRANSITION_SUBJECT_TO_OO_SUBJECT, summary.getSubject()); @@ -117,7 +115,7 @@ set(TikaCoreProperties.PRINT_DATE, summary.getLastPrinted()); set(Metadata.EDIT_TIME, summary.getEditTime()); set(OfficeOpenXMLExtended.DOC_SECURITY, summary.getSecurity()); - + // New style counts set(Office.WORD_COUNT, summary.getWordCount()); set(Office.CHARACTER_COUNT, summary.getCharCount()); @@ -125,7 +123,7 @@ if (summary.getPageCount() > 0) { metadata.set(PagedText.N_PAGES, summary.getPageCount()); } - + // Old style, Tika 1.0 properties // TODO Remove these in Tika 2.0 set(Metadata.TEMPLATE, summary.getTemplate()); @@ -139,10 +137,10 @@ private void parse(DocumentSummaryInformation summary) { set(OfficeOpenXMLExtended.COMPANY, summary.getCompany()); - addMulti(metadata, OfficeOpenXMLExtended.MANAGER, summary.getManager()); + set(OfficeOpenXMLExtended.MANAGER, summary.getManager()); set(TikaCoreProperties.LANGUAGE, getLanguage(summary)); set(OfficeOpenXMLCore.CATEGORY, summary.getCategory()); - + // New style counts set(Office.SLIDE_COUNT, summary.getSlideCount()); if (summary.getSlideCount() > 0) { @@ -154,7 +152,7 @@ set(Metadata.MANAGER, summary.getManager()); set(MSOffice.SLIDE_COUNT, summary.getSlideCount()); set(Metadata.CATEGORY, summary.getCategory()); - + parse(summary.getCustomProperties()); } @@ -171,7 +169,6 @@ /** * Attempt to parse custom document properties and add to the collection of metadata - * * @param customProperties */ private void parse(CustomProperties customProperties) { @@ -182,23 +179,23 @@ // Get, convert and save property value Object value = customProperties.get(name); - if (value instanceof String) { - set(key, (String) value); + if (value instanceof String){ + set(key, (String)value); } else if (value instanceof Date) { Property prop = Property.externalDate(key); - metadata.set(prop, (Date) value); + metadata.set(prop, (Date)value); } else if (value instanceof Boolean) { Property prop = Property.externalBoolean(key); - metadata.set(prop, value.toString()); + metadata.set(prop, ((Boolean)value).toString()); } else if (value instanceof Long) { Property prop = Property.externalInteger(key); - metadata.set(prop, ((Long) value).intValue()); + metadata.set(prop, ((Long)value).intValue()); } else if (value instanceof Double) { Property prop = Property.externalReal(key); - metadata.set(prop, (Double) value); + metadata.set(prop, ((Double)value).doubleValue()); } else if (value instanceof Integer) { Property prop = Property.externalInteger(key); - metadata.set(prop, ((Integer) value).intValue()); + metadata.set(prop, ((Integer)value).intValue()); } } } @@ -209,7 +206,7 @@ metadata.set(name, value); } } - + private void set(Property property, String value) { if (value != null) { metadata.set(property, value); @@ -233,28 +230,4 @@ metadata.set(name, Long.toString(value)); } } - - //MS stores values that should be multiple values (e.g. dc:creator) - //as a semicolon-delimited list. We need to split - //on semicolon to add each value. - public static void addMulti(Metadata metadata, Property property, String string) { - if (string == null) { - return; - } - String[] parts = string.split(";"); - String[] current = metadata.getValues(property); - Set seen = new HashSet<>(); - if (current != null) { - for (String val : current) { - seen.add(val); - } - } - for (String part : parts) { - if (! seen.contains(part)) { - metadata.add(property, part); - seen.add(part); - } - } - } - } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TNEFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TNEFParser.java index 879546b..48c48be 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TNEFParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/TNEFParser.java @@ -43,17 +43,17 @@ /** * A POI-powered Tika Parser for TNEF (Transport Neutral - * Encoding Format) messages, aka winmail.dat + * Encoding Format) messages, aka winmail.dat */ public class TNEFParser extends AbstractParser { - private static final long serialVersionUID = 4611820730372823452L; - - private static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.application("vnd.ms-tnef"), - MediaType.application("ms-tnef"), - MediaType.application("x-tnef") - ))); + private static final long serialVersionUID = 4611820730372823452L; + + private static final Set SUPPORTED_TYPES = + Collections.unmodifiableSet(new HashSet(Arrays.asList( + MediaType.application("vnd.ms-tnef"), + MediaType.application("ms-tnef"), + MediaType.application("x-tnef") + ))); public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; @@ -66,70 +66,70 @@ InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { + + // We work by recursing, so get the appropriate bits + EmbeddedDocumentExtractor ex = context.get(EmbeddedDocumentExtractor.class); + EmbeddedDocumentExtractor embeddedExtractor; + if (ex==null) { + embeddedExtractor = new ParsingEmbeddedDocumentExtractor(context); + } else { + embeddedExtractor = ex; + } + + // Ask POI to process the file for us + HMEFMessage msg = new HMEFMessage(stream); + + // Set the message subject if known + String subject = msg.getSubject(); + if(subject != null && subject.length() > 0) { + // TODO: Move to title in Tika 2.0 + metadata.set(TikaCoreProperties.TRANSITION_SUBJECT_TO_DC_TITLE, subject); + } + + // Recurse into the message body RTF + MAPIAttribute attr = msg.getMessageMAPIAttribute(MAPIProperty.RTF_COMPRESSED); + if(attr != null && attr instanceof MAPIRtfAttribute) { + MAPIRtfAttribute rtf = (MAPIRtfAttribute)attr; + handleEmbedded( + "message.rtf", "application/rtf", + rtf.getData(), + embeddedExtractor, handler + ); + } + + // Recurse into each attachment in turn + for(Attachment attachment : msg.getAttachments()) { + String name = attachment.getLongFilename(); + if(name == null || name.length() == 0) { + name = attachment.getFilename(); + } + if(name == null || name.length() == 0) { + String ext = attachment.getExtension(); + if(ext != null) { + name = "unknown" + ext; + } + } + handleEmbedded( + name, null, attachment.getContents(), + embeddedExtractor, handler + ); + } + } + + private void handleEmbedded(String name, String type, byte[] contents, + EmbeddedDocumentExtractor embeddedExtractor, ContentHandler handler) + throws IOException, SAXException, TikaException { + Metadata metadata = new Metadata(); + if(name != null) + metadata.set(Metadata.RESOURCE_NAME_KEY, name); + if(type != null) + metadata.set(Metadata.CONTENT_TYPE, type); - // We work by recursing, so get the appropriate bits - EmbeddedDocumentExtractor ex = context.get(EmbeddedDocumentExtractor.class); - EmbeddedDocumentExtractor embeddedExtractor; - if (ex == null) { - embeddedExtractor = new ParsingEmbeddedDocumentExtractor(context); - } else { - embeddedExtractor = ex; - } - - // Ask POI to process the file for us - HMEFMessage msg = new HMEFMessage(stream); - - // Set the message subject if known - String subject = msg.getSubject(); - if (subject != null && subject.length() > 0) { - // TODO: Move to title in Tika 2.0 - metadata.set(TikaCoreProperties.TRANSITION_SUBJECT_TO_DC_TITLE, subject); - } - - // Recurse into the message body RTF - MAPIAttribute attr = msg.getMessageMAPIAttribute(MAPIProperty.RTF_COMPRESSED); - if (attr != null && attr instanceof MAPIRtfAttribute) { - MAPIRtfAttribute rtf = (MAPIRtfAttribute) attr; - handleEmbedded( - "message.rtf", "application/rtf", - rtf.getData(), - embeddedExtractor, handler - ); - } - - // Recurse into each attachment in turn - for (Attachment attachment : msg.getAttachments()) { - String name = attachment.getLongFilename(); - if (name == null || name.length() == 0) { - name = attachment.getFilename(); - } - if (name == null || name.length() == 0) { - String ext = attachment.getExtension(); - if (ext != null) { - name = "unknown" + ext; - } - } - handleEmbedded( - name, null, attachment.getContents(), - embeddedExtractor, handler - ); - } - } - - private void handleEmbedded(String name, String type, byte[] contents, - EmbeddedDocumentExtractor embeddedExtractor, ContentHandler handler) - throws IOException, SAXException, TikaException { - Metadata metadata = new Metadata(); - if (name != null) - metadata.set(Metadata.RESOURCE_NAME_KEY, name); - if (type != null) - metadata.set(Metadata.CONTENT_TYPE, type); - - if (embeddedExtractor.shouldParseEmbedded(metadata)) { - embeddedExtractor.parseEmbedded( - TikaInputStream.get(contents), - new EmbeddedContentHandler(handler), - metadata, false); - } + if (embeddedExtractor.shouldParseEmbedded(metadata)) { + embeddedExtractor.parseEmbedded( + TikaInputStream.get(contents), + new EmbeddedContentHandler(handler), + metadata, false); + } } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java index 15984fe..65e7876 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java @@ -22,7 +22,6 @@ import java.util.HashMap; import java.util.HashSet; import java.util.List; -import java.util.Locale; import java.util.Map; import java.util.Set; @@ -53,82 +52,19 @@ import org.xml.sax.SAXException; import org.xml.sax.helpers.AttributesImpl; -import static java.nio.charset.StandardCharsets.UTF_8; - public class WordExtractor extends AbstractPOIFSExtractor { private static final char UNICODECHAR_NONBREAKING_HYPHEN = '\u2011'; private static final char UNICODECHAR_ZERO_WIDTH_SPACE = '\u200b'; - // could be improved by using the real delimiter in xchFollow [MS-DOC], v20140721, 2.4.6.3, Part 3, Step 3 - private static final String LIST_DELIMITER = " "; - private static final Map fixedParagraphStyles = new HashMap(); - private static final TagAndStyle defaultParagraphStyle = new TagAndStyle("p", null); - - static { - fixedParagraphStyles.put("Default", defaultParagraphStyle); - fixedParagraphStyles.put("Normal", defaultParagraphStyle); - fixedParagraphStyles.put("heading", new TagAndStyle("h1", null)); - fixedParagraphStyles.put("Heading", new TagAndStyle("h1", null)); - fixedParagraphStyles.put("Title", new TagAndStyle("h1", "title")); - fixedParagraphStyles.put("Subtitle", new TagAndStyle("h2", "subtitle")); - fixedParagraphStyles.put("HTML Preformatted", new TagAndStyle("pre", null)); + + public WordExtractor(ParseContext context) { + super(context); } // True if we are currently in the named style tag: private boolean curStrikeThrough; private boolean curBold; private boolean curItalic; - - public WordExtractor(ParseContext context) { - super(context); - } - - private static int countParagraphs(Range... ranges) { - int count = 0; - for (Range r : ranges) { - if (r != null) { - count += r.numParagraphs(); - } - } - return count; - } - - /** - * Given a style name, return what tag should be used, and - * what style should be applied to it. - */ - public static TagAndStyle buildParagraphTagAndStyle(String styleName, boolean isTable) { - TagAndStyle tagAndStyle = fixedParagraphStyles.get(styleName); - if (tagAndStyle != null) { - return tagAndStyle; - } - - if (styleName.equals("Table Contents") && isTable) { - return defaultParagraphStyle; - } - - String tag = "p"; - String styleClass = null; - - if (styleName.startsWith("heading") || styleName.startsWith("Heading")) { - // "Heading 3" or "Heading2" or "heading 4" - int num = 1; - try { - num = Integer.parseInt( - styleName.substring(styleName.length() - 1) - ); - } catch (NumberFormatException e) { - } - // Turn it into a H1 - H6 (H7+ isn't valid!) - tag = "h" + Math.min(num, 6); - } else { - styleClass = styleName.replace(' ', '_'); - styleClass = styleClass.substring(0, 1).toLowerCase(Locale.ROOT) + - styleClass.substring(1); - } - - return new TagAndStyle(tag, styleClass); - } protected void parse( NPOIFSFileSystem filesystem, XHTMLContentHandler xhtml) @@ -142,12 +78,12 @@ HWPFDocument document; try { document = new HWPFDocument(root); - } catch (OldWordFileFormatException e) { + } catch(OldWordFileFormatException e) { parseWord6(root, xhtml); return; } org.apache.poi.hwpf.extractor.WordExtractor wordExtractor = - new org.apache.poi.hwpf.extractor.WordExtractor(document); + new org.apache.poi.hwpf.extractor.WordExtractor(document); HeaderStories headerFooter = new HeaderStories(document); // Grab the list of pictures. As far as we can tell, @@ -157,24 +93,23 @@ PicturesSource pictures = new PicturesSource(document); // Do any headers, if present - Range[] headers = new Range[]{headerFooter.getFirstHeaderSubrange(), - headerFooter.getEvenHeaderSubrange(), headerFooter.getOddHeaderSubrange()}; + Range[] headers = new Range[] { headerFooter.getFirstHeaderSubrange(), + headerFooter.getEvenHeaderSubrange(), headerFooter.getOddHeaderSubrange() }; handleHeaderFooter(headers, "header", document, pictures, pictureTable, xhtml); // Do the main paragraph text Range r = document.getRange(); - ListManager listManager = new ListManager(document); - for (int i = 0; i < r.numParagraphs(); i++) { - Paragraph p = r.getParagraph(i); - i += handleParagraph(p, 0, r, document, FieldsDocumentPart.MAIN, pictures, pictureTable, listManager, xhtml); + for(int i=0; i 0) { xhtml.startElement("div", "class", type); - ListManager listManager = new ListManager(document); for (Range r : ranges) { if (r != null) { - for (int i = 0; i < r.numParagraphs(); i++) { + for(int i=0; i parentTableLevel && parentTableLevel == 0) { - Table t = r.getTable(p); - xhtml.startElement("table"); - xhtml.startElement("tbody"); - for (int rn = 0; rn < t.numRows(); rn++) { - TableRow row = t.getRow(rn); - xhtml.startElement("tr"); - for (int cn = 0; cn < row.numCells(); cn++) { - TableCell cell = row.getCell(cn); - xhtml.startElement("td"); - - for (int pn = 0; pn < cell.numParagraphs(); pn++) { - Paragraph cellP = cell.getParagraph(pn); - handleParagraph(cellP, p.getTableLevel(), cell, document, docPart, pictures, pictureTable, listManager, xhtml); - } - xhtml.endElement("td"); + + private int handleParagraph(Paragraph p, int parentTableLevel, Range r, HWPFDocument document, + FieldsDocumentPart docPart, PicturesSource pictures, PicturesTable pictureTable, + XHTMLContentHandler xhtml) throws SAXException, IOException, TikaException { + // Note - a poi bug means we can't currently properly recurse + // into nested tables, so currently we don't + if(p.isInTable() && p.getTableLevel() > parentTableLevel && parentTableLevel==0) { + Table t = r.getTable(p); + xhtml.startElement("table"); + xhtml.startElement("tbody"); + for(int rn=0; rn p.getStyleIndex()) { - StyleDescription style = - document.getStyleSheet().getStyleDescription(p.getStyleIndex()); - if (style != null && style.getName() != null && style.getName().length() > 0) { - if (p.isInList()) { - numbering = listManager.getFormattedNumber(p); - } - tas = buildParagraphTagAndStyle(style.getName(), (parentTableLevel > 0)); - } else { - tas = new TagAndStyle("p", null); - } - } else { - tas = new TagAndStyle("p", null); - } - - if (tas.getStyleClass() != null) { - xhtml.startElement(tas.getTag(), "class", tas.getStyleClass()); - } else { - xhtml.startElement(tas.getTag()); - } - - if (numbering != null) { - xhtml.characters(numbering); - } - - for (int j = 0; j < p.numCharacterRuns(); j++) { - CharacterRun cr = p.getCharacterRun(j); - - // FIELD_BEGIN_MARK: - if (cr.text().getBytes(UTF_8)[0] == 0x13) { - Field field = document.getFields().getFieldByStartOffset(docPart, cr.getStartOffset()); - // 58 is an embedded document - // 56 is a document link - if (field != null && (field.getType() == 58 || field.getType() == 56)) { - // Embedded Object: add a
    so consumer can see where - // in the main text each embedded document - // occurred: - String id = "_" + field.getMarkSeparatorCharacterRun(r).getPicOffset(); - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", "class", "class", "CDATA", "embedded"); - attributes.addAttribute("", "id", "id", "CDATA", id); - xhtml.startElement("div", attributes); - xhtml.endElement("div"); - } - } - - if (cr.text().equals("\u0013")) { - j += handleSpecialCharacterRuns(p, j, tas.isHeading(), pictures, xhtml); - } else if (cr.text().startsWith("\u0008")) { - // Floating Picture(s) - for (int pn = 0; pn < cr.text().length(); pn++) { - // Assume they're in the order from the unclaimed list... - Picture picture = pictures.nextUnclaimed(); - - // Output - handlePictureCharacterRun(cr, picture, pictures, xhtml); - } - } else if (pictureTable.hasPicture(cr)) { - // Inline Picture - Picture picture = pictures.getFor(cr); + xhtml.endElement("td"); + } + xhtml.endElement("tr"); + } + xhtml.endElement("tbody"); + xhtml.endElement("table"); + return (t.numParagraphs()-1); + } + + TagAndStyle tas; + + if (document.getStyleSheet().numStyles()>p.getStyleIndex()) { + StyleDescription style = + document.getStyleSheet().getStyleDescription(p.getStyleIndex()); + if (style != null && style.getName() != null && style.getName().length() > 0) { + tas = buildParagraphTagAndStyle(style.getName(), (parentTableLevel>0)); + } else { + tas = new TagAndStyle("p", null); + } + } else { + tas = new TagAndStyle("p", null); + } + + if(tas.getStyleClass() != null) { + xhtml.startElement(tas.getTag(), "class", tas.getStyleClass()); + } else { + xhtml.startElement(tas.getTag()); + } + + for(int j=0; j so consumer can see where + // in the main text each embedded document + // occurred: + String id = "_" + field.getMarkSeparatorCharacterRun(r).getPicOffset(); + AttributesImpl attributes = new AttributesImpl(); + attributes.addAttribute("", "class", "class", "CDATA", "embedded"); + attributes.addAttribute("", "id", "id", "CDATA", id); + xhtml.startElement("div", attributes); + xhtml.endElement("div"); + } + } + + if(cr.text().equals("\u0013")) { + j += handleSpecialCharacterRuns(p, j, tas.isHeading(), pictures, xhtml); + } else if(cr.text().startsWith("\u0008")) { + // Floating Picture(s) + for(int pn=0; pn controls = new ArrayList(); - List texts = new ArrayList(); - boolean has14 = false; - - // Split it into before and after the 14 - int i; - for (i = index + 1; i < p.numCharacterRuns(); i++) { - CharacterRun cr = p.getCharacterRun(i); - if (cr.text().equals("\u0013")) { - // Nested, oh joy... - int increment = handleSpecialCharacterRuns(p, i + 1, skipStyling, pictures, xhtml); - i += increment; - } else if (cr.text().equals("\u0014")) { - has14 = true; - } else if (cr.text().equals("\u0015")) { - if (!has14) { - texts = controls; - controls = new ArrayList(); + private int handleSpecialCharacterRuns(Paragraph p, int index, boolean skipStyling, + PicturesSource pictures, XHTMLContentHandler xhtml) throws SAXException, TikaException, IOException { + List controls = new ArrayList(); + List texts = new ArrayList(); + boolean has14 = false; + + // Split it into before and after the 14 + int i; + for(i=index+1; i(); + } + break; + } else { + if(has14) { + texts.add(cr); + } else { + controls.add(cr); + } + } + } + + // Do we need to do something special with this? + if(controls.size() > 0) { + String text = controls.get(0).text(); + for(int j=1; j -1) { + String url = text.substring( + text.indexOf('"') + 1, + text.lastIndexOf('"') + ); + xhtml.startElement("a", "href", url); + for(CharacterRun cr : texts) { + handleCharacterRun(cr, skipStyling, xhtml); + } + xhtml.endElement("a"); + } else { + // Just output the text ones + for(CharacterRun cr : texts) { + if(pictures.hasPicture(cr)) { + Picture picture = pictures.getFor(cr); + handlePictureCharacterRun(cr, picture, pictures, xhtml); + } else { + handleCharacterRun(cr, skipStyling, xhtml); } - break; - } else { - if (has14) { - texts.add(cr); - } else { - controls.add(cr); - } - } - } - - // Do we need to do something special with this? - if (controls.size() > 0) { - String text = controls.get(0).text(); - for (int j = 1; j < controls.size(); j++) { - text += controls.get(j).text(); - } - - if ((text.startsWith("HYPERLINK") || text.startsWith(" HYPERLINK")) - && text.indexOf('"') > -1) { - int start = text.indexOf('"') + 1; - int end = findHyperlinkEnd(text, start); - String url = ""; - if (start >= 0 && start < end && end <= text.length()) { - url = text.substring(start, end); - } - - xhtml.startElement("a", "href", url); - for (CharacterRun cr : texts) { - handleCharacterRun(cr, skipStyling, xhtml); - } - xhtml.endElement("a"); - } else { - // Just output the text ones - for (CharacterRun cr : texts) { - if (pictures.hasPicture(cr)) { - Picture picture = pictures.getFor(cr); - handlePictureCharacterRun(cr, picture, pictures, xhtml); - } else { - handleCharacterRun(cr, skipStyling, xhtml); - } - } - } - } else { - // We only had text - // Output as-is - for (CharacterRun cr : texts) { - handleCharacterRun(cr, skipStyling, xhtml); - } - } - - // Tell them how many to skip over - return i - index; - } - - //temporary work around for TIKA-1512 - private int findHyperlinkEnd(String text, int start) { - int end = text.lastIndexOf('"'); - if (end > start) { - return end; - } - end = text.lastIndexOf('\u201D');//smart right double quote - if (end > start) { - return end; - } - end = text.lastIndexOf('\r'); - if (end > start) { - return end; - } - //if nothing so far, take the full length of the string - //If the full string is > 256 characters, it appears - //that the url is truncated in the .doc file. This - //will return the value as it is in the file, which - //may be incorrect; but it is the same behavior as opening - //the link in MSWord. - //This code does not currently check that length is actually >= 256. - //we might want to add that? - return text.length(); - } - - private void handlePictureCharacterRun(CharacterRun cr, Picture picture, PicturesSource pictures, XHTMLContentHandler xhtml) - throws SAXException, IOException, TikaException { - if (!isRendered(cr) || picture == null) { - // Oh dear, we've run out... - // Probably caused by multiple \u0008 images referencing - // the same real image - return; - } - - // Which one is it? - String extension = picture.suggestFileExtension(); - int pictureNumber = pictures.pictureNumber(picture); - - // Make up a name for the picture - // There isn't one in the file, but we need to be able to reference - // the picture from the img tag and the embedded resource - String filename = "image" + pictureNumber + (extension.length() > 0 ? "." + extension : ""); - - // Grab the mime type for the picture - String mimeType = picture.getMimeType(); - - // Output the img tag - AttributesImpl attr = new AttributesImpl(); - attr.addAttribute("", "src", "src", "CDATA", "embedded:" + filename); - attr.addAttribute("", "alt", "alt", "CDATA", filename); - xhtml.startElement("img", attr); - xhtml.endElement("img"); - - // Have we already output this one? - // (Only expose each individual image once) - if (!pictures.hasOutput(picture)) { - TikaInputStream stream = TikaInputStream.get(picture.getContent()); - handleEmbeddedResource(stream, filename, null, mimeType, xhtml, false); - pictures.recordOutput(picture); - } - } - + } + } + } else { + // We only had text + // Output as-is + for(CharacterRun cr : texts) { + handleCharacterRun(cr, skipStyling, xhtml); + } + } + + // Tell them how many to skip over + return i-index; + } + + private void handlePictureCharacterRun(CharacterRun cr, Picture picture, PicturesSource pictures, XHTMLContentHandler xhtml) + throws SAXException, IOException, TikaException { + if(!isRendered(cr) || picture == null) { + // Oh dear, we've run out... + // Probably caused by multiple \u0008 images referencing + // the same real image + return; + } + + // Which one is it? + String extension = picture.suggestFileExtension(); + int pictureNumber = pictures.pictureNumber(picture); + + // Make up a name for the picture + // There isn't one in the file, but we need to be able to reference + // the picture from the img tag and the embedded resource + String filename = "image"+pictureNumber+(extension.length()>0 ? "."+extension : ""); + + // Grab the mime type for the picture + String mimeType = picture.getMimeType(); + + // Output the img tag + AttributesImpl attr = new AttributesImpl(); + attr.addAttribute("", "src", "src", "CDATA", "embedded:" + filename); + attr.addAttribute("", "alt", "alt", "CDATA", filename); + xhtml.startElement("img", attr); + xhtml.endElement("img"); + + // Have we already output this one? + // (Only expose each individual image once) + if(! pictures.hasOutput(picture)) { + TikaInputStream stream = TikaInputStream.get(picture.getContent()); + handleEmbeddedResource(stream, filename, null, mimeType, xhtml, false); + pictures.recordOutput(picture); + } + } + /** * Outputs a section of text if the given text is non-empty. * - * @param xhtml XHTML content handler + * @param xhtml XHTML content handler * @param section the class of the <div/> section emitted - * @param text text to be emitted, if any + * @param text text to be emitted, if any * @throws SAXException if an error occurs */ private void addTextIfAny( @@ -584,7 +489,7 @@ xhtml.endElement("div"); } } - + protected void parseWord6( NPOIFSFileSystem filesystem, XHTMLContentHandler xhtml) throws IOException, SAXException, TikaException { @@ -596,116 +501,160 @@ throws IOException, SAXException, TikaException { HWPFOldDocument doc = new HWPFOldDocument(root); Word6Extractor extractor = new Word6Extractor(doc); - - for (String p : extractor.getParagraphText()) { + + for(String p : extractor.getParagraphText()) { xhtml.element("p", p); } } + private static final Map fixedParagraphStyles = new HashMap(); + private static final TagAndStyle defaultParagraphStyle = new TagAndStyle("p", null); + static { + fixedParagraphStyles.put("Default", defaultParagraphStyle); + fixedParagraphStyles.put("Normal", defaultParagraphStyle); + fixedParagraphStyles.put("heading", new TagAndStyle("h1", null)); + fixedParagraphStyles.put("Heading", new TagAndStyle("h1", null)); + fixedParagraphStyles.put("Title", new TagAndStyle("h1", "title")); + fixedParagraphStyles.put("Subtitle", new TagAndStyle("h2", "subtitle")); + fixedParagraphStyles.put("HTML Preformatted", new TagAndStyle("pre", null)); + } + + /** + * Given a style name, return what tag should be used, and + * what style should be applied to it. + */ + public static TagAndStyle buildParagraphTagAndStyle(String styleName, boolean isTable) { + TagAndStyle tagAndStyle = fixedParagraphStyles.get(styleName); + if (tagAndStyle != null) { + return tagAndStyle; + } + + if (styleName.equals("Table Contents") && isTable) { + return defaultParagraphStyle; + } + + String tag = "p"; + String styleClass = null; + + if(styleName.startsWith("heading") || styleName.startsWith("Heading")) { + // "Heading 3" or "Heading2" or "heading 4" + int num = 1; + try { + num = Integer.parseInt( + styleName.substring(styleName.length()-1) + ); + } catch(NumberFormatException e) {} + // Turn it into a H1 - H6 (H7+ isn't valid!) + tag = "h" + Math.min(num, 6); + } else { + styleClass = styleName.replace(' ', '_'); + styleClass = styleClass.substring(0,1).toLowerCase() + + styleClass.substring(1); + } + + return new TagAndStyle(tag,styleClass); + } + + public static class TagAndStyle { + private String tag; + private String styleClass; + public TagAndStyle(String tag, String styleClass) { + this.tag = tag; + this.styleClass = styleClass; + } + public String getTag() { + return tag; + } + public String getStyleClass() { + return styleClass; + } + public boolean isHeading() { + return tag.length()==2 && tag.startsWith("h"); + } + } + /** * Determines if character run should be included in the extraction. - * + * * @param cr character run. * @return true if character run should be included in extraction. */ private boolean isRendered(final CharacterRun cr) { - return cr == null || !cr.isMarkedDeleted(); - } - - public static class TagAndStyle { - private String tag; - private String styleClass; - - public TagAndStyle(String tag, String styleClass) { - this.tag = tag; - this.styleClass = styleClass; - } - - public String getTag() { - return tag; - } - - public String getStyleClass() { - return styleClass; - } - - public boolean isHeading() { - return tag.length() == 2 && tag.startsWith("h"); - } - } - + return cr == null || !cr.isMarkedDeleted(); + } + + /** * Provides access to the pictures both by offset, iteration - * over the un-claimed, and peeking forward + * over the un-claimed, and peeking forward */ private static class PicturesSource { - private PicturesTable picturesTable; - private Set output = new HashSet(); - private Map lookup; - private List nonU1based; - private List all; - private int pn = 0; - - private PicturesSource(HWPFDocument doc) { - picturesTable = doc.getPicturesTable(); - all = picturesTable.getAllPictures(); - - // Build the Offset-Picture lookup map - lookup = new HashMap(); - for (Picture p : all) { - lookup.put(p.getStartOffset(), p); - } - - // Work out which Pictures aren't referenced by - // a \u0001 in the main text - // These are \u0008 escher floating ones, ones - // found outside the normal text, and who - // knows what else... - nonU1based = new ArrayList(); - nonU1based.addAll(all); - Range r = doc.getRange(); - for (int i = 0; i < r.numCharacterRuns(); i++) { - CharacterRun cr = r.getCharacterRun(i); - if (picturesTable.hasPicture(cr)) { - Picture p = getFor(cr); - int at = nonU1based.indexOf(p); - nonU1based.set(at, null); - } - } - } - - private boolean hasPicture(CharacterRun cr) { - return picturesTable.hasPicture(cr); - } - - private void recordOutput(Picture picture) { - output.add(picture); - } - - private boolean hasOutput(Picture picture) { - return output.contains(picture); - } - - private int pictureNumber(Picture picture) { - return all.indexOf(picture) + 1; - } - - private Picture getFor(CharacterRun cr) { - return lookup.get(cr.getPicOffset()); - } - - /** - * Return the next unclaimed one, used towards - * the end - */ - private Picture nextUnclaimed() { - Picture p = null; - while (pn < nonU1based.size()) { - p = nonU1based.get(pn); - pn++; - if (p != null) return p; - } - return null; - } + private PicturesTable picturesTable; + private Set output = new HashSet(); + private Map lookup; + private List nonU1based; + private List all; + private int pn = 0; + + private PicturesSource(HWPFDocument doc) { + picturesTable = doc.getPicturesTable(); + all = picturesTable.getAllPictures(); + + // Build the Offset-Picture lookup map + lookup = new HashMap(); + for(Picture p : all) { + lookup.put(p.getStartOffset(), p); + } + + // Work out which Pictures aren't referenced by + // a \u0001 in the main text + // These are \u0008 escher floating ones, ones + // found outside the normal text, and who + // knows what else... + nonU1based = new ArrayList(); + nonU1based.addAll(all); + Range r = doc.getRange(); + for(int i=0; i + * * Tika extractors decorate POI extractors so that the parsed content of * documents is returned as a sequence of XHTML SAX events. Subclasses must * implement the buildXHTML method {@link #buildXHTML(XHTMLContentHandler)} that @@ -67,15 +60,17 @@ private static final String TYPE_OLE_OBJECT = "application/vnd.openxmlformats-officedocument.oleObject"; + + protected POIXMLTextExtractor extractor; + private final EmbeddedDocumentExtractor embeddedExtractor; - protected POIXMLTextExtractor extractor; public AbstractOOXMLExtractor(ParseContext context, POIXMLTextExtractor extractor) { this.extractor = extractor; EmbeddedDocumentExtractor ex = context.get(EmbeddedDocumentExtractor.class); - if (ex == null) { + if (ex==null) { embeddedExtractor = new ParsingEmbeddedDocumentExtractor(context); } else { embeddedExtractor = ex; @@ -99,7 +94,7 @@ /** * @see org.apache.tika.parser.microsoft.ooxml.OOXMLExtractor#getXHTML(org.xml.sax.ContentHandler, - * org.apache.tika.metadata.Metadata) + * org.apache.tika.metadata.Metadata) */ public void getXHTML( ContentHandler handler, Metadata metadata, ParseContext context) @@ -112,54 +107,20 @@ // Now do any embedded parts handleEmbeddedParts(handler); - // thumbnail - handleThumbnail(handler); - xhtml.endDocument(); } - + protected String getJustFileName(String desc) { - int idx = desc.lastIndexOf('/'); - if (idx != -1) { - desc = desc.substring(idx + 1); - } - idx = desc.lastIndexOf('.'); - if (idx != -1) { - desc = desc.substring(0, idx); - } - - return desc; - } - - private void handleThumbnail(ContentHandler handler) { - try { - OPCPackage opcPackage = extractor.getPackage(); - for (PackageRelationship rel : opcPackage.getRelationshipsByType(PackageRelationshipTypes.THUMBNAIL)) { - PackagePart tPart = opcPackage.getPart(rel); - InputStream tStream = tPart.getInputStream(); - Metadata thumbnailMetadata = new Metadata(); - String thumbName = tPart.getPartName().getName(); - thumbnailMetadata.set(Metadata.RESOURCE_NAME_KEY, thumbName); - - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute(XHTML, "class", "class", "CDATA", "embedded"); - attributes.addAttribute(XHTML, "id", "id", "CDATA", thumbName); - handler.startElement(XHTML, "div", "div", attributes); - handler.endElement(XHTML, "div", "div"); - - thumbnailMetadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, thumbName); - thumbnailMetadata.set(Metadata.CONTENT_TYPE, tPart.getContentType()); - thumbnailMetadata.set(TikaCoreProperties.TITLE, tPart.getPartName().getName()); - - if (embeddedExtractor.shouldParseEmbedded(thumbnailMetadata)) { - embeddedExtractor.parseEmbedded(TikaInputStream.get(tStream), new EmbeddedContentHandler(handler), thumbnailMetadata, false); - } - - tStream.close(); - } - } catch (Exception ex) { - - } + int idx = desc.lastIndexOf('/'); + if (idx != -1) { + desc = desc.substring(idx+1); + } + idx = desc.lastIndexOf('.'); + if (idx != -1) { + desc = desc.substring(0, idx); + } + + return desc; } private void handleEmbeddedParts(ContentHandler handler) @@ -173,9 +134,9 @@ if (sourceURI != null) { sourceDesc = getJustFileName(sourceURI.getPath()); if (sourceDesc.startsWith("slide")) { - sourceDesc += "_"; + sourceDesc += "_"; } else { - sourceDesc = ""; + sourceDesc = ""; } } else { sourceDesc = ""; @@ -213,11 +174,12 @@ private void handleEmbeddedOLE(PackagePart part, ContentHandler handler, String rel) throws IOException, SAXException { // A POIFSFileSystem needs to be at least 3 blocks big to be valid - if (part.getSize() >= 0 && part.getSize() < 512 * 3) { - // Too small, skip - return; - } - + // TODO: TIKA-1118 Upgrade to POI 4.0 then enable this block of code +// if (part.getSize() >= 0 && part.getSize() < 512*3) { +// // Too small, skip +// return; +// } + // Open the POIFS (OLE2) structure and process POIFSFileSystem fs = new POIFSFileSystem(part.getInputStream()); try { @@ -227,26 +189,24 @@ DirectoryNode root = fs.getRoot(); POIFSDocumentType type = POIFSDocumentType.detectType(root); - + if (root.hasEntry("CONTENTS") - && root.hasEntry("\u0001Ole") - && root.hasEntry("\u0001CompObj") - && root.hasEntry("\u0003ObjInfo")) { - // TIKA-704: OLE 2.0 embedded non-Office document? - stream = TikaInputStream.get( - fs.createDocumentInputStream("CONTENTS")); - if (embeddedExtractor.shouldParseEmbedded(metadata)) { - embeddedExtractor.parseEmbedded( - stream, new EmbeddedContentHandler(handler), - metadata, false); - } + && root.hasEntry("\u0001Ole") + && root.hasEntry("\u0001CompObj") + && root.hasEntry("\u0003ObjInfo")) { + // TIKA-704: OLE 2.0 embedded non-Office document? + stream = TikaInputStream.get( + fs.createDocumentInputStream("CONTENTS")); + if (embeddedExtractor.shouldParseEmbedded(metadata)) { + embeddedExtractor.parseEmbedded( + stream, new EmbeddedContentHandler(handler), + metadata, false); + } } else if (POIFSDocumentType.OLE10_NATIVE == type) { // TIKA-704: OLE 1.0 embedded document Ole10Native ole = Ole10Native.createFromEmbeddedOleObject(fs); - if (ole.getLabel() != null) { - metadata.set(Metadata.RESOURCE_NAME_KEY, ole.getLabel()); - } + metadata.set(Metadata.RESOURCE_NAME_KEY, ole.getLabel()); byte[] data = ole.getDataBuffer(); if (data != null) { stream = TikaInputStream.get(data); @@ -300,12 +260,12 @@ */ protected abstract void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, XmlException, IOException; - + /** * Return a list of the main parts of the document, used - * when searching for embedded resources. + * when searching for embedded resources. * This should be all the parts of the document that end - * up with things embedded into them. + * up with things embedded into them. */ protected abstract List getMainDocumentParts() throws TikaException; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java index 25d3596..e7344bb 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java @@ -19,10 +19,10 @@ import java.math.BigDecimal; import java.util.Date; +import org.apache.poi.POIXMLTextExtractor; import org.apache.poi.POIXMLProperties.CoreProperties; import org.apache.poi.POIXMLProperties.CustomProperties; import org.apache.poi.POIXMLProperties.ExtendedProperties; -import org.apache.poi.POIXMLTextExtractor; import org.apache.poi.openxml4j.opc.internal.PackagePropertiesPart; import org.apache.poi.openxml4j.util.Nullable; import org.apache.poi.xssf.extractor.XSSFEventBasedExcelExtractor; @@ -35,15 +35,14 @@ import org.apache.tika.metadata.PagedText; import org.apache.tika.metadata.Property; import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.parser.microsoft.SummaryExtractor; import org.openxmlformats.schemas.officeDocument.x2006.customProperties.CTProperty; import org.openxmlformats.schemas.officeDocument.x2006.extendedProperties.CTProperties; /** * OOXML metadata extractor. - *

    + * * Currently POI doesn't support metadata extraction for OOXML. - * + * * @see OOXMLExtractor#getMetadataExtractor() */ public class MetadataExtractor { @@ -56,8 +55,8 @@ public void extract(Metadata metadata) throws TikaException { if (extractor.getDocument() != null || - (extractor instanceof XSSFEventBasedExcelExtractor && - extractor.getPackage() != null)) { + (extractor instanceof XSSFEventBasedExcelExtractor && + extractor.getPackage() != null)) { extractMetadata(extractor.getCoreProperties(), metadata); extractMetadata(extractor.getExtendedProperties(), metadata); extractMetadata(extractor.getCustomProperties(), metadata); @@ -73,7 +72,7 @@ .getContentStatusProperty()); addProperty(metadata, TikaCoreProperties.CREATED, propsHolder .getCreatedProperty()); - addMultiProperty(metadata, TikaCoreProperties.CREATOR, propsHolder + addProperty(metadata, TikaCoreProperties.CREATOR, propsHolder .getCreatorProperty()); addProperty(metadata, TikaCoreProperties.DESCRIPTION, propsHolder .getDescriptionProperty()); @@ -90,15 +89,15 @@ addProperty(metadata, Metadata.LAST_MODIFIED, propsHolder .getModifiedProperty()); addProperty(metadata, TikaCoreProperties.MODIFIED, propsHolder - .getModifiedProperty()); + .getModifiedProperty()); addProperty(metadata, OfficeOpenXMLCore.REVISION, propsHolder .getRevisionProperty()); // TODO: Move to OO subject in Tika 2.0 - addProperty(metadata, TikaCoreProperties.TRANSITION_SUBJECT_TO_OO_SUBJECT, + addProperty(metadata, TikaCoreProperties.TRANSITION_SUBJECT_TO_OO_SUBJECT, propsHolder.getSubjectProperty()); addProperty(metadata, TikaCoreProperties.TITLE, propsHolder.getTitleProperty()); addProperty(metadata, OfficeOpenXMLCore.VERSION, propsHolder.getVersionProperty()); - + // Legacy Tika-1.0 style stats // TODO Remove these in Tika 2.0 addProperty(metadata, Metadata.CATEGORY, propsHolder.getCategoryProperty()); @@ -110,23 +109,23 @@ } private void extractMetadata(ExtendedProperties properties, - Metadata metadata) { + Metadata metadata) { CTProperties propsHolder = properties.getUnderlyingProperties(); addProperty(metadata, OfficeOpenXMLExtended.APPLICATION, propsHolder.getApplication()); addProperty(metadata, OfficeOpenXMLExtended.APP_VERSION, propsHolder.getAppVersion()); addProperty(metadata, TikaCoreProperties.PUBLISHER, propsHolder.getCompany()); addProperty(metadata, OfficeOpenXMLExtended.COMPANY, propsHolder.getCompany()); - SummaryExtractor.addMulti(metadata, OfficeOpenXMLExtended.MANAGER, propsHolder.getManager()); + addProperty(metadata, OfficeOpenXMLExtended.MANAGER, propsHolder.getManager()); addProperty(metadata, OfficeOpenXMLExtended.NOTES, propsHolder.getNotes()); addProperty(metadata, OfficeOpenXMLExtended.PRESENTATION_FORMAT, propsHolder.getPresentationFormat()); addProperty(metadata, OfficeOpenXMLExtended.TEMPLATE, propsHolder.getTemplate()); addProperty(metadata, OfficeOpenXMLExtended.TOTAL_TIME, propsHolder.getTotalTime()); if (propsHolder.getPages() > 0) { - metadata.set(PagedText.N_PAGES, propsHolder.getPages()); + metadata.set(PagedText.N_PAGES, propsHolder.getPages()); } else if (propsHolder.getSlides() > 0) { - metadata.set(PagedText.N_PAGES, propsHolder.getSlides()); + metadata.set(PagedText.N_PAGES, propsHolder.getSlides()); } // Process the document statistics @@ -137,7 +136,7 @@ addProperty(metadata, Office.WORD_COUNT, propsHolder.getWords()); addProperty(metadata, Office.CHARACTER_COUNT, propsHolder.getCharacters()); addProperty(metadata, Office.CHARACTER_COUNT_WITH_SPACES, propsHolder.getCharactersWithSpaces()); - + // Legacy Tika-1.0 style stats // TODO Remove these in Tika 2.0 addProperty(metadata, Metadata.APPLICATION_NAME, propsHolder.getApplication()); @@ -157,89 +156,113 @@ } private void extractMetadata(CustomProperties properties, - Metadata metadata) { - org.openxmlformats.schemas.officeDocument.x2006.customProperties.CTProperties - props = properties.getUnderlyingProperties(); - for (int i = 0; i < props.sizeOfPropertyArray(); i++) { - CTProperty property = props.getPropertyArray(i); - String val = null; - Date date = null; - - if (property.isSetLpwstr()) { - val = property.getLpwstr(); - } else if (property.isSetLpstr()) { - val = property.getLpstr(); - } else if (property.isSetDate()) { - date = property.getDate().getTime(); - } else if (property.isSetFiletime()) { - date = property.getFiletime().getTime(); - } else if (property.isSetBool()) { - val = Boolean.toString(property.getBool()); - } - - // Integers - else if (property.isSetI1()) { - val = Integer.toString(property.getI1()); - } else if (property.isSetI2()) { - val = Integer.toString(property.getI2()); - } else if (property.isSetI4()) { - val = Integer.toString(property.getI4()); - } else if (property.isSetI8()) { - val = Long.toString(property.getI8()); - } else if (property.isSetInt()) { - val = Integer.toString(property.getInt()); - } - - // Unsigned Integers - else if (property.isSetUi1()) { - val = Integer.toString(property.getUi1()); - } else if (property.isSetUi2()) { - val = Integer.toString(property.getUi2()); - } else if (property.isSetUi4()) { - val = Long.toString(property.getUi4()); - } else if (property.isSetUi8()) { - val = property.getUi8().toString(); - } else if (property.isSetUint()) { - val = Long.toString(property.getUint()); - } - - // Reals - else if (property.isSetR4()) { - val = Float.toString(property.getR4()); - } else if (property.isSetR8()) { - val = Double.toString(property.getR8()); - } else if (property.isSetDecimal()) { - BigDecimal d = property.getDecimal(); - if (d == null) { - val = null; - } else { - val = d.toPlainString(); - } - } else if (property.isSetArray()) { - // TODO Fetch the array values and output - } else if (property.isSetVector()) { - // TODO Fetch the vector values and output - } else if (property.isSetBlob() || property.isSetOblob()) { - // TODO Decode, if possible - } else if (property.isSetStream() || property.isSetOstream() || - property.isSetVstream()) { - // TODO Decode, if possible - } else if (property.isSetStorage() || property.isSetOstorage()) { - // TODO Decode, if possible - } else { - // This type isn't currently supported yet, skip the property - } - - String propName = "custom:" + property.getName(); - if (date != null) { - Property tikaProp = Property.externalDate(propName); - metadata.set(tikaProp, date); - } else if (val != null) { - metadata.set(propName, val); - } - } - } - + Metadata metadata) { + org.openxmlformats.schemas.officeDocument.x2006.customProperties.CTProperties + props = properties.getUnderlyingProperties(); + + for(CTProperty property : props.getPropertyList()) { + String val = null; + Date date = null; + + if (property.isSetLpwstr()) { + val = property.getLpwstr(); + } + else if (property.isSetLpstr()) { + val = property.getLpstr(); + } + else if (property.isSetDate()) { + date = property.getDate().getTime(); + } + else if (property.isSetFiletime()) { + date = property.getFiletime().getTime(); + } + + else if (property.isSetBool()) { + val = Boolean.toString( property.getBool() ); + } + + // Integers + else if (property.isSetI1()) { + val = Integer.toString(property.getI1()); + } + else if (property.isSetI2()) { + val = Integer.toString(property.getI2()); + } + else if (property.isSetI4()) { + val = Integer.toString(property.getI4()); + } + else if (property.isSetI8()) { + val = Long.toString(property.getI8()); + } + else if (property.isSetInt()) { + val = Integer.toString( property.getInt() ); + } + + // Unsigned Integers + else if (property.isSetUi1()) { + val = Integer.toString(property.getUi1()); + } + else if (property.isSetUi2()) { + val = Integer.toString(property.getUi2()); + } + else if (property.isSetUi4()) { + val = Long.toString(property.getUi4()); + } + else if (property.isSetUi8()) { + val = property.getUi8().toString(); + } + else if (property.isSetUint()) { + val = Long.toString(property.getUint()); + } + + // Reals + else if (property.isSetR4()) { + val = Float.toString( property.getR4() ); + } + else if (property.isSetR8()) { + val = Double.toString( property.getR8() ); + } + else if (property.isSetDecimal()) { + BigDecimal d = property.getDecimal(); + if (d == null) { + val = null; + } else { + val = d.toPlainString(); + } + } + + else if (property.isSetArray()) { + // TODO Fetch the array values and output + } + else if (property.isSetVector()) { + // TODO Fetch the vector values and output + } + + else if (property.isSetBlob() || property.isSetOblob()) { + // TODO Decode, if possible + } + else if (property.isSetStream() || property.isSetOstream() || + property.isSetVstream()) { + // TODO Decode, if possible + } + else if (property.isSetStorage() || property.isSetOstorage()) { + // TODO Decode, if possible + } + + else { + // This type isn't currently supported yet, skip the property + } + + String propName = "custom:" + property.getName(); + if (date != null) { + Property tikaProp = Property.externalDate(propName); + metadata.set(tikaProp, date); + } else if (val != null) { + metadata.set(propName, val); + } + } + } + private void addProperty(Metadata metadata, Property property, Nullable nullableValue) { T value = nullableValue.getValue(); if (value != null) { @@ -260,7 +283,7 @@ addProperty(metadata, name, value.getValue().toString()); } } - + private void addProperty(Metadata metadata, Property property, String value) { if (value != null) { metadata.set(property, value); @@ -274,22 +297,14 @@ } private void addProperty(Metadata metadata, Property property, int value) { - if (value > 0) { - metadata.set(property, value); - } - } - + if (value > 0) { + metadata.set(property, value); + } + } + private void addProperty(Metadata metadata, String name, int value) { if (value > 0) { metadata.set(name, Integer.toString(value)); } } - - private void addMultiProperty(Metadata metadata, Property property, Nullable value) { - if (value == null) { - return; - } - SummaryExtractor.addMulti(metadata, property, value.getValue()); - } - } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractor.java index f52e52d..bb46a21 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractor.java @@ -29,14 +29,14 @@ /** * Interface implemented by all Tika OOXML extractors. - * + * * @see org.apache.poi.POIXMLTextExtractor */ public interface OOXMLExtractor { /** * Returns the opened document. - * + * * @see POIXMLTextExtractor#getDocument() */ POIXMLDocument getDocument(); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java index e2c7717..da0df28 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java @@ -20,7 +20,6 @@ import java.io.InputStream; import java.util.Locale; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.poi.POIXMLDocument; import org.apache.poi.POIXMLTextExtractor; import org.apache.poi.extractor.ExtractorFactory; @@ -34,6 +33,7 @@ import org.apache.poi.xwpf.extractor.XWPFWordExtractor; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; @@ -56,7 +56,7 @@ throws IOException, SAXException, TikaException { Locale locale = context.get(Locale.class, Locale.getDefault()); ExtractorFactory.setThreadPrefersEventExtractors(true); - + try { OOXMLExtractor extractor; OPCPackage pkg; @@ -66,34 +66,34 @@ if (tis != null && tis.getOpenContainer() instanceof OPCPackage) { pkg = (OPCPackage) tis.getOpenContainer(); } else if (tis != null && tis.hasFile()) { - pkg = OPCPackage.open(tis.getFile().getPath(), PackageAccess.READ); + pkg = OPCPackage.open( tis.getFile().getPath(), PackageAccess.READ ); tis.setOpenContainer(pkg); } else { InputStream shield = new CloseShieldInputStream(stream); - pkg = OPCPackage.open(shield); + pkg = OPCPackage.open(shield); } - + // Get the type, and ensure it's one we handle MediaType type = ZipContainerDetector.detectOfficeOpenXML(pkg); if (type == null || OOXMLParser.UNSUPPORTED_OOXML_TYPES.contains(type)) { - // Not a supported type, delegate to Empty Parser - EmptyParser.INSTANCE.parse(stream, baseHandler, metadata, context); - return; + // Not a supported type, delegate to Empty Parser + EmptyParser.INSTANCE.parse(stream, baseHandler, metadata, context); + return; } metadata.set(Metadata.CONTENT_TYPE, type.toString()); // Have the appropriate OOXML text extractor picked POIXMLTextExtractor poiExtractor = ExtractorFactory.createExtractor(pkg); - + POIXMLDocument document = poiExtractor.getDocument(); if (poiExtractor instanceof XSSFEventBasedExcelExtractor) { - extractor = new XSSFExcelExtractorDecorator( - context, (XSSFEventBasedExcelExtractor) poiExtractor, locale); + extractor = new XSSFExcelExtractorDecorator( + context, (XSSFEventBasedExcelExtractor)poiExtractor, locale); } else if (document == null) { - throw new TikaException( - "Expecting UserModel based POI OOXML extractor with a document, but none found. " + - "The extractor returned was a " + poiExtractor - ); + throw new TikaException( + "Expecting UserModel based POI OOXML extractor with a document, but none found. " + + "The extractor returned was a " + poiExtractor + ); } else if (document instanceof XMLSlideShow) { extractor = new XSLFPowerPointExtractorDecorator( context, (XSLFPowerPointExtractor) poiExtractor); @@ -103,19 +103,18 @@ } else { extractor = new POIXMLTextExtractorDecorator(context, poiExtractor); } - + // Get the bulk of the metadata first, so that it's accessible during // parsing if desired by the client (see TIKA-1109) extractor.getMetadataExtractor().extract(metadata); - + // Extract the text, along with any in-document metadata extractor.getXHTML(baseHandler, metadata, context); } catch (IllegalArgumentException e) { - if (e.getMessage() != null && - e.getMessage().startsWith("No supported documents found")) { + if (e.getMessage().startsWith("No supported documents found")) { throw new TikaException( "TIKA-418: RuntimeException while getting content" - + " for thmx and xps file types", e); + + " for thmx and xps file types", e); } else { throw new TikaException("Error creating OOXML extractor", e); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java index 22f2cac..2acfd16 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.java @@ -23,7 +23,6 @@ import java.util.HashSet; import java.util.Set; -import org.apache.poi.openxml4j.util.ZipSecureFile; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; @@ -36,44 +35,40 @@ * Office Open XML (OOXML) parser. */ public class OOXMLParser extends AbstractParser { - static { - //turn off POI's zip bomb detection because we have our own - ZipSecureFile.setMinInflateRatio(-1.0d); - } + /** Serial version UID */ + private static final long serialVersionUID = 6535995710857776481L; + protected static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.application("x-tika-ooxml"), - MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), - MediaType.application("vnd.ms-powerpoint.presentation.macroenabled.12"), - MediaType.application("vnd.openxmlformats-officedocument.presentationml.template"), - MediaType.application("vnd.openxmlformats-officedocument.presentationml.slideshow"), - MediaType.application("vnd.ms-powerpoint.slideshow.macroenabled.12"), - MediaType.application("vnd.ms-powerpoint.addin.macroenabled.12"), - MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), - MediaType.application("vnd.ms-excel.sheet.macroenabled.12"), - MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.template"), - MediaType.application("vnd.ms-excel.template.macroenabled.12"), - MediaType.application("vnd.ms-excel.addin.macroenabled.12"), - MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), - MediaType.application("vnd.ms-word.document.macroenabled.12"), - MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.template"), - MediaType.application("vnd.ms-word.template.macroenabled.12")))); + Collections.unmodifiableSet(new HashSet(Arrays.asList( + MediaType.application("x-tika-ooxml"), + MediaType.application("vnd.openxmlformats-officedocument.presentationml.presentation"), + MediaType.application("vnd.ms-powerpoint.presentation.macroenabled.12"), + MediaType.application("vnd.openxmlformats-officedocument.presentationml.template"), + MediaType.application("vnd.openxmlformats-officedocument.presentationml.slideshow"), + MediaType.application("vnd.ms-powerpoint.slideshow.macroenabled.12"), + MediaType.application("vnd.ms-powerpoint.addin.macroenabled.12"), + MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), + MediaType.application("vnd.ms-excel.sheet.macroenabled.12"), + MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.template"), + MediaType.application("vnd.ms-excel.template.macroenabled.12"), + MediaType.application("vnd.ms-excel.addin.macroenabled.12"), + MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.document"), + MediaType.application("vnd.ms-word.document.macroenabled.12"), + MediaType.application("vnd.openxmlformats-officedocument.wordprocessingml.template"), + MediaType.application("vnd.ms-word.template.macroenabled.12")))); + /** * We claim to support all OOXML files, but we actually don't support a small - * number of them. + * number of them. * This list is used to decline certain formats that are not yet supported - * by Tika and/or POI. + * by Tika and/or POI. */ - protected static final Set UNSUPPORTED_OOXML_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.application("vnd.ms-excel.sheet.binary.macroenabled.12"), - MediaType.application("vnd.ms-xpsdocument") - ))); - /** - * Serial version UID - */ - private static final long serialVersionUID = 6535995710857776481L; + protected static final Set UNSUPPORTED_OOXML_TYPES = + Collections.unmodifiableSet(new HashSet(Arrays.asList( + MediaType.application("vnd.ms-excel.sheet.binary.macroenabled.12"), + MediaType.application("vnd.ms-xpsdocument") + ))); public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/POIXMLTextExtractorDecorator.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/POIXMLTextExtractorDecorator.java index ff44176..375adf5 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/POIXMLTextExtractorDecorator.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/POIXMLTextExtractorDecorator.java @@ -39,6 +39,6 @@ @Override protected List getMainDocumentParts() { - return new ArrayList(); + return new ArrayList(); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java index d55a417..19bbe9c 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java @@ -16,10 +16,10 @@ */ package org.apache.tika.parser.microsoft.ooxml; -import javax.xml.namespace.QName; import java.io.IOException; import java.util.ArrayList; import java.util.List; +import javax.xml.namespace.QName; import org.apache.poi.openxml4j.exceptions.InvalidFormatException; import org.apache.poi.openxml4j.opc.PackagePart; @@ -31,22 +31,17 @@ import org.apache.poi.xslf.extractor.XSLFPowerPointExtractor; import org.apache.poi.xslf.usermodel.Placeholder; import org.apache.poi.xslf.usermodel.XMLSlideShow; -import org.apache.poi.xslf.usermodel.XSLFCommentAuthors; import org.apache.poi.xslf.usermodel.XSLFComments; import org.apache.poi.xslf.usermodel.XSLFGraphicFrame; import org.apache.poi.xslf.usermodel.XSLFGroupShape; -import org.apache.poi.xslf.usermodel.XSLFNotes; -import org.apache.poi.xslf.usermodel.XSLFNotesMaster; import org.apache.poi.xslf.usermodel.XSLFPictureShape; import org.apache.poi.xslf.usermodel.XSLFRelation; import org.apache.poi.xslf.usermodel.XSLFShape; import org.apache.poi.xslf.usermodel.XSLFSheet; import org.apache.poi.xslf.usermodel.XSLFSlide; -import org.apache.poi.xslf.usermodel.XSLFSlideLayout; import org.apache.poi.xslf.usermodel.XSLFTable; import org.apache.poi.xslf.usermodel.XSLFTableCell; import org.apache.poi.xslf.usermodel.XSLFTableRow; -import org.apache.poi.xslf.usermodel.XSLFTextParagraph; import org.apache.poi.xslf.usermodel.XSLFTextShape; import org.apache.tika.exception.TikaException; import org.apache.tika.parser.ParseContext; @@ -54,9 +49,7 @@ import org.apache.xmlbeans.XmlException; import org.apache.xmlbeans.XmlObject; import org.openxmlformats.schemas.presentationml.x2006.main.CTComment; -import org.openxmlformats.schemas.presentationml.x2006.main.CTCommentAuthor; import org.openxmlformats.schemas.presentationml.x2006.main.CTPicture; -import org.openxmlformats.schemas.presentationml.x2006.main.CTSlideIdList; import org.openxmlformats.schemas.presentationml.x2006.main.CTSlideIdListEntry; import org.xml.sax.SAXException; import org.xml.sax.helpers.AttributesImpl; @@ -71,82 +64,49 @@ */ protected void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, IOException { XMLSlideShow slideShow = (XMLSlideShow) extractor.getDocument(); - XSLFCommentAuthors commentAuthors = slideShow.getCommentAuthors(); - - List slides = slideShow.getSlides(); + + XSLFSlide[] slides = slideShow.getSlides(); for (XSLFSlide slide : slides) { String slideDesc; if (slide.getPackagePart() != null && slide.getPackagePart().getPartName() != null) { - slideDesc = getJustFileName(slide.getPackagePart().getPartName().toString()); - slideDesc += "_"; + slideDesc = getJustFileName(slide.getPackagePart().getPartName().toString()); + slideDesc += "_"; } else { - slideDesc = null; - } - - // slide content - xhtml.startElement("div", "class", "slide-content"); + slideDesc = null; + } + + // slide extractContent(slide.getShapes(), false, xhtml, slideDesc); - xhtml.endElement("div"); // slide layout which is the master sheet for this slide - xhtml.startElement("div", "class", "slide-master-content"); - XSLFSlideLayout slideLayout = slide.getMasterSheet(); + XSLFSheet slideLayout = slide.getMasterSheet(); extractContent(slideLayout.getShapes(), true, xhtml, null); - xhtml.endElement("div"); // slide master which is the master sheet for all text layouts XSLFSheet slideMaster = slideLayout.getMasterSheet(); extractContent(slideMaster.getShapes(), true, xhtml, null); // notes (if present) - XSLFNotes slideNotes = slide.getNotes(); + XSLFSheet slideNotes = slide.getNotes(); if (slideNotes != null) { - xhtml.startElement("div", "class", "slide-notes"); - extractContent(slideNotes.getShapes(), false, xhtml, slideDesc); // master sheet for this notes - XSLFNotesMaster notesMaster = slideNotes.getMasterSheet(); + XSLFSheet notesMaster = slideNotes.getMasterSheet(); extractContent(notesMaster.getShapes(), true, xhtml, null); - xhtml.endElement("div"); } // comments (if present) XSLFComments comments = slide.getComments(); if (comments != null) { - StringBuilder authorStringBuilder = new StringBuilder(); - for (int i = 0; i < comments.getNumberOfComments(); i++) { - authorStringBuilder.setLength(0); - CTComment comment = comments.getCommentAt(i); - xhtml.startElement("p", "class", "slide-comment"); - CTCommentAuthor cta = commentAuthors.getAuthorById(comment.getAuthorId()); - if (cta != null) { - if (cta.getName() != null) { - authorStringBuilder.append(cta.getName()); - } - if (cta.getInitials() != null) { - if (authorStringBuilder.length() > 0) { - authorStringBuilder.append(" "); - } - authorStringBuilder.append("("+cta.getInitials()+")"); - } - if (comment.getText() != null && authorStringBuilder.length() > 0) { - authorStringBuilder.append(" - "); - } - if (authorStringBuilder.length() > 0) { - xhtml.startElement("b"); - xhtml.characters(authorStringBuilder.toString()); - xhtml.endElement("b"); - } - } - xhtml.characters(comment.getText()); - xhtml.endElement("p"); + for (CTComment comment : comments.getCTCommentsList().getCmList()) { + xhtml.element("p", comment.getText()); } } } } - private void extractContent(List shapes, boolean skipPlaceholders, XHTMLContentHandler xhtml, String slideDesc) + private void extractContent(XSLFShape[] shapes, boolean skipPlaceholders, XHTMLContentHandler xhtml, String slideDesc) throws SAXException { for (XSLFShape sh : shapes) { if (sh instanceof XSLFTextShape) { @@ -155,27 +115,28 @@ if (skipPlaceholders && ph != null) { continue; } - for (XSLFTextParagraph p : txt.getTextParagraphs()) { - xhtml.element("p", p.getText()); - } - } else if (sh instanceof XSLFGroupShape) { + xhtml.element("p", txt.getText()); + } else if (sh instanceof XSLFGroupShape){ // recurse into groups of shapes - XSLFGroupShape group = (XSLFGroupShape) sh; + XSLFGroupShape group = (XSLFGroupShape)sh; extractContent(group.getShapes(), skipPlaceholders, xhtml, slideDesc); } else if (sh instanceof XSLFTable) { - //unlike tables in Word, ppt/x can't have recursive tables...I don't think - extractTable((XSLFTable)sh, xhtml); + XSLFTable tbl = (XSLFTable)sh; + for(XSLFTableRow row : tbl){ + List cells = row.getCells(); + extractContent(cells.toArray(new XSLFTableCell[cells.size()]), skipPlaceholders, xhtml, slideDesc); + } } else if (sh instanceof XSLFGraphicFrame) { XSLFGraphicFrame frame = (XSLFGraphicFrame) sh; XmlObject[] sp = frame.getXmlObject().selectPath( - "declare namespace p='http://schemas.openxmlformats.org/presentationml/2006/main' .//*/p:oleObj"); + "declare namespace p='http://schemas.openxmlformats.org/presentationml/2006/main' .//*/p:oleObj"); if (sp != null) { - for (XmlObject emb : sp) { + for(XmlObject emb : sp) { XmlObject relIDAtt = emb.selectAttribute(new QName("http://schemas.openxmlformats.org/officeDocument/2006/relationships", "id")); if (relIDAtt != null) { String relID = relIDAtt.getDomNode().getNodeValue(); if (slideDesc != null) { - relID = slideDesc + relID; + relID = slideDesc + relID; } AttributesImpl attributes = new AttributesImpl(); attributes.addAttribute("", "class", "class", "CDATA", "embedded"); @@ -192,7 +153,7 @@ String relID = ctPic.getBlipFill().getBlip().getEmbed(); if (relID != null) { if (slideDesc != null) { - relID = slideDesc + relID; + relID = slideDesc + relID; } AttributesImpl attributes = new AttributesImpl(); attributes.addAttribute("", "class", "class", "CDATA", "embedded"); @@ -205,66 +166,47 @@ } } } - - private void extractTable(XSLFTable tbl, XHTMLContentHandler xhtml) throws SAXException { - xhtml.startElement("table"); - for (XSLFTableRow row : tbl) { - xhtml.startElement("tr"); - List cells = row.getCells(); - for (XSLFTableCell c : row.getCells()) { - xhtml.startElement("td"); - xhtml.characters(c.getText()); - xhtml.endElement("td"); - } - xhtml.endElement("tr"); - } - xhtml.endElement("table"); - - } - + /** * In PowerPoint files, slides have things embedded in them, - * and slide drawings which have the images + * and slide drawings which have the images */ @Override protected List getMainDocumentParts() throws TikaException { - List parts = new ArrayList<>(); - XMLSlideShow slideShow = (XMLSlideShow) extractor.getDocument(); - XSLFSlideShow document = null; - try { - document = slideShow._getXSLFSlideShow(); // TODO Avoid this in future - } catch (Exception e) { - throw new TikaException(e.getMessage()); // Shouldn't happen - } - - CTSlideIdList ctSlideIdList = document.getSlideReferences(); - if (ctSlideIdList != null) { - for (int i = 0; i < ctSlideIdList.sizeOfSldIdArray(); i++) { - CTSlideIdListEntry ctSlide = ctSlideIdList.getSldIdArray(i); - // Add the slide - PackagePart slidePart; - try { - slidePart = document.getSlidePart(ctSlide); - } catch (IOException e) { - throw new TikaException("Broken OOXML file", e); - } catch (XmlException xe) { - throw new TikaException("Broken OOXML file", xe); - } - parts.add(slidePart); - - // If it has drawings, return those too - try { - for (PackageRelationship rel : slidePart.getRelationshipsByType(XSLFRelation.VML_DRAWING.getRelation())) { - if (rel.getTargetMode() == TargetMode.INTERNAL) { - PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI()); - parts.add(rel.getPackage().getPart(relName)); - } - } - } catch (InvalidFormatException e) { - throw new TikaException("Broken OOXML file", e); - } - } - } - return parts; + List parts = new ArrayList(); + XMLSlideShow slideShow = (XMLSlideShow) extractor.getDocument(); + XSLFSlideShow document = null; + try { + document = slideShow._getXSLFSlideShow(); // TODO Avoid this in future + } catch(Exception e) { + throw new TikaException(e.getMessage()); // Shouldn't happen + } + + for (CTSlideIdListEntry ctSlide : document.getSlideReferences().getSldIdList()) { + // Add the slide + PackagePart slidePart; + try { + slidePart = document.getSlidePart(ctSlide); + } catch(IOException e) { + throw new TikaException("Broken OOXML file", e); + } catch(XmlException xe) { + throw new TikaException("Broken OOXML file", xe); + } + parts.add(slidePart); + + // If it has drawings, return those too + try { + for(PackageRelationship rel : slidePart.getRelationshipsByType(XSLFRelation.VML_DRAWING.getRelation())) { + if(rel.getTargetMode() == TargetMode.INTERNAL) { + PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI()); + parts.add( rel.getPackage().getPart(relName) ); + } + } + } catch(InvalidFormatException e) { + throw new TikaException("Broken OOXML file", e); + } + } + + return parts; } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java index 433b9a4..985e413 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java @@ -16,14 +16,15 @@ */ package org.apache.tika.parser.microsoft.ooxml; -import javax.xml.parsers.ParserConfigurationException; -import javax.xml.parsers.SAXParser; -import javax.xml.parsers.SAXParserFactory; import java.io.IOException; import java.io.InputStream; import java.util.ArrayList; import java.util.List; import java.util.Locale; + +import javax.xml.parsers.ParserConfigurationException; +import javax.xml.parsers.SAXParser; +import javax.xml.parsers.SAXParserFactory; import org.apache.poi.hssf.extractor.ExcelExtractor; import org.apache.poi.openxml4j.exceptions.InvalidFormatException; @@ -62,10 +63,6 @@ import org.xml.sax.XMLReader; public class XSSFExcelExtractorDecorator extends AbstractOOXMLExtractor { - /** - * Allows access to headers/footers from raw xml strings - */ - private static HeaderFooterHelper hfHelper = new HeaderFooterHelper(); private final XSSFEventBasedExcelExtractor extractor; private final DataFormatter formatter; private final List sheetParts = new ArrayList(); @@ -78,11 +75,11 @@ this.extractor = extractor; extractor.setFormulasNotResults(false); extractor.setLocale(locale); - - if (locale == null) { - formatter = new DataFormatter(); - } else { - formatter = new DataFormatter(locale); + + if(locale == null) { + formatter = new DataFormatter(); + } else { + formatter = new DataFormatter(locale); } } @@ -91,10 +88,10 @@ ContentHandler handler, Metadata metadata, ParseContext context) throws SAXException, XmlException, IOException, TikaException { - this.metadata = metadata; - metadata.set(TikaMetadataKeys.PROTECTED, "false"); - - super.getXHTML(handler, metadata, context); + this.metadata = metadata; + metadata.set(TikaMetadataKeys.PROTECTED, "false"); + + super.getXHTML(handler, metadata, context); } /** @@ -103,293 +100,278 @@ @Override protected void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, XmlException, IOException { - OPCPackage container = extractor.getPackage(); - - ReadOnlySharedStringsTable strings; - XSSFReader.SheetIterator iter; - XSSFReader xssfReader; - StylesTable styles; - try { - xssfReader = new XSSFReader(container); - styles = xssfReader.getStylesTable(); - iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData(); - strings = new ReadOnlySharedStringsTable(container); - } catch (InvalidFormatException e) { - throw new XmlException(e); - } catch (OpenXML4JException oe) { - throw new XmlException(oe); - } - - while (iter.hasNext()) { - InputStream stream = iter.next(); - sheetParts.add(iter.getSheetPart()); - - SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(xhtml); - CommentsTable comments = iter.getSheetComments(); - - // Start, and output the sheet name - xhtml.startElement("div"); - xhtml.element("h1", iter.getSheetName()); - - // Extract the main sheet contents - xhtml.startElement("table"); - xhtml.startElement("tbody"); - - processSheet(sheetExtractor, comments, styles, strings, stream); - - xhtml.endElement("tbody"); - xhtml.endElement("table"); - - // Output any headers and footers - // (Need to process the sheet to get them, so we can't - // do the headers before the contents) - for (String header : sheetExtractor.headers) { - extractHeaderFooter(header, xhtml); - } - for (String footer : sheetExtractor.footers) { - extractHeaderFooter(footer, xhtml); - } - processShapes(iter.getShapes(), xhtml); - // All done with this sheet - xhtml.endElement("div"); - } + OPCPackage container = extractor.getPackage(); + + ReadOnlySharedStringsTable strings; + XSSFReader.SheetIterator iter; + XSSFReader xssfReader; + StylesTable styles; + try { + xssfReader = new XSSFReader(container); + styles = xssfReader.getStylesTable(); + iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData(); + strings = new ReadOnlySharedStringsTable(container); + } catch(InvalidFormatException e) { + throw new XmlException(e); + } catch (OpenXML4JException oe) { + throw new XmlException(oe); + } + + while (iter.hasNext()) { + InputStream stream = iter.next(); + sheetParts.add(iter.getSheetPart()); + + SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(xhtml, iter.getSheetComments()); + + // Start, and output the sheet name + xhtml.startElement("div"); + xhtml.element("h1", iter.getSheetName()); + + // Extract the main sheet contents + xhtml.startElement("table"); + xhtml.startElement("tbody"); + + processSheet(sheetExtractor, styles, strings, stream); + + xhtml.endElement("tbody"); + xhtml.endElement("table"); + + // Output any headers and footers + // (Need to process the sheet to get them, so we can't + // do the headers before the contents) + for(String header : sheetExtractor.headers) { + extractHeaderFooter(header, xhtml); + } + for(String footer : sheetExtractor.footers) { + extractHeaderFooter(footer, xhtml); + } + processShapes(iter.getShapes(), xhtml); + // All done with this sheet + xhtml.endElement("div"); + } } private void extractHeaderFooter(String hf, XHTMLContentHandler xhtml) throws SAXException { String content = ExcelExtractor._extractHeaderFooter( - new HeaderFooterFromString(hf)); + new HeaderFooterFromString(hf)); if (content.length() > 0) { xhtml.element("p", content); } } - + private void processShapes(List shapes, XHTMLContentHandler xhtml) throws SAXException { - if (shapes == null) { - return; - } - for (XSSFShape shape : shapes) { - if (shape instanceof XSSFSimpleShape) { - String sText = ((XSSFSimpleShape) shape).getText(); - if (sText != null && sText.length() > 0) { - xhtml.element("p", sText); + if (shapes == null){ + return; + } + for (XSSFShape shape : shapes){ + if (shape instanceof XSSFSimpleShape){ + String sText = ((XSSFSimpleShape)shape).getText(); + if (sText != null && sText.length() > 0){ + xhtml.element("p", sText); + } + } + } + } + + public void processSheet( + SheetContentsHandler sheetContentsExtractor, + StylesTable styles, + ReadOnlySharedStringsTable strings, + InputStream sheetInputStream) + throws IOException, SAXException { + InputSource sheetSource = new InputSource(sheetInputStream); + SAXParserFactory saxFactory = SAXParserFactory.newInstance(); + try { + SAXParser saxParser = saxFactory.newSAXParser(); + XMLReader sheetParser = saxParser.getXMLReader(); + XSSFSheetInterestingPartsCapturer handler = + new XSSFSheetInterestingPartsCapturer(new XSSFSheetXMLHandler( + styles, strings, sheetContentsExtractor, formatter, false)); + sheetParser.setContentHandler(handler); + sheetParser.parse(sheetSource); + sheetInputStream.close(); + + if (handler.hasProtection) { + metadata.set(TikaMetadataKeys.PROTECTED, "true"); + } + } catch(ParserConfigurationException e) { + throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage()); + } + } + + /** + * Turns formatted sheet events into HTML + */ + protected static class SheetTextAsHTML implements SheetContentsHandler { + private XHTMLContentHandler xhtml; + private CommentsTable comments; + private List headers; + private List footers; + + protected SheetTextAsHTML(XHTMLContentHandler xhtml, CommentsTable comments) { + this.xhtml = xhtml; + this.comments = comments; + headers = new ArrayList(); + footers = new ArrayList(); + } + + public void startRow(int rowNum) { + try { + xhtml.startElement("tr"); + } catch(SAXException e) {} + } + + public void endRow() { + try { + xhtml.endElement("tr"); + } catch(SAXException e) {} + } + + public void cell(String cellRef, String formattedValue) { + try { + xhtml.startElement("td"); + + // Main cell contents + xhtml.characters(formattedValue); + + // Comments + if(comments != null) { + XSSFComment comment = comments.findCellComment(cellRef); + if(comment != null) { + xhtml.startElement("br"); + xhtml.endElement("br"); + xhtml.characters(comment.getAuthor()); + xhtml.characters(": "); + xhtml.characters(comment.getString().getString()); } - } - } - } - - public void processSheet( - SheetContentsHandler sheetContentsExtractor, - CommentsTable comments, - StylesTable styles, - ReadOnlySharedStringsTable strings, - InputStream sheetInputStream) - throws IOException, SAXException { - InputSource sheetSource = new InputSource(sheetInputStream); - SAXParserFactory saxFactory = SAXParserFactory.newInstance(); - try { - SAXParser saxParser = saxFactory.newSAXParser(); - XMLReader sheetParser = saxParser.getXMLReader(); - XSSFSheetInterestingPartsCapturer handler = - new XSSFSheetInterestingPartsCapturer(new XSSFSheetXMLHandler( - styles, comments, strings, sheetContentsExtractor, formatter, false)); - sheetParser.setContentHandler(handler); - sheetParser.parse(sheetSource); - sheetInputStream.close(); - - if (handler.hasProtection) { - metadata.set(TikaMetadataKeys.PROTECTED, "true"); - } - } catch (ParserConfigurationException e) { - throw new RuntimeException("SAX parser appears to be broken - " + e.getMessage()); - } - } - + } + + xhtml.endElement("td"); + } catch(SAXException e) {} + } + + public void headerFooter(String text, boolean isHeader, String tagName) { + if(isHeader) { + headers.add(text); + } else { + footers.add(text); + } + } + } + + /** + * Allows access to headers/footers from raw xml strings + */ + private static HeaderFooterHelper hfHelper = new HeaderFooterHelper(); + protected static class HeaderFooterFromString implements HeaderFooter { + private String text; + protected HeaderFooterFromString(String text) { + this.text = text; + } + + public String getCenter() { + return hfHelper.getCenterSection(text); + } + public String getLeft() { + return hfHelper.getLeftSection(text); + } + public String getRight() { + return hfHelper.getRightSection(text); + } + + public void setCenter(String paramString) {} + public void setLeft(String paramString) {} + public void setRight(String paramString) {} + } + + /** + * Captures information on interesting tags, whilst + * delegating the main work to the formatting handler + */ + protected static class XSSFSheetInterestingPartsCapturer implements ContentHandler { + private ContentHandler delegate; + private boolean hasProtection = false; + + protected XSSFSheetInterestingPartsCapturer(ContentHandler delegate) { + this.delegate = delegate; + } + + public void startElement(String uri, String localName, String qName, + Attributes atts) throws SAXException { + if("sheetProtection".equals(qName)) { + hasProtection = true; + } + delegate.startElement(uri, localName, qName, atts); + } + + public void characters(char[] ch, int start, int length) + throws SAXException { + delegate.characters(ch, start, length); + } + public void endDocument() throws SAXException { + delegate.endDocument(); + } + public void endElement(String uri, String localName, String qName) + throws SAXException { + delegate.endElement(uri, localName, qName); + } + public void endPrefixMapping(String prefix) throws SAXException { + delegate.endPrefixMapping(prefix); + } + public void ignorableWhitespace(char[] ch, int start, int length) + throws SAXException { + delegate.ignorableWhitespace(ch, start, length); + } + public void processingInstruction(String target, String data) + throws SAXException { + delegate.processingInstruction(target, data); + } + public void setDocumentLocator(Locator locator) { + delegate.setDocumentLocator(locator); + } + public void skippedEntity(String name) throws SAXException { + delegate.skippedEntity(name); + } + public void startDocument() throws SAXException { + delegate.startDocument(); + } + public void startPrefixMapping(String prefix, String uri) + throws SAXException { + delegate.startPrefixMapping(prefix, uri); + } + } + /** * In Excel files, sheets have things embedded in them, - * and sheet drawings which have the images + * and sheet drawings which have the images */ @Override protected List getMainDocumentParts() throws TikaException { - List parts = new ArrayList(); - for (PackagePart part : sheetParts) { - // Add the sheet - parts.add(part); - - // If it has drawings, return those too - try { - for (PackageRelationship rel : part.getRelationshipsByType(XSSFRelation.DRAWINGS.getRelation())) { - if (rel.getTargetMode() == TargetMode.INTERNAL) { - PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI()); - parts.add(rel.getPackage().getPart(relName)); - } + List parts = new ArrayList(); + for(PackagePart part : sheetParts) { + // Add the sheet + parts.add(part); + + // If it has drawings, return those too + try { + for(PackageRelationship rel : part.getRelationshipsByType(XSSFRelation.DRAWINGS.getRelation())) { + if(rel.getTargetMode() == TargetMode.INTERNAL) { + PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI()); + parts.add( rel.getPackage().getPart(relName) ); } - for (PackageRelationship rel : part.getRelationshipsByType(XSSFRelation.VML_DRAWINGS.getRelation())) { - if (rel.getTargetMode() == TargetMode.INTERNAL) { - PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI()); - parts.add(rel.getPackage().getPart(relName)); - } + } + for(PackageRelationship rel : part.getRelationshipsByType(XSSFRelation.VML_DRAWINGS.getRelation())) { + if(rel.getTargetMode() == TargetMode.INTERNAL) { + PackagePartName relName = PackagingURIHelper.createPartName(rel.getTargetURI()); + parts.add( rel.getPackage().getPart(relName) ); } - } catch (InvalidFormatException e) { - throw new TikaException("Broken OOXML file", e); - } - } - - return parts; - } - - /** - * Turns formatted sheet events into HTML - */ - protected static class SheetTextAsHTML implements SheetContentsHandler { - private XHTMLContentHandler xhtml; - private List headers; - private List footers; - - protected SheetTextAsHTML(XHTMLContentHandler xhtml) { - this.xhtml = xhtml; - headers = new ArrayList(); - footers = new ArrayList(); - } - - public void startRow(int rowNum) { - try { - xhtml.startElement("tr"); - } catch (SAXException e) { - } - } - - public void endRow(int rowNum) { - try { - xhtml.endElement("tr"); - } catch (SAXException e) { - } - } - - public void cell(String cellRef, String formattedValue, XSSFComment comment) { - try { - xhtml.startElement("td"); - - // Main cell contents - if (formattedValue != null) { - xhtml.characters(formattedValue); - } - - // Comments - if (comment != null) { - xhtml.startElement("br"); - xhtml.endElement("br"); - xhtml.characters(comment.getAuthor()); - xhtml.characters(": "); - xhtml.characters(comment.getString().getString()); - } - - xhtml.endElement("td"); - } catch (SAXException e) { - } - } - - public void headerFooter(String text, boolean isHeader, String tagName) { - if (isHeader) { - headers.add(text); - } else { - footers.add(text); - } - } - } - - protected static class HeaderFooterFromString implements HeaderFooter { - private String text; - - protected HeaderFooterFromString(String text) { - this.text = text; - } - - public String getCenter() { - return hfHelper.getCenterSection(text); - } - - public void setCenter(String paramString) { - } - - public String getLeft() { - return hfHelper.getLeftSection(text); - } - - public void setLeft(String paramString) { - } - - public String getRight() { - return hfHelper.getRightSection(text); - } - - public void setRight(String paramString) { - } - } - - /** - * Captures information on interesting tags, whilst - * delegating the main work to the formatting handler - */ - protected static class XSSFSheetInterestingPartsCapturer implements ContentHandler { - private ContentHandler delegate; - private boolean hasProtection = false; - - protected XSSFSheetInterestingPartsCapturer(ContentHandler delegate) { - this.delegate = delegate; - } - - public void startElement(String uri, String localName, String qName, - Attributes atts) throws SAXException { - if ("sheetProtection".equals(qName)) { - hasProtection = true; - } - delegate.startElement(uri, localName, qName, atts); - } - - public void characters(char[] ch, int start, int length) - throws SAXException { - delegate.characters(ch, start, length); - } - - public void endDocument() throws SAXException { - delegate.endDocument(); - } - - public void endElement(String uri, String localName, String qName) - throws SAXException { - delegate.endElement(uri, localName, qName); - } - - public void endPrefixMapping(String prefix) throws SAXException { - delegate.endPrefixMapping(prefix); - } - - public void ignorableWhitespace(char[] ch, int start, int length) - throws SAXException { - delegate.ignorableWhitespace(ch, start, length); - } - - public void processingInstruction(String target, String data) - throws SAXException { - delegate.processingInstruction(target, data); - } - - public void setDocumentLocator(Locator locator) { - delegate.setDocumentLocator(locator); - } - - public void skippedEntity(String name) throws SAXException { - delegate.skippedEntity(name); - } - - public void startDocument() throws SAXException { - delegate.startDocument(); - } - - public void startPrefixMapping(String prefix, String uri) - throws SAXException { - delegate.startPrefixMapping(prefix, uri); - } + } + } catch(InvalidFormatException e) { + throw new TikaException("Broken OOXML file", e); + } + } + + return parts; } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java deleted file mode 100644 index 5654378..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFListManager.java +++ /dev/null @@ -1,165 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.microsoft.ooxml; - -import org.apache.poi.xwpf.usermodel.XWPFAbstractNum; -import org.apache.poi.xwpf.usermodel.XWPFDocument; -import org.apache.poi.xwpf.usermodel.XWPFNum; -import org.apache.poi.xwpf.usermodel.XWPFNumbering; -import org.apache.poi.xwpf.usermodel.XWPFParagraph; -import org.apache.tika.parser.microsoft.AbstractListManager; -import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTAbstractNum; -import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTDecimalNumber; -import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTLvl; -import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTNum; -import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTNumLvl; - - -public class XWPFListManager extends AbstractListManager { - private final static boolean OVERRIDE_AVAILABLE; - private final static String SKIP_FORMAT = Character.toString((char) 61623);//if this shows up as the lvlText, don't show a number - - static { - boolean b = false; - try { - Class.forName("org.openxmlformats.schemas.wordprocessingml.x2006.main.CTNumLvl"); - b = true; - } catch (ClassNotFoundException e) { - } - b = OVERRIDE_AVAILABLE = false; - - } - - private final XWPFNumbering numbering; - - //map of numId (which paragraph series is this a member of?), levelcounts - public XWPFListManager(XWPFDocument document) { - numbering = document.getNumbering(); - } - - /** - * - * @param paragraph paragraph - * @return the formatted number or an empty string if something went wrong - */ - public String getFormattedNumber(final XWPFParagraph paragraph) { - int currNumId = paragraph.getNumID().intValue(); - XWPFNum xwpfNum = numbering.getNum(paragraph.getNumID()); - if (xwpfNum == null) { - return ""; - } - CTNum ctNum = xwpfNum.getCTNum(); - CTDecimalNumber abNum = ctNum.getAbstractNumId(); - int currAbNumId = abNum.getVal().intValue(); - - ParagraphLevelCounter lc = listLevelMap.get(currAbNumId); - LevelTuple[] overrideTuples = overrideTupleMap.get(currNumId); - if (lc == null) { - lc = loadLevelTuples(abNum); - } - if (overrideTuples == null) { - overrideTuples = loadOverrideTuples(ctNum, lc.getNumberOfLevels()); - } - - String formattedString = lc.incrementLevel(paragraph.getNumIlvl().intValue(), overrideTuples); - - listLevelMap.put(currAbNumId, lc); - overrideTupleMap.put(currNumId, overrideTuples); - - return formattedString; - } - - private LevelTuple[] loadOverrideTuples(CTNum ctNum, int length) { - LevelTuple[] levelTuples = new LevelTuple[length]; - int overrideLength = ctNum.sizeOfLvlOverrideArray(); - if (overrideLength == 0) { - return null; - } - for (int i = 0; i < length; i++) { - LevelTuple tuple; - if (i >= overrideLength) { - tuple = new LevelTuple("%"+i+"."); - } else { - CTNumLvl ctNumLvl = ctNum.getLvlOverrideArray(i); - if (ctNumLvl != null) { - tuple = buildTuple(i, ctNumLvl.getLvl()); - } else { - tuple = new LevelTuple("%"+i+"."); - } - } - levelTuples[i] = tuple; - } - return levelTuples; - } - - - private ParagraphLevelCounter loadLevelTuples(CTDecimalNumber abNum) { - //Unfortunately, we need to go this far into the underlying structure - //to get the abstract num information for the edge case where - //someone skips a level and the format is not context-free, e.g. "1.B.i". - XWPFAbstractNum abstractNum = numbering.getAbstractNum(abNum.getVal()); - CTAbstractNum ctAbstractNum = abstractNum.getCTAbstractNum(); - - LevelTuple[] levels = new LevelTuple[ctAbstractNum.sizeOfLvlArray()]; - for (int i = 0; i < levels.length; i++) { - levels[i] = buildTuple(i, ctAbstractNum.getLvlArray(i)); - } - return new ParagraphLevelCounter(levels); - } - - private LevelTuple buildTuple(int level, CTLvl ctLvl) { - boolean isLegal = false; - int start = 1; - int restart = -1; - String lvlText = "%" + level + "."; - String numFmt = "decimal"; - - - if (ctLvl != null && ctLvl.getIsLgl() != null) { - isLegal = true; - } - - if (ctLvl != null && ctLvl.getNumFmt() != null && - ctLvl.getNumFmt().getVal() != null) { - numFmt = ctLvl.getNumFmt().getVal().toString(); - } - if (ctLvl != null && ctLvl.getLvlRestart() != null && - ctLvl.getLvlRestart().getVal() != null) { - restart = ctLvl.getLvlRestart().getVal().intValue(); - } - if (ctLvl != null && ctLvl.getStart() != null && - ctLvl.getStart().getVal() != null) { - start = ctLvl.getStart().getVal().intValue(); - } else { - - //this is a hack. Currently, this gets the lowest possible - //start for a given numFmt. We should probably try to grab the - //restartNumberingAfterBreak value in - //e.g. ??? - if ("decimal".equals(numFmt) || "ordinal".equals(numFmt) || "decimalZero".equals(numFmt)) { - start = 0; - } else { - start = 1; - } - } - if (ctLvl != null && ctLvl.getLvlText() != null && ctLvl.getLvlText().getVal() != null) { - lvlText = ctLvl.getLvlText().getVal(); - } - return new LevelTuple(start, restart, lvlText, numFmt, isLegal); - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java index 6caf803..eac3c7b 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/XWPFWordExtractorDecorator.java @@ -16,10 +16,10 @@ */ package org.apache.tika.parser.microsoft.ooxml; -import javax.xml.namespace.QName; import java.io.IOException; import java.util.ArrayList; import java.util.List; +import javax.xml.namespace.QName; import org.apache.poi.openxml4j.opc.PackagePart; import org.apache.poi.xwpf.extractor.XWPFWordExtractor; @@ -28,9 +28,7 @@ import org.apache.poi.xwpf.usermodel.BodyType; import org.apache.poi.xwpf.usermodel.IBody; import org.apache.poi.xwpf.usermodel.IBodyElement; -import org.apache.poi.xwpf.usermodel.ICell; import org.apache.poi.xwpf.usermodel.IRunElement; -import org.apache.poi.xwpf.usermodel.ISDTContent; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFHeaderFooter; import org.apache.poi.xwpf.usermodel.XWPFHyperlink; @@ -40,38 +38,33 @@ import org.apache.poi.xwpf.usermodel.XWPFPictureData; import org.apache.poi.xwpf.usermodel.XWPFRun; import org.apache.poi.xwpf.usermodel.XWPFSDT; -import org.apache.poi.xwpf.usermodel.XWPFSDTCell; +import org.apache.poi.xwpf.usermodel.XWPFSDTContent; import org.apache.poi.xwpf.usermodel.XWPFStyle; import org.apache.poi.xwpf.usermodel.XWPFStyles; import org.apache.poi.xwpf.usermodel.XWPFTable; import org.apache.poi.xwpf.usermodel.XWPFTableCell; import org.apache.poi.xwpf.usermodel.XWPFTableRow; import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.microsoft.WordExtractor.TagAndStyle; import org.apache.tika.parser.microsoft.WordExtractor; -import org.apache.tika.parser.microsoft.WordExtractor.TagAndStyle; import org.apache.tika.sax.XHTMLContentHandler; import org.apache.xmlbeans.XmlCursor; import org.apache.xmlbeans.XmlException; import org.apache.xmlbeans.XmlObject; import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTBookmark; import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTObject; +import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr; import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTP; -import org.openxmlformats.schemas.wordprocessingml.x2006.main.CTSectPr; import org.xml.sax.SAXException; import org.xml.sax.helpers.AttributesImpl; public class XWPFWordExtractorDecorator extends AbstractOOXMLExtractor { - - // could be improved by using the real delimiter in xchFollow [MS-DOC], v20140721, 2.4.6.3, Part 3, Step 3 - private static final String LIST_DELIMITER = " "; - - private XWPFDocument document; private XWPFStyles styles; public XWPFWordExtractorDecorator(ParseContext context, XWPFWordExtractor extractor) { super(context, extractor); - + document = (XWPFDocument) extractor.getDocument(); styles = document.getStyles(); } @@ -83,377 +76,349 @@ protected void buildXHTML(XHTMLContentHandler xhtml) throws SAXException, XmlException, IOException { XWPFHeaderFooterPolicy hfPolicy = document.getHeaderFooterPolicy(); - XWPFListManager listManager = new XWPFListManager(document); + // headers - if (hfPolicy != null) { - extractHeaders(xhtml, hfPolicy, listManager); + if (hfPolicy!=null) { + extractHeaders(xhtml, hfPolicy); } // process text in the order that it occurs in - extractIBodyText(document, listManager, xhtml); + extractIBodyText(document, xhtml); // then all document tables - if (hfPolicy != null) { - extractFooters(xhtml, hfPolicy, listManager); - } - } - - private void extractIBodyText(IBody bodyElement, XWPFListManager listManager, - XHTMLContentHandler xhtml) - throws SAXException, XmlException, IOException { - for (IBodyElement element : bodyElement.getBodyElements()) { - if (element instanceof XWPFParagraph) { - XWPFParagraph paragraph = (XWPFParagraph) element; - extractParagraph(paragraph, listManager, xhtml); - } - if (element instanceof XWPFTable) { - XWPFTable table = (XWPFTable) element; - extractTable(table, listManager, xhtml); - } - if (element instanceof XWPFSDT) { - extractSDT((XWPFSDT) element, xhtml); - } - - } - } - - private void extractSDT(XWPFSDT element, XHTMLContentHandler xhtml) throws SAXException, - XmlException, IOException { - ISDTContent content = element.getContent(); - String tag = "p"; - xhtml.startElement(tag); - xhtml.characters(content.getText()); - xhtml.endElement(tag); - } - - private void extractParagraph(XWPFParagraph paragraph, XWPFListManager listManager, - XHTMLContentHandler xhtml) - throws SAXException, XmlException, IOException { - // If this paragraph is actually a whole new section, then - // it could have its own headers and footers - // Check and handle if so - XWPFHeaderFooterPolicy headerFooterPolicy = null; - if (paragraph.getCTP().getPPr() != null) { - CTSectPr ctSectPr = paragraph.getCTP().getPPr().getSectPr(); - if (ctSectPr != null) { - headerFooterPolicy = - new XWPFHeaderFooterPolicy(document, ctSectPr); - extractHeaders(xhtml, headerFooterPolicy, listManager); - } - } - - // Is this a paragraph, or a heading? - String tag = "p"; - String styleClass = null; - if (paragraph.getStyleID() != null) { - XWPFStyle style = styles.getStyle( - paragraph.getStyleID() - ); - - if (style != null && style.getName() != null) { - TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle( - style.getName(), paragraph.getPartType() == BodyType.TABLECELL - ); - tag = tas.getTag(); - styleClass = tas.getStyleClass(); - } - } - - if (styleClass == null) { - xhtml.startElement(tag); - } else { - xhtml.startElement(tag, "class", styleClass); - } - - writeParagraphNumber(paragraph, listManager, xhtml); - // Output placeholder for any embedded docs: - - // TODO: replace w/ XPath/XQuery: - for (XWPFRun run : paragraph.getRuns()) { - XmlCursor c = run.getCTR().newCursor(); - c.selectPath("./*"); - while (c.toNextSelection()) { - XmlObject o = c.getObject(); - if (o instanceof CTObject) { - XmlCursor c2 = o.newCursor(); - c2.selectPath("./*"); - while (c2.toNextSelection()) { - XmlObject o2 = c2.getObject(); - - XmlObject embedAtt = o2.selectAttribute(new QName("Type")); - if (embedAtt != null && embedAtt.getDomNode().getNodeValue().equals("Embed")) { - // Type is "Embed" - XmlObject relIDAtt = o2.selectAttribute(new QName("http://schemas.openxmlformats.org/officeDocument/2006/relationships", "id")); - if (relIDAtt != null) { - String relID = relIDAtt.getDomNode().getNodeValue(); - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", "class", "class", "CDATA", "embedded"); - attributes.addAttribute("", "id", "id", "CDATA", relID); - xhtml.startElement("div", attributes); - xhtml.endElement("div"); - } - } - } - c2.dispose(); + if (hfPolicy!=null) { + extractFooters(xhtml, hfPolicy); + } + } + + private void extractIBodyText(IBody bodyElement, XHTMLContentHandler xhtml) + throws SAXException, XmlException, IOException { + for(IBodyElement element : bodyElement.getBodyElements()) { + if(element instanceof XWPFParagraph) { + XWPFParagraph paragraph = (XWPFParagraph)element; + extractParagraph(paragraph, xhtml); + } + if(element instanceof XWPFTable) { + XWPFTable table = (XWPFTable)element; + extractTable(table, xhtml); + } + if (element instanceof XWPFSDT){ + extractSDT((XWPFSDT) element, xhtml); + } + + } + } + + private void extractSDT(XWPFSDT element, XHTMLContentHandler xhtml) throws SAXException, + XmlException, IOException { + XWPFSDTContent content = element.getContent(); + String tag = "p"; + xhtml.startElement(tag); + xhtml.characters(content.getText()); + xhtml.endElement(tag); + } + + private void extractParagraph(XWPFParagraph paragraph, XHTMLContentHandler xhtml) + throws SAXException, XmlException, IOException { + // If this paragraph is actually a whole new section, then + // it could have its own headers and footers + // Check and handle if so + XWPFHeaderFooterPolicy headerFooterPolicy = null; + if (paragraph.getCTP().getPPr() != null) { + CTSectPr ctSectPr = paragraph.getCTP().getPPr().getSectPr(); + if(ctSectPr != null) { + headerFooterPolicy = + new XWPFHeaderFooterPolicy(document, ctSectPr); + extractHeaders(xhtml, headerFooterPolicy); + } + } + + // Is this a paragraph, or a heading? + String tag = "p"; + String styleClass = null; + if(paragraph.getStyleID() != null) { + XWPFStyle style = styles.getStyle( + paragraph.getStyleID() + ); + + if (style != null && style.getName() != null) { + TagAndStyle tas = WordExtractor.buildParagraphTagAndStyle( + style.getName(), paragraph.getPartType() == BodyType.TABLECELL + ); + tag = tas.getTag(); + styleClass = tas.getStyleClass(); + } + } + + if(styleClass == null) { + xhtml.startElement(tag); + } else { + xhtml.startElement(tag, "class", styleClass); + } + + // Output placeholder for any embedded docs: + + // TODO: replace w/ XPath/XQuery: + for(XWPFRun run : paragraph.getRuns()) { + XmlCursor c = run.getCTR().newCursor(); + c.selectPath("./*"); + while (c.toNextSelection()) { + XmlObject o = c.getObject(); + if (o instanceof CTObject) { + XmlCursor c2 = o.newCursor(); + c2.selectPath("./*"); + while (c2.toNextSelection()) { + XmlObject o2 = c2.getObject(); + + XmlObject embedAtt = o2.selectAttribute(new QName("Type")); + if (embedAtt != null && embedAtt.getDomNode().getNodeValue().equals("Embed")) { + // Type is "Embed" + XmlObject relIDAtt = o2.selectAttribute(new QName("http://schemas.openxmlformats.org/officeDocument/2006/relationships", "id")); + if (relIDAtt != null) { + String relID = relIDAtt.getDomNode().getNodeValue(); + AttributesImpl attributes = new AttributesImpl(); + attributes.addAttribute("", "class", "class", "CDATA", "embedded"); + attributes.addAttribute("", "id", "id", "CDATA", relID); + xhtml.startElement("div", attributes); + xhtml.endElement("div"); + } + } } - } - - c.dispose(); - } - - // Attach bookmarks for the paragraph - // (In future, we might put them in the right place, for now - // we just put them in the correct paragraph) - for (int i = 0; i < paragraph.getCTP().sizeOfBookmarkStartArray(); i++) { - CTBookmark bookmark = paragraph.getCTP().getBookmarkStartArray(i); - xhtml.startElement("a", "name", bookmark.getName()); - xhtml.endElement("a"); - } - - TmpFormatting fmtg = new TmpFormatting(false, false); - - // Do the iruns - for (IRunElement run : paragraph.getIRuns()) { - if (run instanceof XWPFSDT) { - fmtg = closeStyleTags(xhtml, fmtg); - processSDTRun((XWPFSDT) run, xhtml); - //for now, we're ignoring formatting in sdt - //if you hit an sdt reset to false - fmtg.setBold(false); - fmtg.setItalic(false); - } else { - fmtg = processRun((XWPFRun) run, paragraph, xhtml, fmtg); - } - } - closeStyleTags(xhtml, fmtg); - - - // Now do any comments for the paragraph - XWPFCommentsDecorator comments = new XWPFCommentsDecorator(paragraph, null); - String commentText = comments.getCommentText(); - if (commentText != null && commentText.length() > 0) { - xhtml.characters(commentText); - } - - String footnameText = paragraph.getFootnoteText(); - if (footnameText != null && footnameText.length() > 0) { - xhtml.characters(footnameText + "\n"); - } - - // Also extract any paragraphs embedded in text boxes: - for (XmlObject embeddedParagraph : paragraph.getCTP().selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' .//*/wps:txbx/w:txbxContent/w:p")) { - extractParagraph(new XWPFParagraph(CTP.Factory.parse(embeddedParagraph.xmlText()), paragraph.getBody()), listManager, xhtml); - } - - // Finish this paragraph - xhtml.endElement(tag); - - if (headerFooterPolicy != null) { - extractFooters(xhtml, headerFooterPolicy, listManager); - } - } - - private void writeParagraphNumber(XWPFParagraph paragraph, - XWPFListManager listManager, - XHTMLContentHandler xhtml) throws SAXException { - if (paragraph.getNumIlvl() == null) { - return; - } - String number = listManager.getFormattedNumber(paragraph); - if (number != null) { - xhtml.characters(number); - } - + c2.dispose(); + } + } + + c.dispose(); + } + + // Attach bookmarks for the paragraph + // (In future, we might put them in the right place, for now + // we just put them in the correct paragraph) + for (CTBookmark bookmark : paragraph.getCTP().getBookmarkStartList()) { + xhtml.startElement("a", "name", bookmark.getName()); + xhtml.endElement("a"); + } + + TmpFormatting fmtg = new TmpFormatting(false, false); + + // Do the iruns + for(IRunElement run : paragraph.getIRuns()) { + if (run instanceof XWPFSDT){ + fmtg = closeStyleTags(xhtml, fmtg); + processSDTRun((XWPFSDT)run, xhtml); + //for now, we're ignoring formatting in sdt + //if you hit an sdt reset to false + fmtg.setBold(false); + fmtg.setItalic(false); + } else { + fmtg = processRun((XWPFRun)run, paragraph, xhtml, fmtg); + } + } + closeStyleTags(xhtml, fmtg); + + + // Now do any comments for the paragraph + XWPFCommentsDecorator comments = new XWPFCommentsDecorator(paragraph, null); + String commentText = comments.getCommentText(); + if(commentText != null && commentText.length() > 0) { + xhtml.characters(commentText); + } + + String footnameText = paragraph.getFootnoteText(); + if(footnameText != null && footnameText.length() > 0) { + xhtml.characters(footnameText + "\n"); + } + + // Also extract any paragraphs embedded in text boxes: + for (XmlObject embeddedParagraph : paragraph.getCTP().selectPath("declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' declare namespace wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' .//*/wps:txbx/w:txbxContent/w:p")) { + extractParagraph(new XWPFParagraph(CTP.Factory.parse(embeddedParagraph.xmlText()), paragraph.getBody()), xhtml); + } + + // Finish this paragraph + xhtml.endElement(tag); + + if (headerFooterPolicy != null) { + extractFooters(xhtml, headerFooterPolicy); + } } private TmpFormatting closeStyleTags(XHTMLContentHandler xhtml, - TmpFormatting fmtg) throws SAXException { - // Close any still open style tags - if (fmtg.isItalic()) { - xhtml.endElement("i"); - fmtg.setItalic(false); - } - if (fmtg.isBold()) { - xhtml.endElement("b"); - fmtg.setBold(false); - } - return fmtg; - } - - private TmpFormatting processRun(XWPFRun run, XWPFParagraph paragraph, - XHTMLContentHandler xhtml, TmpFormatting tfmtg) - throws SAXException, XmlException, IOException { - // True if we are currently in the named style tag: - if (run.isBold() != tfmtg.isBold()) { - if (tfmtg.isItalic()) { - xhtml.endElement("i"); - tfmtg.setItalic(false); - } - if (run.isBold()) { - xhtml.startElement("b"); - } else { - xhtml.endElement("b"); - } - tfmtg.setBold(run.isBold()); - } - - if (run.isItalic() != tfmtg.isItalic()) { - if (run.isItalic()) { - xhtml.startElement("i"); - } else { - xhtml.endElement("i"); - } - tfmtg.setItalic(run.isItalic()); - } - - boolean addedHREF = false; - if (run instanceof XWPFHyperlinkRun) { - XWPFHyperlinkRun linkRun = (XWPFHyperlinkRun) run; - XWPFHyperlink link = linkRun.getHyperlink(document); - if (link != null && link.getURL() != null) { - xhtml.startElement("a", "href", link.getURL()); - addedHREF = true; - } else if (linkRun.getAnchor() != null && linkRun.getAnchor().length() > 0) { - xhtml.startElement("a", "href", "#" + linkRun.getAnchor()); - addedHREF = true; - } - } - - xhtml.characters(run.toString()); - - // If we have any pictures, output them - for (XWPFPicture picture : run.getEmbeddedPictures()) { - if (paragraph.getDocument() != null) { - XWPFPictureData data = picture.getPictureData(); - if (data != null) { - AttributesImpl attr = new AttributesImpl(); - - attr.addAttribute("", "src", "src", "CDATA", "embedded:" + data.getFileName()); - attr.addAttribute("", "alt", "alt", "CDATA", picture.getDescription()); - - xhtml.startElement("img", attr); - xhtml.endElement("img"); - } - } - } - - if (addedHREF) { - xhtml.endElement("a"); - } - - return tfmtg; + TmpFormatting fmtg) throws SAXException { + // Close any still open style tags + if (fmtg.isItalic()) { + xhtml.endElement("i"); + fmtg.setItalic(false); + } + if (fmtg.isBold()) { + xhtml.endElement("b"); + fmtg.setBold(false); + } + return fmtg; + } + + private TmpFormatting processRun(XWPFRun run, XWPFParagraph paragraph, + XHTMLContentHandler xhtml, TmpFormatting tfmtg) + throws SAXException, XmlException, IOException{ + // True if we are currently in the named style tag: + if (run.isBold() != tfmtg.isBold()) { + if (tfmtg.isItalic()) { + xhtml.endElement("i"); + tfmtg.setItalic(false); + } + if (run.isBold()) { + xhtml.startElement("b"); + } else { + xhtml.endElement("b"); + } + tfmtg.setBold(run.isBold()); + } + + if (run.isItalic() != tfmtg.isItalic()) { + if (run.isItalic()) { + xhtml.startElement("i"); + } else { + xhtml.endElement("i"); + } + tfmtg.setItalic(run.isItalic()); + } + + boolean addedHREF = false; + if(run instanceof XWPFHyperlinkRun) { + XWPFHyperlinkRun linkRun = (XWPFHyperlinkRun)run; + XWPFHyperlink link = linkRun.getHyperlink(document); + if(link != null && link.getURL() != null) { + xhtml.startElement("a", "href", link.getURL()); + addedHREF = true; + } else if(linkRun.getAnchor() != null && linkRun.getAnchor().length() > 0) { + xhtml.startElement("a", "href", "#" + linkRun.getAnchor()); + addedHREF = true; + } + } + + xhtml.characters(run.toString()); + + // If we have any pictures, output them + for(XWPFPicture picture : run.getEmbeddedPictures()) { + if(paragraph.getDocument() != null) { + XWPFPictureData data = picture.getPictureData(); + if(data != null) { + AttributesImpl attr = new AttributesImpl(); + + attr.addAttribute("", "src", "src", "CDATA", "embedded:" + data.getFileName()); + attr.addAttribute("", "alt", "alt", "CDATA", picture.getDescription()); + + xhtml.startElement("img", attr); + xhtml.endElement("img"); + } + } + } + + if (addedHREF) { + xhtml.endElement("a"); + } + + return tfmtg; } private void processSDTRun(XWPFSDT run, XHTMLContentHandler xhtml) - throws SAXException, XmlException, IOException { - xhtml.characters(run.getContent().getText()); - } - - private void extractTable(XWPFTable table, XWPFListManager listManager, - XHTMLContentHandler xhtml) - throws SAXException, XmlException, IOException { - xhtml.startElement("table"); - xhtml.startElement("tbody"); - for (XWPFTableRow row : table.getRows()) { - xhtml.startElement("tr"); - for (ICell cell : row.getTableICells()) { - xhtml.startElement("td"); - if (cell instanceof XWPFTableCell) { - extractIBodyText((XWPFTableCell) cell, listManager, xhtml); - } else if (cell instanceof XWPFSDTCell) { - xhtml.characters(((XWPFSDTCell) cell).getContent().getText()); - } - xhtml.endElement("td"); - } - xhtml.endElement("tr"); - } - xhtml.endElement("tbody"); - xhtml.endElement("table"); - } - + throws SAXException, XmlException, IOException{ + xhtml.characters(run.getContent().getText()); + } + + private void extractTable(XWPFTable table, XHTMLContentHandler xhtml) + throws SAXException, XmlException, IOException { + xhtml.startElement("table"); + xhtml.startElement("tbody"); + for(XWPFTableRow row : table.getRows()) { + xhtml.startElement("tr"); + for(XWPFTableCell cell : row.getTableCells()) { + xhtml.startElement("td"); + extractIBodyText(cell, xhtml); + xhtml.endElement("td"); + } + xhtml.endElement("tr"); + } + xhtml.endElement("tbody"); + xhtml.endElement("table"); + } + private void extractFooters( - XHTMLContentHandler xhtml, XWPFHeaderFooterPolicy hfPolicy, - XWPFListManager listManager) + XHTMLContentHandler xhtml, XWPFHeaderFooterPolicy hfPolicy) throws SAXException, XmlException, IOException { // footers if (hfPolicy.getFirstPageFooter() != null) { - extractHeaderText(xhtml, hfPolicy.getFirstPageFooter(), listManager); + extractHeaderText(xhtml, hfPolicy.getFirstPageFooter()); } if (hfPolicy.getEvenPageFooter() != null) { - extractHeaderText(xhtml, hfPolicy.getEvenPageFooter(), listManager); + extractHeaderText(xhtml, hfPolicy.getEvenPageFooter()); } if (hfPolicy.getDefaultFooter() != null) { - extractHeaderText(xhtml, hfPolicy.getDefaultFooter(), listManager); + extractHeaderText(xhtml, hfPolicy.getDefaultFooter()); } } private void extractHeaders( - XHTMLContentHandler xhtml, XWPFHeaderFooterPolicy hfPolicy, XWPFListManager listManager) + XHTMLContentHandler xhtml, XWPFHeaderFooterPolicy hfPolicy) throws SAXException, XmlException, IOException { if (hfPolicy == null) return; - + if (hfPolicy.getFirstPageHeader() != null) { - extractHeaderText(xhtml, hfPolicy.getFirstPageHeader(), listManager); + extractHeaderText(xhtml, hfPolicy.getFirstPageHeader()); } if (hfPolicy.getEvenPageHeader() != null) { - extractHeaderText(xhtml, hfPolicy.getEvenPageHeader(), listManager); + extractHeaderText(xhtml, hfPolicy.getEvenPageHeader()); } if (hfPolicy.getDefaultHeader() != null) { - extractHeaderText(xhtml, hfPolicy.getDefaultHeader(), listManager); - } - } - - private void extractHeaderText(XHTMLContentHandler xhtml, XWPFHeaderFooter header, XWPFListManager listManager) throws SAXException, XmlException, IOException { - - for (IBodyElement e : header.getBodyElements()) { - if (e instanceof XWPFParagraph) { - extractParagraph((XWPFParagraph) e, listManager, xhtml); - } else if (e instanceof XWPFTable) { - extractTable((XWPFTable) e, listManager, xhtml); - } else if (e instanceof XWPFSDT) { - extractSDT((XWPFSDT) e, xhtml); - } + extractHeaderText(xhtml, hfPolicy.getDefaultHeader()); + } + } + + private void extractHeaderText(XHTMLContentHandler xhtml, XWPFHeaderFooter header) throws SAXException, XmlException, IOException { + + for (IBodyElement e : header.getBodyElements()){ + if (e instanceof XWPFParagraph){ + extractParagraph((XWPFParagraph)e, xhtml); + } else if (e instanceof XWPFTable){ + extractTable((XWPFTable)e, xhtml); + } else if (e instanceof XWPFSDT){ + extractSDT((XWPFSDT)e, xhtml); + } } } /** * Word documents are simple, they only have the one - * main part + * main part */ @Override protected List getMainDocumentParts() { - List parts = new ArrayList(); - parts.add(document.getPackagePart()); - return parts; - } - - private class TmpFormatting { - private boolean bold = false; - private boolean italic = false; - - private TmpFormatting(boolean bold, boolean italic) { - this.bold = bold; - this.italic = italic; - } - - public boolean isBold() { - return bold; - } - - public void setBold(boolean bold) { - this.bold = bold; - } - - public boolean isItalic() { - return italic; - } - - public void setItalic(boolean italic) { - this.italic = italic; - } - + List parts = new ArrayList(); + parts.add( document.getPackagePart() ); + return parts; + } + + private class TmpFormatting{ + private boolean bold = false; + private boolean italic = false; + private TmpFormatting(boolean bold, boolean italic){ + this.bold = bold; + this.italic = italic; + } + public boolean isBold() { + return bold; + } + public void setBold(boolean bold) { + this.bold = bold; + } + public boolean isItalic() { + return italic; + } + public void setItalic(boolean italic) { + this.italic = italic; + } + } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/CompositeTagHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/CompositeTagHandler.java index b7d2d75..a4d7784 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/CompositeTagHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/CompositeTagHandler.java @@ -113,30 +113,4 @@ return null; } - public String getAlbumArtist() { - for (ID3Tags tag : tags) { - if (tag.getAlbumArtist() != null) { - return tag.getAlbumArtist(); - } - } - return null; - } - - public String getDisc() { - for (ID3Tags tag : tags) { - if (tag.getDisc() != null) { - return tag.getDisc(); - } - } - return null; - } - - public String getCompilation() { - for (ID3Tags tag : tags) { - if (tag.getCompilation() != null) { - return tag.getCompilation(); - } - } - return null; - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3Tags.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3Tags.java index 98ef504..074235d 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3Tags.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3Tags.java @@ -17,6 +17,7 @@ package org.apache.tika.parser.mp3; import java.util.List; + /** * Interface that defines the common interface for ID3 tag parsers, @@ -171,22 +172,12 @@ String getTitle(); - /** - * The Artist for the track - */ String getArtist(); - - /** - * The Artist for the overall album / compilation of albums - */ - String getAlbumArtist(); String getAlbum(); String getComposer(); - String getCompilation(); - /** * Retrieves the comments, if any. * Files may have more than one comment, but normally only @@ -198,15 +189,7 @@ String getYear(); - /** - * The number of the track within the album / recording - */ String getTrackNumber(); - - /** - * The number of the disc this belongs to, within the set - */ - String getDisc(); /** * Represents a comments in ID3 (especially ID3 v2), where are diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v1Handler.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v1Handler.java index 2111356..5cf603c 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v1Handler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v1Handler.java @@ -18,14 +18,13 @@ import java.io.IOException; import java.io.InputStream; +import java.io.UnsupportedEncodingException; import java.util.Arrays; import java.util.List; import org.apache.tika.exception.TikaException; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.ISO_8859_1; /** * This is used to parse ID3 Version 1 Tag information from an MP3 file, @@ -102,7 +101,7 @@ } public List getComments() { - return Arrays.asList(comment); + return Arrays.asList(new ID3Comment[] {comment}); } public String getGenre() { @@ -118,30 +117,6 @@ * so returns null; */ public String getComposer() { - return null; - } - - /** - * ID3v1 doesn't have album-wide artists, - * so returns null; - */ - public String getAlbumArtist() { - return null; - } - - /** - * ID3v1 doesn't have disc numbers, - * so returns null; - */ - public String getDisc() { - return null; - } - - /** - * ID3v1 doesn't have compilations, - * so returns null; - */ - public String getCompilation() { return null; } @@ -178,6 +153,10 @@ } // Return the remaining substring - return new String(buffer, start, end - start, ISO_8859_1); + try { + return new String(buffer, start, end - start, "ISO-8859-1"); + } catch (UnsupportedEncodingException e) { + throw new TikaException("ISO-8859-1 encoding is not available", e); + } } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v22Handler.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v22Handler.java index 8d94c0b..f6ca0ee 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v22Handler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v22Handler.java @@ -39,8 +39,6 @@ private String composer; private String genre; private String trackNumber; - private String albumArtist; - private String disc; private List comments = new ArrayList(); public ID3v22Handler(ID3v2Frame frame) @@ -52,8 +50,6 @@ title = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TP1")) { artist = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TP2")) { - albumArtist = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TAL")) { album = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TYE")) { @@ -64,8 +60,6 @@ comments.add( getComment(tag.data, 0, tag.data.length) ); } else if (tag.name.equals("TRK")) { trackNumber = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TPA")) { - disc = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TCO")) { genre = extractGenre( getTagString(tag.data, 0, tag.data.length) ); } @@ -135,25 +129,10 @@ return trackNumber; } - public String getAlbumArtist() { - return albumArtist; - } - - public String getDisc() { - return disc; - } - - /** - * ID3v22 doesn't have compilations, - * so returns null; - */ - public String getCompilation() { - return null; - } - private class RawV22TagIterator extends RawTagIterator { private RawV22TagIterator(ID3v2Frame frame) { frame.super(3, 3, 1, 0); } } + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v23Handler.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v23Handler.java index 4b67eda..daee47e 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v23Handler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v23Handler.java @@ -39,9 +39,6 @@ private String composer; private String genre; private String trackNumber; - private String albumArtist; - private String disc; - private String compilation; private List comments = new ArrayList(); public ID3v23Handler(ID3v2Frame frame) @@ -53,8 +50,6 @@ title = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TPE1")) { artist = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TPE2")) { - albumArtist = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TALB")) { album = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TYER")) { @@ -65,10 +60,6 @@ comments.add( getComment(tag.data, 0, tag.data.length) ); } else if (tag.name.equals("TRCK")) { trackNumber = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TPOS")) { - disc = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TCMP")) { - compilation = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TCON")) { genre = ID3v22Handler.extractGenre( getTagString(tag.data, 0, tag.data.length) ); } @@ -118,21 +109,10 @@ return trackNumber; } - public String getAlbumArtist() { - return albumArtist; - } - - public String getDisc() { - return disc; - } - - public String getCompilation() { - return compilation; - } - private class RawV23TagIterator extends RawTagIterator { private RawV23TagIterator(ID3v2Frame frame) { frame.super(4, 4, 1, 2); } } + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v24Handler.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v24Handler.java index caba928..33d4a01 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v24Handler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v24Handler.java @@ -40,9 +40,6 @@ private String composer; private String genre; private String trackNumber; - private String albumArtist; - private String disc; - private String compilation; private List comments = new ArrayList(); public ID3v24Handler(ID3v2Frame frame) @@ -54,8 +51,6 @@ title = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TPE1")) { artist = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TPE2")) { - albumArtist = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TALB")) { album = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TYER")) { @@ -70,10 +65,6 @@ comments.add( getComment(tag.data, 0, tag.data.length) ); } else if (tag.name.equals("TRCK")) { trackNumber = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TPOS")) { - disc = getTagString(tag.data, 0, tag.data.length); - } else if (tag.name.equals("TCMP")) { - compilation = getTagString(tag.data, 0, tag.data.length); } else if (tag.name.equals("TCON")) { genre = ID3v22Handler.extractGenre( getTagString(tag.data, 0, tag.data.length) ); } @@ -123,21 +114,10 @@ return trackNumber; } - public String getAlbumArtist() { - return albumArtist; - } - - public String getDisc() { - return disc; - } - - public String getCompilation() { - return compilation; - } - private class RawV24TagIterator extends RawTagIterator { private RawV24TagIterator(ID3v2Frame frame) { frame.super(4, 4, 1, 2); } } + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v2Frame.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v2Frame.java index 41298dd..9b7cbaa 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v2Frame.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/ID3v2Frame.java @@ -23,8 +23,6 @@ import java.util.Iterator; import org.apache.tika.parser.mp3.ID3Tags.ID3Comment; - -import static java.nio.charset.StandardCharsets.ISO_8859_1; /** * A frame of ID3v2 data, which is then passed to a handler to @@ -333,7 +331,12 @@ * offset and length. Strings are ISO-8859-1 */ protected static String getString(byte[] data, int offset, int length) { - return new String(data, offset, length, ISO_8859_1); + try { + return new String(data, offset, length, "ISO-8859-1"); + } catch (UnsupportedEncodingException e) { + throw new RuntimeException( + "Core encoding ISO-8859-1 encoding is not available", e); + } } @@ -410,7 +413,7 @@ // Now data int copyFrom = offset+nameLength+sizeLength+flagLength; - size = Math.max(0, Math.min(size, frameData.length-copyFrom)); // TIKA-1218, prevent negative size for malformed files. + size = Math.min(size, frameData.length-copyFrom); data = new byte[size]; System.arraycopy(frameData, copyFrom, data, 0, size); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/LyricsHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/LyricsHandler.java index 12d0f2d..606d344 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/LyricsHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/LyricsHandler.java @@ -22,9 +22,6 @@ import org.apache.tika.exception.TikaException; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.US_ASCII; -import static java.nio.charset.StandardCharsets.UTF_8; /** * This is used to parse Lyrics3 tag information @@ -85,12 +82,12 @@ // size including the LYRICSBEGIN but excluding the // length+LYRICS200 at the end. int length = Integer.parseInt( - new String(tagData, lookat-6, 6, UTF_8) + new String(tagData, lookat-6, 6) ); String lyrics = new String( tagData, lookat-length+5, length-11, - US_ASCII + "ASCII" ); // Tags are a 3 letter code, 5 digit length, then data diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java index 3b79f31..c8891bb 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp3/Mp3Parser.java @@ -70,19 +70,17 @@ // Create handlers for the various kinds of ID3 tags ID3TagsAndAudio audioAndTags = getAllTagHandlers(stream, handler); - // Process tags metadata if the file has supported tags if (audioAndTags.tags.length > 0) { CompositeTagHandler tag = new CompositeTagHandler(audioAndTags.tags); metadata.set(TikaCoreProperties.TITLE, tag.getTitle()); metadata.set(TikaCoreProperties.CREATOR, tag.getArtist()); metadata.set(XMPDM.ARTIST, tag.getArtist()); - metadata.set(XMPDM.ALBUM_ARTIST, tag.getAlbumArtist()); metadata.set(XMPDM.COMPOSER, tag.getComposer()); metadata.set(XMPDM.ALBUM, tag.getAlbum()); - metadata.set(XMPDM.COMPILATION, tag.getCompilation()); metadata.set(XMPDM.RELEASE_DATE, tag.getYear()); metadata.set(XMPDM.GENRE, tag.getGenre()); + metadata.set(XMPDM.DURATION, audioAndTags.duration); List comments = new ArrayList(); for (ID3Comment comment : tag.getComments()) { @@ -109,27 +107,18 @@ xhtml.element("p", tag.getArtist()); // ID3v1.1 Track addition - StringBuilder sb = new StringBuilder(); - sb.append(tag.getAlbum()); if (tag.getTrackNumber() != null) { - sb.append(", track ").append(tag.getTrackNumber()); + xhtml.element("p", tag.getAlbum() + ", track " + tag.getTrackNumber()); metadata.set(XMPDM.TRACK_NUMBER, tag.getTrackNumber()); - } - if (tag.getDisc() != null) { - sb.append(", disc ").append(tag.getDisc()); - metadata.set(XMPDM.DISC_NUMBER, tag.getDisc()); - } - xhtml.element("p", sb.toString()); - + } else { + xhtml.element("p", tag.getAlbum()); + } xhtml.element("p", tag.getYear()); xhtml.element("p", tag.getGenre()); xhtml.element("p", String.valueOf(audioAndTags.duration)); for (String comment : comments) { xhtml.element("p", comment); } - } - if (audioAndTags.duration > 0) { - metadata.set(XMPDM.DURATION, audioAndTags.duration); } if (audioAndTags.audio != null) { metadata.set("samplerate", String.valueOf(audioAndTags.audio.getSampleRate())); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/DirectFileReadDataSource.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp4/DirectFileReadDataSource.java deleted file mode 100644 index bea50a0..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/DirectFileReadDataSource.java +++ /dev/null @@ -1,100 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.mp4; - -import com.googlecode.mp4parser.DataSource; - -import java.io.File; -import java.io.IOException; -import java.io.RandomAccessFile; -import java.nio.ByteBuffer; -import java.nio.channels.WritableByteChannel; - -import static com.googlecode.mp4parser.util.CastUtils.l2i; - -/** - * A {@link DataSource} implementation that relies on direct reads from a {@link RandomAccessFile}. - * It should be slower than {@link com.googlecode.mp4parser.FileDataSourceImpl} but does not incur the implicit file locks of - * memory mapped I/O on some JVMs. This implementation allows for a more controlled deletion of files - * and might be preferred when working with temporary files. - * @see JDK-4724038 : (fs) Add unmap method to MappedByteBuffer - * @see JDK-6359560 : (fs) File.deleteOnExit() doesn't work when MappedByteBuffer exists (win) - */ -public class DirectFileReadDataSource implements DataSource { - - private static final int TRANSFER_SIZE = 8192; - - private RandomAccessFile raf; - - public DirectFileReadDataSource(File f) throws IOException { - this.raf = new RandomAccessFile(f, "r"); - } - - public int read(ByteBuffer byteBuffer) throws IOException { - int len = byteBuffer.remaining(); - int totalRead = 0; - int bytesRead = 0; - byte[] buf = new byte[TRANSFER_SIZE]; - while (totalRead < len) { - int bytesToRead = Math.min((len - totalRead), TRANSFER_SIZE); - bytesRead = raf.read(buf, 0, bytesToRead); - if (bytesRead < 0) { - break; - } else { - totalRead += bytesRead; - } - byteBuffer.put(buf, 0, bytesRead); - } - return ((bytesRead < 0) && (totalRead == 0)) ? -1 : totalRead; - } - - public int readAllInOnce(ByteBuffer byteBuffer) throws IOException { - byte[] buf = new byte[byteBuffer.remaining()]; - int read = raf.read(buf); - byteBuffer.put(buf, 0, read); - return read; - } - - public long size() throws IOException { - return raf.length(); - } - - public long position() throws IOException { - return raf.getFilePointer(); - } - - public void position(long nuPos) throws IOException { - raf.seek(nuPos); - } - - public long transferTo(long position, long count, WritableByteChannel target) throws IOException { - return target.write(map(position, count)); - } - - public ByteBuffer map(long startPosition, long size) throws IOException { - raf.seek(startPosition); - byte[] payload = new byte[l2i(size)]; - raf.readFully(payload); - return ByteBuffer.wrap(payload); - } - - public void close() throws IOException { - raf.close(); - } - - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java b/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java index 20c8246..fb474d3 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/mp4/MP4Parser.java @@ -18,23 +18,19 @@ import java.io.IOException; import java.io.InputStream; -import java.text.DecimalFormat; -import java.text.NumberFormat; import java.util.Arrays; import java.util.Collections; +import java.util.Date; import java.util.HashMap; import java.util.List; -import java.util.Locale; import java.util.Map; import java.util.Set; import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.Property; import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.metadata.XMP; import org.apache.tika.metadata.XMPDM; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -45,7 +41,7 @@ import com.coremedia.iso.IsoFile; import com.coremedia.iso.boxes.Box; -import com.coremedia.iso.boxes.Container; +import com.coremedia.iso.boxes.ContainerBox; import com.coremedia.iso.boxes.FileTypeBox; import com.coremedia.iso.boxes.MetaBox; import com.coremedia.iso.boxes.MovieBox; @@ -55,21 +51,19 @@ import com.coremedia.iso.boxes.TrackBox; import com.coremedia.iso.boxes.TrackHeaderBox; import com.coremedia.iso.boxes.UserDataBox; +import com.coremedia.iso.boxes.apple.AbstractAppleMetaDataBox; +import com.coremedia.iso.boxes.apple.AppleAlbumBox; +import com.coremedia.iso.boxes.apple.AppleArtistBox; +import com.coremedia.iso.boxes.apple.AppleCommentBox; +import com.coremedia.iso.boxes.apple.AppleCustomGenreBox; +import com.coremedia.iso.boxes.apple.AppleEncoderBox; import com.coremedia.iso.boxes.apple.AppleItemListBox; +import com.coremedia.iso.boxes.apple.AppleRecordingYearBox; +import com.coremedia.iso.boxes.apple.AppleStandardGenreBox; +import com.coremedia.iso.boxes.apple.AppleTrackAuthorBox; +import com.coremedia.iso.boxes.apple.AppleTrackNumberBox; +import com.coremedia.iso.boxes.apple.AppleTrackTitleBox; import com.coremedia.iso.boxes.sampleentry.AudioSampleEntry; -import com.googlecode.mp4parser.boxes.apple.AppleAlbumBox; -import com.googlecode.mp4parser.boxes.apple.AppleArtistBox; -import com.googlecode.mp4parser.boxes.apple.AppleArtist2Box; -import com.googlecode.mp4parser.boxes.apple.AppleCommentBox; -import com.googlecode.mp4parser.boxes.apple.AppleCompilationBox; -import com.googlecode.mp4parser.boxes.apple.AppleDiskNumberBox; -import com.googlecode.mp4parser.boxes.apple.AppleEncoderBox; -import com.googlecode.mp4parser.boxes.apple.AppleGenreBox; -import com.googlecode.mp4parser.boxes.apple.AppleNameBox; -import com.googlecode.mp4parser.boxes.apple.AppleRecordingYear2Box; -import com.googlecode.mp4parser.boxes.apple.AppleTrackAuthorBox; -import com.googlecode.mp4parser.boxes.apple.AppleTrackNumberBox; -import com.googlecode.mp4parser.boxes.apple.Utf8AppleDataBox; /** * Parser for the MP4 media container format, as well as the older @@ -81,12 +75,6 @@ public class MP4Parser extends AbstractParser { /** Serial version UID */ private static final long serialVersionUID = 84011216792285L; - /** TODO Replace this with a 2dp Duration Property Converter */ - private static final DecimalFormat DURATION_FORMAT = - (DecimalFormat)NumberFormat.getNumberInstance(Locale.ROOT); - static { - DURATION_FORMAT.applyPattern("0.0#"); - } // Ensure this stays in Sync with the entries in tika-mimetypes.xml private static final Map> typesMap = new HashMap>(); @@ -124,196 +112,197 @@ // The MP4Parser library accepts either a File, or a byte array // As MP4 video files are typically large, always use a file to // avoid OOMs that may occur with in-memory buffering - TemporaryResources tmp = new TemporaryResources(); - TikaInputStream tstream = TikaInputStream.get(stream, tmp); + TikaInputStream tstream = TikaInputStream.get(stream); try { - isoFile = new IsoFile(new DirectFileReadDataSource(tstream.getFile())); - tmp.addResource(isoFile); - - // Grab the file type box - FileTypeBox fileType = getOrNull(isoFile, FileTypeBox.class); - if (fileType != null) { - // Identify the type - MediaType type = MediaType.application("mp4"); - for (MediaType t : typesMap.keySet()) { - if (typesMap.get(t).contains(fileType.getMajorBrand())) { - type = t; - break; - } - } - metadata.set(Metadata.CONTENT_TYPE, type.toString()); - - if (type.getType().equals("audio")) { - metadata.set(XMPDM.AUDIO_COMPRESSOR, fileType.getMajorBrand().trim()); - } - } else { - // Some older QuickTime files lack the FileType - metadata.set(Metadata.CONTENT_TYPE, "video/quicktime"); - } - - - // Get the main MOOV box - MovieBox moov = getOrNull(isoFile, MovieBox.class); - if (moov == null) { - // Bail out - return; - } - - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - - - // Pull out some information from the header box - MovieHeaderBox mHeader = getOrNull(moov, MovieHeaderBox.class); - if (mHeader != null) { - // Get the creation and modification dates - metadata.set(Metadata.CREATION_DATE, mHeader.getCreationTime()); - metadata.set(TikaCoreProperties.MODIFIED, mHeader.getModificationTime()); - - // Get the duration - double durationSeconds = ((double)mHeader.getDuration()) / mHeader.getTimescale(); - metadata.set(XMPDM.DURATION, DURATION_FORMAT.format(durationSeconds)); - - // The timescale is normally the sampling rate - metadata.set(XMPDM.AUDIO_SAMPLE_RATE, (int)mHeader.getTimescale()); - } - - - // Get some more information from the track header - // TODO Decide how to handle multiple tracks - List tb = moov.getBoxes(TrackBox.class); - if (tb.size() > 0) { - TrackBox track = tb.get(0); - - TrackHeaderBox header = track.getTrackHeaderBox(); - // Get the creation and modification dates - metadata.set(TikaCoreProperties.CREATED, header.getCreationTime()); - metadata.set(TikaCoreProperties.MODIFIED, header.getModificationTime()); - - // Get the video with and height - metadata.set(Metadata.IMAGE_WIDTH, (int)header.getWidth()); - metadata.set(Metadata.IMAGE_LENGTH, (int)header.getHeight()); - - // Get the sample information - SampleTableBox samples = track.getSampleTableBox(); - SampleDescriptionBox sampleDesc = samples.getSampleDescriptionBox(); - if (sampleDesc != null) { - // Look for the first Audio Sample, if present - AudioSampleEntry sample = getOrNull(sampleDesc, AudioSampleEntry.class); - if (sample != null) { - XMPDM.ChannelTypePropertyConverter.convertAndSet(metadata, sample.getChannelCount()); - //metadata.set(XMPDM.AUDIO_SAMPLE_TYPE, sample.getSampleSize()); // TODO Num -> Type mapping - metadata.set(XMPDM.AUDIO_SAMPLE_RATE, (int)sample.getSampleRate()); - //metadata.set(XMPDM.AUDIO_, sample.getSamplesPerPacket()); - //metadata.set(XMPDM.AUDIO_, sample.getBytesPerSample()); - } - } - } - - // Get metadata from the User Data Box - UserDataBox userData = getOrNull(moov, UserDataBox.class); - if (userData != null) { - MetaBox meta = getOrNull(userData, MetaBox.class); - - // Check for iTunes Metadata - // See http://atomicparsley.sourceforge.net/mpeg-4files.html and - // http://code.google.com/p/mp4v2/wiki/iTunesMetadata for more on these - AppleItemListBox apple = getOrNull(meta, AppleItemListBox.class); - if (apple != null) { - // Title - AppleNameBox title = getOrNull(apple, AppleNameBox.class); - addMetadata(TikaCoreProperties.TITLE, metadata, title); - - // Artist - AppleArtistBox artist = getOrNull(apple, AppleArtistBox.class); - addMetadata(TikaCoreProperties.CREATOR, metadata, artist); - addMetadata(XMPDM.ARTIST, metadata, artist); - - // Album Artist - AppleArtist2Box artist2 = getOrNull(apple, AppleArtist2Box.class); - addMetadata(XMPDM.ALBUM_ARTIST, metadata, artist2); - - // Album - AppleAlbumBox album = getOrNull(apple, AppleAlbumBox.class); - addMetadata(XMPDM.ALBUM, metadata, album); - - // Composer - AppleTrackAuthorBox composer = getOrNull(apple, AppleTrackAuthorBox.class); - addMetadata(XMPDM.COMPOSER, metadata, composer); - - // Genre - AppleGenreBox genre = getOrNull(apple, AppleGenreBox.class); - addMetadata(XMPDM.GENRE, metadata, genre); - - // Year - AppleRecordingYear2Box year = getOrNull(apple, AppleRecordingYear2Box.class); - if (year != null) { - metadata.set(XMPDM.RELEASE_DATE, year.getValue()); - } - - // Track number - AppleTrackNumberBox trackNum = getOrNull(apple, AppleTrackNumberBox.class); - if (trackNum != null) { - metadata.set(XMPDM.TRACK_NUMBER, trackNum.getA()); - //metadata.set(XMPDM.NUMBER_OF_TRACKS, trackNum.getB()); // TODO - } - - // Disc number - AppleDiskNumberBox discNum = getOrNull(apple, AppleDiskNumberBox.class); - if (discNum != null) { - metadata.set(XMPDM.DISC_NUMBER, discNum.getA()); - } - - // Compilation - AppleCompilationBox compilation = getOrNull(apple, AppleCompilationBox.class); - if (compilation != null) { - metadata.set(XMPDM.COMPILATION, (int)compilation.getValue()); - } - - // Comment - AppleCommentBox comment = getOrNull(apple, AppleCommentBox.class); - addMetadata(XMPDM.LOG_COMMENT, metadata, comment); - - // Encoder - AppleEncoderBox encoder = getOrNull(apple, AppleEncoderBox.class); - if (encoder != null) { - metadata.set(XMP.CREATOR_TOOL, encoder.getValue()); - } - - - // As text - for (Box box : apple.getBoxes()) { - if (box instanceof Utf8AppleDataBox) { - xhtml.element("p", ((Utf8AppleDataBox)box).getValue()); - } - } - } - - // TODO Check for other kinds too - } - - // All done - xhtml.endDocument(); - + isoFile = new IsoFile(tstream.getFileChannel()); } finally { - tmp.dispose(); - } - + tstream.close(); + } + + + // Grab the file type box + FileTypeBox fileType = getOrNull(isoFile, FileTypeBox.class); + if (fileType != null) { + // Identify the type + MediaType type = MediaType.application("mp4"); + for (MediaType t : typesMap.keySet()) { + if (typesMap.get(t).contains(fileType.getMajorBrand())) { + type = t; + break; + } + } + metadata.set(Metadata.CONTENT_TYPE, type.toString()); + + if (type.getType().equals("audio")) { + metadata.set(XMPDM.AUDIO_COMPRESSOR, fileType.getMajorBrand().trim()); + } + } else { + // Some older QuickTime files lack the FileType + metadata.set(Metadata.CONTENT_TYPE, "video/quicktime"); + } + + + // Get the main MOOV box + MovieBox moov = getOrNull(isoFile, MovieBox.class); + if (moov == null) { + // Bail out + return; + } + + + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + + + // Pull out some information from the header box + MovieHeaderBox mHeader = getOrNull(moov, MovieHeaderBox.class); + if (mHeader != null) { + // Get the creation and modification dates + metadata.set( + Metadata.CREATION_DATE, + MP4TimeToDate(mHeader.getCreationTime()) + ); + metadata.set( + TikaCoreProperties.MODIFIED, + MP4TimeToDate(mHeader.getModificationTime()) + ); + + // Get the duration + double durationSeconds = ((double)mHeader.getDuration()) / mHeader.getTimescale(); + // TODO Use this + + // The timescale is normally the sampling rate + metadata.set(XMPDM.AUDIO_SAMPLE_RATE, (int)mHeader.getTimescale()); + } + + + // Get some more information from the track header + // TODO Decide how to handle multiple tracks + List tb = moov.getBoxes(TrackBox.class); + if (tb.size() > 0) { + TrackBox track = tb.get(0); + + TrackHeaderBox header = track.getTrackHeaderBox(); + // Get the creation and modification dates + metadata.set( + TikaCoreProperties.CREATED, + MP4TimeToDate(header.getCreationTime()) + ); + metadata.set( + TikaCoreProperties.MODIFIED, + MP4TimeToDate(header.getModificationTime()) + ); + + // Get the video with and height + metadata.set(Metadata.IMAGE_WIDTH, (int)header.getWidth()); + metadata.set(Metadata.IMAGE_LENGTH, (int)header.getHeight()); + + // Get the sample information + SampleTableBox samples = track.getSampleTableBox(); + SampleDescriptionBox sampleDesc = samples.getSampleDescriptionBox(); + if (sampleDesc != null) { + // Look for the first Audio Sample, if present + AudioSampleEntry sample = getOrNull(sampleDesc, AudioSampleEntry.class); + if (sample != null) { + XMPDM.ChannelTypePropertyConverter.convertAndSet(metadata, sample.getChannelCount()); + //metadata.set(XMPDM.AUDIO_SAMPLE_TYPE, sample.getSampleSize()); // TODO Num -> Type mapping + metadata.set(XMPDM.AUDIO_SAMPLE_RATE, (int)sample.getSampleRate()); + //metadata.set(XMPDM.AUDIO_, sample.getSamplesPerPacket()); + //metadata.set(XMPDM.AUDIO_, sample.getBytesPerSample()); + } + } + } + + // Get metadata from the User Data Box + UserDataBox userData = getOrNull(moov, UserDataBox.class); + if (userData != null) { + MetaBox meta = getOrNull(userData, MetaBox.class); + + // Check for iTunes Metadata + // See http://atomicparsley.sourceforge.net/mpeg-4files.html and + // http://code.google.com/p/mp4v2/wiki/iTunesMetadata for more on these + AppleItemListBox apple = getOrNull(meta, AppleItemListBox.class); + if (apple != null) { + // Title + AppleTrackTitleBox title = getOrNull(apple, AppleTrackTitleBox.class); + addMetadata(TikaCoreProperties.TITLE, metadata, title); + + // Artist + AppleArtistBox artist = getOrNull(apple, AppleArtistBox.class); + addMetadata(TikaCoreProperties.CREATOR, metadata, artist); + addMetadata(XMPDM.ARTIST, metadata, artist); + + // Album + AppleAlbumBox album = getOrNull(apple, AppleAlbumBox.class); + addMetadata(XMPDM.ALBUM, metadata, album); + + // Composer + AppleTrackAuthorBox composer = getOrNull(apple, AppleTrackAuthorBox.class); + addMetadata(XMPDM.COMPOSER, metadata, composer); + + // Genre + AppleStandardGenreBox sGenre = getOrNull(apple, AppleStandardGenreBox.class); + AppleCustomGenreBox cGenre = getOrNull(apple, AppleCustomGenreBox.class); + addMetadata(XMPDM.GENRE, metadata, sGenre); + addMetadata(XMPDM.GENRE, metadata, cGenre); + + // Year + AppleRecordingYearBox year = getOrNull(apple, AppleRecordingYearBox.class); + addMetadata(XMPDM.RELEASE_DATE, metadata, year); + + // Track number + AppleTrackNumberBox trackNum = getOrNull(apple, AppleTrackNumberBox.class); + if (trackNum != null) { + metadata.set(XMPDM.TRACK_NUMBER, trackNum.getTrackNumber()); + //metadata.set(XMPDM.NUMBER_OF_TRACKS, trackNum.getNumberOfTracks()); // TODO + } + + // Comment + AppleCommentBox comment = getOrNull(apple, AppleCommentBox.class); + addMetadata(XMPDM.LOG_COMMENT, metadata, comment); + + // Encoder + AppleEncoderBox encoder = getOrNull(apple, AppleEncoderBox.class); + // addMetadata(XMPDM.???, metadata, encoder); // TODO + + + // As text + for (Box box : apple.getBoxes()) { + if (box instanceof AbstractAppleMetaDataBox) { + xhtml.element("p", ((AbstractAppleMetaDataBox)box).getValue()); + } + } + } + + // TODO Check for other kinds too + } + + // All done + xhtml.endDocument(); } - private static void addMetadata(String key, Metadata m, Utf8AppleDataBox metadata) { + private static void addMetadata(String key, Metadata m, AbstractAppleMetaDataBox metadata) { if (metadata != null) { m.add(key, metadata.getValue()); } } - private static void addMetadata(Property prop, Metadata m, Utf8AppleDataBox metadata) { + private static void addMetadata(Property prop, Metadata m, AbstractAppleMetaDataBox metadata) { if (metadata != null) { m.set(prop, metadata.getValue()); } } - private static T getOrNull(Container box, Class clazz) { + /** + * MP4 Dates are stored as 32-bit integer, which represent the seconds + * since midnight, January 1, 1904, and are generally in UTC + */ + private static Date MP4TimeToDate(long mp4Time) { + long unix = mp4Time - EPOC_AS_MP4_TIME; + return new Date(unix*1000); + } + private static final long EPOC_AS_MP4_TIME = 2082844800l; + + private static T getOrNull(ContainerBox box, Class clazz) { if (box == null) return null; List boxes = box.getBoxes(clazz); diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/netcdf/NetCDFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/netcdf/NetCDFParser.java index 57254f8..37e8978 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/netcdf/NetCDFParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/netcdf/NetCDFParser.java @@ -17,16 +17,14 @@ package org.apache.tika.parser.netcdf; //JDK imports - +import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.util.Collections; import java.util.Set; -import java.util.List; import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; +import org.apache.tika.io.IOUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.Property; import org.apache.tika.metadata.TikaCoreProperties; @@ -40,8 +38,6 @@ import ucar.nc2.Attribute; import ucar.nc2.NetcdfFile; -import ucar.nc2.Variable; -import ucar.nc2.Dimension; /** * A {@link Parser} for SUPPORTED_TYPES = - Collections.singleton(MediaType.application("x-netcdf")); + Collections.singleton(MediaType.application("x-netcdf")); /* * (non-Javadoc) @@ -79,16 +73,22 @@ * org.apache.tika.parser.ParseContext) */ public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, + Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { + ByteArrayOutputStream os = new ByteArrayOutputStream(); + IOUtils.copy(stream, os); - TikaInputStream tis = TikaInputStream.get(stream, new TemporaryResources()); + String name = metadata.get(Metadata.RESOURCE_NAME_KEY); + if (name == null) { + name = ""; + } + try { - NetcdfFile ncFile = NetcdfFile.open(tis.getFile().getAbsolutePath()); - metadata.set("File-Type-Description", ncFile.getFileTypeDescription()); + NetcdfFile ncFile = NetcdfFile.openInMemory(name, os.toByteArray()); + // first parse out the set of global attributes for (Attribute attr : ncFile.getGlobalAttributes()) { - Property property = resolveMetadataKey(attr.getFullName()); + Property property = resolveMetadataKey(attr.getName()); if (attr.getDataType().isString()) { metadata.add(property, attr.getStringValue()); } else if (attr.getDataType().isNumeric()) { @@ -96,49 +96,20 @@ metadata.add(property, String.valueOf(value)); } } - - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - xhtml.newline(); - xhtml.element("h1", "dimensions"); - xhtml.startElement("ul"); - xhtml.newline(); - for (Dimension dim : ncFile.getDimensions()) { - xhtml.element("li", dim.getFullName() + " = " + dim.getLength()); - } - xhtml.endElement("ul"); - - xhtml.element("h1", "variables"); - xhtml.startElement("ul"); - xhtml.newline(); - for (Variable var : ncFile.getVariables()) { - xhtml.startElement("li"); - xhtml.characters(var.getDataType() + " " + var.getNameAndDimensions()); - xhtml.newline(); - List attributes = var.getAttributes(); - if (!attributes.isEmpty()) { - xhtml.startElement("ul"); - for (Attribute element : attributes) { - xhtml.element("li", element.toString()); - } - xhtml.endElement("ul"); - } - xhtml.endElement("li"); - } - xhtml.endElement("ul"); - - xhtml.endDocument(); - } catch (IOException e) { throw new TikaException("NetCDF parse error", e); - } + } + + XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); + xhtml.startDocument(); + xhtml.endDocument(); } - + private Property resolveMetadataKey(String localName) { if ("title".equals(localName)) { return TikaCoreProperties.TITLE; } return Property.internalText(localName); } -} \ No newline at end of file + +} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java deleted file mode 100644 index a35370a..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java +++ /dev/null @@ -1,256 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ocr; - -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.io.Serializable; -import java.util.Locale; -import java.util.Properties; - -/** - * Configuration for TesseractOCRParser. - * - * This allows to enable TesseractOCRParser and set its parameters: - *

    - * TesseractOCRConfig config = new TesseractOCRConfig();
    - * config.setTesseractPath(tesseractFolder);
    - * parseContext.set(TesseractOCRConfig.class, config);
    - *

    - * - * Parameters can also be set by either editing the existing TesseractOCRConfig.properties file in, - * tika-parser/src/main/resources/org/apache/tika/parser/ocr, or overriding it by creating your own - * and placing it in the package org/apache/tika/parser/ocr on the classpath. - * - */ -public class TesseractOCRConfig implements Serializable{ - - private static final long serialVersionUID = -4861942486845757891L; - - // Path to tesseract installation folder, if not on system path. - private String tesseractPath = ""; - - // Path to the 'tessdata' folder, which contains language files and config files. - private String tessdataPath = ""; - - // Language dictionary to be used. - private String language = "eng"; - - // Tesseract page segmentation mode. - private String pageSegMode = "1"; - - // Minimum file size to submit file to ocr. - private int minFileSizeToOcr = 0; - - // Maximum file size to submit file to ocr. - private int maxFileSizeToOcr = Integer.MAX_VALUE; - - // Maximum time (seconds) to wait for the ocring process termination - private int timeout = 120; - - /** - * Default contructor. - */ - public TesseractOCRConfig() { - init(this.getClass().getResourceAsStream("TesseractOCRConfig.properties")); - } - - /** - * Loads properties from InputStream and then tries to close InputStream. - * If there is an IOException, this silently swallows the exception - * and goes back to the default. - * - * @param is - */ - public TesseractOCRConfig(InputStream is) { - init(is); - } - - private void init(InputStream is) { - if (is == null) { - return; - } - Properties props = new Properties(); - try { - props.load(is); - } catch (IOException e) { - } finally { - if (is != null) { - try { - is.close(); - } catch (IOException e) { - //swallow - } - } - } - - setTesseractPath( - getProp(props, "tesseractPath", getTesseractPath())); - setTessdataPath( - getProp(props, "tessdataPath", getTessdataPath())); - setLanguage( - getProp(props, "language", getLanguage())); - setPageSegMode( - getProp(props, "pageSegMode", getPageSegMode())); - setMinFileSizeToOcr( - getProp(props, "minFileSizeToOcr", getMinFileSizeToOcr())); - setMaxFileSizeToOcr( - getProp(props, "maxFileSizeToOcr", getMaxFileSizeToOcr())); - setTimeout( - getProp(props, "timeout", getTimeout())); - - } - - /** @see #setTesseractPath(String tesseractPath)*/ - public String getTesseractPath() { - return tesseractPath; - } - - /** - * Set the path to the Tesseract executable, needed if it is not on system path. - *

    - * Note that if you set this value, it is highly recommended that you also - * set the path to the 'tessdata' folder using {@link #setTessdataPath}. - *

    - */ - public void setTesseractPath(String tesseractPath) { - if(!tesseractPath.isEmpty() && !tesseractPath.endsWith(File.separator)) - tesseractPath += File.separator; - - this.tesseractPath = tesseractPath; - } - - /** @see #setTessdataPath(String tessdataPath) */ - public String getTessdataPath() { - return tessdataPath; - } - - /** - * Set the path to the 'tessdata' folder, which contains language files and config files. In some cases (such - * as on Windows), this folder is found in the Tesseract installation, but in other cases - * (such as when Tesseract is built from source), it may be located elsewhere. - */ - public void setTessdataPath(String tessdataPath) { - if(!tessdataPath.isEmpty() && !tessdataPath.endsWith(File.separator)) - tessdataPath += File.separator; - - this.tessdataPath = tessdataPath; - } - - /** @see #setLanguage(String language)*/ - public String getLanguage() { - return language; - } - - /** - * Set tesseract language dictionary to be used. Default is "eng". - * Multiple languages may be specified, separated by plus characters. - */ - public void setLanguage(String language) { - if (!language.matches("([A-Za-z](\\+?))*")) { - throw new IllegalArgumentException("Invalid language code"); - } - this.language = language; - } - - /** @see #setPageSegMode(String pageSegMode)*/ - public String getPageSegMode() { - return pageSegMode; - } - - /** - * Set tesseract page segmentation mode. - * Default is 1 = Automatic page segmentation with OSD (Orientation and Script Detection) - */ - public void setPageSegMode(String pageSegMode) { - if (!pageSegMode.matches("[1-9]|10")) { - throw new IllegalArgumentException("Invalid language code"); - } - this.pageSegMode = pageSegMode; - } - - /** @see #setMinFileSizeToOcr(int minFileSizeToOcr)*/ - public int getMinFileSizeToOcr() { - return minFileSizeToOcr; - } - - /** - * Set minimum file size to submit file to ocr. - * Default is 0. - */ - public void setMinFileSizeToOcr(int minFileSizeToOcr) { - this.minFileSizeToOcr = minFileSizeToOcr; - } - - /** @see #setMaxFileSizeToOcr(int maxFileSizeToOcr)*/ - public int getMaxFileSizeToOcr() { - return maxFileSizeToOcr; - } - - /** - * Set maximum file size to submit file to ocr. - * Default is Integer.MAX_VALUE. - */ - public void setMaxFileSizeToOcr(int maxFileSizeToOcr) { - this.maxFileSizeToOcr = maxFileSizeToOcr; - } - - /** - * Set maximum time (seconds) to wait for the ocring process to terminate. - * Default value is 120s. - */ - public void setTimeout(int timeout) { - this.timeout = timeout; - } - - /** @see #setTimeout(int timeout)*/ - public int getTimeout() { - return timeout; - } - - /** - * Get property from the properties file passed in. - * @param properties properties file to read from. - * @param property the property to fetch. - * @param defaultMissing default parameter to use. - * @return the value. - */ - private int getProp(Properties properties, String property, int defaultMissing) { - String p = properties.getProperty(property); - if (p == null || p.isEmpty()){ - return defaultMissing; - } - try { - return Integer.parseInt(p); - } catch (Throwable ex) { - throw new RuntimeException(String.format(Locale.ROOT, "Cannot parse TesseractOCRConfig variable %s, invalid integer value", - property), ex); - } - } - - /** - * Get property from the properties file passed in. - * @param properties properties file to read from. - * @param property the property to fetch. - * @param defaultMissing default parameter to use. - * @return the value. - */ - private String getProp(Properties properties, String property, String defaultMissing) { - return properties.getProperty(property, defaultMissing); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java deleted file mode 100644 index e45095a..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java +++ /dev/null @@ -1,336 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.ocr; - -import javax.imageio.ImageIO; - -import java.awt.Image; -import java.awt.image.BufferedImage; -import java.io.File; -import java.io.FileInputStream; -import java.io.FileOutputStream; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.io.Reader; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.Callable; -import java.util.concurrent.ExecutionException; -import java.util.concurrent.FutureTask; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.TimeoutException; - -import org.apache.commons.io.IOUtils; -import org.apache.commons.logging.LogFactory; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MediaTypeRegistry; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.external.ExternalParser; -import org.apache.tika.parser.image.ImageParser; -import org.apache.tika.parser.image.TiffParser; -import org.apache.tika.parser.jpeg.JpegParser; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, - * create a {@link TesseractOCRConfig} object and pass it through a - * ParseContext. Tesseract-ocr must be installed and on system path or the path - * to its root folder must be provided: - *

    - * TesseractOCRConfig config = new TesseractOCRConfig();
    - * //Needed if tesseract is not on system path
    - * config.setTesseractPath(tesseractFolder);
    - * parseContext.set(TesseractOCRConfig.class, config);
    - *

    - * - * - */ -public class TesseractOCRParser extends AbstractParser { - private static final long serialVersionUID = -8167538283213097265L; - private static final TesseractOCRConfig DEFAULT_CONFIG = new TesseractOCRConfig(); - private static final Set SUPPORTED_TYPES = Collections.unmodifiableSet( - new HashSet(Arrays.asList(new MediaType[] { - MediaType.image("png"), MediaType.image("jpeg"), MediaType.image("tiff"), - MediaType.image("x-ms-bmp"), MediaType.image("gif") - }))); - private static Map TESSERACT_PRESENT = new HashMap(); - - @Override - public Set getSupportedTypes(ParseContext context) { - // If Tesseract is installed, offer our supported image types - TesseractOCRConfig config = context.get(TesseractOCRConfig.class, DEFAULT_CONFIG); - if (hasTesseract(config)) - return SUPPORTED_TYPES; - - // Otherwise don't advertise anything, so the other image parsers - // can be selected instead - return Collections.emptySet(); - } - - private void setEnv(TesseractOCRConfig config, ProcessBuilder pb) { - String tessdataPrefix = "TESSDATA_PREFIX"; - Map env = pb.environment(); - - if (!config.getTessdataPath().isEmpty()) { - env.put(tessdataPrefix, config.getTessdataPath()); - } - else if(!config.getTesseractPath().isEmpty()) { - env.put(tessdataPrefix, config.getTesseractPath()); - } - } - - private boolean hasTesseract(TesseractOCRConfig config) { - // Fetch where the config says to find Tesseract - String tesseract = config.getTesseractPath() + getTesseractProg(); - - // Have we already checked for a copy of Tesseract there? - if (TESSERACT_PRESENT.containsKey(tesseract)) { - return TESSERACT_PRESENT.get(tesseract); - } - - // Try running Tesseract from there, and see if it exists + works - String[] checkCmd = { tesseract }; - try { - boolean hasTesseract = ExternalParser.check(checkCmd); - TESSERACT_PRESENT.put(tesseract, hasTesseract); - return hasTesseract; - } catch (NoClassDefFoundError e) { - // This happens under OSGi + Fork Parser - see TIKA-1507 - // As a workaround for now, just say we can't use OCR - // TODO Resolve it so we don't need this try/catch block - TESSERACT_PRESENT.put(tesseract, false); - return false; - } - } - - public void parse(Image image, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - TemporaryResources tmp = new TemporaryResources(); - FileOutputStream fos = null; - TikaInputStream tis = null; - try { - int w = image.getWidth(null); - int h = image.getHeight(null); - BufferedImage bImage = new BufferedImage(w, h, BufferedImage.TYPE_INT_RGB); - File file = tmp.createTemporaryFile(); - fos = new FileOutputStream(file); - ImageIO.write(bImage, "png", fos); - tis = TikaInputStream.get(file); - parse(tis, handler, metadata, context); - - } finally { - tmp.dispose(); - if (tis != null) - tis.close(); - if (fos != null) - fos.close(); - } - - } - - @Override - public void parse(InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { - TesseractOCRConfig config = context.get(TesseractOCRConfig.class, DEFAULT_CONFIG); - - // If Tesseract is not on the path with the current config, do not try to run OCR - // getSupportedTypes shouldn't have listed us as handling it, so this should only - // occur if someone directly calls this parser, not via DefaultParser or similar - if (! hasTesseract(config)) - return; - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - - TemporaryResources tmp = new TemporaryResources(); - File output = null; - try { - TikaInputStream tikaStream = TikaInputStream.get(stream, tmp); - File input = tikaStream.getFile(); - long size = tikaStream.getLength(); - - if (size >= config.getMinFileSizeToOcr() && size <= config.getMaxFileSizeToOcr()) { - - output = tmp.createTemporaryFile(); - doOCR(input, output, config); - - // Tesseract appends .txt to output file name - output = new File(output.getAbsolutePath() + ".txt"); - - if (output.exists()) - extractOutput(new FileInputStream(output), xhtml); - - } - - // Temporary workaround for TIKA-1445 - until we can specify - // composite parsers with strategies (eg Composite, Try In Turn), - // always send the image onwards to the regular parser to have - // the metadata for them extracted as well - _TMP_IMAGE_METADATA_PARSER.parse(tikaStream, handler, metadata, context); - } finally { - tmp.dispose(); - if (output != null) { - output.delete(); - } - } - } - // TIKA-1445 workaround parser - private static Parser _TMP_IMAGE_METADATA_PARSER = new CompositeImageParser(); - private static class CompositeImageParser extends CompositeParser { - private static final long serialVersionUID = -2398203346206381382L; - private static List imageParsers = Arrays.asList(new Parser[]{ - new ImageParser(), new JpegParser(), new TiffParser() - }); - CompositeImageParser() { - super(new MediaTypeRegistry(), imageParsers); - } - } - - /** - * Run external tesseract-ocr process. - * - * @param input - * File to be ocred - * @param output - * File to collect ocr result - * @param config - * Configuration of tesseract-ocr engine - * @throws TikaException - * if the extraction timed out - * @throws IOException - * if an input error occurred - */ - private void doOCR(File input, File output, TesseractOCRConfig config) throws IOException, TikaException { - String[] cmd = { config.getTesseractPath() + getTesseractProg(), input.getPath(), output.getPath(), "-l", - config.getLanguage(), "-psm", config.getPageSegMode() }; - - ProcessBuilder pb = new ProcessBuilder(cmd); - setEnv(config, pb); - final Process process = pb.start(); - - process.getOutputStream().close(); - InputStream out = process.getInputStream(); - InputStream err = process.getErrorStream(); - - logStream("OCR MSG", out, input); - logStream("OCR ERROR", err, input); - - FutureTask waitTask = new FutureTask(new Callable() { - public Integer call() throws Exception { - return process.waitFor(); - } - }); - - Thread waitThread = new Thread(waitTask); - waitThread.start(); - - try { - waitTask.get(config.getTimeout(), TimeUnit.SECONDS); - - } catch (InterruptedException e) { - waitThread.interrupt(); - process.destroy(); - Thread.currentThread().interrupt(); - throw new TikaException("TesseractOCRParser interrupted", e); - - } catch (ExecutionException e) { - // should not be thrown - - } catch (TimeoutException e) { - waitThread.interrupt(); - process.destroy(); - throw new TikaException("TesseractOCRParser timeout", e); - } - - } - - /** - * Reads the contents of the given stream and write it to the given XHTML - * content handler. The stream is closed once fully processed. - * - * @param stream - * Stream where is the result of ocr - * @param xhtml - * XHTML content handler - * @throws SAXException - * if the XHTML SAX events could not be handled - * @throws IOException - * if an input error occurred - */ - private void extractOutput(InputStream stream, XHTMLContentHandler xhtml) throws SAXException, IOException { - - xhtml.startDocument(); - xhtml.startElement("div"); - try (Reader reader = new InputStreamReader(stream, UTF_8)) { - char[] buffer = new char[1024]; - for (int n = reader.read(buffer); n != -1; n = reader.read(buffer)) { - if (n > 0) - xhtml.characters(buffer, 0, n); - } - } - xhtml.endElement("div"); - xhtml.endDocument(); - } - - /** - * Starts a thread that reads the contents of the standard output or error - * stream of the given process to not block the process. The stream is closed - * once fully processed. - */ - private void logStream(final String logType, final InputStream stream, final File file) { - new Thread() { - public void run() { - Reader reader = new InputStreamReader(stream, UTF_8); - StringBuilder out = new StringBuilder(); - char[] buffer = new char[1024]; - try { - for (int n = reader.read(buffer); n != -1; n = reader.read(buffer)) - out.append(buffer, 0, n); - } catch (IOException e) { - - } finally { - IOUtils.closeQuietly(stream); - } - - String msg = out.toString(); - LogFactory.getLog(TesseractOCRParser.class).debug(msg); - } - }.start(); - } - - static String getTesseractProg() { - return System.getProperty("os.name").startsWith("Windows") ? "tesseract.exe" : "tesseract"; - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/odf/NSNormalizerContentHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/odf/NSNormalizerContentHandler.java index fa932a6..70257d0 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/odf/NSNormalizerContentHandler.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/odf/NSNormalizerContentHandler.java @@ -18,7 +18,6 @@ import java.io.IOException; import java.io.StringReader; -import java.util.Locale; import org.apache.tika.sax.ContentHandlerDecorator; import org.xml.sax.Attributes; @@ -36,13 +35,13 @@ public class NSNormalizerContentHandler extends ContentHandlerDecorator { private static final String OLD_NS = - "http://openoffice.org/2000/"; + "http://openoffice.org/2000/"; private static final String NEW_NS = - "urn:oasis:names:tc:opendocument:xmlns:"; + "urn:oasis:names:tc:opendocument:xmlns:"; private static final String DTD_PUBLIC_ID = - "-//OpenOffice.org//DTD OfficeDocument 1.0//EN"; + "-//OpenOffice.org//DTD OfficeDocument 1.0//EN"; public NSNormalizerContentHandler(ContentHandler handler) { super(handler); @@ -88,7 +87,7 @@ @Override public InputSource resolveEntity(String publicId, String systemId) throws IOException, SAXException { - if ((systemId != null && systemId.toLowerCase(Locale.ROOT).endsWith(".dtd")) + if ((systemId != null && systemId.toLowerCase().endsWith(".dtd")) || DTD_PUBLIC_ID.equals(publicId)) { return new InputSource(new StringReader("")); } else { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java index 1bf3f7f..d62fc9b 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java @@ -16,11 +16,7 @@ */ package org.apache.tika.parser.odf; -import javax.xml.XMLConstants; -import javax.xml.namespace.QName; -import javax.xml.parsers.ParserConfigurationException; -import javax.xml.parsers.SAXParser; -import javax.xml.parsers.SAXParserFactory; +import static org.apache.tika.sax.XHTMLContentHandler.XHTML; import java.io.IOException; import java.io.InputStream; @@ -31,8 +27,14 @@ import java.util.Set; import java.util.Stack; -import org.apache.commons.io.input.CloseShieldInputStream; +import javax.xml.XMLConstants; +import javax.xml.namespace.QName; +import javax.xml.parsers.ParserConfigurationException; +import javax.xml.parsers.SAXParser; +import javax.xml.parsers.SAXParserFactory; + import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -48,347 +50,189 @@ import org.xml.sax.helpers.AttributesImpl; import org.xml.sax.helpers.DefaultHandler; -import static org.apache.tika.sax.XHTMLContentHandler.XHTML; - /** * Parser for ODF content.xml files. */ public class OpenDocumentContentParser extends AbstractParser { - private interface Style { - } - - private static class TextStyle implements Style { - public boolean italic; - public boolean bold; - public boolean underlined; - } - - private static class ListStyle implements Style { - public boolean ordered; - - public String getTag() { - return ordered ? "ol" : "ul"; - } - } private static final class OpenDocumentElementMappingContentHandler extends - ElementMappingContentHandler { - private final ContentHandler handler; - private final BitSet textNodeStack = new BitSet(); - private int nodeDepth = 0; - private int completelyFiltered = 0; - private Stack headingStack = new Stack(); - private Map textStyleMap = new HashMap(); - private Map listStyleMap = new HashMap(); - private TextStyle textStyle; - private TextStyle lastTextStyle; - private Stack listStyleStack = new Stack(); - private ListStyle listStyle; - - private OpenDocumentElementMappingContentHandler(ContentHandler handler, - Map mappings) { - super(handler, mappings); - this.handler = handler; - } - - @Override - public void characters(char[] ch, int start, int length) - throws SAXException { - // only forward content of tags from text:-namespace - if (completelyFiltered == 0 && nodeDepth > 0 - && textNodeStack.get(nodeDepth - 1)) { - lazyEndSpan(); - super.characters(ch, start, length); - } - } - - // helper for checking tags which need complete filtering - // (with sub-tags) - private boolean needsCompleteFiltering( - String namespaceURI, String localName) { - if (TEXT_NS.equals(namespaceURI)) { - return localName.endsWith("-template") - || localName.endsWith("-style"); - } - return TABLE_NS.equals(namespaceURI) && "covered-table-cell".equals(localName); - } - - // map the heading level to HTML tags - private String getXHTMLHeaderTagName(Attributes atts) { - String depthStr = atts.getValue(TEXT_NS, "outline-level"); - if (depthStr == null) { - return "h1"; - } - - int depth = Integer.parseInt(depthStr); - if (depth >= 6) { - return "h6"; - } else if (depth <= 1) { - return "h1"; - } else { - return "h" + depth; - } - } - - /** - * Check if a node is a text node - */ - private boolean isTextNode(String namespaceURI, String localName) { - if (TEXT_NS.equals(namespaceURI) && !localName.equals("page-number") && !localName.equals("page-count")) { - return true; - } - if (SVG_NS.equals(namespaceURI)) { - return "title".equals(localName) || - "desc".equals(localName); - } - return false; - } - - private void startList(String name) throws SAXException { - String elementName = "ul"; - if (name != null) { - ListStyle style = listStyleMap.get(name); - elementName = style != null ? style.getTag() : "ul"; - listStyleStack.push(style); - } - handler.startElement(XHTML, elementName, elementName, EMPTY_ATTRIBUTES); - } - - private void endList() throws SAXException { - String elementName = "ul"; - if (!listStyleStack.isEmpty()) { - ListStyle style = listStyleStack.pop(); - elementName = style != null ? style.getTag() : "ul"; - } - handler.endElement(XHTML, elementName, elementName); - } - - private void startSpan(String name) throws SAXException { - if (name == null) { - return; - } - - TextStyle style = textStyleMap.get(name); - if (style == null) { - return; - } - - // End tags that refer to no longer valid styles - if (!style.underlined && lastTextStyle != null && lastTextStyle.underlined) { - handler.endElement(XHTML, "u", "u"); - } - if (!style.italic && lastTextStyle != null && lastTextStyle.italic) { - handler.endElement(XHTML, "i", "i"); - } - if (!style.bold && lastTextStyle != null && lastTextStyle.bold) { - handler.endElement(XHTML, "b", "b"); - } - - // Start tags for new styles - if (style.bold && (lastTextStyle == null || !lastTextStyle.bold)) { - handler.startElement(XHTML, "b", "b", EMPTY_ATTRIBUTES); - } - if (style.italic && (lastTextStyle == null || !lastTextStyle.italic)) { - handler.startElement(XHTML, "i", "i", EMPTY_ATTRIBUTES); - } - if (style.underlined && (lastTextStyle == null || !lastTextStyle.underlined)) { - handler.startElement(XHTML, "u", "u", EMPTY_ATTRIBUTES); - } - - textStyle = style; - lastTextStyle = null; - } - - private void endSpan() throws SAXException { - lastTextStyle = textStyle; - textStyle = null; - } - - private void lazyEndSpan() throws SAXException { - if (lastTextStyle == null) { - return; - } - - if (lastTextStyle.underlined) { - handler.endElement(XHTML, "u", "u"); - } - if (lastTextStyle.italic) { - handler.endElement(XHTML, "i", "i"); - } - if (lastTextStyle.bold) { - handler.endElement(XHTML, "b", "b"); - } - - lastTextStyle = null; - } - - @Override - public void startElement( - String namespaceURI, String localName, String qName, - Attributes attrs) throws SAXException { - // keep track of current node type. If it is a text node, - // a bit at the current depth its set in textNodeStack. - // characters() checks the top bit to determine, if the - // actual node is a text node to print out nodeDepth contains - // the depth of the current node and also marks top of stack. - assert nodeDepth >= 0; - - // Set styles - if (STYLE_NS.equals(namespaceURI) && "style".equals(localName)) { - String family = attrs.getValue(STYLE_NS, "family"); - if ("text".equals(family)) { - textStyle = new TextStyle(); - String name = attrs.getValue(STYLE_NS, "name"); - textStyleMap.put(name, textStyle); - } - } else if (TEXT_NS.equals(namespaceURI) && "list-style".equals(localName)) { - listStyle = new ListStyle(); - String name = attrs.getValue(STYLE_NS, "name"); - listStyleMap.put(name, listStyle); - } else if (textStyle != null && STYLE_NS.equals(namespaceURI) - && "text-properties".equals(localName)) { - String fontStyle = attrs.getValue(FORMATTING_OBJECTS_NS, "font-style"); - if ("italic".equals(fontStyle) || "oblique".equals(fontStyle)) { - textStyle.italic = true; - } - String fontWeight = attrs.getValue(FORMATTING_OBJECTS_NS, "font-weight"); - if ("bold".equals(fontWeight) || "bolder".equals(fontWeight) - || (fontWeight != null && Character.isDigit(fontWeight.charAt(0)) - && Integer.valueOf(fontWeight) > 500)) { - textStyle.bold = true; - } - String underlineStyle = attrs.getValue(STYLE_NS, "text-underline-style"); - if (underlineStyle != null) { - textStyle.underlined = true; - } - } else if (listStyle != null && TEXT_NS.equals(namespaceURI)) { - if ("list-level-style-bullet".equals(localName)) { - listStyle.ordered = false; - } else if ("list-level-style-number".equals(localName)) { - listStyle.ordered = true; - } - } - - textNodeStack.set(nodeDepth++, - isTextNode(namespaceURI, localName)); - // filter *all* content of some tags - assert completelyFiltered >= 0; - - if (needsCompleteFiltering(namespaceURI, localName)) { - completelyFiltered++; - } - // call next handler if no filtering - if (completelyFiltered == 0) { - // special handling of text:h, that are directly passed - // to incoming handler - if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) { - final String el = headingStack.push(getXHTMLHeaderTagName(attrs)); - handler.startElement(XHTMLContentHandler.XHTML, el, el, EMPTY_ATTRIBUTES); - } else if (TEXT_NS.equals(namespaceURI) && "list".equals(localName)) { - startList(attrs.getValue(TEXT_NS, "style-name")); - } else if (TEXT_NS.equals(namespaceURI) && "span".equals(localName)) { - startSpan(attrs.getValue(TEXT_NS, "style-name")); - } else { - super.startElement(namespaceURI, localName, qName, attrs); - } - } - } - - @Override - public void endElement( - String namespaceURI, String localName, String qName) - throws SAXException { - if (STYLE_NS.equals(namespaceURI) && "style".equals(localName)) { - textStyle = null; - } else if (TEXT_NS.equals(namespaceURI) && "list-style".equals(localName)) { - listStyle = null; - } - - // call next handler if no filtering - if (completelyFiltered == 0) { - // special handling of text:h, that are directly passed - // to incoming handler - if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) { - final String el = headingStack.pop(); - handler.endElement(XHTMLContentHandler.XHTML, el, el); - } else if (TEXT_NS.equals(namespaceURI) && "list".equals(localName)) { - endList(); - } else if (TEXT_NS.equals(namespaceURI) && "span".equals(localName)) { - endSpan(); - } else { - if (TEXT_NS.equals(namespaceURI) && "p".equals(localName)) { - lazyEndSpan(); - } - super.endElement(namespaceURI, localName, qName); - } - - // special handling of tabulators - if (TEXT_NS.equals(namespaceURI) - && ("tab-stop".equals(localName) - || "tab".equals(localName))) { - this.characters(TAB, 0, TAB.length); - } - } - - // revert filter for *all* content of some tags - if (needsCompleteFiltering(namespaceURI, localName)) { - completelyFiltered--; - } - assert completelyFiltered >= 0; - - // reduce current node depth - nodeDepth--; - assert nodeDepth >= 0; - } - - @Override - public void startPrefixMapping(String prefix, String uri) { - // remove prefix mappings as they should not occur in XHTML - } - - @Override - public void endPrefixMapping(String prefix) { - // remove prefix mappings as they should not occur in XHTML - } - } - - public static final String TEXT_NS = - "urn:oasis:names:tc:opendocument:xmlns:text:1.0"; + ElementMappingContentHandler { + private final ContentHandler handler; + private final BitSet textNodeStack = new BitSet(); + private int nodeDepth = 0; + private int completelyFiltered = 0; + private Stack headingStack = new Stack(); + + private OpenDocumentElementMappingContentHandler(ContentHandler handler, + Map mappings) { + super(handler, mappings); + this.handler = handler; + } + + @Override + public void characters(char[] ch, int start, int length) + throws SAXException { + // only forward content of tags from text:-namespace + if (completelyFiltered == 0 && nodeDepth > 0 + && textNodeStack.get(nodeDepth - 1)) { + super.characters(ch,start,length); + } + } + + // helper for checking tags which need complete filtering + // (with sub-tags) + private boolean needsCompleteFiltering( + String namespaceURI, String localName) { + if (TEXT_NS.equals(namespaceURI)) { + return localName.endsWith("-template") + || localName.endsWith("-style"); + } else if (TABLE_NS.equals(namespaceURI)) { + return "covered-table-cell".equals(localName); + } else { + return false; + } + } + + // map the heading level to HTML tags + private String getXHTMLHeaderTagName(Attributes atts) { + String depthStr = atts.getValue(TEXT_NS, "outline-level"); + if (depthStr == null) { + return "h1"; + } + + int depth = Integer.parseInt(depthStr); + if (depth >= 6) { + return "h6"; + } else if (depth <= 1) { + return "h1"; + } else { + return "h" + depth; + } + } + + /** + * Check if a node is a text node + */ + private boolean isTextNode(String namespaceURI, String localName) { + if (TEXT_NS.equals(namespaceURI) && !localName.equals("page-number") && !localName.equals("page-count")) { + return true; + } + if (SVG_NS.equals(namespaceURI)) { + return "title".equals(localName) || + "desc".equals(localName); + } + return false; + } + + @Override + public void startElement( + String namespaceURI, String localName, String qName, + Attributes atts) throws SAXException { + // keep track of current node type. If it is a text node, + // a bit at the current depth ist set in textNodeStack. + // characters() checks the top bit to determine, if the + // actual node is a text node to print out nodeDepth contains + // the depth of the current node and also marks top of stack. + assert nodeDepth >= 0; + + textNodeStack.set(nodeDepth++, + isTextNode(namespaceURI, localName)); + // filter *all* content of some tags + assert completelyFiltered >= 0; + + if (needsCompleteFiltering(namespaceURI, localName)) { + completelyFiltered++; + } + // call next handler if no filtering + if (completelyFiltered == 0) { + // special handling of text:h, that are directly passed + // to incoming handler + if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) { + final String el = headingStack.push(getXHTMLHeaderTagName(atts)); + handler.startElement(XHTMLContentHandler.XHTML, el, el, EMPTY_ATTRIBUTES); + } else { + super.startElement( + namespaceURI, localName, qName, atts); + } + } + } + + @Override + public void endElement( + String namespaceURI, String localName, String qName) + throws SAXException { + // call next handler if no filtering + if (completelyFiltered == 0) { + // special handling of text:h, that are directly passed + // to incoming handler + if (TEXT_NS.equals(namespaceURI) && "h".equals(localName)) { + final String el = headingStack.pop(); + handler.endElement(XHTMLContentHandler.XHTML, el, el); + } else { + super.endElement(namespaceURI,localName,qName); + } + + // special handling of tabulators + if (TEXT_NS.equals(namespaceURI) + && ("tab-stop".equals(localName) + || "tab".equals(localName))) { + this.characters(TAB, 0, TAB.length); + } + } + + // revert filter for *all* content of some tags + if (needsCompleteFiltering(namespaceURI,localName)) { + completelyFiltered--; + } + assert completelyFiltered >= 0; + + // reduce current node depth + nodeDepth--; + assert nodeDepth >= 0; + } + + @Override + public void startPrefixMapping(String prefix, String uri) { + // remove prefix mappings as they should not occur in XHTML + } + + @Override + public void endPrefixMapping(String prefix) { + // remove prefix mappings as they should not occur in XHTML + } + } + + public static final String TEXT_NS = + "urn:oasis:names:tc:opendocument:xmlns:text:1.0"; public static final String TABLE_NS = - "urn:oasis:names:tc:opendocument:xmlns:table:1.0"; - - public static final String STYLE_NS = - "urn:oasis:names:tc:opendocument:xmlns:style:1.0"; - - public static final String FORMATTING_OBJECTS_NS = - "urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0"; + "urn:oasis:names:tc:opendocument:xmlns:table:1.0"; public static final String OFFICE_NS = - "urn:oasis:names:tc:opendocument:xmlns:office:1.0"; + "urn:oasis:names:tc:opendocument:xmlns:office:1.0"; public static final String SVG_NS = - "urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"; + "urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0"; public static final String PRESENTATION_NS = - "urn:oasis:names:tc:opendocument:xmlns:presentation:1.0"; + "urn:oasis:names:tc:opendocument:xmlns:presentation:1.0"; public static final String DRAW_NS = - "urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"; + "urn:oasis:names:tc:opendocument:xmlns:drawing:1.0"; public static final String XLINK_NS = "http://www.w3.org/1999/xlink"; - protected static final char[] TAB = new char[]{'\t'}; + protected static final char[] TAB = new char[] { '\t' }; private static final Attributes EMPTY_ATTRIBUTES = new AttributesImpl(); /** * Mappings between ODF tag names and XHTML tag names * (including attributes). All other tag names/attributes are ignored - * and left out from event stream. + * and left out from event stream. */ private static final HashMap MAPPINGS = - new HashMap(); + new HashMap(); static { // general mappings of text:-tags @@ -400,6 +244,9 @@ new QName(TEXT_NS, "line-break"), new TargetElement(XHTML, "br")); MAPPINGS.put( + new QName(TEXT_NS, "list"), + new TargetElement(XHTML, "ul")); + MAPPINGS.put( new QName(TEXT_NS, "list-item"), new TargetElement(XHTML, "li")); MAPPINGS.put( @@ -426,9 +273,9 @@ MAPPINGS.put( new QName(TEXT_NS, "span"), new TargetElement(XHTML, "span")); - - final HashMap aAttsMapping = - new HashMap(); + + final HashMap aAttsMapping = + new HashMap(); aAttsMapping.put( new QName(XLINK_NS, "href"), new QName("href")); @@ -448,8 +295,8 @@ new QName(TABLE_NS, "table-row"), new TargetElement(XHTML, "tr")); // special mapping for rowspan/colspan attributes - final HashMap tableCellAttsMapping = - new HashMap(); + final HashMap tableCellAttsMapping = + new HashMap(); tableCellAttsMapping.put( new QName(TABLE_NS, "number-columns-spanned"), new QName("colspan")); @@ -479,8 +326,8 @@ Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { parseInternal(stream, - new XHTMLContentHandler(handler, metadata), - metadata, context); + new XHTMLContentHandler(handler,metadata), + metadata, context); } void parseInternal( @@ -496,7 +343,7 @@ factory.setNamespaceAware(true); try { factory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true); - } catch (SAXNotRecognizedException e) { + } catch (SAXNotRecognizedException e){ // TIKA-329: Some XML parsers do not support the secure-processing // feature, even though it's required by JAXP in Java 5. Ignoring // the exception is fine here, deployments without this feature @@ -513,3 +360,4 @@ } } + \ No newline at end of file diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java index 14b9674..776775b 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentMetaParser.java @@ -50,33 +50,33 @@ * Serial version UID */ private static final long serialVersionUID = -8739250869531737584L; - - private static final String META_NS = "urn:oasis:names:tc:opendocument:xmlns:meta:1.0"; + + private static final String META_NS = "urn:oasis:names:tc:opendocument:xmlns:meta:1.0"; private static final XPathParser META_XPATH = new XPathParser("meta", META_NS); - - /** - * @see OfficeOpenXMLCore#SUBJECT + + /** + * @see OfficeOpenXMLCore#SUBJECT * @deprecated use OfficeOpenXMLCore#SUBJECT */ @Deprecated - private static final Property TRANSITION_INITIAL_CREATOR_TO_INITIAL_AUTHOR = - Property.composite(Office.INITIAL_AUTHOR, - new Property[]{Property.externalText("initial-creator")}); - + private static final Property TRANSITION_INITIAL_CREATOR_TO_INITIAL_AUTHOR = + Property.composite(Office.INITIAL_AUTHOR, + new Property[] { Property.externalText("initial-creator") }); + private static ContentHandler getDublinCoreHandler( Metadata metadata, Property property, String element) { return new ElementMetadataHandler( DublinCore.NAMESPACE_URI_DC, element, metadata, property); } - + private static ContentHandler getMeta( ContentHandler ch, Metadata md, Property property, String element) { Matcher matcher = new CompositeMatcher( META_XPATH.parse("//meta:" + element), META_XPATH.parse("//meta:" + element + "//text()")); ContentHandler branch = - new MatchingContentHandler(new MetadataHandler(md, property), matcher); + new MatchingContentHandler(new MetadataHandler(md, property), matcher); return new TeeContentHandler(ch, branch); } @@ -87,29 +87,27 @@ META_XPATH.parse("//meta:user-defined//text()")); // eg Text1 becomes custom:Info1=Text1 ContentHandler branch = new MatchingContentHandler( - new AttributeDependantMetadataHandler(md, "meta:name", Metadata.USER_DEFINED_METADATA_NAME_PREFIX), - matcher); + new AttributeDependantMetadataHandler(md, "meta:name", Metadata.USER_DEFINED_METADATA_NAME_PREFIX), + matcher); return new TeeContentHandler(ch, branch); } - @Deprecated - private static ContentHandler getStatistic( + @Deprecated private static ContentHandler getStatistic( ContentHandler ch, Metadata md, String name, String attribute) { Matcher matcher = - META_XPATH.parse("//meta:document-statistic/@meta:" + attribute); + META_XPATH.parse("//meta:document-statistic/@meta:"+attribute); ContentHandler branch = new MatchingContentHandler( - new AttributeMetadataHandler(META_NS, attribute, md, name), matcher); + new AttributeMetadataHandler(META_NS, attribute, md, name), matcher); return new TeeContentHandler(ch, branch); } - private static ContentHandler getStatistic( - ContentHandler ch, Metadata md, Property property, String attribute) { - Matcher matcher = - META_XPATH.parse("//meta:document-statistic/@meta:" + attribute); - ContentHandler branch = new MatchingContentHandler( - new AttributeMetadataHandler(META_NS, attribute, md, property), matcher); - return new TeeContentHandler(ch, branch); - } + ContentHandler ch, Metadata md, Property property, String attribute) { + Matcher matcher = + META_XPATH.parse("//meta:document-statistic/@meta:"+attribute); + ContentHandler branch = new MatchingContentHandler( + new AttributeMetadataHandler(META_NS, attribute, md, property), matcher); + return new TeeContentHandler(ch, branch); + } protected ContentHandler getContentHandler(ContentHandler ch, Metadata md, ParseContext context) { // We can no longer extend DcXMLParser due to the handling of dc:subject and dc:date @@ -125,48 +123,48 @@ getDublinCoreHandler(md, TikaCoreProperties.IDENTIFIER, "identifier"), getDublinCoreHandler(md, TikaCoreProperties.LANGUAGE, "language"), getDublinCoreHandler(md, TikaCoreProperties.RIGHTS, "rights")); - + // Process the OO Meta Attributes ch = getMeta(ch, md, TikaCoreProperties.CREATED, "creation-date"); // ODF uses dc:date for modified ch = new TeeContentHandler(ch, new ElementMetadataHandler( DublinCore.NAMESPACE_URI_DC, "date", md, TikaCoreProperties.MODIFIED)); - + // ODF uses dc:subject for description ch = new TeeContentHandler(ch, new ElementMetadataHandler( DublinCore.NAMESPACE_URI_DC, "subject", md, TikaCoreProperties.TRANSITION_SUBJECT_TO_OO_SUBJECT)); ch = getMeta(ch, md, TikaCoreProperties.TRANSITION_KEYWORDS_TO_DC_SUBJECT, "keyword"); - - ch = getMeta(ch, md, Property.externalText(MSOffice.EDIT_TIME), "editing-duration"); + + ch = getMeta(ch, md, Property.externalText(MSOffice.EDIT_TIME), "editing-duration"); ch = getMeta(ch, md, Property.externalText("editing-cycles"), "editing-cycles"); ch = getMeta(ch, md, TRANSITION_INITIAL_CREATOR_TO_INITIAL_AUTHOR, "initial-creator"); ch = getMeta(ch, md, Property.externalText("generator"), "generator"); - + // Process the user defined Meta Attributes ch = getUserDefined(ch, md); - + // Process the OO Statistics Attributes - ch = getStatistic(ch, md, Office.OBJECT_COUNT, "object-count"); - ch = getStatistic(ch, md, Office.IMAGE_COUNT, "image-count"); - ch = getStatistic(ch, md, Office.PAGE_COUNT, "page-count"); - ch = getStatistic(ch, md, PagedText.N_PAGES, "page-count"); - ch = getStatistic(ch, md, Office.TABLE_COUNT, "table-count"); + ch = getStatistic(ch, md, Office.OBJECT_COUNT, "object-count"); + ch = getStatistic(ch, md, Office.IMAGE_COUNT, "image-count"); + ch = getStatistic(ch, md, Office.PAGE_COUNT, "page-count"); + ch = getStatistic(ch, md, PagedText.N_PAGES, "page-count"); + ch = getStatistic(ch, md, Office.TABLE_COUNT, "table-count"); ch = getStatistic(ch, md, Office.PARAGRAPH_COUNT, "paragraph-count"); - ch = getStatistic(ch, md, Office.WORD_COUNT, "word-count"); + ch = getStatistic(ch, md, Office.WORD_COUNT, "word-count"); ch = getStatistic(ch, md, Office.CHARACTER_COUNT, "character-count"); - + // Legacy, Tika-1.0 style attributes // TODO Remove these in Tika 2.0 - ch = getStatistic(ch, md, MSOffice.OBJECT_COUNT, "object-count"); - ch = getStatistic(ch, md, MSOffice.IMAGE_COUNT, "image-count"); - ch = getStatistic(ch, md, MSOffice.PAGE_COUNT, "page-count"); - ch = getStatistic(ch, md, MSOffice.TABLE_COUNT, "table-count"); + ch = getStatistic(ch, md, MSOffice.OBJECT_COUNT, "object-count"); + ch = getStatistic(ch, md, MSOffice.IMAGE_COUNT, "image-count"); + ch = getStatistic(ch, md, MSOffice.PAGE_COUNT, "page-count"); + ch = getStatistic(ch, md, MSOffice.TABLE_COUNT, "table-count"); ch = getStatistic(ch, md, MSOffice.PARAGRAPH_COUNT, "paragraph-count"); - ch = getStatistic(ch, md, MSOffice.WORD_COUNT, "word-count"); + ch = getStatistic(ch, md, MSOffice.WORD_COUNT, "word-count"); ch = getStatistic(ch, md, MSOffice.CHARACTER_COUNT, "character-count"); - + // Legacy Statistics Attributes, replaced with real keys above // TODO Remove these shortly, eg after Tika 1.1 (TIKA-770) ch = getStatistic(ch, md, "nbPage", "page-count"); @@ -176,12 +174,12 @@ ch = getStatistic(ch, md, "nbTab", "table-count"); ch = getStatistic(ch, md, "nbObject", "object-count"); ch = getStatistic(ch, md, "nbImg", "image-count"); - + // Normalise the rest ch = new NSNormalizerContentHandler(ch); return ch; } - + @Override public void parse( InputStream stream, ContentHandler handler, @@ -190,10 +188,10 @@ super.parse(stream, handler, metadata, context); // Copy subject to description for OO2 String odfSubject = metadata.get(OfficeOpenXMLCore.SUBJECT); - if (odfSubject != null && !odfSubject.equals("") && + if (odfSubject != null && !odfSubject.equals("") && (metadata.get(TikaCoreProperties.DESCRIPTION) == null || metadata.get(TikaCoreProperties.DESCRIPTION).equals(""))) { metadata.set(TikaCoreProperties.DESCRIPTION, odfSubject); } } - + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java index 65361f0..b215448 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentParser.java @@ -20,16 +20,15 @@ import java.io.InputStream; import java.util.Arrays; import java.util.Collections; -import java.util.Enumeration; import java.util.HashSet; import java.util.Set; import java.util.zip.ZipEntry; -import java.util.zip.ZipFile; import java.util.zip.ZipInputStream; -import org.apache.commons.io.IOUtils; +//import org.apache.commons.compress.archivers.zip.ZipArchiveEntry; +//import org.apache.commons.compress.archivers.zip.ZipFile; import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; +import org.apache.tika.io.IOUtils; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -41,55 +40,49 @@ import org.xml.sax.SAXException; import org.xml.sax.helpers.DefaultHandler; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * OpenOffice parser */ public class OpenDocumentParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = -6410276875438618287L; private static final Set SUPPORTED_TYPES = - Collections.unmodifiableSet(new HashSet(Arrays.asList( - MediaType.application("vnd.sun.xml.writer"), - MediaType.application("vnd.oasis.opendocument.text"), - MediaType.application("vnd.oasis.opendocument.graphics"), - MediaType.application("vnd.oasis.opendocument.presentation"), - MediaType.application("vnd.oasis.opendocument.spreadsheet"), - MediaType.application("vnd.oasis.opendocument.chart"), - MediaType.application("vnd.oasis.opendocument.image"), - MediaType.application("vnd.oasis.opendocument.formula"), - MediaType.application("vnd.oasis.opendocument.text-master"), - MediaType.application("vnd.oasis.opendocument.text-web"), - MediaType.application("vnd.oasis.opendocument.text-template"), - MediaType.application("vnd.oasis.opendocument.graphics-template"), - MediaType.application("vnd.oasis.opendocument.presentation-template"), - MediaType.application("vnd.oasis.opendocument.spreadsheet-template"), - MediaType.application("vnd.oasis.opendocument.chart-template"), - MediaType.application("vnd.oasis.opendocument.image-template"), - MediaType.application("vnd.oasis.opendocument.formula-template"), - MediaType.application("x-vnd.oasis.opendocument.text"), - MediaType.application("x-vnd.oasis.opendocument.graphics"), - MediaType.application("x-vnd.oasis.opendocument.presentation"), - MediaType.application("x-vnd.oasis.opendocument.spreadsheet"), - MediaType.application("x-vnd.oasis.opendocument.chart"), - MediaType.application("x-vnd.oasis.opendocument.image"), - MediaType.application("x-vnd.oasis.opendocument.formula"), - MediaType.application("x-vnd.oasis.opendocument.text-master"), - MediaType.application("x-vnd.oasis.opendocument.text-web"), - MediaType.application("x-vnd.oasis.opendocument.text-template"), - MediaType.application("x-vnd.oasis.opendocument.graphics-template"), - MediaType.application("x-vnd.oasis.opendocument.presentation-template"), - MediaType.application("x-vnd.oasis.opendocument.spreadsheet-template"), - MediaType.application("x-vnd.oasis.opendocument.chart-template"), - MediaType.application("x-vnd.oasis.opendocument.image-template"), - MediaType.application("x-vnd.oasis.opendocument.formula-template")))); - - private static final String META_NAME = "meta.xml"; + Collections.unmodifiableSet(new HashSet(Arrays.asList( + MediaType.application("vnd.sun.xml.writer"), + MediaType.application("vnd.oasis.opendocument.text"), + MediaType.application("vnd.oasis.opendocument.graphics"), + MediaType.application("vnd.oasis.opendocument.presentation"), + MediaType.application("vnd.oasis.opendocument.spreadsheet"), + MediaType.application("vnd.oasis.opendocument.chart"), + MediaType.application("vnd.oasis.opendocument.image"), + MediaType.application("vnd.oasis.opendocument.formula"), + MediaType.application("vnd.oasis.opendocument.text-master"), + MediaType.application("vnd.oasis.opendocument.text-web"), + MediaType.application("vnd.oasis.opendocument.text-template"), + MediaType.application("vnd.oasis.opendocument.graphics-template"), + MediaType.application("vnd.oasis.opendocument.presentation-template"), + MediaType.application("vnd.oasis.opendocument.spreadsheet-template"), + MediaType.application("vnd.oasis.opendocument.chart-template"), + MediaType.application("vnd.oasis.opendocument.image-template"), + MediaType.application("vnd.oasis.opendocument.formula-template"), + MediaType.application("x-vnd.oasis.opendocument.text"), + MediaType.application("x-vnd.oasis.opendocument.graphics"), + MediaType.application("x-vnd.oasis.opendocument.presentation"), + MediaType.application("x-vnd.oasis.opendocument.spreadsheet"), + MediaType.application("x-vnd.oasis.opendocument.chart"), + MediaType.application("x-vnd.oasis.opendocument.image"), + MediaType.application("x-vnd.oasis.opendocument.formula"), + MediaType.application("x-vnd.oasis.opendocument.text-master"), + MediaType.application("x-vnd.oasis.opendocument.text-web"), + MediaType.application("x-vnd.oasis.opendocument.text-template"), + MediaType.application("x-vnd.oasis.opendocument.graphics-template"), + MediaType.application("x-vnd.oasis.opendocument.presentation-template"), + MediaType.application("x-vnd.oasis.opendocument.spreadsheet-template"), + MediaType.application("x-vnd.oasis.opendocument.chart-template"), + MediaType.application("x-vnd.oasis.opendocument.image-template"), + MediaType.application("x-vnd.oasis.opendocument.formula-template")))); private Parser meta = new OpenDocumentMetaParser(); @@ -120,86 +113,67 @@ Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - // Open the Zip stream - // Use a File if we can, and an already open zip is even better - ZipFile zipFile = null; - ZipInputStream zipStream = null; + // TODO: reuse the already opened ZIPFile, if + // present + + /* + ZipFile zipFile; if (stream instanceof TikaInputStream) { TikaInputStream tis = (TikaInputStream) stream; Object container = ((TikaInputStream) stream).getOpenContainer(); if (container instanceof ZipFile) { zipFile = (ZipFile) container; } else if (tis.hasFile()) { - zipFile = new ZipFile(tis.getFile()); - } else { - zipStream = new ZipInputStream(stream); + zipFile = new ZipFile(tis.getFile()); } - } else { - zipStream = new ZipInputStream(stream); } + */ - // Prepare to handle the content + // TODO: if incoming IS is a TIS with a file + // associated, we should open ZipFile so we can + // visit metadata, mimetype first; today we lose + // all the metadata if meta.xml is hit after + // content.xml in the stream. Then we can still + // read-once for the content.xml. + XHTMLContentHandler xhtml = new XHTMLContentHandler(baseHandler, metadata); // As we don't know which of the metadata or the content // we'll hit first, catch the endDocument call initially - EndDocumentShieldingContentHandler handler = - new EndDocumentShieldingContentHandler(xhtml); + EndDocumentShieldingContentHandler handler = + new EndDocumentShieldingContentHandler(xhtml); - // If we can, process the metadata first, then the - // rest of the file afterwards - // Only possible to guarantee that when opened from a file not a stream - ZipEntry entry = null; - if (zipFile != null) { - entry = zipFile.getEntry(META_NAME); - handleZipEntry(entry, zipFile.getInputStream(entry), metadata, context, handler); - - Enumeration entries = zipFile.entries(); - while (entries.hasMoreElements()) { - entry = entries.nextElement(); - if (!META_NAME.equals(entry.getName())) { - handleZipEntry(entry, zipFile.getInputStream(entry), metadata, context, handler); + // Process the file in turn + ZipInputStream zip = new ZipInputStream(stream); + ZipEntry entry = zip.getNextEntry(); + while (entry != null) { + if (entry.getName().equals("mimetype")) { + String type = IOUtils.toString(zip, "UTF-8"); + metadata.set(Metadata.CONTENT_TYPE, type); + } else if (entry.getName().equals("meta.xml")) { + meta.parse(zip, new DefaultHandler(), metadata, context); + } else if (entry.getName().endsWith("content.xml")) { + if (content instanceof OpenDocumentContentParser) { + ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); + } else { + // Foreign content parser was set: + content.parse(zip, handler, metadata, context); + } + } else if (entry.getName().endsWith("styles.xml")) { + if (content instanceof OpenDocumentContentParser) { + ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); + } else { + // Foreign content parser was set: + content.parse(zip, handler, metadata, context); } } - zipFile.close(); - } else { - do { - entry = zipStream.getNextEntry(); - handleZipEntry(entry, zipStream, metadata, context, handler); - } while (entry != null); - zipStream.close(); + entry = zip.getNextEntry(); } - + // Only now call the end document - if (handler.getEndDocumentWasCalled()) { - handler.reallyEndDocument(); + if(handler.getEndDocumentWasCalled()) { + handler.reallyEndDocument(); } } - private void handleZipEntry(ZipEntry entry, InputStream zip, Metadata metadata, - ParseContext context, EndDocumentShieldingContentHandler handler) - throws IOException, SAXException, TikaException { - if (entry == null) return; - - if (entry.getName().equals("mimetype")) { - String type = IOUtils.toString(zip, UTF_8); - metadata.set(Metadata.CONTENT_TYPE, type); - } else if (entry.getName().equals(META_NAME)) { - meta.parse(zip, new DefaultHandler(), metadata, context); - } else if (entry.getName().endsWith("content.xml")) { - if (content instanceof OpenDocumentContentParser) { - ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); - } else { - // Foreign content parser was set: - content.parse(zip, handler, metadata, context); - } - } else if (entry.getName().endsWith("styles.xml")) { - if (content instanceof OpenDocumentContentParser) { - ((OpenDocumentContentParser) content).parseInternal(zip, handler, metadata, context); - } else { - // Foreign content parser was set: - content.parse(zip, handler, metadata, context); - } - } - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AccessChecker.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AccessChecker.java deleted file mode 100644 index 775e590..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AccessChecker.java +++ /dev/null @@ -1,81 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.pdf; - -import java.io.Serializable; - -import org.apache.tika.exception.AccessPermissionException; -import org.apache.tika.metadata.AccessPermissions; -import org.apache.tika.metadata.Metadata; - -/** - * Checks whether or not a document allows extraction generally - * or extraction for accessibility only. - */ -public class AccessChecker implements Serializable { - - private static final long serialVersionUID = 6492570218190936986L; - - private final boolean needToCheck; - private final boolean allowAccessibility; - - /** - * This constructs an {@link AccessChecker} that - * will not perform any checking and will always return without - * throwing an exception. - *

    - * This constructor is available to allow for Tika's legacy ( <= v1.7) behavior. - */ - public AccessChecker() { - needToCheck = false; - allowAccessibility = true; - } - - /** - * This constructs an {@link AccessChecker} that will check - * for whether or not content should be extracted from a document. - * - * @param allowExtractionForAccessibility if general extraction is not allowed, is extraction for accessibility allowed - */ - public AccessChecker(boolean allowExtractionForAccessibility) { - needToCheck = true; - this.allowAccessibility = allowExtractionForAccessibility; - } - - /** - * Checks to see if a document's content should be extracted based - * on metadata values and the value of {@link #allowAccessibility} in the constructor. - * - * @param metadata - * @throws AccessPermissionException if access is not permitted - */ - public void check(Metadata metadata) throws AccessPermissionException { - if (!needToCheck) { - return; - } - if ("false".equals(metadata.get(AccessPermissions.EXTRACT_CONTENT))) { - if (allowAccessibility) { - if ("true".equals(metadata.get(AccessPermissions.EXTRACT_FOR_ACCESSIBILITY))) { - return; - } - throw new AccessPermissionException("Content extraction for accessibility is not allowed."); - } - throw new AccessPermissionException("Content extraction is not allowed."); - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java index 1ffe60c..b6a18ca 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java @@ -16,44 +16,27 @@ */ package org.apache.tika.parser.pdf; -import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.Writer; import java.text.SimpleDateFormat; import java.util.Calendar; -import java.util.HashMap; -import java.util.HashSet; +import java.util.Iterator; import java.util.List; import java.util.ListIterator; -import java.util.Locale; import java.util.Map; -import java.util.Set; import java.util.TreeMap; -import org.apache.commons.io.IOExceptionWithCause; -import org.apache.commons.io.IOUtils; -import org.apache.pdfbox.cos.COSBase; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDDocumentCatalog; import org.apache.pdfbox.pdmodel.PDDocumentNameDictionary; import org.apache.pdfbox.pdmodel.PDEmbeddedFilesNameTreeNode; import org.apache.pdfbox.pdmodel.PDPage; -import org.apache.pdfbox.pdmodel.PDResources; import org.apache.pdfbox.pdmodel.common.COSObjectable; import org.apache.pdfbox.pdmodel.common.PDNameTreeNode; import org.apache.pdfbox.pdmodel.common.filespecification.PDComplexFileSpecification; import org.apache.pdfbox.pdmodel.common.filespecification.PDEmbeddedFile; -import org.apache.pdfbox.pdmodel.graphics.xobject.PDCcitt; -import org.apache.pdfbox.pdmodel.graphics.xobject.PDJpeg; -import org.apache.pdfbox.pdmodel.graphics.xobject.PDPixelMap; -import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObject; -import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectForm; -import org.apache.pdfbox.pdmodel.graphics.xobject.PDXObjectImage; import org.apache.pdfbox.pdmodel.interactive.action.type.PDAction; import org.apache.pdfbox.pdmodel.interactive.action.type.PDActionURI; -import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotation; -import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationFileAttachment; import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink; import org.apache.pdfbox.pdmodel.interactive.annotation.PDAnnotationMarkup; import org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature; @@ -68,9 +51,9 @@ import org.apache.tika.exception.TikaException; import org.apache.tika.extractor.EmbeddedDocumentExtractor; import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; +import org.apache.tika.io.IOExceptionWithCause; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.EmbeddedContentHandler; import org.apache.tika.sax.XHTMLContentHandler; @@ -84,36 +67,70 @@ * stream. */ class PDF2XHTML extends PDFTextStripper { - + + /** + * format used for signature dates + */ + private final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ"); + /** * Maximum recursive depth during AcroForm processing. - * Prevents theoretical AcroForm recursion bomb. + * Prevents theoretical AcroForm recursion bomb. */ private final static int MAX_ACROFORM_RECURSIONS = 10; + + + // TODO: remove once PDFBOX-1130 is fixed: + private boolean inParagraph = false; + /** - * Format used for signature dates - * TODO Make this thread-safe + * Converts the given PDF document (and related metadata) to a stream + * of XHTML SAX events sent to the given content handler. + * + * @param document PDF document + * @param handler SAX content handler + * @param metadata PDF metadata + * @throws SAXException if the content handler fails to process SAX events + * @throws TikaException if the PDF document can not be processed */ - private final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ", Locale.ROOT); + public static void process( + PDDocument document, ContentHandler handler, ParseContext context, Metadata metadata, + PDFParserConfig config) + throws SAXException, TikaException { + try { + // Extract text using a dummy Writer as we override the + // key methods to output to the given content + // handler. + PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, context, metadata, config); + + pdf2XHTML.writeText(document, new Writer() { + @Override + public void write(char[] cbuf, int off, int len) { + } + @Override + public void flush() { + } + @Override + public void close() { + } + }); + + } catch (IOException e) { + if (e.getCause() instanceof SAXException) { + throw (SAXException) e.getCause(); + } else { + throw new TikaException("Unable to extract PDF content", e); + } + } + } + private final ContentHandler originalHandler; private final ParseContext context; private final XHTMLContentHandler handler; private final PDFParserConfig config; - /** - * This keeps track of the pdf object ids for inline - * images that have been processed. - * If {@link PDFParserConfig#getExtractUniqueInlineImagesOnly() - * is true, this will be checked before extracting an embedded image. - * The integer keeps track of the inlineImageCounter for that image. - * This integer is used to identify images in the markup. - * - * This is used across the document. To avoid infinite recursion - * TIKA-1742, we're limiting the export to one image per page. - */ - private Map processedInlineImages = new HashMap<>(); - private int inlineImageCounter = 0; - private PDF2XHTML(ContentHandler handler, ParseContext context, Metadata metadata, - PDFParserConfig config) + + private PDF2XHTML(ContentHandler handler, ParseContext context, Metadata metadata, + PDFParserConfig config) throws IOException { //source of config (derives from context or PDFParser?) is //already determined in PDFParser. No need to check context here. @@ -121,51 +138,17 @@ this.originalHandler = handler; this.context = context; this.handler = new XHTMLContentHandler(handler, metadata); - } - - /** - * Converts the given PDF document (and related metadata) to a stream - * of XHTML SAX events sent to the given content handler. - * - * @param document PDF document - * @param handler SAX content handler - * @param metadata PDF metadata - * @throws SAXException if the content handler fails to process SAX events - * @throws TikaException if the PDF document can not be processed - */ - public static void process( - PDDocument document, ContentHandler handler, ParseContext context, Metadata metadata, - PDFParserConfig config) - throws SAXException, TikaException { - try { - // Extract text using a dummy Writer as we override the - // key methods to output to the given content - // handler. - PDF2XHTML pdf2XHTML = new PDF2XHTML(handler, context, metadata, config); - - config.configure(pdf2XHTML); - - pdf2XHTML.writeText(document, new Writer() { - @Override - public void write(char[] cbuf, int off, int len) { - } - - @Override - public void flush() { - } - - @Override - public void close() { - } - }); - - } catch (IOException e) { - if (e.getCause() instanceof SAXException) { - throw (SAXException) e.getCause(); - } else { - throw new TikaException("Unable to extract PDF content", e); - } - } + setForceParsing(true); + setSortByPosition(config.getSortByPosition()); + if (config.getEnableAutoSpace()) { + setWordSeparator(" "); + } else { + setWordSeparator(""); + } + // TODO: maybe expose setting these too: + //setAverageCharTolerance(1.0f); + //setSpacingTolerance(1.0f); + setSuppressDuplicateOverlappingText(config.getSuppressDuplicateOverlappingText()); } void extractBookmarkText() throws SAXException { @@ -206,16 +189,16 @@ // Extract text for any bookmarks: extractBookmarkText(); extractEmbeddedDocuments(pdf, originalHandler); - + //extract acroform data at end of doc - if (config.getExtractAcroFormContent() == true) { + if (config.getExtractAcroFormContent() == true){ extractAcroForm(pdf, handler); - } + } handler.endDocument(); - } catch (TikaException e) { + } catch (TikaException e){ + throw new IOExceptionWithCause("Unable to end a document", e); + } catch (SAXException e) { throw new IOExceptionWithCause("Unable to end a document", e); - } catch (SAXException e) { - throw new IOExceptionWithCause("Unable to end a document", e); } } @@ -231,32 +214,17 @@ @Override protected void endPage(PDPage page) throws IOException { + try { writeParagraphEnd(); - - extractImages(page.getResources(), new HashSet()); - - EmbeddedDocumentExtractor extractor = getEmbeddedDocumentExtractor(); - for (PDAnnotation annotation : page.getAnnotations()) { - - if (annotation instanceof PDAnnotationFileAttachment) { - PDAnnotationFileAttachment fann = (PDAnnotationFileAttachment) annotation; - PDComplexFileSpecification fileSpec = (PDComplexFileSpecification) fann.getFile(); - try { - extractMultiOSPDEmbeddedFiles("", fileSpec, extractor); - } catch (SAXException e) { - throw new IOExceptionWithCause("file embedded in annotation sax exception", e); - } catch (TikaException e) { - throw new IOExceptionWithCause("file embedded in annotation tika exception", e); - } - } - // TODO: remove once PDFBOX-1143 is fixed: - if (config.getExtractAnnotationText()) { - if (annotation instanceof PDAnnotationLink) { - PDAnnotationLink annotationlink = (PDAnnotationLink) annotation; - if (annotationlink.getAction() != null) { + // TODO: remove once PDFBOX-1143 is fixed: + if (config.getExtractAnnotationText()) { + for(Object o : page.getAnnotations()) { + if( o instanceof PDAnnotationLink ) { + PDAnnotationLink annotationlink = (PDAnnotationLink) o; + if (annotationlink.getAction() != null) { PDAction action = annotationlink.getAction(); - if (action instanceof PDActionURI) { + if( action instanceof PDActionURI ) { PDActionURI uri = (PDActionURI) action; String link = uri.getURI(); if (link != null) { @@ -265,16 +233,16 @@ handler.endElement("a"); handler.endElement("div"); } - } + } } } - if (annotation instanceof PDAnnotationMarkup) { - PDAnnotationMarkup annotationMarkup = (PDAnnotationMarkup) annotation; - String title = annotationMarkup.getTitlePopup(); - String subject = annotationMarkup.getSubject(); - String contents = annotationMarkup.getContents(); - // TODO: maybe also annotationMarkup.getRichContents()? + if (o instanceof PDAnnotationMarkup) { + PDAnnotationMarkup annot = (PDAnnotationMarkup) o; + String title = annot.getTitlePopup(); + String subject = annot.getSubject(); + String contents = annot.getContents(); + // TODO: maybe also annot.getRichContents()? if (title != null || subject != null || contents != null) { handler.startElement("div", "class", "annotation"); @@ -301,115 +269,21 @@ } } } - handler.endElement("div"); } catch (SAXException e) { throw new IOExceptionWithCause("Unable to end a page", e); } - page.clear(); - } - - private void extractImages(PDResources resources, Set seenThisPage) throws SAXException { - if (resources == null || config.getExtractInlineImages() == false) { - return; - } - - Map xObjects = resources.getXObjects(); - if (xObjects == null) { - return; - } - - for (Map.Entry entry : xObjects.entrySet()) { - - PDXObject object = entry.getValue(); - if (object == null) { - continue; - } - COSBase cosObject = object.getCOSObject(); - if (seenThisPage.contains(cosObject)) { - //avoid infinite recursion TIKA-1742 - continue; - } - seenThisPage.add(cosObject); - - if (object instanceof PDXObjectForm) { - extractImages(((PDXObjectForm) object).getResources(), seenThisPage); - } else if (object instanceof PDXObjectImage) { - - PDXObjectImage image = (PDXObjectImage) object; - - Metadata metadata = new Metadata(); - String extension = ""; - if (image instanceof PDJpeg) { - metadata.set(Metadata.CONTENT_TYPE, "image/jpeg"); - extension = ".jpg"; - } else if (image instanceof PDCcitt) { - metadata.set(Metadata.CONTENT_TYPE, "image/tiff"); - extension = ".tif"; - } else if (image instanceof PDPixelMap) { - metadata.set(Metadata.CONTENT_TYPE, "image/png"); - extension = ".png"; - } - - Integer imageNumber = processedInlineImages.get(entry.getKey()); - if (imageNumber == null) { - imageNumber = inlineImageCounter++; - } - String fileName = "image" + imageNumber + extension; - metadata.set(Metadata.RESOURCE_NAME_KEY, fileName); - - // Output the img tag - AttributesImpl attr = new AttributesImpl(); - attr.addAttribute("", "src", "src", "CDATA", "embedded:" + fileName); - attr.addAttribute("", "alt", "alt", "CDATA", fileName); - handler.startElement("img", attr); - handler.endElement("img"); - - //Do we only want to process unique COSObject ids? - //If so, have we already processed this one? - if (config.getExtractUniqueInlineImagesOnly() == true) { - String cosObjectId = entry.getKey(); - if (processedInlineImages.containsKey(cosObjectId)) { - continue; - } - processedInlineImages.put(cosObjectId, imageNumber); - } - - metadata.set(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE, - TikaCoreProperties.EmbeddedResourceType.INLINE.toString()); - - EmbeddedDocumentExtractor extractor = - getEmbeddedDocumentExtractor(); - if (extractor.shouldParseEmbedded(metadata)) { - ByteArrayOutputStream buffer = new ByteArrayOutputStream(); - try { - image.write2OutputStream(buffer); - image.clear(); - extractor.parseEmbedded( - new ByteArrayInputStream(buffer.toByteArray()), - new EmbeddedContentHandler(handler), - metadata, false); - } catch (IOException e) { - // could not extract this image, so just skip it... - } - } - } - } - resources.clear(); - } - - protected EmbeddedDocumentExtractor getEmbeddedDocumentExtractor() { - EmbeddedDocumentExtractor extractor = - context.get(EmbeddedDocumentExtractor.class); - if (extractor == null) { - extractor = new ParsingEmbeddedDocumentExtractor(context); - } - return extractor; } @Override protected void writeParagraphStart() throws IOException { - super.writeParagraphStart(); + // TODO: remove once PDFBOX-1130 is fixed + if (inParagraph) { + // Close last paragraph + writeParagraphEnd(); + } + assert !inParagraph; + inParagraph = true; try { handler.startElement("p"); } catch (SAXException e) { @@ -419,7 +293,12 @@ @Override protected void writeParagraphEnd() throws IOException { - super.writeParagraphEnd(); + // TODO: remove once PDFBOX-1130 is fixed + if (!inParagraph) { + writeParagraphStart(); + } + assert inParagraph; + inParagraph = false; try { handler.endElement("p"); } catch (SAXException e) { @@ -466,18 +345,23 @@ "Unable to write a newline character", e); } } - + private void extractEmbeddedDocuments(PDDocument document, ContentHandler handler) throws IOException, SAXException, TikaException { PDDocumentCatalog catalog = document.getDocumentCatalog(); PDDocumentNameDictionary names = catalog.getNames(); - if (names == null) { + if (names == null){ return; } PDEmbeddedFilesNameTreeNode embeddedFiles = names.getEmbeddedFiles(); if (embeddedFiles == null) { return; + } + + EmbeddedDocumentExtractor embeddedExtractor = context.get(EmbeddedDocumentExtractor.class); + if (embeddedExtractor == null) { + embeddedExtractor = new ParsingEmbeddedDocumentExtractor(context); } Map embeddedFileNames = embeddedFiles.getNames(); @@ -485,91 +369,53 @@ //This code follows: pdfbox/examples/pdmodel/ExtractEmbeddedFiles.java //If there is a need we could add a fully recursive search to find a non-null //Map that contains the doc info. - if (embeddedFileNames != null) { - processEmbeddedDocNames(embeddedFileNames); + if (embeddedFileNames != null){ + processEmbeddedDocNames(embeddedFileNames, embeddedExtractor); } else { List kids = embeddedFiles.getKids(); - if (kids == null) { + if (kids == null){ return; } - for (PDNameTreeNode n : kids) { + for (PDNameTreeNode n : kids){ Map childNames = n.getNames(); - if (childNames != null) { - processEmbeddedDocNames(childNames); - } - } - } - } - - - private void processEmbeddedDocNames(Map embeddedFileNames) - throws IOException, SAXException, TikaException { - if (embeddedFileNames == null || embeddedFileNames.isEmpty()) { - return; - } - - EmbeddedDocumentExtractor extractor = getEmbeddedDocumentExtractor(); - for (Map.Entry ent : embeddedFileNames.entrySet()) { + if (childNames != null){ + processEmbeddedDocNames(childNames, embeddedExtractor); + } + } + } + } + + + private void processEmbeddedDocNames(Map embeddedFileNames, + EmbeddedDocumentExtractor embeddedExtractor) throws IOException, SAXException, TikaException { + if (embeddedFileNames == null){ + return; + } + for (Map.Entry ent : embeddedFileNames.entrySet()) { PDComplexFileSpecification spec = (PDComplexFileSpecification) ent.getValue(); - extractMultiOSPDEmbeddedFiles(ent.getKey(), spec, extractor); - } - } - - private void extractMultiOSPDEmbeddedFiles(String defaultName, - PDComplexFileSpecification spec, EmbeddedDocumentExtractor extractor) throws IOException, - SAXException, TikaException { - - if (spec == null) { - return; - } - //current strategy is to pull all, not just first non-null - extractPDEmbeddedFile(defaultName, spec.getFile(), spec.getEmbeddedFile(), extractor); - extractPDEmbeddedFile(defaultName, spec.getFileMac(), spec.getEmbeddedFileMac(), extractor); - extractPDEmbeddedFile(defaultName, spec.getFileDos(), spec.getEmbeddedFileDos(), extractor); - extractPDEmbeddedFile(defaultName, spec.getFileUnix(), spec.getEmbeddedFileUnix(), extractor); - } - - private void extractPDEmbeddedFile(String defaultName, String fileName, PDEmbeddedFile file, - EmbeddedDocumentExtractor extractor) - throws SAXException, IOException, TikaException { - - if (file == null) { - //skip silently - return; - } - - fileName = (fileName == null) ? defaultName : fileName; - - // TODO: other metadata? - Metadata metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, fileName); - metadata.set(Metadata.CONTENT_TYPE, file.getSubtype()); - metadata.set(Metadata.CONTENT_LENGTH, Long.toString(file.getSize())); - metadata.set(TikaCoreProperties.EMBEDDED_RESOURCE_TYPE, - TikaCoreProperties.EmbeddedResourceType.ATTACHMENT.toString()); - - if (extractor.shouldParseEmbedded(metadata)) { - TikaInputStream stream = null; - try { - stream = TikaInputStream.get(file.createInputStream()); - extractor.parseEmbedded( - stream, - new EmbeddedContentHandler(handler), - metadata, false); - - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", "class", "class", "CDATA", "embedded"); - attributes.addAttribute("", "id", "id", "CDATA", fileName); - handler.startElement("div", attributes); - handler.endElement("div"); - } finally { - IOUtils.closeQuietly(stream); - } - } - } - - private void extractAcroForm(PDDocument pdf, XHTMLContentHandler handler) throws IOException, - SAXException { + PDEmbeddedFile file = spec.getEmbeddedFile(); + + Metadata metadata = new Metadata(); + // TODO: other metadata? + metadata.set(Metadata.RESOURCE_NAME_KEY, ent.getKey()); + metadata.set(Metadata.CONTENT_TYPE, file.getSubtype()); + metadata.set(Metadata.CONTENT_LENGTH, Long.toString(file.getSize())); + + if (embeddedExtractor.shouldParseEmbedded(metadata)) { + TikaInputStream stream = TikaInputStream.get(file.createInputStream()); + try { + embeddedExtractor.parseEmbedded( + stream, + new EmbeddedContentHandler(handler), + metadata, false); + } finally { + stream.close(); + } + } + } + } + private void extractAcroForm(PDDocument pdf, XHTMLContentHandler handler) throws IOException, + SAXException { //Thank you, Ben Litchfield, for org.apache.pdfbox.examples.fdf.PrintFields //this code derives from Ben's code PDDocumentCatalog catalog = pdf.getDocumentCatalog(); @@ -580,59 +426,63 @@ PDAcroForm form = catalog.getAcroForm(); if (form == null) return; - + @SuppressWarnings("rawtypes") List fields = form.getFields(); if (fields == null) - return; - + return; + @SuppressWarnings("rawtypes") - ListIterator itr = fields.listIterator(); + ListIterator itr = fields.listIterator(); if (itr == null) - return; + return; handler.startElement("div", "class", "acroform"); handler.startElement("ol"); - - while (itr.hasNext()) { - Object obj = itr.next(); - if (obj != null && obj instanceof PDField) { - processAcroField((PDField) obj, handler, 0); - } + while (itr.hasNext()){ + Object obj = itr.next(); + if (obj != null && obj instanceof PDField){ + processAcroField((PDField)obj, handler, 0); + } } handler.endElement("ol"); handler.endElement("div"); } - - private void processAcroField(PDField field, XHTMLContentHandler handler, final int currentRecursiveDepth) - throws SAXException, IOException { - - if (currentRecursiveDepth >= MAX_ACROFORM_RECURSIONS) { - return; - } - - addFieldString(field, handler); - - List kids = field.getKids(); - if (kids != null) { - - int r = currentRecursiveDepth + 1; - handler.startElement("ol"); - //TODO: can generate

      . Rework to avoid that. - for (COSObjectable pdfObj : kids) { - if (pdfObj != null && pdfObj instanceof PDField) { - PDField kid = (PDField) pdfObj; - //recurse - processAcroField(kid, handler, r); - } - } - handler.endElement("ol"); - } - } - - private void addFieldString(PDField field, XHTMLContentHandler handler) throws SAXException { + + private void processAcroField(PDField field, XHTMLContentHandler handler, final int recurseDepth) + throws SAXException, IOException { + + if (recurseDepth >= MAX_ACROFORM_RECURSIONS){ + return; + } + + addFieldString(field, handler); + + @SuppressWarnings("rawtypes") + List kids = field.getKids(); + if(kids != null){ + + @SuppressWarnings("rawtypes") + Iterator kidsIter = kids.iterator(); + if (kidsIter == null){ + return; + } + int r = recurseDepth+1; + handler.startElement("ol"); + while(kidsIter.hasNext()){ + Object pdfObj = kidsIter.next(); + if(pdfObj != null && pdfObj instanceof PDField){ + PDField kid = (PDField)pdfObj; + //recurse + processAcroField(kid, handler, r); + } + } + handler.endElement("ol"); + } + } + private void addFieldString(PDField field, XHTMLContentHandler handler) throws SAXException{ //Pick partial name to present in content and altName for attribute //Ignoring FullyQualifiedName for now String partName = field.getPartialName(); @@ -641,28 +491,28 @@ StringBuilder sb = new StringBuilder(); AttributesImpl attrs = new AttributesImpl(); - if (partName != null) { + if (partName != null){ sb.append(partName).append(": "); } - if (altName != null) { + if (altName != null){ attrs.addAttribute("", "altName", "altName", "CDATA", altName); } //return early if PDSignature field - if (field instanceof PDSignatureField) { - handleSignature(attrs, (PDSignatureField) field, handler); + if (field instanceof PDSignatureField){ + handleSignature(attrs, (PDSignatureField)field, handler); return; } try { //getValue can throw an IOException if there is no value String value = field.getValue(); - if (value != null && !value.equals("null")) { + if (value != null && ! value.equals("null")){ sb.append(value); } } catch (IOException e) { //swallow } - if (attrs.getLength() > 0 || sb.length() > 0) { + if (attrs.getLength() > 0 || sb.length() > 0){ handler.startElement("li", attrs); handler.characters(sb.toString()); handler.endElement("li"); @@ -670,41 +520,41 @@ } private void handleSignature(AttributesImpl parentAttributes, PDSignatureField sigField, - XHTMLContentHandler handler) throws SAXException { - + XHTMLContentHandler handler) throws SAXException{ + PDSignature sig = sigField.getSignature(); - if (sig == null) { - return; - } - Map vals = new TreeMap(); + if (sig == null){ + return; + } + Map vals= new TreeMap(); vals.put("name", sig.getName()); vals.put("contactInfo", sig.getContactInfo()); vals.put("location", sig.getLocation()); vals.put("reason", sig.getReason()); Calendar cal = sig.getSignDate(); - if (cal != null) { + if (cal != null){ dateFormat.setTimeZone(cal.getTimeZone()); vals.put("date", dateFormat.format(cal.getTime())); } //see if there is any data int nonNull = 0; - for (String val : vals.keySet()) { - if (val != null && !val.equals("")) { + for (String val : vals.keySet()){ + if (val != null && ! val.equals("")){ nonNull++; } } //if there is, process it - if (nonNull > 0) { + if (nonNull > 0){ handler.startElement("li", parentAttributes); AttributesImpl attrs = new AttributesImpl(); attrs.addAttribute("", "type", "type", "CDATA", "signaturedata"); handler.startElement("ol", attrs); - for (Map.Entry e : vals.entrySet()) { - if (e.getValue() == null || e.getValue().equals("")) { + for (Map.Entry e : vals.entrySet()){ + if (e.getValue() == null || e.getValue().equals("")){ continue; } attrs = new AttributesImpl(); @@ -718,4 +568,3 @@ } } } - diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java deleted file mode 100644 index 0d7e3ba..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFEncodedStringDecoder.java +++ /dev/null @@ -1,117 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.pdf; - -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; - -import org.apache.pdfbox.cos.COSString; -import org.apache.pdfbox.pdfparser.BaseParser; - -import static java.nio.charset.StandardCharsets.ISO_8859_1; - -/** - * In fairly rare cases, a PDF's XMP will contain a string that - * has incorrectly been encoded with PDFEncoding: an octal for non-ascii and - * ascii for ascii, e.g. "\376\377\000M\000i\000c\000r\000o\000s\000o\000f\000t\000" - *

      - * This class can be used to decode those strings. - *

      - * See TIKA-1678. Many thanks to Andrew Jackson for raising this issue - * and Tilman Hausherr for the solution. - *

      - * As of this writing, we are only handling strings that start with - * an encoded BOM. Andrew Jackson found a handful of other examples (e.g. - * this ISO-8859-7 string: - * "Microsoft Word - \\323\\365\\354\\354\\345\\364\\357\\367\\336 - * \\364\\347\\362 PRAKSIS \\363\\364\\357") - * that we aren't currently handling. - */ -class PDFEncodedStringDecoder { - - private static final String[] PDF_ENCODING_BOMS = { - "\\376\\377", //UTF-16BE - "\\377\\376", //UTF-16LE - "\\357\\273\\277"//UTF-8 - }; - - /** - * Does this string contain an octal-encoded UTF BOM? - * Call this statically to determine if you should bother creating a new parser to parse it. - * @param s - * @return - */ - static boolean shouldDecode(String s) { - if (s == null || s.length() < 8) { - return false; - } - for (String BOM : PDF_ENCODING_BOMS) { - if (s.startsWith(BOM)) { - return true; - } - } - return false; - } - - /** - * This assumes that {@link #shouldDecode(String)} has been called - * and has returned true. If you run this on a non-octal encoded string, - * disaster will happen! - * - * @param value - * @return - */ - String decode(String value) { - try { - byte[] bytes = new String("(" + value + ")").getBytes(ISO_8859_1); - InputStream is = new ByteArrayInputStream(bytes); - COSStringParser p = new COSStringParser(is); - String parsed = p.myParseCOSString(); - if (parsed != null) { - return parsed; - } - } catch (IOException e) { - //oh well, we tried. - } - //just return value if something went wrong - return value; - } - - class COSStringParser extends BaseParser { - - COSStringParser(InputStream buffer) throws IOException { - super(buffer); - } - - /** - * - * @return parsed string or null if something went wrong. - */ - String myParseCOSString() { - try { - COSString cosString = parseCOSString(); - if (cosString != null) { - return cosString.getString(); - } - } catch (IOException e) { - } - return null; - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java index 01bbc8a..3a40d49 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java @@ -22,32 +22,22 @@ import java.util.Calendar; import java.util.Collections; import java.util.List; -import java.util.Locale; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; -import org.apache.jempbox.xmp.XMPSchema; -import org.apache.jempbox.xmp.XMPSchemaDublinCore; -import org.apache.jempbox.xmp.pdfa.XMPSchemaPDFAId; import org.apache.pdfbox.cos.COSArray; import org.apache.pdfbox.cos.COSBase; -import org.apache.pdfbox.cos.COSDictionary; import org.apache.pdfbox.cos.COSName; import org.apache.pdfbox.cos.COSString; -import org.apache.pdfbox.exceptions.CryptographyException; import org.apache.pdfbox.io.RandomAccess; import org.apache.pdfbox.io.RandomAccessBuffer; import org.apache.pdfbox.io.RandomAccessFile; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDDocumentInformation; -import org.apache.pdfbox.pdmodel.encryption.AccessPermission; -import org.apache.pdfbox.pdmodel.font.PDFont; -import org.apache.tika.exception.EncryptedDocumentException; import org.apache.tika.exception.TikaException; import org.apache.tika.extractor.EmbeddedDocumentExtractor; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.io.TemporaryResources; import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.AccessPermissions; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.PagedText; import org.apache.tika.metadata.Property; @@ -61,7 +51,7 @@ /** * PDF parser. - *

      + *

      * This parser can process also encrypted PDF documents if the required * password is given as a part of the input metadata associated with a * document. If no password is given, then this parser will try decrypting @@ -69,17 +59,13 @@ * the PDF contains any embedded documents (for example as part of a PDF * package) then this parser will use the {@link EmbeddedDocumentExtractor} * to handle them. - *

      - * As of Tika 1.6, it is possible to extract inline images with - * the {@link EmbeddedDocumentExtractor} as if they were regular - * attachments. By default, this feature is turned off because of - * the potentially enormous number and size of inline images. To - * turn this feature on, see - * {@link PDFParserConfig#setExtractInlineImages(boolean)}. */ public class PDFParser extends AbstractParser { - + /** Serial version UID */ + private static final long serialVersionUID = -752276948656079347L; + + private PDFParserConfig defaultConfig = new PDFParserConfig(); /** * Metadata key for giving the document password to the parser. * @@ -87,14 +73,9 @@ * @deprecated Supply a {@link PasswordProvider} on the {@link ParseContext} instead */ public static final String PASSWORD = "org.apache.tika.parser.pdf.password"; - private static final MediaType MEDIA_TYPE = MediaType.application("pdf"); - /** - * Serial version UID - */ - private static final long serialVersionUID = -752276948656079347L; + private static final Set SUPPORTED_TYPES = - Collections.singleton(MEDIA_TYPE); - private PDFParserConfig defaultConfig = new PDFParserConfig(); + Collections.singleton(MediaType.application("pdf")); public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; @@ -104,141 +85,83 @@ InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - + PDDocument pdfDocument = null; TemporaryResources tmp = new TemporaryResources(); //config from context, or default if not set via context PDFParserConfig localConfig = context.get(PDFParserConfig.class, defaultConfig); - String password = ""; try { // PDFBox can process entirely in memory, or can use a temp file // for unpacked / processed resources // Decide which to do based on if we're reading from a file or not already TikaInputStream tstream = TikaInputStream.cast(stream); - password = getPassword(metadata, context); if (tstream != null && tstream.hasFile()) { // File based, take that as a cue to use a temporary file RandomAccess scratchFile = new RandomAccessFile(tmp.createTemporaryFile(), "rw"); - if (localConfig.getUseNonSequentialParser() == true) { - pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), scratchFile, password); + if (localConfig.getUseNonSequentialParser() == true){ + pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), scratchFile); } else { pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), scratchFile, true); } } else { // Go for the normal, stream based in-memory parsing - if (localConfig.getUseNonSequentialParser() == true) { - pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), new RandomAccessBuffer(), password); + if (localConfig.getUseNonSequentialParser() == true){ + pdfDocument = PDDocument.loadNonSeq(new CloseShieldInputStream(stream), new RandomAccessBuffer()); } else { pdfDocument = PDDocument.load(new CloseShieldInputStream(stream), true); } } - metadata.set("pdf:encrypted", Boolean.toString(pdfDocument.isEncrypted())); - - //if using the classic parser and the doc is encrypted, we must manually decrypt - if (!localConfig.getUseNonSequentialParser() && pdfDocument.isEncrypted()) { - pdfDocument.decrypt(password); - } - + + + if (pdfDocument.isEncrypted()) { + String password = null; + + // Did they supply a new style Password Provider? + PasswordProvider passwordProvider = context.get(PasswordProvider.class); + if (passwordProvider != null) { + password = passwordProvider.getPassword(metadata); + } + + // Fall back on the old style metadata if set + if (password == null && metadata.get(PASSWORD) != null) { + password = metadata.get(PASSWORD); + } + + // If no password is given, use an empty string as the default + if (password == null) { + password = ""; + } + + try { + pdfDocument.decrypt(password); + } catch (Exception e) { + // Ignore + } + } metadata.set(Metadata.CONTENT_TYPE, "application/pdf"); extractMetadata(pdfDocument, metadata); - - AccessChecker checker = localConfig.getAccessChecker(); - checker.check(metadata); - if (handler != null) { - PDF2XHTML.process(pdfDocument, handler, context, metadata, localConfig); - } - - } catch (CryptographyException e) { - //seq parser throws CryptographyException for bad password - throw new EncryptedDocumentException(e); - } catch (IOException e) { - //nonseq parser throws IOException for bad password - //At the Tika level, we want the same exception to be thrown - if (e.getMessage() != null && - e.getMessage().contains("Error (CryptographyException)")) { - metadata.set("pdf:encrypted", Boolean.toString(true)); - throw new EncryptedDocumentException(e); - } - //rethrow any other IOExceptions - throw e; + PDF2XHTML.process(pdfDocument, handler, context, metadata, localConfig); + } finally { if (pdfDocument != null) { - pdfDocument.close(); + pdfDocument.close(); } tmp.dispose(); - //TODO: once we migrate to PDFBox 2.0, remove this (PDFBOX-2200) - PDFont.clearResources(); - } - } - - private String getPassword(Metadata metadata, ParseContext context) { - String password = null; - - // Did they supply a new style Password Provider? - PasswordProvider passwordProvider = context.get(PasswordProvider.class); - if (passwordProvider != null) { - password = passwordProvider.getPassword(metadata); - } - - // Fall back on the old style metadata if set - if (password == null && metadata.get(PASSWORD) != null) { - password = metadata.get(PASSWORD); - } - - // If no password is given, use an empty string as the default - if (password == null) { - password = ""; - } - return password; - } - + } + handler.endDocument(); + } + + private void extractMetadata(PDDocument document, Metadata metadata) throws TikaException { - - //first extract AccessPermissions - AccessPermission ap = document.getCurrentAccessPermission(); - metadata.set(AccessPermissions.EXTRACT_FOR_ACCESSIBILITY, - Boolean.toString(ap.canExtractForAccessibility())); - metadata.set(AccessPermissions.EXTRACT_CONTENT, - Boolean.toString(ap.canExtractContent())); - metadata.set(AccessPermissions.ASSEMBLE_DOCUMENT, - Boolean.toString(ap.canAssembleDocument())); - metadata.set(AccessPermissions.FILL_IN_FORM, - Boolean.toString(ap.canFillInForm())); - metadata.set(AccessPermissions.CAN_MODIFY, - Boolean.toString(ap.canModify())); - metadata.set(AccessPermissions.CAN_MODIFY_ANNOTATIONS, - Boolean.toString(ap.canModifyAnnotations())); - metadata.set(AccessPermissions.CAN_PRINT, - Boolean.toString(ap.canPrint())); - metadata.set(AccessPermissions.CAN_PRINT_DEGRADED, - Boolean.toString(ap.canPrintDegraded())); - - - //now go for the XMP stuff - org.apache.jempbox.xmp.XMPMetadata xmp = null; - XMPSchemaDublinCore dcSchema = null; - try { - if (document.getDocumentCatalog().getMetadata() != null) { - xmp = document.getDocumentCatalog().getMetadata().exportXMPMetadata(); - } - if (xmp != null) { - dcSchema = xmp.getDublinCoreSchema(); - } - } catch (IOException e) { - //swallow - } PDDocumentInformation info = document.getDocumentInformation(); metadata.set(PagedText.N_PAGES, document.getNumberOfPages()); - extractMultilingualItems(metadata, TikaCoreProperties.TITLE, info.getTitle(), dcSchema); - extractDublinCoreListItems(metadata, TikaCoreProperties.CREATOR, info.getAuthor(), dcSchema); - extractDublinCoreListItems(metadata, TikaCoreProperties.CONTRIBUTOR, null, dcSchema); + addMetadata(metadata, TikaCoreProperties.TITLE, info.getTitle()); + addMetadata(metadata, TikaCoreProperties.CREATOR, info.getAuthor()); addMetadata(metadata, TikaCoreProperties.CREATOR_TOOL, info.getCreator()); addMetadata(metadata, TikaCoreProperties.KEYWORDS, info.getKeywords()); addMetadata(metadata, "producer", info.getProducer()); - extractMultilingualItems(metadata, TikaCoreProperties.DESCRIPTION, null, dcSchema); - // TODO: Move to description in Tika 2.0 addMetadata(metadata, TikaCoreProperties.TRANSITION_SUBJECT_TO_OO_SUBJECT, info.getSubject()); addMetadata(metadata, "trapped", info.getTrapped()); @@ -250,218 +173,37 @@ // Invalid date format, just ignore } try { - Calendar modified = info.getModificationDate(); + Calendar modified = info.getModificationDate(); addMetadata(metadata, Metadata.LAST_MODIFIED, modified); addMetadata(metadata, TikaCoreProperties.MODIFIED, modified); } catch (IOException e) { // Invalid date format, just ignore } - + // All remaining metadata is custom // Copy this over as-is - List handledMetadata = Arrays.asList("Author", "Creator", "CreationDate", "ModDate", - "Keywords", "Producer", "Subject", "Title", "Trapped"); - for (COSName key : info.getDictionary().keySet()) { + List handledMetadata = Arrays.asList(new String[] { + "Author", "Creator", "CreationDate", "ModDate", + "Keywords", "Producer", "Subject", "Title", "Trapped" + }); + for(COSName key : info.getDictionary().keySet()) { String name = key.getName(); - if (!handledMetadata.contains(name)) { - addMetadata(metadata, name, info.getDictionary().getDictionaryObject(key)); - } - } - - //try to get the various versions - //Caveats: - // there is currently a fair amount of redundancy - // TikaCoreProperties.FORMAT can be multivalued - // There are also three potential pdf specific version keys: pdf:PDFVersion, pdfa:PDFVersion, pdf:PDFExtensionVersion - metadata.set("pdf:PDFVersion", Float.toString(document.getDocument().getVersion())); - metadata.add(TikaCoreProperties.FORMAT.getName(), - MEDIA_TYPE.toString() + "; version=" + - Float.toString(document.getDocument().getVersion())); - - try { - if (xmp != null) { - xmp.addXMLNSMapping(XMPSchemaPDFAId.NAMESPACE, XMPSchemaPDFAId.class); - XMPSchemaPDFAId pdfaxmp = (XMPSchemaPDFAId) xmp.getSchemaByClass(XMPSchemaPDFAId.class); - if (pdfaxmp != null) { - if (pdfaxmp.getPart() != null) { - metadata.set("pdfaid:part", Integer.toString(pdfaxmp.getPart())); - } - if (pdfaxmp.getConformance() != null) { - metadata.set("pdfaid:conformance", pdfaxmp.getConformance()); - String version = "A-" + pdfaxmp.getPart() + pdfaxmp.getConformance().toLowerCase(Locale.ROOT); - metadata.set("pdfa:PDFVersion", version); - metadata.add(TikaCoreProperties.FORMAT.getName(), - MEDIA_TYPE.toString() + "; version=\"" + version + "\""); - } - } - // TODO WARN if this XMP version is inconsistent with document header version? - } - } catch (IOException e) { - metadata.set(TikaCoreProperties.TIKA_META_PREFIX + "pdf:metadata-xmp-parse-failed", "" + e); - } - //TODO: Let's try to move this into PDFBox. - //Attempt to determine Adobe extension level, if present: - COSDictionary root = document.getDocumentCatalog().getCOSDictionary(); - COSDictionary extensions = (COSDictionary) root.getDictionaryObject(COSName.getPDFName("Extensions")); - if (extensions != null) { - for (COSName extName : extensions.keySet()) { - // If it's an Adobe one, interpret it to determine the extension level: - if (extName.equals(COSName.getPDFName("ADBE"))) { - COSDictionary adobeExt = (COSDictionary) extensions.getDictionaryObject(extName); - if (adobeExt != null) { - String baseVersion = adobeExt.getNameAsString(COSName.getPDFName("BaseVersion")); - int el = adobeExt.getInt(COSName.getPDFName("ExtensionLevel")); - //-1 is sentinel value that something went wrong in getInt - if (el != -1) { - metadata.set("pdf:PDFExtensionVersion", baseVersion + " Adobe Extension Level " + el); - metadata.add(TikaCoreProperties.FORMAT.getName(), - MEDIA_TYPE.toString() + "; version=\"" + baseVersion + " Adobe Extension Level " + el + "\""); - } - } - } else { - // WARN that there is an Extension, but it's not Adobe's, and so is a 'new' format'. - metadata.set("pdf:foundNonAdobeExtensionName", extName.getName()); - } - } - } - } - - /** - * Try to extract all multilingual items from the XMPSchema - *

      - * This relies on the property having a valid xmp getName() - *

      - * For now, this only extracts the first language if the property does not allow multiple values (see TIKA-1295) - * - * @param metadata - * @param property - * @param pdfBoxBaseline - * @param schema - */ - private void extractMultilingualItems(Metadata metadata, Property property, - String pdfBoxBaseline, XMPSchema schema) { - //if schema is null, just go with pdfBoxBaseline - if (schema == null) { - if (pdfBoxBaseline != null && pdfBoxBaseline.length() > 0) { - addMetadata(metadata, property, pdfBoxBaseline); - } - return; - } - - for (String lang : schema.getLanguagePropertyLanguages(property.getName())) { - String value = schema.getLanguageProperty(property.getName(), lang); - - if (value != null && value.length() > 0) { - //if you're going to add it below in the baseline addition, don't add it now - if (pdfBoxBaseline != null && value.equals(pdfBoxBaseline)) { - continue; - } - addMetadata(metadata, property, value); - if (!property.isMultiValuePermitted()) { - return; - } - } - } - - if (pdfBoxBaseline != null && pdfBoxBaseline.length() > 0) { - //if we've already added something above and multivalue is not permitted - //return. - if (!property.isMultiValuePermitted()) { - if (metadata.get(property) != null) { - return; - } - } - addMetadata(metadata, property, pdfBoxBaseline); - } - } - - - /** - * This tries to read a list from a particular property in - * XMPSchemaDublinCore. - * If it can't find the information, it falls back to the - * pdfboxBaseline. The pdfboxBaseline should be the value - * that pdfbox returns from its PDDocumentInformation object - * (e.g. getAuthor()) This method is designed include the pdfboxBaseline, - * and it should not duplicate the pdfboxBaseline. - *

      - * Until PDFBOX-1803/TIKA-1233 are fixed, do not call this - * on dates! - *

      - * This relies on the property having a DublinCore compliant getName() - * - * @param property - * @param pdfBoxBaseline - * @param dc - * @param metadata - */ - private void extractDublinCoreListItems(Metadata metadata, Property property, - String pdfBoxBaseline, XMPSchemaDublinCore dc) { - //if no dc, add baseline and return - if (dc == null) { - if (pdfBoxBaseline != null && pdfBoxBaseline.length() > 0) { - addMetadata(metadata, property, pdfBoxBaseline); - } - return; - } - List items = getXMPBagOrSeqList(dc, property.getName()); - if (items == null) { - if (pdfBoxBaseline != null && pdfBoxBaseline.length() > 0) { - addMetadata(metadata, property, pdfBoxBaseline); - } - return; - } - for (String item : items) { - if (pdfBoxBaseline != null && !item.equals(pdfBoxBaseline)) { - addMetadata(metadata, property, item); - } - } - //finally, add the baseline - if (pdfBoxBaseline != null && pdfBoxBaseline.length() > 0) { - addMetadata(metadata, property, pdfBoxBaseline); - } - } - - /** - * As of this writing, XMPSchema can contain bags or sequence lists - * for some attributes...despite standards documentation. - * JempBox expects one or the other for specific attributes. - * Until more flexibility is added to JempBox, Tika will have to handle both. - * - * @param schema - * @param name - * @return list of values or null - */ - private List getXMPBagOrSeqList(XMPSchema schema, String name) { - List ret = schema.getBagList(name); - if (ret == null) { - ret = schema.getSequenceList(name); - } - return ret; + if(! handledMetadata.contains(name)) { + addMetadata(metadata, name, info.getDictionary().getDictionaryObject(key)); + } + } } private void addMetadata(Metadata metadata, Property property, String value) { if (value != null) { - String decoded = decode(value); - if (property.isMultiValuePermitted() || metadata.get(property) == null) { - metadata.add(property, decoded); - } - //silently skip adding property that already exists if multiple values are not permitted - } - } - + metadata.add(property, value); + } + } + private void addMetadata(Metadata metadata, String name, String value) { if (value != null) { - metadata.add(name, decode(value)); - } - } - - private String decode(String value) { - if (PDFEncodedStringDecoder.shouldDecode(value)) { - PDFEncodedStringDecoder d = new PDFEncodedStringDecoder(); - return d.decode(value); - } - return value; + metadata.add(name, value); + } } private void addMetadata(Metadata metadata, String name, Calendar value) { @@ -478,52 +220,61 @@ /** * Used when processing custom metadata entries, as PDFBox won't do - * the conversion for us in the way it does for the standard ones + * the conversion for us in the way it does for the standard ones */ private void addMetadata(Metadata metadata, String name, COSBase value) { - if (value instanceof COSArray) { - for (Object v : ((COSArray) value).toList()) { + if(value instanceof COSArray) { + for(Object v : ((COSArray)value).toList()) { addMetadata(metadata, name, ((COSBase) v)); } - } else if (value instanceof COSString) { - addMetadata(metadata, name, ((COSString) value).getString()); - } - // Avoid calling COSDictionary#toString, since it can lead to infinite - // recursion. See TIKA-1038 and PDFBOX-1835. - else if (value != null && !(value instanceof COSDictionary)) { + } else if(value instanceof COSString) { + addMetadata(metadata, name, ((COSString)value).getString()); + } else if (value != null){ addMetadata(metadata, name, value.toString()); } } - public PDFParserConfig getPDFParserConfig() { + public void setPDFParserConfig(PDFParserConfig config){ + this.defaultConfig = config; + } + + public PDFParserConfig getPDFParserConfig(){ return defaultConfig; } - - public void setPDFParserConfig(PDFParserConfig config) { - this.defaultConfig = config; - } - - /** - * @see #setUseNonSequentialParser(boolean) - * @deprecated use {@link #getPDFParserConfig()} - */ - public boolean getUseNonSequentialParser() { - return defaultConfig.getUseNonSequentialParser(); - } - + /** * If true, the parser will use the NonSequentialParser. This may * be faster than the full doc parser. * If false (default), this will use the full doc parser. + * + * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} + */ + public void setUseNonSequentialParser(boolean v){ + defaultConfig.setUseNonSequentialParser(v); + } + + /** + * @see #setUseNonSequentialParser(boolean) + * @deprecated use {@link #getPDFParserConfig()} + */ + public boolean getUseNonSequentialParser(){ + return defaultConfig.getUseNonSequentialParser(); + } + + /** + * If true (the default), the parser should estimate + * where spaces should be inserted between words. For + * many PDFs this is necessary as they do not include + * explicit whitespace characters. * - * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} - */ - public void setUseNonSequentialParser(boolean v) { - defaultConfig.setUseNonSequentialParser(v); - } - - /** - * @see #setEnableAutoSpace(boolean) + * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} + */ + public void setEnableAutoSpace(boolean v) { + defaultConfig.setEnableAutoSpace(v); + } + + /** + * @see #setEnableAutoSpace. * @deprecated use {@link #getPDFParserConfig()} */ public boolean getEnableAutoSpace() { @@ -531,20 +282,17 @@ } /** - * If true (the default), the parser should estimate - * where spaces should be inserted between words. For - * many PDFs this is necessary as they do not include - * explicit whitespace characters. - * + * If true (the default), text in annotations will be + * extracted. * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} */ - public void setEnableAutoSpace(boolean v) { - defaultConfig.setEnableAutoSpace(v); + public void setExtractAnnotationText(boolean v) { + defaultConfig.setExtractAnnotationText(v); } /** * If true, text in annotations will be extracted. - * + * * @deprecated use {@link #getPDFParserConfig()} */ public boolean getExtractAnnotationText() { @@ -552,17 +300,23 @@ } /** - * If true (the default), text in annotations will be - * extracted. - * - * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} - */ - public void setExtractAnnotationText(boolean v) { - defaultConfig.setExtractAnnotationText(v); - } - - /** - * @see #setSuppressDuplicateOverlappingText(boolean) + * If true, the parser should try to remove duplicated + * text over the same region. This is needed for some + * PDFs that achieve bolding by re-writing the same + * text in the same area. Note that this can + * slow down extraction substantially (PDFBOX-956) and + * sometimes remove characters that were not in fact + * duplicated (PDFBOX-1155). By default this is disabled. + * + * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} + */ + public void setSuppressDuplicateOverlappingText(boolean v) { + defaultConfig.setSuppressDuplicateOverlappingText(v); + } + + /** + * @see #setSuppressDuplicateOverlappingText. + * * @deprecated use {@link #getPDFParserConfig()} */ public boolean getSuppressDuplicateOverlappingText() { @@ -570,40 +324,26 @@ } /** - * If true, the parser should try to remove duplicated - * text over the same region. This is needed for some - * PDFs that achieve bolding by re-writing the same - * text in the same area. Note that this can - * slow down extraction substantially (PDFBOX-956) and - * sometimes remove characters that were not in fact - * duplicated (PDFBOX-1155). By default this is disabled. - * - * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} - */ - public void setSuppressDuplicateOverlappingText(boolean v) { - defaultConfig.setSuppressDuplicateOverlappingText(v); - } - - /** - * @see #setSortByPosition(boolean) + * If true, sort text tokens by their x/y position + * before extracting text. This may be necessary for + * some PDFs (if the text tokens are not rendered "in + * order"), while for other PDFs it can produce the + * wrong result (for example if there are 2 columns, + * the text will be interleaved). Default is false. + * + * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} + */ + public void setSortByPosition(boolean v) { + defaultConfig.setSortByPosition(v); + } + + /** + * @see #setSortByPosition. + * * @deprecated use {@link #getPDFParserConfig()} */ public boolean getSortByPosition() { return defaultConfig.getSortByPosition(); } - /** - * If true, sort text tokens by their x/y position - * before extracting text. This may be necessary for - * some PDFs (if the text tokens are not rendered "in - * order"), while for other PDFs it can produce the - * wrong result (for example if there are 2 columns, - * the text will be interleaved). Default is false. - * - * @deprecated use {@link #setPDFParserConfig(PDFParserConfig)} - */ - public void setSortByPosition(boolean v) { - defaultConfig.setSortByPosition(v); - } - } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java index 74e67dd..9a99319 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java @@ -1,5 +1,9 @@ package org.apache.tika.parser.pdf; +import java.io.IOException; +import java.io.InputStream; +import java.io.Serializable; +import java.util.Properties; /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with @@ -17,32 +21,25 @@ * limitations under the License. */ -import java.io.IOException; -import java.io.InputStream; -import java.io.Serializable; -import java.util.Locale; -import java.util.Properties; - -import org.apache.pdfbox.util.PDFTextStripper; - /** * Config for PDFParser. - *

      + * * This allows parameters to be set programmatically: *

        *
      1. Calls to PDFParser, i.e. parser.getPDFParserConfig().setEnableAutoSpace() (as before)
      2. *
      3. Constructor of PDFParser
      4. *
      5. Passing to PDFParser through a ParseContext: context.set(PDFParserConfig.class, config);
      6. *
      - *

      + * * Parameters can also be set by modifying the PDFParserConfig.properties file, * which lives in the expected places, in trunk: * tika-parsers/src/main/resources/org/apache/tika/parser/pdf - *

      + * * Or, in tika-app-x.x.jar or tika-parsers-x.x.jar: * org/apache/tika/parser/pdf + * */ -public class PDFParserConfig implements Serializable { +public class PDFParserConfig implements Serializable{ private static final long serialVersionUID = 6492570218190936986L; @@ -62,26 +59,11 @@ //True if we should use PDFBox's NonSequentialParser private boolean useNonSequentialParser = false; - + //True if acroform content should be extracted private boolean extractAcroFormContent = true; - //True if inline PDXImage objects should be extracted - private boolean extractInlineImages = false; - - //True if inline images (as identified by their object id within - //a pdf file) should only be extracted once. - private boolean extractUniqueInlineImagesOnly = true; - - //The character width-based tolerance value used to estimate where spaces in text should be added - private Float averageCharTolerance; - - //The space width-based tolerance value used to estimate where spaces in text should be added - private Float spacingTolerance; - - private AccessChecker accessChecker; - - public PDFParserConfig() { + public PDFParserConfig(){ init(this.getClass().getResourceAsStream("PDFParser.properties")); } @@ -89,28 +71,28 @@ * Loads properties from InputStream and then tries to close InputStream. * If there is an IOException, this silently swallows the exception * and goes back to the default. - * + * * @param is */ - public PDFParserConfig(InputStream is) { + public PDFParserConfig(InputStream is){ init(is); } //initializes object and then tries to close inputstream - private void init(InputStream is) { - - if (is == null) { + private void init(InputStream is){ + + if (is == null){ return; } Properties props = new Properties(); - try { + try{ props.load(is); - } catch (IOException e) { + } catch (IOException e){ } finally { - if (is != null) { - try { + if (is != null){ + try{ is.close(); - } catch (IOException e) { + } catch (IOException e){ //swallow } } @@ -118,177 +100,74 @@ setEnableAutoSpace( getProp(props.getProperty("enableAutoSpace"), getEnableAutoSpace())); setSuppressDuplicateOverlappingText( - getProp(props.getProperty("suppressDuplicateOverlappingText"), + getProp(props.getProperty("suppressDuplicateOverlappingText"), getSuppressDuplicateOverlappingText())); setExtractAnnotationText( - getProp(props.getProperty("extractAnnotationText"), + getProp(props.getProperty("extractAnnotationText"), getExtractAnnotationText())); setSortByPosition( - getProp(props.getProperty("sortByPosition"), + getProp(props.getProperty("sortByPosition"), getSortByPosition())); setUseNonSequentialParser( - getProp(props.getProperty("useNonSequentialParser"), + getProp(props.getProperty("useNonSequentialParser"), getUseNonSequentialParser())); setExtractAcroFormContent( getProp(props.getProperty("extractAcroFormContent"), - getExtractAcroFormContent())); - setExtractInlineImages( - getProp(props.getProperty("extractInlineImages"), - getExtractInlineImages())); - setExtractUniqueInlineImagesOnly( - getProp(props.getProperty("extractUniqueInlineImagesOnly"), - getExtractUniqueInlineImagesOnly())); - - boolean checkExtractAccessPermission = getProp(props.getProperty("checkExtractAccessPermission"), false); - boolean allowExtractionForAccessibility = getProp(props.getProperty("allowExtractionForAccessibility"), true); - - if (checkExtractAccessPermission == false) { - //silently ignore the crazy configuration of checkExtractAccessPermission = false, - //but allowExtractionForAccessibility=false - accessChecker = new AccessChecker(); - } else { - accessChecker = new AccessChecker(allowExtractionForAccessibility); - } - } - - /** - * Configures the given pdf2XHTML. - * - * @param pdf2XHTML - */ - public void configure(PDF2XHTML pdf2XHTML) { - pdf2XHTML.setForceParsing(true); - pdf2XHTML.setSortByPosition(getSortByPosition()); - if (getEnableAutoSpace()) { - pdf2XHTML.setWordSeparator(" "); - } else { - pdf2XHTML.setWordSeparator(""); - } - if (getAverageCharTolerance() != null) { - pdf2XHTML.setAverageCharTolerance(getAverageCharTolerance()); - } - if (getSpacingTolerance() != null) { - pdf2XHTML.setSpacingTolerance(getSpacingTolerance()); - } - pdf2XHTML.setSuppressDuplicateOverlappingText(getSuppressDuplicateOverlappingText()); - } - - /** - * @see #setExtractAcroFormContent(boolean) - */ + getExtractAcroFormContent())); + } + + + /** + * If true (the default), extract content from AcroForms + * at the end of the document. + * + * @param b + */ + public void setExtractAcroFormContent(boolean extractAcroFormContent) { + this.extractAcroFormContent = extractAcroFormContent; + + } + + /** @see #setExtractAcroFormContent(boolean) */ public boolean getExtractAcroFormContent() { return extractAcroFormContent; } - /** - * If true (the default), extract content from AcroForms - * at the end of the document. - * - * @param extractAcroFormContent - */ - public void setExtractAcroFormContent(boolean extractAcroFormContent) { - this.extractAcroFormContent = extractAcroFormContent; - - } - - /** - * @see #setExtractInlineImages(boolean) - */ - public boolean getExtractInlineImages() { - return extractInlineImages; - } - - /** - * If true, extract inline embedded OBXImages. - * Beware: some PDF documents of modest size (~4MB) can contain - * thousands of embedded images totaling > 2.5 GB. Also, at least as of PDFBox 1.8.5, - * there can be surprisingly large memory consumption and/or out of memory errors. - * Set to true with caution. - *

      - * The default is false. - *

      - * See also: {@see #setExtractUniqueInlineImagesOnly(boolean)}; - * - * @param extractInlineImages - */ - public void setExtractInlineImages(boolean extractInlineImages) { - this.extractInlineImages = extractInlineImages; - } - - /** - * @see #setExtractUniqueInlineImagesOnly(boolean) - */ - public boolean getExtractUniqueInlineImagesOnly() { - return extractUniqueInlineImagesOnly; - } - - /** - * Multiple pages within a PDF file might refer to the same underlying image. - * If {@link #extractUniqueInlineImagesOnly} is set to false, the - * parser will call the EmbeddedExtractor each time the image appears on a page. - * This might be desired for some use cases. However, to avoid duplication of - * extracted images, set this to true. The default is true. - *

      - * Note that uniqueness is determined only by the underlying PDF COSObject id, not by - * file hash or similar equality metric. - * If the PDF actually contains multiple copies of the same image - * -- all with different object ids -- then all images will be extracted. - *

      - * For this parameter to have any effect, {@link #extractInlineImages} must be - * set to true. - *

      - * Because of TIKA-1742 -- to avoid infinite recursion -- no matter the setting - * of this parameter, the extractor will only pull out one copy of each image per - * page. This parameter tries to capture uniqueness across the entire document. - * - * @param extractUniqueInlineImagesOnly - */ - public void setExtractUniqueInlineImagesOnly(boolean extractUniqueInlineImagesOnly) { - this.extractUniqueInlineImagesOnly = extractUniqueInlineImagesOnly; - - } - - /** - * @see #setEnableAutoSpace(boolean) - */ + /** @see #setEnableAutoSpace. */ public boolean getEnableAutoSpace() { return enableAutoSpace; } /** - * If true (the default), the parser should estimate - * where spaces should be inserted between words. For - * many PDFs this is necessary as they do not include - * explicit whitespace characters. + * If true (the default), the parser should estimate + * where spaces should be inserted between words. For + * many PDFs this is necessary as they do not include + * explicit whitespace characters. */ public void setEnableAutoSpace(boolean enableAutoSpace) { this.enableAutoSpace = enableAutoSpace; } - /** - * @see #setSuppressDuplicateOverlappingText(boolean) - */ + /** @see #setSuppressDuplicateOverlappingText(boolean)*/ public boolean getSuppressDuplicateOverlappingText() { return suppressDuplicateOverlappingText; } /** - * If true, the parser should try to remove duplicated - * text over the same region. This is needed for some - * PDFs that achieve bolding by re-writing the same - * text in the same area. Note that this can - * slow down extraction substantially (PDFBOX-956) and - * sometimes remove characters that were not in fact - * duplicated (PDFBOX-1155). By default this is disabled. + * If true, the parser should try to remove duplicated + * text over the same region. This is needed for some + * PDFs that achieve bolding by re-writing the same + * text in the same area. Note that this can + * slow down extraction substantially (PDFBOX-956) and + * sometimes remove characters that were not in fact + * duplicated (PDFBOX-1155). By default this is disabled. */ public void setSuppressDuplicateOverlappingText( boolean suppressDuplicateOverlappingText) { this.suppressDuplicateOverlappingText = suppressDuplicateOverlappingText; } - /** - * @see #setExtractAnnotationText(boolean) - */ + /** @see #setExtractAnnotationText(boolean)*/ public boolean getExtractAnnotationText() { return extractAnnotationText; } @@ -300,29 +179,24 @@ public void setExtractAnnotationText(boolean extractAnnotationText) { this.extractAnnotationText = extractAnnotationText; } - - /** - * @see #setSortByPosition(boolean) - */ + /** @see #setSortByPosition(boolean)*/ public boolean getSortByPosition() { return sortByPosition; } /** - * If true, sort text tokens by their x/y position - * before extracting text. This may be necessary for - * some PDFs (if the text tokens are not rendered "in - * order"), while for other PDFs it can produce the - * wrong result (for example if there are 2 columns, - * the text will be interleaved). Default is false. + * If true, sort text tokens by their x/y position + * before extracting text. This may be necessary for + * some PDFs (if the text tokens are not rendered "in + * order"), while for other PDFs it can produce the + * wrong result (for example if there are 2 columns, + * the text will be interleaved). Default is false. */ public void setSortByPosition(boolean sortByPosition) { this.sortByPosition = sortByPosition; } - /** - * @see #setUseNonSequentialParser(boolean) - */ + /** @see #setUseNonSequentialParser(boolean)*/ public boolean getUseNonSequentialParser() { return useNonSequentialParser; } @@ -330,61 +204,24 @@ /** * If true, uses PDFBox's non-sequential parser. * The non-sequential parser should be much faster than the traditional - * full doc parser. However, until PDFBOX-XXX is fixed, + * full doc parser. However, until PDFBOX-XXX is fixed, * the non-sequential parser fails * to extract some document metadata. - *

      + *

      * Default is false (use the traditional parser) - * * @param useNonSequentialParser */ public void setUseNonSequentialParser(boolean useNonSequentialParser) { this.useNonSequentialParser = useNonSequentialParser; } - /** - * @see #setAverageCharTolerance(Float) - */ - public Float getAverageCharTolerance() { - return averageCharTolerance; - } - - /** - * See {@link PDFTextStripper#setAverageCharTolerance(float)} - */ - public void setAverageCharTolerance(Float averageCharTolerance) { - this.averageCharTolerance = averageCharTolerance; - } - - /** - * @see #setSpacingTolerance(Float) - */ - public Float getSpacingTolerance() { - return spacingTolerance; - } - - /** - * See {@link PDFTextStripper#setSpacingTolerance(float)} - */ - public void setSpacingTolerance(Float spacingTolerance) { - this.spacingTolerance = spacingTolerance; - } - - public AccessChecker getAccessChecker() { - return accessChecker; - } - - public void setAccessChecker(AccessChecker accessChecker) { - this.accessChecker = accessChecker; - } - - private boolean getProp(String p, boolean defaultMissing) { - if (p == null) { + private boolean getProp(String p, boolean defaultMissing){ + if (p == null){ return defaultMissing; } - if (p.toLowerCase(Locale.ROOT).equals("true")) { + if (p.toLowerCase().equals("true")){ return true; - } else if (p.toLowerCase(Locale.ROOT).equals("false")) { + } else if (p.toLowerCase().equals("false")){ return false; } else { return defaultMissing; @@ -395,19 +232,10 @@ public int hashCode() { final int prime = 31; int result = 1; - result = prime - * result - + ((averageCharTolerance == null) ? 0 : averageCharTolerance - .hashCode()); result = prime * result + (enableAutoSpace ? 1231 : 1237); result = prime * result + (extractAcroFormContent ? 1231 : 1237); result = prime * result + (extractAnnotationText ? 1231 : 1237); - result = prime * result + (extractInlineImages ? 1231 : 1237); - result = prime * result + (extractUniqueInlineImagesOnly ? 1231 : 1237); result = prime * result + (sortByPosition ? 1231 : 1237); - result = prime - * result - + ((spacingTolerance == null) ? 0 : spacingTolerance.hashCode()); result = prime * result + (suppressDuplicateOverlappingText ? 1231 : 1237); result = prime * result + (useNonSequentialParser ? 1231 : 1237); @@ -423,27 +251,13 @@ if (getClass() != obj.getClass()) return false; PDFParserConfig other = (PDFParserConfig) obj; - if (averageCharTolerance == null) { - if (other.averageCharTolerance != null) - return false; - } else if (!averageCharTolerance.equals(other.averageCharTolerance)) - return false; if (enableAutoSpace != other.enableAutoSpace) return false; if (extractAcroFormContent != other.extractAcroFormContent) return false; if (extractAnnotationText != other.extractAnnotationText) return false; - if (extractInlineImages != other.extractInlineImages) - return false; - if (extractUniqueInlineImagesOnly != other.extractUniqueInlineImagesOnly) - return false; if (sortByPosition != other.sortByPosition) - return false; - if (spacingTolerance == null) { - if (other.spacingTolerance != null) - return false; - } else if (!spacingTolerance.equals(other.spacingTolerance)) return false; if (suppressDuplicateOverlappingText != other.suppressDuplicateOverlappingText) return false; @@ -459,11 +273,9 @@ + suppressDuplicateOverlappingText + ", extractAnnotationText=" + extractAnnotationText + ", sortByPosition=" + sortByPosition + ", useNonSequentialParser=" + useNonSequentialParser - + ", extractAcroFormContent=" + extractAcroFormContent - + ", extractInlineImages=" + extractInlineImages - + ", extractUniqueInlineImagesOnly=" - + extractUniqueInlineImagesOnly + ", averageCharTolerance=" - + averageCharTolerance + ", spacingTolerance=" - + spacingTolerance + "]"; - } + + ", extractAcroFormContent=" + extractAcroFormContent + "]"; + } + + + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java index 84b3b11..a5921cd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pkg/CompressorParser.java @@ -27,18 +27,14 @@ import org.apache.commons.compress.compressors.CompressorInputStream; import org.apache.commons.compress.compressors.CompressorStreamFactory; import org.apache.commons.compress.compressors.bzip2.BZip2CompressorInputStream; -import org.apache.commons.compress.compressors.deflate.DeflateCompressorInputStream; import org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream; import org.apache.commons.compress.compressors.gzip.GzipUtils; import org.apache.commons.compress.compressors.pack200.Pack200CompressorInputStream; -import org.apache.commons.compress.compressors.snappy.FramedSnappyCompressorInputStream; -import org.apache.commons.compress.compressors.snappy.SnappyCompressorInputStream; import org.apache.commons.compress.compressors.xz.XZCompressorInputStream; -import org.apache.commons.compress.compressors.z.ZCompressorInputStream; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.exception.TikaException; import org.apache.tika.extractor.EmbeddedDocumentExtractor; import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -57,37 +53,22 @@ private static final MediaType BZIP = MediaType.application("x-bzip"); private static final MediaType BZIP2 = MediaType.application("x-bzip2"); - private static final MediaType GZIP = MediaType.application("gzip"); - private static final MediaType GZIP_ALT = MediaType.application("x-gzip"); - private static final MediaType COMPRESS = MediaType.application("x-compress"); + private static final MediaType GZIP = MediaType.application("x-gzip"); private static final MediaType XZ = MediaType.application("x-xz"); - private static final MediaType PACK = MediaType.application("x-java-pack200"); - private static final MediaType SNAPPY = MediaType.application("x-snappy-framed"); - private static final MediaType ZLIB = MediaType.application("zlib"); + private static final MediaType PACK = MediaType.application("application/x-java-pack200"); private static final Set SUPPORTED_TYPES = - MediaType.set(BZIP, BZIP2, GZIP, GZIP_ALT, COMPRESS, XZ, PACK, ZLIB); + MediaType.set(BZIP, BZIP2, GZIP, XZ, PACK); static MediaType getMediaType(CompressorInputStream stream) { - // TODO Add support for the remaining CompressorInputStream formats: - // LZMACompressorInputStream - // LZWInputStream -> UnshrinkingInputStream if (stream instanceof BZip2CompressorInputStream) { return BZIP2; } else if (stream instanceof GzipCompressorInputStream) { return GZIP; } else if (stream instanceof XZCompressorInputStream) { return XZ; - } else if (stream instanceof DeflateCompressorInputStream) { - return ZLIB; - } else if (stream instanceof ZCompressorInputStream) { - return COMPRESS; } else if (stream instanceof Pack200CompressorInputStream) { return PACK; - } else if (stream instanceof FramedSnappyCompressorInputStream || - stream instanceof SnappyCompressorInputStream) { - // TODO Add unit tests for this format - return SNAPPY; } else { return MediaType.OCTET_STREAM; } @@ -111,14 +92,14 @@ CompressorInputStream cis; try { + CompressorStreamFactory factory = new CompressorStreamFactory(); CompressorParserOptions options = context.get(CompressorParserOptions.class, new CompressorParserOptions() { public boolean decompressConcatenated(Metadata metadata) { return false; } }); - CompressorStreamFactory factory = - new CompressorStreamFactory(options.decompressConcatenated(metadata)); + factory.setDecompressConcatenated(options.decompressConcatenated(metadata)); cis = factory.createCompressorInputStream(stream); } catch (CompressorException e) { throw new TikaException("Unable to uncompress document stream", e); @@ -146,8 +127,6 @@ name = name.substring(0, name.length() - 4); } else if (name.endsWith(".xz")) { name = name.substring(0, name.length() - 3); - } else if (name.endsWith(".zlib")) { - name = name.substring(0, name.length() - 5); } else if (name.endsWith(".pack")) { name = name.substring(0, name.length() - 5); } else if (name.length() > 0) { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java index 443eb9e..4520bcb 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pkg/PackageParser.java @@ -16,57 +16,43 @@ */ package org.apache.tika.parser.pkg; -import static org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE; - import java.io.BufferedInputStream; import java.io.IOException; import java.io.InputStream; -import java.util.Date; import java.util.Set; -import org.apache.commons.compress.PasswordRequiredException; import org.apache.commons.compress.archivers.ArchiveEntry; import org.apache.commons.compress.archivers.ArchiveException; import org.apache.commons.compress.archivers.ArchiveInputStream; import org.apache.commons.compress.archivers.ArchiveStreamFactory; -import org.apache.commons.compress.archivers.StreamingNotSupportedException; import org.apache.commons.compress.archivers.ar.ArArchiveInputStream; import org.apache.commons.compress.archivers.cpio.CpioArchiveInputStream; import org.apache.commons.compress.archivers.dump.DumpArchiveInputStream; import org.apache.commons.compress.archivers.jar.JarArchiveInputStream; -import org.apache.commons.compress.archivers.sevenz.SevenZFile; import org.apache.commons.compress.archivers.tar.TarArchiveInputStream; -import org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException; -import org.apache.commons.compress.archivers.zip.UnsupportedZipFeatureException.Feature; import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream; -import org.apache.commons.io.input.CloseShieldInputStream; -import org.apache.tika.exception.EncryptedDocumentException; import org.apache.tika.exception.TikaException; import org.apache.tika.extractor.EmbeddedDocumentExtractor; import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.io.TemporaryResources; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.PasswordProvider; import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; import org.xml.sax.helpers.AttributesImpl; + +import static org.apache.tika.metadata.HttpHeaders.CONTENT_TYPE; /** * Parser for various packaging formats. Package entries will be written to * the XHTML event stream as <div class="package-entry"> elements that * contain the (optional) entry name as a <h1> element and the full * structured body content of the parsed entry. - *

      - * User must have JCE Unlimited Strength jars installed for encryption to - * work with 7Z files (see: COMPRESS-299 and TIKA-1521). If the jars - * are not installed, an IOException will be thrown, and potentially - * wrapped in a TikaException. */ public class PackageParser extends AbstractParser { @@ -79,10 +65,9 @@ private static final MediaType CPIO = MediaType.application("x-cpio"); private static final MediaType DUMP = MediaType.application("x-tika-unix-dump"); private static final MediaType TAR = MediaType.application("x-tar"); - private static final MediaType SEVENZ = MediaType.application("x-7z-compressed"); private static final Set SUPPORTED_TYPES = - MediaType.set(ZIP, JAR, AR, CPIO, DUMP, TAR, SEVENZ); + MediaType.set(ZIP, JAR, AR, CPIO, DUMP, TAR); static MediaType getMediaType(ArchiveInputStream stream) { if (stream instanceof JarArchiveInputStream) { @@ -97,8 +82,6 @@ return DUMP; } else if (stream instanceof TarArchiveInputStream) { return TAR; - } else if (stream instanceof SevenZWrapper) { - return SEVENZ; } else { return MediaType.OCTET_STREAM; } @@ -116,50 +99,19 @@ InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { - + // At the end we want to close the archive stream to release + // any associated resources, but the underlying document stream + // should not be closed + stream = new CloseShieldInputStream(stream); + // Ensure that the stream supports the mark feature - if (! TikaInputStream.isTikaInputStream(stream)) - stream = new BufferedInputStream(stream); - - - TemporaryResources tmp = new TemporaryResources(); - ArchiveInputStream ais = null; + stream = new BufferedInputStream(stream); + + ArchiveInputStream ais; try { - ArchiveStreamFactory factory = context.get(ArchiveStreamFactory.class, new ArchiveStreamFactory()); - // At the end we want to close the archive stream to release - // any associated resources, but the underlying document stream - // should not be closed - ais = factory.createArchiveInputStream(new CloseShieldInputStream(stream)); - - } catch (StreamingNotSupportedException sne) { - // Most archive formats work on streams, but a few need files - if (sne.getFormat().equals(ArchiveStreamFactory.SEVEN_Z)) { - // Rework as a file, and wrap - stream.reset(); - TikaInputStream tstream = TikaInputStream.get(stream, tmp); - - // Seven Zip suports passwords, was one given? - String password = null; - PasswordProvider provider = context.get(PasswordProvider.class); - if (provider != null) { - password = provider.getPassword(metadata); - } - - SevenZFile sevenz; - if (password == null) { - sevenz = new SevenZFile(tstream.getFile()); - } else { - sevenz = new SevenZFile(tstream.getFile(), password.getBytes("UnicodeLittleUnmarked")); - } - - // Pending a fix for COMPRESS-269 / TIKA-1525, this bit is a little nasty - ais = new SevenZWrapper(sevenz); - } else { - tmp.close(); - throw new TikaException("Unknown non-streaming format " + sne.getFormat(), sne); - } + ArchiveStreamFactory factory = new ArchiveStreamFactory(); + ais = factory.createArchiveInputStream(stream); } catch (ArchiveException e) { - tmp.close(); throw new TikaException("Unable to unpack document stream", e); } @@ -167,6 +119,7 @@ if (!type.equals(MediaType.OCTET_STREAM)) { metadata.set(CONTENT_TYPE, type.toString()); } + // Use the delegate parser to parse the contained document EmbeddedDocumentExtractor extractor = context.get( EmbeddedDocumentExtractor.class, @@ -183,17 +136,8 @@ } entry = ais.getNextEntry(); } - } catch (UnsupportedZipFeatureException zfe) { - // If it's an encrypted document of unknown password, report as such - if (zfe.getFeature() == Feature.ENCRYPTION) { - throw new EncryptedDocumentException(zfe); - } - // Otherwise fall through to raise the exception as normal - } catch (PasswordRequiredException pre) { - throw new EncryptedDocumentException(pre); } finally { ais.close(); - tmp.close(); } xhtml.endDocument(); @@ -205,11 +149,17 @@ throws SAXException, IOException, TikaException { String name = entry.getName(); if (archive.canReadEntryData(entry)) { - // Fetch the metadata on the entry contained in the archive - Metadata entrydata = handleEntryMetadata(name, null, - entry.getLastModifiedDate(), entry.getSize(), xhtml); - - // Recurse into the entry if desired + Metadata entrydata = new Metadata(); + if (name != null && name.length() > 0) { + entrydata.set(Metadata.RESOURCE_NAME_KEY, name); + AttributesImpl attributes = new AttributesImpl(); + attributes.addAttribute("", "class", "class", "CDATA", "embedded"); + attributes.addAttribute("", "id", "id", "CDATA", name); + xhtml.startElement("div", attributes); + xhtml.endElement("div"); + + entrydata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, name); + } if (extractor.shouldParseEmbedded(entrydata)) { // For detectors to work, we need a mark/reset supporting // InputStream, which ArchiveInputStream isn't, so wrap @@ -225,63 +175,5 @@ xhtml.element("p", name); } } - - protected static Metadata handleEntryMetadata( - String name, Date createAt, Date modifiedAt, - Long size, XHTMLContentHandler xhtml) - throws SAXException, IOException, TikaException { - Metadata entrydata = new Metadata(); - if (createAt != null) { - entrydata.set(TikaCoreProperties.CREATED, createAt); - } - if (modifiedAt != null) { - entrydata.set(TikaCoreProperties.MODIFIED, modifiedAt); - } - if (size != null) { - entrydata.set(Metadata.CONTENT_LENGTH, Long.toString(size)); - } - if (name != null && name.length() > 0) { - name = name.replace("\\", "/"); - entrydata.set(Metadata.RESOURCE_NAME_KEY, name); - AttributesImpl attributes = new AttributesImpl(); - attributes.addAttribute("", "class", "class", "CDATA", "embedded"); - attributes.addAttribute("", "id", "id", "CDATA", name); - xhtml.startElement("div", attributes); - xhtml.endElement("div"); - entrydata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, name); - } - return entrydata; - } - - // Pending a fix for COMPRESS-269, we have to wrap ourselves - private static class SevenZWrapper extends ArchiveInputStream { - private SevenZFile file; - private SevenZWrapper(SevenZFile file) { - this.file = file; - } - - @Override - public int read() throws IOException { - return file.read(); - } - @Override - public int read(byte[] b) throws IOException { - return file.read(b); - } - @Override - public int read(byte[] b, int off, int len) throws IOException { - return file.read(b, off, len); - } - - @Override - public ArchiveEntry getNextEntry() throws IOException { - return file.getNextEntry(); - } - - @Override - public void close() throws IOException { - file.close(); - } - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/RarParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/pkg/RarParser.java deleted file mode 100644 index 99508b0..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/RarParser.java +++ /dev/null @@ -1,110 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.pkg; - -import java.io.IOException; -import java.io.InputStream; -import java.util.Collections; -import java.util.Set; - -import org.apache.tika.exception.EncryptedDocumentException; -import org.apache.tika.exception.TikaException; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; -import org.apache.tika.io.TemporaryResources; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import com.github.junrar.Archive; -import com.github.junrar.exception.RarException; -import com.github.junrar.rarfile.FileHeader; - -/** - * Parser for Rar files. - */ -public class RarParser extends AbstractParser { - private static final long serialVersionUID = 6157727985054451501L; - - private static final Set SUPPORTED_TYPES = Collections - .singleton(MediaType.application("x-rar-compressed")); - - @Override - public Set getSupportedTypes(ParseContext arg0) { - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - - EmbeddedDocumentExtractor extractor = context.get( - EmbeddedDocumentExtractor.class, - new ParsingEmbeddedDocumentExtractor(context)); - - Archive rar = null; - try (TemporaryResources tmp = new TemporaryResources()) { - TikaInputStream tis = TikaInputStream.get(stream, tmp); - rar = new Archive(tis.getFile()); - - if (rar.isEncrypted()) { - throw new EncryptedDocumentException(); - } - - //Without this BodyContentHandler does not work - xhtml.element("div", " "); - - FileHeader header = rar.nextFileHeader(); - while (header != null && !Thread.currentThread().isInterrupted()) { - if (!header.isDirectory()) { - try (InputStream subFile = rar.getInputStream(header)) { - Metadata entrydata = PackageParser.handleEntryMetadata( - "".equals(header.getFileNameW()) ? header.getFileNameString() : header.getFileNameW(), - header.getCTime(), header.getMTime(), - header.getFullUnpackSize(), - xhtml - ); - - if (extractor.shouldParseEmbedded(entrydata)) { - extractor.parseEmbedded(subFile, handler, entrydata, true); - } - } - } - - header = rar.nextFileHeader(); - } - - } catch (RarException e) { - throw new TikaException("RarParser Exception", e); - } finally { - if (rar != null) - rar.close(); - - } - - xhtml.endDocument(); - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java index fcbf70a..d54cfb3 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pkg/ZipContainerDetector.java @@ -22,7 +22,6 @@ import java.util.Enumeration; import java.util.HashSet; import java.util.Iterator; -import java.util.Locale; import java.util.Set; import java.util.regex.Pattern; @@ -35,15 +34,15 @@ import org.apache.commons.compress.compressors.CompressorException; import org.apache.commons.compress.compressors.CompressorInputStream; import org.apache.commons.compress.compressors.CompressorStreamFactory; -import org.apache.commons.io.IOUtils; +import org.apache.poi.extractor.ExtractorFactory; import org.apache.poi.openxml4j.exceptions.InvalidFormatException; import org.apache.poi.openxml4j.opc.OPCPackage; import org.apache.poi.openxml4j.opc.PackageAccess; import org.apache.poi.openxml4j.opc.PackagePart; import org.apache.poi.openxml4j.opc.PackageRelationshipCollection; -import org.apache.poi.openxml4j.opc.PackageRelationshipTypes; import org.apache.tika.detect.Detector; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.IOUtils; import org.apache.tika.io.TemporaryResources; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; @@ -51,8 +50,6 @@ import org.apache.tika.parser.iwork.IWorkPackageParser; import org.apache.tika.parser.iwork.IWorkPackageParser.IWORKDocumentType; -import static java.nio.charset.StandardCharsets.UTF_8; - /** * A detector that works on Zip documents and other archive and compression * formats to figure out exactly what the file is. @@ -60,13 +57,6 @@ public class ZipContainerDetector implements Detector { private static final Pattern MACRO_TEMPLATE_PATTERN = Pattern.compile("macroenabledtemplate$", Pattern.CASE_INSENSITIVE); - // TODO Remove this constant once we upgrade to POI 3.12 beta 2, then use PackageRelationshipTypes - private static final String VISIO_DOCUMENT = - "http://schemas.microsoft.com/visio/2010/relationships/document"; - // TODO Remove this constant once we upgrade to POI 3.12 beta 2, then use PackageRelationshipTypes - private static final String STRICT_CORE_DOCUMENT = - "http://purl.oclc.org/ooxml/officeDocument/relationships/officeDocument"; - /** Serial version UID */ private static final long serialVersionUID = 2891763938430295453L; @@ -144,7 +134,7 @@ try { MediaType type = detectOpenDocument(zip); if (type == null) { - type = detectOPCBased(zip, tis); + type = detectOfficeOpenXML(zip, tis); } if (type == null) { type = detectIWork(zip); @@ -188,8 +178,11 @@ try { ZipArchiveEntry mimetype = zip.getEntry("mimetype"); if (mimetype != null) { - try (InputStream stream = zip.getInputStream(mimetype)) { - return MediaType.parse(IOUtils.toString(stream, UTF_8)); + InputStream stream = zip.getInputStream(mimetype); + try { + return MediaType.parse(IOUtils.toString(stream, "UTF-8")); + } finally { + stream.close(); } } else { return null; @@ -199,7 +192,7 @@ } } - private static MediaType detectOPCBased(ZipFile zip, TikaInputStream stream) { + private static MediaType detectOfficeOpenXML(ZipFile zip, TikaInputStream stream) { try { if (zip.getEntry("_rels/.rels") != null || zip.getEntry("[Content_Types].xml") != null) { @@ -207,20 +200,8 @@ OPCPackage pkg = OPCPackage.open(stream.getFile().getPath(), PackageAccess.READ); stream.setOpenContainer(pkg); - // Is at an OOXML format? - MediaType type = detectOfficeOpenXML(pkg); - if (type != null) return type; - - // Is it XPS format? - type = detectXPSOPC(pkg); - if (type != null) return type; - - // Is it an AutoCAD format? - type = detectAutoCADOPC(pkg); - if (type != null) return type; - - // We don't know what it is, sorry - return null; + // Detect based on the open OPC Package + return detectOfficeOpenXML(pkg); } else { return null; } @@ -237,18 +218,8 @@ * opened Package */ public static MediaType detectOfficeOpenXML(OPCPackage pkg) { - // Check for the normal Office core document PackageRelationshipCollection core = - pkg.getRelationshipsByType(PackageRelationshipTypes.CORE_DOCUMENT); - // Otherwise check for some other Office core document types - if (core.size() == 0) { - core = pkg.getRelationshipsByType(STRICT_CORE_DOCUMENT); - } - if (core.size() == 0) { - core = pkg.getRelationshipsByType(VISIO_DOCUMENT); - } - - // If we didn't find a single core document of any type, skip detection + pkg.getRelationshipsByType(ExtractorFactory.CORE_DOCUMENT_REL); if (core.size() != 1) { // Invalid OOXML Package received return null; @@ -262,42 +233,16 @@ String docType = coreType.substring(0, coreType.lastIndexOf('.')); // The Macro Enabled formats are a little special - if(docType.toLowerCase(Locale.ROOT).endsWith("macroenabled")) { - docType = docType.toLowerCase(Locale.ROOT) + ".12"; - } - - if(docType.toLowerCase(Locale.ROOT).endsWith("macroenabledtemplate")) { + if(docType.toLowerCase().endsWith("macroenabled")) { + docType = docType.toLowerCase() + ".12"; + } + + if(docType.toLowerCase().endsWith("macroenabledtemplate")) { docType = MACRO_TEMPLATE_PATTERN.matcher(docType).replaceAll("macroenabled.12"); } // Build the MediaType object and return return MediaType.parse(docType); - } - /** - * Detects Open XML Paper Specification (XPS) - */ - private static MediaType detectXPSOPC(OPCPackage pkg) { - PackageRelationshipCollection xps = - pkg.getRelationshipsByType("http://schemas.microsoft.com/xps/2005/06/fixedrepresentation"); - if (xps.size() == 1) { - return MediaType.application("vnd.ms-xpsdocument"); - } else { - // Non-XPS Package received - return null; - } - } - /** - * Detects AutoCAD formats that live in OPC packaging - */ - private static MediaType detectAutoCADOPC(OPCPackage pkg) { - PackageRelationshipCollection dwfxSeq = - pkg.getRelationshipsByType("http://schemas.autodesk.com/dwfx/2007/relationships/documentsequence"); - if (dwfxSeq.size() == 1) { - return MediaType.parse("model/vnd.dwfx+xps"); - } else { - // Non-AutoCAD Package received - return null; - } } private static MediaType detectIWork(ZipFile zip) { @@ -382,8 +327,10 @@ add(Pattern.compile("^Payload/.*\\.app/$")); add(Pattern.compile("^Payload/.*\\.app/_CodeSignature/$")); add(Pattern.compile("^Payload/.*\\.app/_CodeSignature/CodeResources$")); + add(Pattern.compile("^Payload/.*\\.app/CodeResources$")); add(Pattern.compile("^Payload/.*\\.app/Info\\.plist$")); add(Pattern.compile("^Payload/.*\\.app/PkgInfo$")); + add(Pattern.compile("^Payload/.*\\.app/ResourceRules\\.plist$")); }}; @SuppressWarnings("unchecked") private static MediaType detectIpa(ZipFile zip) { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/prt/PRTParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/prt/PRTParser.java index ddb45f6..9d12102 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/prt/PRTParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/prt/PRTParser.java @@ -34,8 +34,6 @@ import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; -import static java.nio.charset.StandardCharsets.US_ASCII; - /** * A basic text extracting parser for the CADKey PRT (CAD Drawing) * format. It outputs text from note entries. @@ -82,7 +80,7 @@ byte[] date = new byte[12]; IOUtils.readFully(stream, date); - String dateStr = new String(date, US_ASCII); + String dateStr = new String(date, "ASCII"); if(dateStr.startsWith("19") || dateStr.startsWith("20")) { String formattedDate = dateStr.substring(0, 4) + "-" + dateStr.substring(4,6) + "-" + dateStr.substring(6,8) + "T" + dateStr.substring(8,10) + ":" + diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/GroupState.java b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/GroupState.java index 4a9a1d1..ea8f7a3 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/GroupState.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/GroupState.java @@ -33,18 +33,6 @@ public int list; public int listLevel; public Charset fontCharset; - //in objdata - public boolean objdata; - //depth in pict, 1 = at pict level - public int pictDepth; - //in picprop key/value pair - public boolean sp; - //in picprop's name - public boolean sn; - //in picprop's value - public boolean sv; - //in embedded object or not - public boolean object; // Create default (root) GroupState public GroupState() { @@ -59,9 +47,6 @@ list = other.list; listLevel = other.listLevel; fontCharset = other.fontCharset; - depth = 1 + other.depth; - pictDepth = other.pictDepth > 0 ? other.pictDepth + 1 : 0; - //do not inherit object, sn, sv or sp - + depth = 1+other.depth; } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/ListDescriptor.java b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/ListDescriptor.java index e7142bd..704b3b4 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/ListDescriptor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/ListDescriptor.java @@ -29,7 +29,8 @@ public boolean isStyle; public int[] numberType = new int[9]; - public boolean isUnordered(int level) { + public boolean isUnordered(int level) + { return numberType[level] == NUMBER_TYPE_BULLET; } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFEmbObjHandler.java b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFEmbObjHandler.java deleted file mode 100644 index 395ff54..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFEmbObjHandler.java +++ /dev/null @@ -1,287 +0,0 @@ -package org.apache.tika.parser.rtf; -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.ByteArrayOutputStream; -import java.io.IOException; -import java.io.InputStream; -import java.util.concurrent.atomic.AtomicInteger; - -import org.apache.commons.io.FilenameUtils; -import org.apache.tika.config.TikaConfig; -import org.apache.tika.detect.Detector; -import org.apache.tika.exception.TikaException; -import org.apache.tika.extractor.EmbeddedDocumentExtractor; -import org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.RTFMetadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MimeType; -import org.apache.tika.mime.MimeTypeException; -import org.apache.tika.mime.MimeTypes; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.EmbeddedContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * This class buffers data from embedded objects and pictures. - *

      - *

      - *

      - * When the parser has finished an object or picture and called - * {@link #handleCompletedObject()}, this will write the object - * to the {@link #handler}. - *

      - *

      - *

      - * This (in combination with TextExtractor) expects basically a flat parse. It will pull out - * all pict whether they are tied to objdata or are intended - * to be standalone. - *

      - *

      - * This tries to pull metadata around a pict that is encoded - * with {sp {sn} {sv}} types of data. This information - * sometimes contains the name and even full file path of the original file. - */ -class RTFEmbObjHandler { - - private static final String EMPTY_STRING = ""; - private final ContentHandler handler; - - - private final ParseContext context; - private final ByteArrayOutputStream os; - //high hex cached for writing hexpair chars (data) - private int hi = -1; - private int thumbCount = 0; - //don't need atomic, do need mutable - private AtomicInteger unknownFilenameCount = new AtomicInteger(); - private boolean inObject = false; - private String sv = EMPTY_STRING; - private String sn = EMPTY_STRING; - private StringBuilder sb = new StringBuilder(); - private Metadata metadata; - private EMB_STATE state = EMB_STATE.NADA; - protected RTFEmbObjHandler(ContentHandler handler, Metadata metadata, ParseContext context) { - this.handler = handler; - this.context = context; - os = new ByteArrayOutputStream(); - } - - protected void startPict() { - state = EMB_STATE.PICT; - metadata = new Metadata(); - } - - protected void startObjData() { - state = EMB_STATE.OBJDATA; - metadata = new Metadata(); - } - - protected void startSN() { - sb.setLength(0); - sb.append(RTFMetadata.RTF_PICT_META_PREFIX); - } - - protected void endSN() { - sn = sb.toString(); - } - - protected void startSV() { - sb.setLength(0); - } - - protected void endSV() { - sv = sb.toString(); - } - - //end metadata pair - protected void endSP() { - metadata.add(sn, sv); - } - - protected boolean getInObject() { - return inObject; - } - - protected void setInObject(boolean v) { - inObject = v; - } - - protected void writeMetadataChar(char c) { - sb.append(c); - } - - protected void writeHexChar(int b) throws IOException, TikaException { - //if not hexchar, ignore - //white space is common - if (TextExtractor.isHexChar(b)) { - if (hi == -1) { - hi = 16 * TextExtractor.hexValue(b); - } else { - long sum = hi + TextExtractor.hexValue(b); - if (sum > Integer.MAX_VALUE || sum < 0) { - throw new IOException("hex char to byte overflow"); - } - - os.write((int) sum); - - hi = -1; - } - return; - } - if (b == -1) { - throw new TikaException("hit end of stream before finishing byte pair"); - } - } - - protected void writeBytes(InputStream is, int len) throws IOException, TikaException { - if (len < 0 || len > RTFParser.getMaxBytesForEmbeddedObject()) { - throw new IOException("length of bytes to read out of bounds: " + len); - } - - byte[] bytes = new byte[len]; - int bytesRead = is.read(bytes); - if (bytesRead < len) { - throw new TikaException("unexpected end of file: need " + len + - " bytes of binary data, found " + (len - bytesRead)); - } - os.write(bytes); - } - - /** - * Call this when the objdata/pict has completed - * - * @throws IOException - * @throws SAXException - * @throws TikaException - */ - protected void handleCompletedObject() throws IOException, SAXException, TikaException { - EmbeddedDocumentExtractor embeddedExtractor = context.get(EmbeddedDocumentExtractor.class); - - if (embeddedExtractor == null) { - embeddedExtractor = new ParsingEmbeddedDocumentExtractor(context); - } - - byte[] bytes = os.toByteArray(); - if (state == EMB_STATE.OBJDATA) { - RTFObjDataParser objParser = new RTFObjDataParser(); - try { - byte[] objBytes = objParser.parse(bytes, metadata, unknownFilenameCount); - extractObj(objBytes, handler, embeddedExtractor, metadata); - } catch (IOException e) { - //swallow. If anything goes wrong, ignore. - } - } else if (state == EMB_STATE.PICT) { - String filePath = metadata.get(RTFMetadata.RTF_PICT_META_PREFIX + "wzDescription"); - if (filePath != null && filePath.length() > 0) { - metadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, filePath); - metadata.set(Metadata.RESOURCE_NAME_KEY, FilenameUtils.getName(filePath)); - } - metadata.set(RTFMetadata.THUMBNAIL, Boolean.toString(inObject)); - extractObj(bytes, handler, embeddedExtractor, metadata); - - } else if (state == EMB_STATE.NADA) { - //swallow...no start for pict or embed?! - } - reset(); - } - - private void extractObj(byte[] bytes, ContentHandler handler, - EmbeddedDocumentExtractor embeddedExtractor, Metadata metadata) - throws SAXException, IOException, TikaException { - - if (bytes == null) { - return; - } - - metadata.set(Metadata.CONTENT_LENGTH, Integer.toString(bytes.length)); - - if (embeddedExtractor.shouldParseEmbedded(metadata)) { - TikaInputStream stream = TikaInputStream.get(bytes); - if (metadata.get(Metadata.RESOURCE_NAME_KEY) == null) { - String extension = getExtension(stream, metadata); - stream.reset(); - if (inObject && state == EMB_STATE.PICT) { - metadata.set(Metadata.RESOURCE_NAME_KEY, "thumbnail_" + thumbCount++ + extension); - metadata.set(RTFMetadata.THUMBNAIL, "true"); - } else { - metadata.set(Metadata.RESOURCE_NAME_KEY, "file_" + unknownFilenameCount.getAndIncrement() + - extension); - } - } - try { - embeddedExtractor.parseEmbedded( - stream, - new EmbeddedContentHandler(handler), - metadata, false); - } finally { - stream.close(); - } - } - } - - private String getExtension(TikaInputStream is, Metadata metadata) { - String cType = metadata.get(Metadata.CONTENT_TYPE); - TikaConfig config = getConfig(); - if (cType == null) { - Detector detector = config.getDetector(); - try { - MediaType mediaType = detector.detect(is, metadata); - MimeTypes types = config.getMimeRepository(); - MimeType mime = types.forName(mediaType.toString()); - metadata.set(Metadata.CONTENT_TYPE, mediaType.getSubtype()); - return mime.getExtension(); - } catch (IOException e) { - //swallow - } catch (MimeTypeException e) { - - } - } - return ".bin"; - } - - private TikaConfig getConfig() { - TikaConfig config = context.get(TikaConfig.class); - if (config == null) { - config = TikaConfig.getDefaultConfig(); - } - return config; - } - - /** - * reset state after each object. - * Do not reset unknown file number. - */ - protected void reset() { - state = EMB_STATE.NADA; - os.reset(); - metadata = new Metadata(); - hi = -1; - sv = EMPTY_STRING; - sn = EMPTY_STRING; - sb.setLength(0); - } - - private enum EMB_STATE { - PICT, //recording pict data - OBJDATA, //recording objdata - NADA - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFObjDataParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFObjDataParser.java deleted file mode 100644 index cc9d62f..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFObjDataParser.java +++ /dev/null @@ -1,315 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - - -package org.apache.tika.parser.rtf; - -import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; -import java.io.FileNotFoundException; -import java.io.IOException; -import java.io.InputStream; -import java.io.UnsupportedEncodingException; -import java.util.Locale; -import java.util.concurrent.atomic.AtomicInteger; - -import org.apache.commons.io.FilenameUtils; -import org.apache.poi.poifs.filesystem.DirectoryNode; -import org.apache.poi.poifs.filesystem.DocumentEntry; -import org.apache.poi.poifs.filesystem.DocumentInputStream; -import org.apache.poi.poifs.filesystem.Entry; -import org.apache.poi.poifs.filesystem.NPOIFSFileSystem; -import org.apache.poi.poifs.filesystem.Ole10Native; -import org.apache.poi.poifs.filesystem.Ole10NativeException; -import org.apache.poi.util.IOUtils; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.RTFMetadata; -import org.apache.tika.parser.microsoft.OfficeParser.POIFSDocumentType; - -/** - * Many thanks to Simon Mourier for: - * http://stackoverflow.com/questions/14779647/extract-embedded-image-object-in-rtf - * and for granting permission to use his code in Tika. - */ -class RTFObjDataParser { - - private final static int[] INT_LE_POWS = new int[]{ - 1, 256, 65536, 16777216 - }; - - private final static String WIN_ASCII = "WINDOWS-1252"; - - /** - * Parses the embedded object/pict string - * - * @param bytes actual bytes (already converted from the - * hex pair string stored in the embedded object data into actual bytes or read - * as raw binary bytes) - * @return a SimpleRTFEmbObj or null - * @throws IOException if there are any surprise surprises during parsing - */ - - /** - * @param bytes - * @param metadata incoming metadata - * @param unknownFilenameCount - * @return byte[] for contents of obj data - * @throws IOException - */ - protected byte[] parse(byte[] bytes, Metadata metadata, AtomicInteger unknownFilenameCount) - throws IOException { - ByteArrayInputStream is = new ByteArrayInputStream(bytes); - long version = readUInt(is); - metadata.add(RTFMetadata.EMB_APP_VERSION, Long.toString(version)); - - long formatId = readUInt(is); - //2 is an embedded object. 1 is a link. - if (formatId != 2L) { - return null; - } - String className = readLengthPrefixedAnsiString(is).trim(); - String topicName = readLengthPrefixedAnsiString(is).trim(); - String itemName = readLengthPrefixedAnsiString(is).trim(); - - if (className != null && className.length() > 0) { - metadata.add(RTFMetadata.EMB_CLASS, className); - } - if (topicName != null && topicName.length() > 0) { - metadata.add(RTFMetadata.EMB_TOPIC, topicName); - } - if (itemName != null && itemName.length() > 0) { - metadata.add(RTFMetadata.EMB_ITEM, itemName); - } - - long dataSz = readUInt(is); - - //readBytes tests for reading too many bytes - byte[] embObjBytes = readBytes(is, dataSz); - - if (className.toLowerCase(Locale.ROOT).equals("package")) { - return handlePackage(embObjBytes, metadata); - } else if (className.toLowerCase(Locale.ROOT).equals("pbrush")) { - //simple bitmap bytes - return embObjBytes; - } else { - ByteArrayInputStream embIs = new ByteArrayInputStream(embObjBytes); - if (NPOIFSFileSystem.hasPOIFSHeader(embIs)) { - try { - return handleEmbeddedPOIFS(embIs, metadata, unknownFilenameCount); - } catch (IOException e) { - //swallow - } - } - } - return embObjBytes; - } - - - //will throw IOException if not actually POIFS - //can return null byte[] - private byte[] handleEmbeddedPOIFS(InputStream is, Metadata metadata, - AtomicInteger unknownFilenameCount) - throws IOException { - - byte[] ret = null; - try (NPOIFSFileSystem fs = new NPOIFSFileSystem(is)) { - - DirectoryNode root = fs.getRoot(); - - if (root == null) { - return ret; - } - - if (root.hasEntry("Package")) { - Entry ooxml = root.getEntry("Package"); - TikaInputStream stream = TikaInputStream.get(new DocumentInputStream((DocumentEntry) ooxml)); - - ByteArrayOutputStream out = new ByteArrayOutputStream(); - - IOUtils.copy(stream, out); - ret = out.toByteArray(); - } else { - //try poifs - POIFSDocumentType type = POIFSDocumentType.detectType(root); - if (type == POIFSDocumentType.OLE10_NATIVE) { - try { - // Try to un-wrap the OLE10Native record: - Ole10Native ole = Ole10Native.createFromEmbeddedOleObject(root); - ret = ole.getDataBuffer(); - } catch (Ole10NativeException ex) { - // Not a valid OLE10Native record, skip it - } - } else if (type == POIFSDocumentType.COMP_OBJ) { - - DocumentEntry contentsEntry; - try { - contentsEntry = (DocumentEntry) root.getEntry("CONTENTS"); - } catch (FileNotFoundException ioe) { - contentsEntry = (DocumentEntry) root.getEntry("Contents"); - } - - try (DocumentInputStream inp = new DocumentInputStream(contentsEntry)) { - ret = new byte[contentsEntry.getSize()]; - inp.readFully(ret); - } - } else { - - ByteArrayOutputStream out = new ByteArrayOutputStream(); - is.reset(); - IOUtils.copy(is, out); - ret = out.toByteArray(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "file_" + unknownFilenameCount.getAndIncrement() + "." + type.getExtension()); - metadata.set(Metadata.CONTENT_TYPE, type.getType().toString()); - } - } - } - return ret; - } - - - /** - * can return null if there is a linked object - * instead of an embedded file - */ - private byte[] handlePackage(byte[] pkgBytes, Metadata metadata) throws IOException { - //now parse the package header - ByteArrayInputStream is = new ByteArrayInputStream(pkgBytes); - readUShort(is); - - String displayName = readAnsiString(is); - - //should we add this to the metadata? - readAnsiString(is); //iconFilePath - readUShort(is); //iconIndex - int type = readUShort(is); //type - - //1 is link, 3 is embedded object - //this only handles embedded objects - if (type != 3) { - return null; - } - //should we really be ignoring this filePathLen? - readUInt(is); //filePathLen - - String ansiFilePath = readAnsiString(is); //filePath - long bytesLen = readUInt(is); - byte[] objBytes = initByteArray(bytesLen); - is.read(objBytes); - StringBuilder unicodeFilePath = new StringBuilder(); - - try { - long unicodeLen = readUInt(is); - - for (int i = 0; i < unicodeLen; i++) { - int lo = is.read(); - int hi = is.read(); - int sum = lo + 256 * hi; - if (hi == -1 || lo == -1) { - //stream ran out; empty SB and stop - unicodeFilePath.setLength(0); - break; - } - unicodeFilePath.append((char) sum); - } - } catch (IOException e) { - //swallow; the unicode file path is optional and might not happen - unicodeFilePath.setLength(0); - } - String fileNameToUse = ""; - String pathToUse = ""; - if (unicodeFilePath.length() > 0) { - String p = unicodeFilePath.toString(); - fileNameToUse = p; - pathToUse = p; - } else { - fileNameToUse = displayName == null ? "" : displayName; - pathToUse = ansiFilePath == null ? "" : ansiFilePath; - } - metadata.set(Metadata.RESOURCE_NAME_KEY, FilenameUtils.getName(fileNameToUse)); - metadata.set(Metadata.EMBEDDED_RELATIONSHIP_ID, pathToUse); - - return objBytes; - } - - - private int readUShort(InputStream is) throws IOException { - int lo = is.read(); - int hi = is.read() * 256; - if (lo == -1 || hi == -1) { - throw new IOException("Hit end of stream before reading little endian unsigned short."); - } - return hi + lo; - } - - private long readUInt(InputStream is) throws IOException { - long sum = 0; - for (int i = 0; i < 4; i++) { - int v = is.read(); - if (v == -1) { - throw new IOException("Hit end of stream before finishing little endian unsigned int."); - } - sum += v * (long) INT_LE_POWS[i]; - } - return sum; - } - - private String readAnsiString(InputStream is) throws IOException { - StringBuilder sb = new StringBuilder(); - int c = is.read(); - while (c > 0) { - sb.append((char) c); - c = is.read(); - } - if (c == -1) { - throw new IOException("Hit end of stream before end of AnsiString"); - } - return sb.toString(); - } - - private String readLengthPrefixedAnsiString(InputStream is) throws IOException { - long len = readUInt(is); - byte[] bytes = readBytes(is, len); - try { - return new String(bytes, WIN_ASCII); - } catch (UnsupportedEncodingException e) { - //shouldn't ever happen - throw new IOException("Unsupported encoding"); - } - } - - - private byte[] readBytes(InputStream is, long len) throws IOException { - //initByteArray tests for "reading of too many bytes" - byte[] bytes = initByteArray(len); - int read = is.read(bytes); - if (read != len) { - throw new IOException("Hit end of stream before reading all bytes"); - } - - return bytes; - } - - private byte[] initByteArray(long len) throws IOException { - if (len < 0 || len > RTFParser.getMaxBytesForEmbeddedObject()) { - throw new IOException("Requested length for reading bytes is out of bounds: " + len); - } - return new byte[(int) len]; - - } -} - diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFParser.java index c047fae..74d73a6 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/RTFParser.java @@ -21,8 +21,8 @@ import java.util.Collections; import java.util.Set; -import org.apache.commons.io.input.TaggedInputStream; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.TaggedInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -36,39 +36,11 @@ */ public class RTFParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = -4165069489372320313L; private static final Set SUPPORTED_TYPES = Collections.singleton(MediaType.application("rtf")); - /** - * maximum number of bytes per embedded object/pict (default: 20MB) - */ - private static int EMB_OBJ_MAX_BYTES = 20 * 1024 * 1024; //20MB - - /** - * See {@link #setMaxBytesForEmbeddedObject(int)}. - * - * @return maximum number of bytes allowed for an embedded object. - */ - public static int getMaxBytesForEmbeddedObject() { - return EMB_OBJ_MAX_BYTES; - } - - /** - * Bytes for embedded objects are currently cached in memory. - * If something goes wrong during the parsing of an embedded object, - * it is possible that a read length may be crazily too long - * and cause a heap crash. - * - * @param max maximum number of bytes to allow for embedded objects. If - * the embedded object has more than this number of bytes, skip it. - */ - public static void setMaxBytesForEmbeddedObject(int max) { - EMB_OBJ_MAX_BYTES = max; - } public Set getSupportedTypes(ParseContext context) { return SUPPORTED_TYPES; @@ -77,12 +49,10 @@ public void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) - throws IOException, SAXException, TikaException { + throws IOException, SAXException, TikaException { TaggedInputStream tagged = new TaggedInputStream(stream); try { - XHTMLContentHandler xhtmlHandler = new XHTMLContentHandler(handler, metadata); - RTFEmbObjHandler embObjHandler = new RTFEmbObjHandler(xhtmlHandler, metadata, context); - final TextExtractor ert = new TextExtractor(xhtmlHandler, metadata, embObjHandler); + final TextExtractor ert = new TextExtractor(new XHTMLContentHandler(handler, metadata), metadata); ert.extract(stream); metadata.add(Metadata.CONTENT_TYPE, "application/rtf"); } catch (IOException e) { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/TextExtractor.java b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/TextExtractor.java index eeb58ce..08534c3 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/rtf/TextExtractor.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/rtf/TextExtractor.java @@ -29,9 +29,7 @@ import java.util.Calendar; import java.util.HashMap; import java.util.LinkedList; -import java.util.Locale; import java.util.Map; -import java.util.TimeZone; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; @@ -54,6 +52,15 @@ final class TextExtractor { private static final Charset ASCII = Charset.forName("US-ASCII"); + + private static Charset getCharset(String name) { + try { + return CharsetUtils.forName(name); + } catch (Exception e) { + return ASCII; + } + } + private static final Charset WINDOWS_1252 = getCharset("WINDOWS-1252"); private static final Charset MAC_ROMAN = getCharset("MacRoman"); private static final Charset SHIFT_JIS = getCharset("Shift_JIS"); @@ -122,16 +129,97 @@ private static final Charset BIG5 = getCharset("Big5"); private static final Charset GB2312 = getCharset("GB2312"); private static final Charset MS949 = getCharset("ms949"); + + // Hold pending bytes (encoded in the current charset) + // for text output: + private byte[] pendingBytes = new byte[16]; + private int pendingByteCount; + private ByteBuffer pendingByteBuffer = ByteBuffer.wrap(pendingBytes); + + // Holds pending chars for text output + private char[] pendingChars = new char[10]; + private int pendingCharCount; + + // Holds chars for a still-being-tokenized control word + private byte[] pendingControl = new byte[10]; + private int pendingControlCount; + + // Used when we decode bytes -> chars using CharsetDecoder: + private final char[] outputArray = new char[128]; + private final CharBuffer outputBuffer = CharBuffer.wrap(outputArray); + + // Reused when possible: + private CharsetDecoder decoder; + private Charset lastCharset; + + private Charset globalCharset = WINDOWS_1252; + private int globalDefaultFont = -1; + private int curFontID = -1; + + // Holds the font table from this RTF doc, mapping + // the font number (from \fN control word) to the + // corresponding charset: + private final Map fontToCharset = + new HashMap(); + + // Group stack: when we open a new group, we push + // the previous group state onto the stack; when we + // close the group, we restore it + private final LinkedList groupStates = new LinkedList(); + + // Current group state; in theory this initial + // GroupState is unused because the RTF doc should + // immediately open the top group (start with {): + private GroupState groupState = new GroupState(); + + private boolean inHeader = true; + private int fontTableState; + private int fontTableDepth; + + // Non null if we are processing metadata (title, + // keywords, etc.) inside the info group: + private Property nextMetaData; + private boolean inParagraph; + + // Non-zero if we are processing inside a field destination: + private int fieldState; + + // Non-zero list index + private int pendingListEnd; + private Map listTable = new HashMap(); + private Map listOverrideTable = new HashMap(); + private Map currentListTable; + private ListDescriptor currentList; + private int listTableLevel = -1; + private boolean ignoreLists; + + // Non-null if we've seen the url for a HYPERLINK but not yet + // its text: + private String pendingURL; + + private final StringBuilder pendingBuffer = new StringBuilder(); + + // Used to process the sub-groups inside the upr + // group: + private int uprState = -1; + + private final XHTMLContentHandler out; + private final Metadata metadata; + + // Used when extracting CREATION date: + private int year, month, day, hour, minute; + + // How many next ansi chars we should skip; this + // is 0 except when we are still in the "ansi + // shadow" after seeing a unicode escape, at which + // point it's set to the last ucN skip we had seen: + int ansiSkip = 0; + // The RTF doc has a "font table" that assigns ords // (f0, f1, f2, etc.) to fonts and charsets, using the // \fcharsetN control word. This mapping maps from the // N to corresponding Java charset: private static final Map FCHARSET_MAP = - new HashMap(); - // The RTF may specify the \ansicpgN charset in the - // header; this maps the N to the corresponding Java - // character set: - private static final Map ANSICPG_MAP = new HashMap(); static { @@ -175,10 +263,15 @@ FCHARSET_MAP.put(255, CP850); // OEM } + // The RTF may specify the \ansicpgN charset in the + // header; this maps the N to the corresponding Java + // character set: + private static final Map ANSICPG_MAP = + new HashMap(); static { ANSICPG_MAP.put(437, CP4372); // US IBM ANSICPG_MAP.put(708, ISO_8859_6); // Arabic (ASMO 708) - + ANSICPG_MAP.put(709, WINDOWS_709); // Arabic (ASMO 449+, BCON V4) ANSICPG_MAP.put(710, WINDOWS_710); // Arabic (transparent Arabic) ANSICPG_MAP.put(710, WINDOWS_711); // Arabic (Nafitha Enhanced) @@ -234,106 +327,35 @@ ANSICPG_MAP.put(57011, WINDOWS_57011); // Punjabi } - // Used when we decode bytes -> chars using CharsetDecoder: - private final char[] outputArray = new char[128]; - private final CharBuffer outputBuffer = CharBuffer.wrap(outputArray); - // Holds the font table from this RTF doc, mapping - // the font number (from \fN control word) to the - // corresponding charset: - private final Map fontToCharset = - new HashMap(); - // Group stack: when we open a new group, we push - // the previous group state onto the stack; when we - // close the group, we restore it - private final LinkedList groupStates = new LinkedList(); - private final StringBuilder pendingBuffer = new StringBuilder(); - private final XHTMLContentHandler out; - private final Metadata metadata; - private final RTFEmbObjHandler embObjHandler; - // How many next ansi chars we should skip; this - // is 0 except when we are still in the "ansi - // shadow" after seeing a unicode escape, at which - // point it's set to the last ucN skip we had seen: - int ansiSkip = 0; - private int written = 0; - // Hold pending bytes (encoded in the current charset) - // for text output: - private byte[] pendingBytes = new byte[16]; - private int pendingByteCount; - private ByteBuffer pendingByteBuffer = ByteBuffer.wrap(pendingBytes); - // Holds pending chars for text output - private char[] pendingChars = new char[10]; - private int pendingCharCount; - // Holds chars for a still-being-tokenized control word - private byte[] pendingControl = new byte[10]; - private int pendingControlCount; - // Reused when possible: - private CharsetDecoder decoder; - private Charset lastCharset; - private Charset globalCharset = WINDOWS_1252; - private int globalDefaultFont = -1; - private int curFontID = -1; - // Current group state; in theory this initial - // GroupState is unused because the RTF doc should - // immediately open the top group (start with {): - private GroupState groupState = new GroupState(); - private boolean inHeader = true; - private int fontTableState; - private int fontTableDepth; - // Non null if we are processing metadata (title, - // keywords, etc.) inside the info group: - private Property nextMetaData; - private boolean inParagraph; - // Non-zero if we are processing inside a field destination: - private int fieldState; - // Non-zero list index - private int pendingListEnd; - private Map listTable = new HashMap(); - private Map listOverrideTable = new HashMap(); - private Map currentListTable; - private ListDescriptor currentList; - private int listTableLevel = -1; - private boolean ignoreLists; - // Non-null if we've seen the url for a HYPERLINK but not yet - // its text: - private String pendingURL; - // Used to process the sub-groups inside the upr - // group: - private int uprState = -1; - // Used when extracting CREATION date: - private int year, month, day, hour, minute; - - public TextExtractor(XHTMLContentHandler out, Metadata metadata, - RTFEmbObjHandler embObjHandler) { + public TextExtractor(XHTMLContentHandler out, Metadata metadata) { this.metadata = metadata; this.out = out; - this.embObjHandler = embObjHandler; - } - - private static Charset getCharset(String name) { - try { - return CharsetUtils.forName(name); - } catch (Exception e) { - return ASCII; - } - } - - protected static boolean isHexChar(int ch) { + } + + public boolean isIgnoringLists() { + return ignoreLists; + } + + public void setIgnoreLists(boolean ignore) { + this.ignoreLists = ignore; + } + + private static boolean isHexChar(int ch) { return (ch >= '0' && ch <= '9') || - (ch >= 'a' && ch <= 'f') || - (ch >= 'A' && ch <= 'F'); + (ch >= 'a' && ch <= 'f') || + (ch >= 'A' && ch <= 'F'); } private static boolean isAlpha(int ch) { return (ch >= 'a' && ch <= 'z') || - (ch >= 'A' && ch <= 'Z'); + (ch >= 'A' && ch <= 'Z'); } private static boolean isDigit(int ch) { return ch >= '0' && ch <= '9'; } - protected static int hexValue(int ch) { + private static int hexValue(int ch) { if (ch >= '0' && ch <= '9') { return ch - '0'; } else if (ch >= 'a' && ch <= 'z') { @@ -344,14 +366,6 @@ } } - public boolean isIgnoringLists() { - return ignoreLists; - } - - public void setIgnoreLists(boolean ignore) { - this.ignoreLists = ignore; - } - // Push pending bytes or pending chars: private void pushText() throws IOException, SAXException, TikaException { if (pendingByteCount != 0) { @@ -370,28 +384,25 @@ if (pendingCharCount != 0) { pushChars(); } - if (groupState.pictDepth > 0) { - embObjHandler.writeMetadataChar((char) b); - } else { - // Save the byte in pending buffer: - if (pendingByteCount == pendingBytes.length) { - // Gradual but exponential growth: - final byte[] newArray = new byte[(int) (pendingBytes.length * 1.25)]; - System.arraycopy(pendingBytes, 0, newArray, 0, pendingBytes.length); - pendingBytes = newArray; - pendingByteBuffer = ByteBuffer.wrap(pendingBytes); - } - pendingBytes[pendingByteCount++] = (byte) b; - } - } - - // Buffers a byte as part of a control word: + + // Save the byte in pending buffer: + if (pendingByteCount == pendingBytes.length) { + // Gradual but exponential growth: + final byte[] newArray = new byte[(int) (pendingBytes.length*1.25)]; + System.arraycopy(pendingBytes, 0, newArray, 0, pendingBytes.length); + pendingBytes = newArray; + pendingByteBuffer = ByteBuffer.wrap(pendingBytes); + } + pendingBytes[pendingByteCount++] = (byte) b; + } + + // Buffers a byte as part of a control word: private void addControl(int b) { assert isAlpha(b); // Save the byte in pending buffer: if (pendingControlCount == pendingControl.length) { // Gradual but exponential growth: - final byte[] newArray = new byte[(int) (pendingControl.length * 1.25)]; + final byte[] newArray = new byte[(int) (pendingControl.length*1.25)]; System.arraycopy(pendingControl, 0, newArray, 0, pendingControl.length); pendingControl = newArray; } @@ -406,12 +417,10 @@ if (inHeader || fieldState == 1) { pendingBuffer.append(ch); - } else if (groupState.sn == true || groupState.sv == true) { - embObjHandler.writeMetadataChar(ch); } else { if (pendingCharCount == pendingChars.length) { // Gradual but exponential growth: - final char[] newArray = new char[(int) (pendingChars.length * 1.25)]; + final char[] newArray = new char[(int) (pendingChars.length*1.25)]; System.arraycopy(pendingChars, 0, newArray, 0, pendingChars.length); pendingChars = newArray; } @@ -438,7 +447,7 @@ // }; extract(new PushbackInputStream(in, 2)); } - + private void extract(PushbackInputStream in) throws IOException, SAXException, TikaException { out.startDocument(); @@ -451,19 +460,14 @@ } else if (b == '{') { pushText(); processGroupStart(in); - } else if (b == '}') { + } else if (b == '}') { pushText(); processGroupEnd(); if (groupStates.isEmpty()) { // parsed document closing brace break; } - } else if (groupState.objdata == true || - groupState.pictDepth == 1) { - embObjHandler.writeHexChar(b); - } else if (b != '\r' && b != '\n' - && (!groupState.ignore || nextMetaData != null || - groupState.sn == true || groupState.sv == true)) { + } else if (b != '\r' && b != '\n' && (!groupState.ignore || nextMetaData != null)) { // Linefeed and carriage return are not // significant if (ansiSkip != 0) { @@ -477,7 +481,7 @@ endParagraph(false); out.endDocument(); } - + private void parseControlToken(PushbackInputStream in) throws IOException, SAXException, TikaException { int b = in.read(); if (b == '\'') { @@ -485,16 +489,16 @@ parseHexChar(in); } else if (isAlpha(b)) { // control word - parseControlWord((char) b, in); + parseControlWord((char)b, in); } else if (b == '{' || b == '}' || b == '\\' || b == '\r' || b == '\n') { // escaped char addOutputByte(b); } else if (b != -1) { // control symbol, eg \* or \~ - processControlSymbol((char) b); - } - } - + processControlSymbol((char)b); + } + } + private void parseHexChar(PushbackInputStream in) throws IOException, SAXException, TikaException { int hex1 = in.read(); if (!isHexChar(hex1)) { @@ -502,7 +506,7 @@ in.unread(hex1); return; } - + int hex2 = in.read(); if (!isHexChar(hex2)) { // TODO: log a warning here, somehow? @@ -511,7 +515,7 @@ in.unread(hex2); return; } - + if (ansiSkip != 0) { // Skip this ansi char since we are // still in the shadow of a unicode @@ -519,19 +523,19 @@ ansiSkip--; } else { // Unescape: - addOutputByte(16 * hexValue(hex1) + hexValue(hex2)); + addOutputByte(16*hexValue(hex1) + hexValue(hex2)); } } private void parseControlWord(int firstChar, PushbackInputStream in) throws IOException, SAXException, TikaException { addControl(firstChar); - + int b = in.read(); while (isAlpha(b)) { addControl(b); b = in.read(); } - + boolean hasParam = false; boolean negParam = false; if (b == '-') { @@ -547,14 +551,14 @@ hasParam = true; b = in.read(); } - + // space is consumed as part of the // control word, but is not added to the // control word if (b != ' ') { in.unread(b); } - + if (hasParam) { if (negParam) { param = -param; @@ -563,7 +567,7 @@ } else { processControlWord(); } - + pendingControlCount = 0; } @@ -582,7 +586,7 @@ } if (inList() && pendingListEnd != groupState.list) { startList(groupState.list); - } + } if (inList()) { out.startElement("li"); } else { @@ -602,10 +606,6 @@ private void endParagraph(boolean preserveStyles) throws IOException, SAXException, TikaException { pushText(); - //maintain consecutive new lines - if (!inParagraph) { - lazyStartParagraph(); - } if (inParagraph) { if (groupState.italic) { end("i"); @@ -718,7 +718,7 @@ if (pendingControlCount != s.length()) { return false; } - for (int idx = 0; idx < pendingControlCount; idx++) { + for(int idx=0;idx unicode NON-BREAKING SPACE - addOutputChar('\u00a0'); - break; - case '*': - // Ignorable destination (control words defined after - // the 1987 RTF spec). These are already handled by - // processGroupStart() - break; - case '-': - // Optional hyphen -> unicode SOFT HYPHEN - addOutputChar('\u00ad'); - break; - case '_': - // Non-breaking hyphen -> unicode NON-BREAKING HYPHEN - addOutputChar('\u2011'); - break; - default: - break; + switch(ch) { + case '~': + // Non-breaking space -> unicode NON-BREAKING SPACE + addOutputChar('\u00a0'); + break; + case '*': + // Ignorable destination (control words defined after + // the 1987 RTF spec). These are already handled by + // processGroupStart() + break; + case '-': + // Optional hyphen -> unicode SOFT HYPHEN + addOutputChar('\u00ad'); + break; + case '_': + // Non-breaking hyphen -> unicode NON-BREAKING HYPHEN + addOutputChar('\u2011'); + break; + default: + break; } } @@ -869,11 +869,7 @@ } else if (equals("listtemplateid")) { currentList.templateID = param; } else if (equals("levelnfc") || equals("levelnfcn")) { - //sanity check to make sure list information isn't corrupt - if (listTableLevel > -1 && - listTableLevel < currentList.numberType.length) { - currentList.numberType[listTableLevel] = param; - } + currentList.numberType[listTableLevel] = param; } } } else { @@ -928,7 +924,7 @@ // in the header can be unicode escaped as well: if (equals("u")) { // Unicode escape - if (!groupState.ignore || groupState.sv || groupState.sn) { + if (!groupState.ignore) { final char utf16CodeUnit = (char) (param & 0xffff); addOutputChar(utf16CodeUnit); } @@ -939,28 +935,17 @@ ansiSkip = groupState.ucSkip; } else if (equals("uc")) { // Change unicode shadow length - groupState.ucSkip = param; + groupState.ucSkip = (int) param; } else if (equals("bin")) { if (param >= 0) { - if (groupState.pictDepth == 1) { - try { - embObjHandler.writeBytes(in, param); - } catch (IOException e) { - //param was out of bounds or something went wrong during writing. - //skip this obj and move on - //TODO: log.warn - embObjHandler.reset(); + int bytesToRead = param; + byte[] tmpArray = new byte[Math.min(1024, bytesToRead)]; + while (bytesToRead > 0) { + int r = in.read(tmpArray, 0, Math.min(bytesToRead, tmpArray.length)); + if (r < 0) { + throw new TikaException("unexpected end of file: need " + param + " bytes of binary data, found " + (param-bytesToRead)); } - } else { - int bytesToRead = param; - byte[] tmpArray = new byte[Math.min(1024, bytesToRead)]; - while (bytesToRead > 0) { - int r = in.read(tmpArray, 0, Math.min(bytesToRead, tmpArray.length)); - if (r < 0) { - throw new TikaException("unexpected end of file: need " + param + " bytes of binary data, found " + (param - bytesToRead)); - } - bytesToRead -= r; - } + bytesToRead -= r; } } else { // log some warning? @@ -985,7 +970,6 @@ /** * Emits the end tag of a list. Uses {@link #isUnorderedList(int)} to determine the list * type for the given listID. - * * @param listID The ID of the list. * @throws IOException * @throws SAXException @@ -1000,7 +984,6 @@ /** * Emits the start tag of a list. Uses {@link #isUnorderedList(int)} to determine the list * type for the given listID. - * * @param listID The ID of the list. * @throws IOException * @throws SAXException @@ -1033,11 +1016,11 @@ if (inHeader) { if (equals("ansi")) { globalCharset = WINDOWS_1252; - } else if (equals("pca")) { + } else if (equals("pca")) { globalCharset = CP850; - } else if (equals("pc")) { + } else if (equals("pc")) { globalCharset = CP437; - } else if (equals("mac")) { + } else if (equals("mac")) { globalCharset = MAC_ROMAN; } @@ -1173,27 +1156,11 @@ // TODO: we should produce a table output here? //addOutputChar(' '); endParagraph(true); - } else if (equals("sp")) { - groupState.sp = true; - } else if (equals("sn")) { - embObjHandler.startSN(); - groupState.sn = true; - } else if (equals("sv")) { - embObjHandler.startSV(); - groupState.sv = true; - } else if (equals("object")) { - pushText(); - embObjHandler.setInObject(true); - groupState.object = true; - } else if (equals("objdata")) { - groupState.objdata = true; - embObjHandler.startObjData(); } else if (equals("pict")) { pushText(); // TODO: create img tag? but can that support // embedded image data? - groupState.pictDepth = 1; - embObjHandler.startPict(); + groupState.ignore = true; } else if (equals("line")) { if (!ignored) { addOutputChar('\n'); @@ -1301,13 +1268,13 @@ // Make new GroupState groupState = new GroupState(groupState); - assert groupStates.size() == groupState.depth : "size=" + groupStates.size() + " depth=" + groupState.depth; + assert groupStates.size() == groupState.depth: "size=" + groupStates.size() + " depth=" + groupState.depth; if (uprState == 0) { uprState = 1; groupState.ignore = true; } - + // Check for ignorable groups. Note that // sometimes we un-ignore within this group, eg // when handling upr escape. @@ -1317,7 +1284,7 @@ if (b3 == '*') { groupState.ignore = true; } - in.unread(b3); + in.unread(b3); } in.unread(b2); } @@ -1327,8 +1294,8 @@ if (inHeader) { if (nextMetaData != null) { if (nextMetaData == TikaCoreProperties.CREATED) { - Calendar cal = Calendar.getInstance(TimeZone.getDefault(), Locale.ROOT); - cal.set(year, month - 1, day, hour, minute, 0); + Calendar cal = Calendar.getInstance(); + cal.set(year, month-1, day, hour, minute, 0); metadata.set(nextMetaData, cal.getTime()); } else if (nextMetaData.isMultiValuePermitted()) { metadata.add(nextMetaData, pendingBuffer.toString()); @@ -1342,25 +1309,6 @@ assert groupState.depth > 0; ansiSkip = 0; - - if (groupState.objdata == true) { - embObjHandler.handleCompletedObject(); - groupState.objdata = false; - } else if (groupState.pictDepth > 0) { - if (groupState.sn == true) { - embObjHandler.endSN(); - } else if (groupState.sv == true) { - embObjHandler.endSV(); - } else if (groupState.sp == true) { - embObjHandler.endSP(); - } else if (groupState.pictDepth == 1) { - embObjHandler.handleCompletedObject(); - } - } - - if (groupState.object == true) { - embObjHandler.setInObject(false); - } // Be robust if RTF doc is corrupt (has too many // closing }s): @@ -1373,7 +1321,7 @@ // bold changed: if (groupState.italic) { if (!outerGroupState.italic || - groupState.bold != outerGroupState.bold) { + groupState.bold != outerGroupState.bold) { end("i"); groupState.italic = false; } @@ -1404,12 +1352,12 @@ s = s.substring(9).trim(); // TODO: what other instructions can be in a // HYPERLINK destination? - final boolean isLocalLink = s.contains("\\l "); + final boolean isLocalLink = s.indexOf("\\l ") != -1; int idx = s.indexOf('"'); if (idx != -1) { - int idx2 = s.indexOf('"', 1 + idx); + int idx2 = s.indexOf('"', 1+idx); if (idx2 != -1) { - s = s.substring(1 + idx, idx2); + s = s.substring(1+idx, idx2); } } pendingURL = (isLocalLink ? "#" : "") + s; diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/strings/FileConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/strings/FileConfig.java deleted file mode 100644 index da9deab..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/strings/FileConfig.java +++ /dev/null @@ -1,77 +0,0 @@ -/* - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.strings; - -import java.io.Serializable; - -/** - * Configuration for the "file" (or file-alternative) command. - * - */ -public class FileConfig implements Serializable { - /** - * Serial version UID - */ - private static final long serialVersionUID = 5712655467296441314L; - - private String filePath = ""; - - private boolean mimetype = false; - - /** - * Default constructor. - */ - public FileConfig() { - // TODO Loads properties from InputStream. - } - - /** - * Returns the "file" installation folder. - * - * @return the "file" installation folder. - */ - public String getFilePath() { - return filePath; - } - - /** - * Sets the "file" installation folder. - * - * @param path - * the "file" installation folder. - */ - public void setFilePath(String filePath) { - this.filePath = filePath; - } - - /** - * Returns {@code true} if the mime option is enabled. - * - * @return {@code true} if the mime option is enabled, {@code} otherwise. - */ - public boolean isMimetype() { - return mimetype; - } - - /** - * Sets the mime option. If {@code true}, it causes the file command to - * output mime type strings rather than the more traditional human readable - * ones. - * - * @param mimetype - */ - public void setMimetype(boolean mimetype) { - this.mimetype = mimetype; - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/strings/Latin1StringsParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/strings/Latin1StringsParser.java deleted file mode 100644 index 5c6fb46..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/strings/Latin1StringsParser.java +++ /dev/null @@ -1,322 +0,0 @@ -/* - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.strings; - -import java.io.IOException; -import java.io.InputStream; -import java.io.UnsupportedEncodingException; -import java.util.HashSet; -import java.util.Set; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -/** - * Parser to extract printable Latin1 strings from arbitrary files with pure - * java. Useful for binary or unknown files, for files without a specific parser - * and for corrupted ones causing a TikaException as a fallback parser. - * - * Currently the parser does a best effort to extract Latin1 strings, used by - * Western European languages, encoded with ISO-8859-1, UTF-8 or UTF-16 charsets - * within the same file. - * - * The implementation is optimized for fast parsing with only one pass. - */ -public class Latin1StringsParser extends AbstractParser { - - private static final long serialVersionUID = 1L; - - /** - * The set of supported types - */ - private static final Set SUPPORTED_TYPES = getTypes(); - - /** - * The valid ISO-8859-1 character map. - */ - private static final boolean[] isChar = getCharMap(); - - /** - * The size of the internal buffers. - */ - private static int BUF_SIZE = 64 * 1024; - - /** - * The minimum size of a character sequence to be extracted. - */ - private int minSize = 4; - - /** - * The output buffer. - */ - private byte[] output = new byte[BUF_SIZE]; - - /** - * The input buffer. - */ - private byte[] input = new byte[BUF_SIZE]; - - /** - * The temporary position into the output buffer. - */ - private int tmpPos = 0; - - /** - * The current position into the output buffer. - */ - private int outPos = 0; - - /** - * The number of bytes into the input buffer. - */ - private int inSize = 0; - - /** - * The position into the input buffer. - */ - private int inPos = 0; - - /** - * The output content handler. - */ - private XHTMLContentHandler xhtml; - - /** - * Returns the minimum size of a character sequence to be extracted. - * - * @return the minimum size of a character sequence - */ - public int getMinSize() { - return minSize; - } - - /** - * Sets the minimum size of a character sequence to be extracted. - * - * @param minSize - * the minimum size of a character sequence - */ - public void setMinSize(int minSize) { - this.minSize = minSize; - } - - /** - * Populates the valid ISO-8859-1 character map. - * - * @return the valid ISO-8859-1 character map. - */ - private static boolean[] getCharMap() { - - boolean[] isChar = new boolean[256]; - for (int c = Byte.MIN_VALUE; c <= Byte.MAX_VALUE; c++) - if ((c >= 0x20 && c <= 0x7E) - || (c >= (byte) 0xC0 && c <= (byte) 0xFE) || c == 0x0A - || c == 0x0D || c == 0x09) { - isChar[c & 0xFF] = true; - } - return isChar; - - } - - /** - * Returns the set of supported types. - * - * @return the set of supported types - */ - private static Set getTypes() { - HashSet supportedTypes = new HashSet(); - supportedTypes.add(MediaType.OCTET_STREAM); - return supportedTypes; - } - - /** - * Tests if the byte is a ISO-8859-1 char. - * - * @param c - * the byte to test. - * - * @return if the byte is a char. - */ - private static final boolean isChar(byte c) { - return isChar[c & 0xFF]; - } - - /** - * Flushes the internal output buffer to the content handler. - * - * @throws UnsupportedEncodingException - * @throws SAXException - */ - private void flushBuffer() throws UnsupportedEncodingException, - SAXException { - if (tmpPos - outPos >= minSize) - outPos = tmpPos - minSize; - - xhtml.characters(new String(output, 0, outPos, "windows-1252")); - - for (int k = 0; k < tmpPos - outPos; k++) - output[k] = output[outPos + k]; - tmpPos = tmpPos - outPos; - outPos = 0; - } - - @Override - public Set getSupportedTypes(ParseContext arg0) { - return SUPPORTED_TYPES; - } - - /** - * @see org.apache.tika.parser.Parser#parse(java.io.InputStream, - * org.xml.sax.ContentHandler, org.apache.tika.metadata.Metadata, - * org.apache.tika.parser.ParseContext) - */ - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException { - /* - * Creates a new instance because the object is not immutable. - */ - new Latin1StringsParser().doParse(stream, handler, metadata, context); - } - - /** - * Does a best effort to extract Latin1 strings encoded with ISO-8859-1, - * UTF-8 or UTF-16. Valid chars are saved into the output buffer and the - * temporary buffer position is incremented. When an invalid char is read, - * the difference of the temporary and current buffer position is checked. - * If it is greater than the minimum string size, the current buffer - * position is updated to the temp position. If it is not, the temp position - * is reseted to the current position. - * - * @param stream - * the input stream. - * @param handler - * the output content handler - * @param metadata - * the metadata of the file - * @param context - * the parsing context - * @throws IOException - * if an io error occurs - * @throws SAXException - * if a sax error occurs - */ - private void doParse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException { - - tmpPos = 0; - outPos = 0; - - xhtml = new XHTMLContentHandler(handler, metadata); - xhtml.startDocument(); - - int i = 0; - do { - inSize = 0; - while ((i = stream.read(input, inSize, BUF_SIZE - inSize)) > 0) { - inSize += i; - } - inPos = 0; - while (inPos < inSize) { - byte c = input[inPos++]; - boolean utf8 = false; - /* - * Test for a possible UTF8 encoded char - */ - if (c == (byte) 0xC3) { - byte c_ = inPos < inSize ? input[inPos++] : (byte) stream - .read(); - /* - * Test if the next byte is in the valid UTF8 range - */ - if (c_ >= (byte) 0x80 && c_ <= (byte) 0xBF) { - utf8 = true; - output[tmpPos++] = (byte) (c_ + 0x40); - } else { - output[tmpPos++] = c; - c = c_; - } - if (tmpPos == BUF_SIZE) - flushBuffer(); - - /* - * Test for a possible UTF8 encoded char - */ - } else if (c == (byte) 0xC2) { - byte c_ = inPos < inSize ? input[inPos++] : (byte) stream - .read(); - /* - * Test if the next byte is in the valid UTF8 range - */ - if (c_ >= (byte) 0xA0 && c_ <= (byte) 0xBF) { - utf8 = true; - output[tmpPos++] = c_; - } else { - output[tmpPos++] = c; - c = c_; - } - if (tmpPos == BUF_SIZE) - flushBuffer(); - } - if (!utf8) - /* - * Test if the byte is a valid char. - */ - if (isChar(c)) { - output[tmpPos++] = c; - if (tmpPos == BUF_SIZE) - flushBuffer(); - } else { - /* - * Test if the byte is an invalid char, marking a string - * end. If it is a zero, test 2 positions before or - * ahead for a valid char, meaning it marks the - * transition between ISO-8859-1 and UTF16 sequences. - */ - if (c != 0 - || (inPos >= 3 && isChar(input[inPos - 3])) - || (inPos + 1 < inSize && isChar(input[inPos + 1]))) { - - if (tmpPos - outPos >= minSize) { - output[tmpPos++] = 0x0A; - outPos = tmpPos; - - if (tmpPos == BUF_SIZE) - flushBuffer(); - } else - tmpPos = outPos; - - } - } - } - } while (i != -1 && !Thread.currentThread().isInterrupted()); - - if (tmpPos - outPos >= minSize) { - output[tmpPos++] = 0x0A; - outPos = tmpPos; - } - xhtml.characters(new String(output, 0, outPos, "windows-1252")); - - xhtml.endDocument(); - - } - -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsConfig.java b/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsConfig.java deleted file mode 100644 index 9183f2e..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsConfig.java +++ /dev/null @@ -1,187 +0,0 @@ -/* - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.strings; - -import java.io.File; -import java.io.Serializable; -import java.util.Properties; -import java.io.InputStream; -import java.io.IOException; - -/** - * Configuration for the "strings" (or strings-alternative) command. - * - */ -public class StringsConfig implements Serializable { - /** - * Serial version UID - */ - private static final long serialVersionUID = -1465227101645003594L; - - private String stringsPath = ""; - - // Minimum sequence length (characters) to print - private int minLength = 4; - - // Character encoding of the strings that are to be found - private StringsEncoding encoding = StringsEncoding.SINGLE_7_BIT; - - // Maximum time (seconds) to wait for the strings process termination - private int timeout = 120; - - /** - * Default contructor. - */ - public StringsConfig() { - init(this.getClass().getResourceAsStream("Strings.properties")); - } - - /** - * Loads properties from InputStream and then tries to close InputStream. If - * there is an IOException, this silently swallows the exception and goes - * back to the default. - * - * @param is - */ - public StringsConfig(InputStream is) { - init(is); - } - - /** - * Initializes attributes. - * - * @param is - */ - private void init(InputStream is) { - if (is == null) { - return; - } - Properties props = new Properties(); - try { - props.load(is); - } catch (IOException e) { - // swallow - } finally { - if (is != null) { - try { - is.close(); - } catch (IOException e) { - // swallow - } - } - } - - setStringsPath(props.getProperty("stringsPath", "" + getStringsPath())); - - setMinLength(Integer.parseInt(props.getProperty("minLength", "" - + getMinLength()))); - - setEncoding(StringsEncoding.valueOf(props.getProperty("encoding", "" - + getEncoding().get()))); - - setTimeout(Integer.parseInt(props.getProperty("timeout", "" - + getTimeout()))); - } - - /** - * Returns the "strings" installation folder. - * - * @return the "strings" installation folder. - */ - public String getStringsPath() { - return this.stringsPath; - } - - /** - * Returns the minimum sequence length (characters) to print. - * - * @return the minimum sequence length (characters) to print. - */ - public int getMinLength() { - return this.minLength; - } - - /** - * Returns the character encoding of the strings that are to be found. - * - * @return {@see StringsEncoding} enum that represents the character - * encoding of the strings that are to be found. - */ - public StringsEncoding getEncoding() { - return this.encoding; - } - - /** - * Returns the maximum time (in seconds) to wait for the "strings" command - * to terminate. - * - * @return the maximum time (in seconds) to wait for the "strings" command - * to terminate. - */ - public int getTimeout() { - return this.timeout; - } - - /** - * Sets the "strings" installation folder. - * - * @param path - * the "strings" installation folder. - */ - public void setStringsPath(String path) { - if (!path.isEmpty() && !path.endsWith(File.separator)) { - path += File.separatorChar; - } - this.stringsPath = path; - } - - /** - * Sets the minimum sequence length (characters) to print. - * - * @param minLength - * the minimum sequence length (characters) to print. - */ - public void setMinLength(int minLength) { - if (minLength < 1) { - throw new IllegalArgumentException("Invalid minimum length"); - } - this.minLength = minLength; - } - - /** - * Sets the character encoding of the strings that are to be found. - * - * @param encoding - * {@see StringsEncoding} enum that represents the character - * encoding of the strings that are to be found. - */ - public void setEncoding(StringsEncoding encoding) { - this.encoding = encoding; - } - - /** - * Sets the maximum time (in seconds) to wait for the "strings" command to - * terminate. - * - * @param timeout - * the maximum time (in seconds) to wait for the "strings" - * command to terminate. - */ - public void setTimeout(int timeout) { - if (timeout < 1) { - throw new IllegalArgumentException("Invalid timeout"); - } - this.timeout = timeout; - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsEncoding.java b/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsEncoding.java deleted file mode 100644 index d63d6de..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsEncoding.java +++ /dev/null @@ -1,45 +0,0 @@ -/* - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.strings; - -/** - * Character encoding of the strings that are to be found using the "strings" command. - * - */ -public enum StringsEncoding { - SINGLE_7_BIT('s', "single-7-bit-byte"), // default - SINGLE_8_BIT('S', "single-8-bit-byte"), - BIGENDIAN_16_BIT('b', "16-bit bigendian"), - LITTLEENDIAN_16_BIT('l', "16-bit littleendian"), - BIGENDIAN_32_BIT('B', "32-bit bigendian"), - LITTLEENDIAN_32_BIT('L', "32-bit littleendian"); - - private char value; - - private String encoding; - - private StringsEncoding(char value, String encoding) { - this.value = value; - this.encoding = encoding; - } - - public char get() { - return value; - } - - @Override - public String toString() { - return encoding; - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsParser.java deleted file mode 100644 index 4d69d2e..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/strings/StringsParser.java +++ /dev/null @@ -1,335 +0,0 @@ -/* - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.strings; - -import java.io.BufferedReader; -import java.io.File; -import java.io.IOException; -import java.io.InputStream; -import java.io.InputStreamReader; -import java.util.ArrayList; -import java.util.Collections; -import java.util.HashMap; -import java.util.Map; -import java.util.Set; -import java.util.concurrent.Callable; -import java.util.concurrent.ExecutionException; -import java.util.concurrent.FutureTask; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.TimeoutException; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AbstractParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.external.ExternalParser; -import org.apache.tika.sax.XHTMLContentHandler; -import org.xml.sax.ContentHandler; -import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; - -/** - * Parser that uses the "strings" (or strings-alternative) command to find the - * printable strings in a object, or other binary, file - * (application/octet-stream). Useful as "best-effort" parser for files detected - * as application/octet-stream. - * - * @author gtotaro - * - */ -public class StringsParser extends AbstractParser { - /** - * Serial version UID - */ - private static final long serialVersionUID = 802566634661575025L; - - private static final Set SUPPORTED_TYPES = Collections - .singleton(MediaType.OCTET_STREAM); - - private static final StringsConfig DEFAULT_STRINGS_CONFIG = new StringsConfig(); - - private static final FileConfig DEFAULT_FILE_CONFIG = new FileConfig(); - - /* - * This map is organized as follows: - * command's pathname (String) -> is it present? (Boolean), does it support -e option? (Boolean) - * It stores check results for command and, if present, -e (encoding) option. - */ - private static Map STRINGS_PRESENT = new HashMap(); - - @Override - public Set getSupportedTypes(ParseContext context) { - return SUPPORTED_TYPES; - } - - @Override - public void parse(InputStream stream, ContentHandler handler, - Metadata metadata, ParseContext context) throws IOException, - SAXException, TikaException { - StringsConfig stringsConfig = context.get(StringsConfig.class, DEFAULT_STRINGS_CONFIG); - FileConfig fileConfig = context.get(FileConfig.class, DEFAULT_FILE_CONFIG); - - if (!hasStrings(stringsConfig)) { - return; - } - - TikaInputStream tis = TikaInputStream.get(stream); - File input = tis.getFile(); - - // Metadata - metadata.set("strings:min-len", "" + stringsConfig.getMinLength()); - metadata.set("strings:encoding", stringsConfig.toString()); - metadata.set("strings:file_output", doFile(input, fileConfig)); - - int totalBytes = 0; - - // Content - XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata); - - xhtml.startDocument(); - - totalBytes = doStrings(input, stringsConfig, xhtml); - - xhtml.endDocument(); - - // Metadata - metadata.set("strings:length", "" + totalBytes); - } - - /** - * Checks if the "strings" command is supported. - * - * @param config - * {@see StringsConfig} object used for testing the strings - * command. - * @return Returns returns {@code true} if the strings command is supported. - */ - private boolean hasStrings(StringsConfig config) { - String stringsProg = config.getStringsPath() + getStringsProg(); - - if (STRINGS_PRESENT.containsKey(stringsProg)) { - return STRINGS_PRESENT.get(stringsProg)[0]; - } - - String[] checkCmd = { stringsProg, "--version" }; - try { - boolean hasStrings = ExternalParser.check(checkCmd); - - boolean encodingOpt = false; - - // Check if the -e option (encoding) is supported - if (!System.getProperty("os.name").startsWith("Windows")) { - String[] checkOpt = {stringsProg, "-e", "" + config.getEncoding().get(), "/dev/null"}; - int[] errorValues = {1, 2}; // Exit status code: 1 = general error; 2 = incorrect usage. - encodingOpt = ExternalParser.check(checkOpt, errorValues); - } - - Boolean[] values = {hasStrings, encodingOpt}; - STRINGS_PRESENT.put(stringsProg, values); - - return hasStrings; - } catch (NoClassDefFoundError ncdfe) { - // This happens under OSGi + Fork Parser - see TIKA-1507 - // As a workaround for now, just say we can't use strings - // TODO Resolve it so we don't need this try/catch block - Boolean[] values = {false, false}; - STRINGS_PRESENT.put(stringsProg, values); - return false; - } - } - - /** - * Checks if the "file" command is supported. - * - * @param config - * @return - */ - private boolean hasFile(FileConfig config) { - String fileProg = config.getFilePath() + getFileProg(); - - String[] checkCmd = { fileProg, "--version" }; - - boolean hasFile = ExternalParser.check(checkCmd); - - return hasFile; - } - - /** - * Runs the "strings" command on the given file. - * - * @param input - * {@see File} object that represents the file to parse. - * @param config - * {@see StringsConfig} object including the strings - * configuration. - * @param xhtml - * {@see XHTMLContentHandler} object. - * @return the total number of bytes read using the strings command. - * @throws IOException - * if any I/O error occurs. - * @throws TikaException - * if the parsing process has been interrupted. - * @throws SAXException - */ - private int doStrings(File input, StringsConfig config, - XHTMLContentHandler xhtml) throws IOException, TikaException, - SAXException { - - String stringsProg = config.getStringsPath() + getStringsProg(); - - // Builds the command array - ArrayList cmdList = new ArrayList(4); - cmdList.add(stringsProg); - cmdList.add("-n"); - cmdList.add("" + config.getMinLength());; - // Currently, encoding option is not supported by Windows (and other) versions - if (STRINGS_PRESENT.get(stringsProg)[1]) { - cmdList.add("-e"); - cmdList.add("" + config.getEncoding().get()); - } - cmdList.add(input.getPath()); - - String[] cmd = cmdList.toArray(new String[cmdList.size()]); - - ProcessBuilder pb = new ProcessBuilder(cmd); - final Process process = pb.start(); - - InputStream out = process.getInputStream(); - - FutureTask waitTask = new FutureTask( - new Callable() { - public Integer call() throws Exception { - return process.waitFor(); - } - }); - - Thread waitThread = new Thread(waitTask); - waitThread.start(); - - // Reads content printed out by "strings" command - int totalBytes = 0; - totalBytes = extractOutput(out, xhtml); - - try { - waitTask.get(config.getTimeout(), TimeUnit.SECONDS); - - } catch (InterruptedException ie) { - waitThread.interrupt(); - process.destroy(); - Thread.currentThread().interrupt(); - throw new TikaException(StringsParser.class.getName() - + " interrupted", ie); - - } catch (ExecutionException ee) { - // should not be thrown - - } catch (TimeoutException te) { - waitThread.interrupt(); - process.destroy(); - throw new TikaException(StringsParser.class.getName() + " timeout", - te); - } - - return totalBytes; - } - - /** - * Extracts ASCII strings using the "strings" command. - * - * @param stream - * {@see InputStream} object used for reading the binary file. - * @param xhtml - * {@see XHTMLContentHandler} object. - * @return the total number of bytes read using the "strings" command. - * @throws SAXException - * if the content element could not be written. - * @throws IOException - * if any I/O error occurs. - */ - private int extractOutput(InputStream stream, XHTMLContentHandler xhtml) - throws SAXException, IOException { - - char[] buffer = new char[1024]; - int totalBytes = 0; - - try (BufferedReader reader = new BufferedReader(new InputStreamReader(stream, UTF_8))) { - int n = 0; - while ((n = reader.read(buffer)) != -1) { - if (n > 0) { - xhtml.characters(buffer, 0, n); - } - totalBytes += n; - } - - } - - return totalBytes; - } - - /** - * Runs the "file" command on the given file that aims at providing an - * alternative way to determine the file type. - * - * @param input - * {@see File} object that represents the file to detect. - * @return the file type provided by the "file" command using the "-b" - * option (it stands for "brief mode"). - * @throws IOException - * if any I/O error occurs. - */ - private String doFile(File input, FileConfig config) throws IOException { - if (!hasFile(config)) { - return null; - } - - // Builds the command array - ArrayList cmdList = new ArrayList(3); - cmdList.add(config.getFilePath() + getFileProg()); - cmdList.add("-b"); - if (config.isMimetype()) { - cmdList.add("-I"); - } - cmdList.add(input.getPath()); - - String[] cmd = cmdList.toArray(new String[cmdList.size()]); - - ProcessBuilder pb = new ProcessBuilder(cmd); - final Process process = pb.start(); - - InputStream out = process.getInputStream(); - - String fileOutput = null; - - try (BufferedReader reader = new BufferedReader(new InputStreamReader(out, UTF_8))) { - fileOutput = reader.readLine(); - } catch (IOException ioe) { - // file output not available! - fileOutput = ""; - } - - return fileOutput; - } - - - public static String getStringsProg() { - return System.getProperty("os.name").startsWith("Windows") ? "strings.exe" : "strings"; - } - - public static String getFileProg() { - return System.getProperty("os.name").startsWith("Windows") ? "file.exe" : "file"; - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java index f9df9e0..2534f1c 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetDetector.java @@ -1,18 +1,18 @@ /** - * ****************************************************************************** - * Copyright (C) 2005-2009, International Business Machines Corporation and * - * others. All Rights Reserved. * - * ****************************************************************************** - */ +******************************************************************************* +* Copyright (C) 2005-2009, International Business Machines Corporation and * +* others. All Rights Reserved. * +******************************************************************************* +*/ package org.apache.tika.parser.txt; -import java.io.IOException; import java.io.InputStream; import java.io.Reader; +import java.io.IOException; import java.nio.charset.Charset; import java.util.ArrayList; +import java.util.Collections; import java.util.Arrays; -import java.util.Collections; /** @@ -47,71 +47,449 @@ // actually choose the "real" charset. All assuming that the application just // wants the data, and doesn't care about a char set name. + /** + * Constructor + * + * @stable ICU 3.4 + */ + public CharsetDetector() { + } + + /** + * Set the declared encoding for charset detection. + * The declared encoding of an input text is an encoding obtained + * from an http header or xml declaration or similar source that + * can be provided as additional information to the charset detector. + * A match between a declared encoding and a possible detected encoding + * will raise the quality of that detected encoding by a small delta, + * and will also appear as a "reason" for the match. + *

      + * A declared encoding that is incompatible with the input data being + * analyzed will not be added to the list of possible encodings. + * + * @param encoding The declared encoding + * + * @stable ICU 3.4 + */ + public CharsetDetector setDeclaredEncoding(String encoding) { + setCanonicalDeclaredEncoding(encoding); + return this; + } + + /** + * Set the input text (byte) data whose charset is to be detected. + * + * @param in the input text of unknown encoding + * + * @return This CharsetDetector + * + * @stable ICU 3.4 + */ + public CharsetDetector setText(byte [] in) { + fRawInput = in; + fRawLength = in.length; + + MungeInput(); + + return this; + } + private static final int kBufSize = 12000; + private static final int MAX_CONFIDENCE = 100; - private static String[] fCharsetNames; + + /** + * Set the input text (byte) data whose charset is to be detected. + *

      + * The input stream that supplies the character data must have markSupported() + * == true; the charset detection process will read a small amount of data, + * then return the stream to its original position via + * the InputStream.reset() operation. The exact amount that will + * be read depends on the characteristics of the data itself. + * + * @param in the input text of unknown encoding + * + * @return This CharsetDetector + * + * @stable ICU 3.4 + */ + + public CharsetDetector setText(InputStream in) throws IOException { + fInputStream = in; + fInputStream.mark(kBufSize); + fRawInput = new byte[kBufSize]; // Always make a new buffer because the + // previous one may have come from the caller, + // in which case we can't touch it. + fRawLength = 0; + int remainingLength = kBufSize; + while (remainingLength > 0 ) { + // read() may give data in smallish chunks, esp. for remote sources. Hence, this loop. + int bytesRead = fInputStream.read(fRawInput, fRawLength, remainingLength); + if (bytesRead <= 0) { + break; + } + fRawLength += bytesRead; + remainingLength -= bytesRead; + } + fInputStream.reset(); + + MungeInput(); // Strip html markup, collect byte stats. + return this; + } + + + /** + * Return the charset that best matches the supplied input data. + * + * Note though, that because the detection + * only looks at the start of the input data, + * there is a possibility that the returned charset will fail to handle + * the full set of input data. + *

      + * Raise an exception if + *

        + *
      • no charset appears to match the data.
      • + *
      • no input text has been provided
      • + *
      + * + * @return a CharsetMatch object representing the best matching charset, or + * null if there are no matches. + * + * @stable ICU 3.4 + */ + public CharsetMatch detect() { +// TODO: A better implementation would be to copy the detect loop from +// detectAll(), and cut it short as soon as a match with a high confidence +// is found. This is something to be done later, after things are otherwise +// working. + CharsetMatch matches[] = detectAll(); + + if (matches == null || matches.length == 0) { + return null; + } + + return matches[0]; + } + + /** + * Return an array of all charsets that appear to be plausible + * matches with the input data. The array is ordered with the + * best quality match first. + *

      + * Raise an exception if + *

        + *
      • no charsets appear to match the input data.
      • + *
      • no input text has been provided
      • + *
      + * + * @return An array of CharsetMatch objects representing possibly matching charsets. + * + * @stable ICU 3.4 + */ + public CharsetMatch[] detectAll() { + CharsetRecognizer csr; + int i; + int detectResults; + int confidence; + ArrayList matches = new ArrayList(); + + // Iterate over all possible charsets, remember all that + // give a match quality > 0. + for (i=0; i 0) { + // Just to be safe, constrain + confidence = Math.min(confidence, MAX_CONFIDENCE); + + // Apply charset hint. + if ((fDeclaredEncoding != null) && (fDeclaredEncoding.equalsIgnoreCase(csr.getName()))) { + // Reduce lack of confidence (delta between "sure" and current) by 50%. + confidence += (MAX_CONFIDENCE - confidence)/2; + } + + CharsetMatch m = new CharsetMatch(this, csr, confidence); + matches.add(m); + } + } + + Collections.sort(matches); // CharsetMatch compares on confidence + Collections.reverse(matches); // Put best match first. + CharsetMatch [] resultArray = new CharsetMatch[matches.size()]; + resultArray = (CharsetMatch[]) matches.toArray(resultArray); + return resultArray; + } + + + /** + * Autodetect the charset of an inputStream, and return a Java Reader + * to access the converted input data. + *

      + * This is a convenience method that is equivalent to + * this.setDeclaredEncoding(declaredEncoding).setText(in).detect().getReader(); + *

      + * For the input stream that supplies the character data, markSupported() + * must be true; the charset detection will read a small amount of data, + * then return the stream to its original position via + * the InputStream.reset() operation. The exact amount that will + * be read depends on the characteristics of the data itself. + *

      + * Raise an exception if no charsets appear to match the input data. + * + * @param in The source of the byte data in the unknown charset. + * + * @param declaredEncoding A declared encoding for the data, if available, + * or null or an empty string if none is available. + * + * @stable ICU 3.4 + */ + public Reader getReader(InputStream in, String declaredEncoding) { + setCanonicalDeclaredEncoding(declaredEncoding); + + try { + setText(in); + + CharsetMatch match = detect(); + + if (match == null) { + return null; + } + + return match.getReader(); + } catch (IOException e) { + return null; + } + } + + /** + * Autodetect the charset of an inputStream, and return a String + * containing the converted input data. + *

      + * This is a convenience method that is equivalent to + * this.setDeclaredEncoding(declaredEncoding).setText(in).detect().getString(); + *

      + * Raise an exception if no charsets appear to match the input data. + * + * @param in The source of the byte data in the unknown charset. + * + * @param declaredEncoding A declared encoding for the data, if available, + * or null or an empty string if none is available. + * + * @stable ICU 3.4 + */ + public String getString(byte[] in, String declaredEncoding) { + setCanonicalDeclaredEncoding(declaredEncoding); + + try { + setText(in); + + CharsetMatch match = detect(); + + if (match == null) { + return null; + } + + return match.getString(-1); + } catch (IOException e) { + return null; + } + } + + + /** + * Get the names of all char sets that can be recognized by the char set detector. + * + * @return an array of the names of all charsets that can be recognized + * by the charset detector. + * + * @stable ICU 3.4 + */ + public static String[] getAllDetectableCharsets() { + return fCharsetNames; + } + + /** + * Test whether or not input filtering is enabled. + * + * @return true if input text will be filtered. + * + * @see #enableInputFilter + * + * @stable ICU 3.4 + */ + public boolean inputFilterEnabled() + { + return fStripTags; + } + + /** + * Enable filtering of input text. If filtering is enabled, + * text within angle brackets ("<" and ">") will be removed + * before detection. + * + * @param filter true to enable input text filtering. + * + * @return The previous setting. + * + * @stable ICU 3.4 + */ + public boolean enableInputFilter(boolean filter) + { + boolean previous = fStripTags; + + fStripTags = filter; + + return previous; + } + + /** + * Try to set fDeclaredEncoding to the canonical name for , if it exists. + * + * @param encoding - name of character encoding + */ + private void setCanonicalDeclaredEncoding(String encoding) { + Charset cs = Charset.forName(encoding); + if (cs != null) { + fDeclaredEncoding = cs.name(); + } + } + /* - * List of recognizers for all charsets known to the implementation. - */ - private static ArrayList fCSRecognizers = createRecognizers(); + * MungeInput - after getting a set of raw input data to be analyzed, preprocess + * it by removing what appears to be html markup. + */ + private void MungeInput() { + int srci = 0; + int dsti = 0; + byte b; + boolean inMarkup = false; + int openTags = 0; + int badTags = 0; + + // + // html / xml markup stripping. + // quick and dirty, not 100% accurate, but hopefully good enough, statistically. + // discard everything within < brackets > + // Count how many total '<' and illegal (nested) '<' occur, so we can make some + // guess as to whether the input was actually marked up at all. + if (fStripTags) { + for (srci = 0; srci < fRawLength && dsti < fInputBytes.length; srci++) { + b = fRawInput[srci]; + if (b == (byte)'<') { + if (inMarkup) { + badTags++; + } + inMarkup = true; + openTags++; + } + + if (! inMarkup) { + fInputBytes[dsti++] = b; + } + + if (b == (byte)'>') { + inMarkup = false; + } + } + + fInputLen = dsti; + } + + // + // If it looks like this input wasn't marked up, or if it looks like it's + // essentially nothing but markup abandon the markup stripping. + // Detection will have to work on the unstripped input. + // + if (openTags<5 || openTags/5 < badTags || + (fInputLen < 100 && fRawLength>600)) { + int limit = fRawLength; + + if (limit > kBufSize) { + limit = kBufSize; + } + + for (srci=0; srci fCSRecognizers = createRecognizers(); + private static String [] fCharsetNames; + /* * Create the singleton instances of the CharsetRecognizer classes */ private static ArrayList createRecognizers() { ArrayList recognizers = new ArrayList(); - + recognizers.add(new CharsetRecog_UTF8()); - + recognizers.add(new CharsetRecog_Unicode.CharsetRecog_UTF_16_BE()); recognizers.add(new CharsetRecog_Unicode.CharsetRecog_UTF_16_LE()); recognizers.add(new CharsetRecog_Unicode.CharsetRecog_UTF_32_BE()); recognizers.add(new CharsetRecog_Unicode.CharsetRecog_UTF_32_LE()); - + recognizers.add(new CharsetRecog_mbcs.CharsetRecog_sjis()); recognizers.add(new CharsetRecog_2022.CharsetRecog_2022JP()); recognizers.add(new CharsetRecog_2022.CharsetRecog_2022CN()); @@ -120,7 +498,7 @@ recognizers.add(new CharsetRecog_mbcs.CharsetRecog_euc.CharsetRecog_euc_jp()); recognizers.add(new CharsetRecog_mbcs.CharsetRecog_euc.CharsetRecog_euc_kr()); recognizers.add(new CharsetRecog_mbcs.CharsetRecog_big5()); - + recognizers.add(new CharsetRecog_sbcs.CharsetRecog_8859_1_da()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_8859_1_de()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_8859_1_en()); @@ -144,7 +522,7 @@ recognizers.add(new CharsetRecog_sbcs.CharsetRecog_windows_1256()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_KOI8_R()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_8859_9_tr()); - + recognizers.add(new CharsetRecog_sbcs.CharsetRecog_IBM424_he_rtl()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_IBM424_he_ltr()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_IBM420_ar_rtl()); @@ -156,389 +534,25 @@ recognizers.add(new CharsetRecog_sbcs.CharsetRecog_EBCDIC_500_fr()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_EBCDIC_500_it()); recognizers.add(new CharsetRecog_sbcs.CharsetRecog_EBCDIC_500_nl()); - + recognizers.add(new CharsetRecog_sbcs.CharsetRecog_IBM866_ru()); // Create an array of all charset names, as a side effect. // Needed for the getAllDetectableCharsets() API. - String[] charsetNames = new String[recognizers.size()]; + String[] charsetNames = new String [recognizers.size()]; int out = 0; - - for (CharsetRecognizer recognizer : recognizers) { - String name = recognizer.getName(); - - if (out == 0 || !name.equals(charsetNames[out - 1])) { + + for (int i = 0; i < recognizers.size(); i++) { + String name = ((CharsetRecognizer)recognizers.get(i)).getName(); + + if (out == 0 || ! name.equals(charsetNames[out - 1])) { charsetNames[out++] = name; } } - + fCharsetNames = new String[out]; System.arraycopy(charsetNames, 0, fCharsetNames, 0, out); - + return recognizers; } - - /** - * Set the declared encoding for charset detection. - * The declared encoding of an input text is an encoding obtained - * from an http header or xml declaration or similar source that - * can be provided as additional information to the charset detector. - * A match between a declared encoding and a possible detected encoding - * will raise the quality of that detected encoding by a small delta, - * and will also appear as a "reason" for the match. - *

      - * A declared encoding that is incompatible with the input data being - * analyzed will not be added to the list of possible encodings. - * - * @param encoding The declared encoding - * - * @stable ICU 3.4 - */ - public CharsetDetector setDeclaredEncoding(String encoding) { - setCanonicalDeclaredEncoding(encoding); - return this; - } - - /** - * Set the input text (byte) data whose charset is to be detected. - * - * @param in the input text of unknown encoding - * - * @return This CharsetDetector - * - * @stable ICU 3.4 - */ - public CharsetDetector setText(byte[] in) { - fRawInput = in; - fRawLength = in.length; - - MungeInput(); - - return this; - } - // Value is rounded up, so zero really means zero occurences. - - /** - * Set the input text (byte) data whose charset is to be detected. - *

      - * The input stream that supplies the character data must have markSupported() - * == true; the charset detection process will read a small amount of data, - * then return the stream to its original position via - * the InputStream.reset() operation. The exact amount that will - * be read depends on the characteristics of the data itself. - * - * @param in the input text of unknown encoding - * - * @return This CharsetDetector - * - * @stable ICU 3.4 - */ - - public CharsetDetector setText(InputStream in) throws IOException { - fInputStream = in; - fInputStream.mark(kBufSize); - fRawInput = new byte[kBufSize]; // Always make a new buffer because the - // previous one may have come from the caller, - // in which case we can't touch it. - fRawLength = 0; - int remainingLength = kBufSize; - while (remainingLength > 0) { - // read() may give data in smallish chunks, esp. for remote sources. Hence, this loop. - int bytesRead = fInputStream.read(fRawInput, fRawLength, remainingLength); - if (bytesRead <= 0) { - break; - } - fRawLength += bytesRead; - remainingLength -= bytesRead; - } - fInputStream.reset(); - - MungeInput(); // Strip html markup, collect byte stats. - return this; - } - - /** - * Return the charset that best matches the supplied input data. - * - * Note though, that because the detection - * only looks at the start of the input data, - * there is a possibility that the returned charset will fail to handle - * the full set of input data. - *

      - * Raise an exception if - *

        - *
      • no charset appears to match the data.
      • - *
      • no input text has been provided
      • - *
      - * - * @return a CharsetMatch object representing the best matching charset, or - * null if there are no matches. - * - * @stable ICU 3.4 - */ - public CharsetMatch detect() { -// TODO: A better implementation would be to copy the detect loop from -// detectAll(), and cut it short as soon as a match with a high confidence -// is found. This is something to be done later, after things are otherwise -// working. - CharsetMatch matches[] = detectAll(); - - if (matches == null || matches.length == 0) { - return null; - } - - return matches[0]; - } - - /** - * Return an array of all charsets that appear to be plausible - * matches with the input data. The array is ordered with the - * best quality match first. - *

      - * Raise an exception if - *

        - *
      • no charsets appear to match the input data.
      • - *
      • no input text has been provided
      • - *
      - * - * @return An array of CharsetMatch objects representing possibly matching charsets. - * - * @stable ICU 3.4 - */ - public CharsetMatch[] detectAll() { - CharsetRecognizer csr; - int i; - int detectResults; - int confidence; - ArrayList matches = new ArrayList(); - - // Iterate over all possible charsets, remember all that - // give a match quality > 0. - for (i = 0; i < fCSRecognizers.size(); i++) { - csr = fCSRecognizers.get(i); - detectResults = csr.match(this); - confidence = detectResults & 0x000000ff; - if (confidence > 0) { - // Just to be safe, constrain - confidence = Math.min(confidence, MAX_CONFIDENCE); - - // Apply charset hint. - if ((fDeclaredEncoding != null) && (fDeclaredEncoding.equalsIgnoreCase(csr.getName()))) { - // Reduce lack of confidence (delta between "sure" and current) by 50%. - confidence += (MAX_CONFIDENCE - confidence) / 2; - } - - CharsetMatch m = new CharsetMatch(this, csr, confidence); - matches.add(m); - } - } - - Collections.sort(matches); // CharsetMatch compares on confidence - Collections.reverse(matches); // Put best match first. - CharsetMatch[] resultArray = new CharsetMatch[matches.size()]; - resultArray = matches.toArray(resultArray); - return resultArray; - } - - /** - * Autodetect the charset of an inputStream, and return a Java Reader - * to access the converted input data. - *

      - * This is a convenience method that is equivalent to - * this.setDeclaredEncoding(declaredEncoding).setText(in).detect().getReader(); - *

      - * For the input stream that supplies the character data, markSupported() - * must be true; the charset detection will read a small amount of data, - * then return the stream to its original position via - * the InputStream.reset() operation. The exact amount that will - * be read depends on the characteristics of the data itself. - *

      - * Raise an exception if no charsets appear to match the input data. - * - * @param in The source of the byte data in the unknown charset. - * - * @param declaredEncoding A declared encoding for the data, if available, - * or null or an empty string if none is available. - * - * @stable ICU 3.4 - */ - public Reader getReader(InputStream in, String declaredEncoding) { - setCanonicalDeclaredEncoding(declaredEncoding); - - try { - setText(in); - - CharsetMatch match = detect(); - - if (match == null) { - return null; - } - - return match.getReader(); - } catch (IOException e) { - return null; - } - } - - /** - * Autodetect the charset of an inputStream, and return a String - * containing the converted input data. - *

      - * This is a convenience method that is equivalent to - * this.setDeclaredEncoding(declaredEncoding).setText(in).detect().getString(); - *

      - * Raise an exception if no charsets appear to match the input data. - * - * @param in The source of the byte data in the unknown charset. - * - * @param declaredEncoding A declared encoding for the data, if available, - * or null or an empty string if none is available. - * - * @stable ICU 3.4 - */ - public String getString(byte[] in, String declaredEncoding) { - setCanonicalDeclaredEncoding(declaredEncoding); - - try { - setText(in); - - CharsetMatch match = detect(); - - if (match == null) { - return null; - } - - return match.getString(-1); - } catch (IOException e) { - return null; - } - } - // gave us a byte array. - - /** - * Test whether or not input filtering is enabled. - * - * @return true if input text will be filtered. - * - * @see #enableInputFilter - * - * @stable ICU 3.4 - */ - public boolean inputFilterEnabled() { - return fStripTags; - } - - /** - * Enable filtering of input text. If filtering is enabled, - * text within angle brackets ("<" and ">") will be removed - * before detection. - * - * @param filter true to enable input text filtering. - * - * @return The previous setting. - * - * @stable ICU 3.4 - */ - public boolean enableInputFilter(boolean filter) { - boolean previous = fStripTags; - - fStripTags = filter; - - return previous; - } - - /** - * Try to set fDeclaredEncoding to the canonical name for , if it exists. - * - * @param encoding - name of character encoding - */ - private void setCanonicalDeclaredEncoding(String encoding) { - if ((encoding == null) || encoding.isEmpty()) { - return; - } - - Charset cs = Charset.forName(encoding); - if (cs != null) { - fDeclaredEncoding = cs.name(); - } - } - - /* - * MungeInput - after getting a set of raw input data to be analyzed, preprocess - * it by removing what appears to be html markup. - */ - private void MungeInput() { - int srci = 0; - int dsti = 0; - byte b; - boolean inMarkup = false; - int openTags = 0; - int badTags = 0; - - // - // html / xml markup stripping. - // quick and dirty, not 100% accurate, but hopefully good enough, statistically. - // discard everything within < brackets > - // Count how many total '<' and illegal (nested) '<' occur, so we can make some - // guess as to whether the input was actually marked up at all. - if (fStripTags) { - for (srci = 0; srci < fRawLength && dsti < fInputBytes.length; srci++) { - b = fRawInput[srci]; - if (b == (byte) '<') { - if (inMarkup) { - badTags++; - } - inMarkup = true; - openTags++; - } - - if (!inMarkup) { - fInputBytes[dsti++] = b; - } - - if (b == (byte) '>') { - inMarkup = false; - } - } - - fInputLen = dsti; - } - - // - // If it looks like this input wasn't marked up, or if it looks like it's - // essentially nothing but markup abandon the markup stripping. - // Detection will have to work on the unstripped input. - // - if (openTags < 5 || openTags / 5 < badTags || - (fInputLen < 100 && fRawLength > 600)) { - int limit = fRawLength; - - if (limit > kBufSize) { - limit = kBufSize; - } - - for (srci = 0; srci < limit; srci++) { - fInputBytes[srci] = fRawInput[srci]; - } - fInputLen = srci; - } - - // - // Tally up the byte occurence statistics. - // These are available for use by the various detectors. - // - Arrays.fill(fByteStats, (short) 0); - for (srci = 0; srci < fInputLen; srci++) { - int val = fInputBytes[srci] & 0x00ff; - fByteStats[val]++; - } - - fC1Bytes = false; - for (int i = 0x80; i <= 0x9F; i += 1) { - if (fByteStats[i] != 0) { - fC1Bytes = true; - break; - } - } - } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java index 22219ab..95b37dd 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetMatch.java @@ -1,9 +1,9 @@ /** - * ****************************************************************************** - * Copyright (C) 2005-2007, International Business Machines Corporation and * - * others. All Rights Reserved. * - * ****************************************************************************** - */ +******************************************************************************* +* Copyright (C) 2005-2007, International Business Machines Corporation and * +* others. All Rights Reserved. * +******************************************************************************* +*/ package org.apache.tika.parser.txt; import java.io.ByteArrayInputStream; @@ -28,70 +28,13 @@ */ public class CharsetMatch implements Comparable { - - /** - * Bit flag indicating the match is based on the the encoding scheme. - * - * @see #getMatchType - * @stable ICU 3.4 - */ - static public final int ENCODING_SCHEME = 1; - /** - * Bit flag indicating the match is based on the presence of a BOM. - * - * @see #getMatchType - * @stable ICU 3.4 - */ - static public final int BOM = 2; - /** - * Bit flag indicating he match is based on the declared encoding. - * - * @see #getMatchType - * @stable ICU 3.4 - */ - static public final int DECLARED_ENCODING = 4; - /** - * Bit flag indicating the match is based on language statistics. - * - * @see #getMatchType - * @stable ICU 3.4 - */ - static public final int LANG_STATISTICS = 8; - // - // Private Data - // - private int fConfidence; - private CharsetRecognizer fRecognizer; - private byte[] fRawInput = null; // Original, untouched input bytes. - // If user gave us a byte array, this is it. - private int fRawLength; // Length of data in fRawInput array. - private InputStream fInputStream = null; // User's input stream, or null if the user - - /* - * Constructor. Implementation internal - */ - CharsetMatch(CharsetDetector det, CharsetRecognizer rec, int conf) { - fRecognizer = rec; - fConfidence = conf; - - // The references to the original aplication input data must be copied out - // of the charset recognizer to here, in case the application resets the - // recognizer before using this CharsetMatch. - if (det.fInputStream == null) { - // We only want the existing input byte data if it came straight from the user, - // not if is just the head of a stream. - fRawInput = det.fRawInput; - fRawLength = det.fRawLength; - } - fInputStream = det.fInputStream; - } - + /** * Create a java.io.Reader for reading the Unicode character data corresponding * to the original byte data supplied to the Charset detect operation. *

      * CAUTION: if the source of the byte data was an InputStream, a Reader - * can be created for only one matching char set using this method. If more + * can be created for only one matching char set using this method. If more * than one charset needs to be tried, the caller will need to reset * the InputStream and create InputStreamReaders itself, based on the charset name. * @@ -101,11 +44,11 @@ */ public Reader getReader() { InputStream inputStream = fInputStream; - + if (inputStream == null) { inputStream = new ByteArrayInputStream(fRawInput, 0, fRawLength); } - + try { inputStream.reset(); return new InputStreamReader(inputStream, getName()); @@ -122,7 +65,7 @@ * * @stable ICU 3.4 */ - public String getString() throws java.io.IOException { + public String getString() throws java.io.IOException { return getString(-1); } @@ -147,24 +90,24 @@ StringBuffer sb = new StringBuffer(); char[] buffer = new char[1024]; Reader reader = getReader(); - int max = maxLength < 0 ? Integer.MAX_VALUE : maxLength; + int max = maxLength < 0? Integer.MAX_VALUE : maxLength; int bytesRead = 0; - + while ((bytesRead = reader.read(buffer, 0, Math.min(max, 1024))) >= 0) { sb.append(buffer, 0, bytesRead); max -= bytesRead; } - + reader.close(); - + return sb.toString(); } else { - result = new String(fRawInput, getName()); + result = new String(fRawInput, getName()); } return result; } - + /** * Get an indication of the confidence in the charset detected. * Confidence values range from 0-100, with larger numbers indicating @@ -178,9 +121,42 @@ public int getConfidence() { return fConfidence; } - - /** - * Return flags indicating what it was about the input data + + + /** + * Bit flag indicating the match is based on the the encoding scheme. + * + * @see #getMatchType + * @stable ICU 3.4 + */ + static public final int ENCODING_SCHEME = 1; + + /** + * Bit flag indicating the match is based on the presence of a BOM. + * + * @see #getMatchType + * @stable ICU 3.4 + */ + static public final int BOM = 2; + + /** + * Bit flag indicating he match is based on the declared encoding. + * + * @see #getMatchType + * @stable ICU 3.4 + */ + static public final int DECLARED_ENCODING = 4; + + /** + * Bit flag indicating the match is based on language statistics. + * + * @see #getMatchType + * @stable ICU 3.4 + */ + static public final int LANG_STATISTICS = 8; + + /** + * Return flags indicating what it was about the input data * that caused this charset to be considered as a possible match. * The result is a bitfield containing zero or more of the flags * ENCODING_SCHEME, BOM, DECLARED_ENCODING, and LANG_STATISTICS. @@ -200,7 +176,7 @@ } /** - * Get the name of the detected charset. + * Get the name of the detected charset. * The name will be one that can be used with other APIs on the * platform that accept charset names. It is the "Canonical name" * as defined by the class java.nio.charset.Charset; for @@ -217,9 +193,9 @@ public String getName() { return fRecognizer.getName(); } - - /** - * Get the ISO code for the language of the detected charset. + + /** + * Get the ISO code for the language of the detected charset. * * @return The ISO code for the language or null if the language cannot be determined. * @@ -231,11 +207,11 @@ /** * Compare to other CharsetMatch objects. - * Comparison is based on the match confidence value, which - * allows CharsetDetector.detectAll() to order its results. + * Comparison is based on the match confidence value, which + * allows CharsetDetector.detectAll() to order its results. * * @param o the CharsetMatch object to compare against. - * @return a negative integer, zero, or a positive integer as the + * @return a negative integer, zero, or a positive integer as the * confidence level of this CharsetMatch * is less than, equal to, or greater than that of * the argument. @@ -273,14 +249,45 @@ public int hashCode() { return fConfidence; } - // gave us a byte array. + + /* + * Constructor. Implementation internal + */ + CharsetMatch(CharsetDetector det, CharsetRecognizer rec, int conf) { + fRecognizer = rec; + fConfidence = conf; + + // The references to the original aplication input data must be copied out + // of the charset recognizer to here, in case the application resets the + // recognizer before using this CharsetMatch. + if (det.fInputStream == null) { + // We only want the existing input byte data if it came straight from the user, + // not if is just the head of a stream. + fRawInput = det.fRawInput; + fRawLength = det.fRawLength; + } + fInputStream = det.fInputStream; + } + + + // + // Private Data + // + private int fConfidence; + private CharsetRecognizer fRecognizer; + private byte[] fRawInput = null; // Original, untouched input bytes. + // If user gave us a byte array, this is it. + private int fRawLength; // Length of data in fRawInput array. + + private InputStream fInputStream = null; // User's input stream, or null if the user + // gave us a byte array. public String toString() { - String s = "Match of " + fRecognizer.getName(); - if (fRecognizer.getLanguage() != null) { - s += " in " + fRecognizer.getLanguage(); - } - s += " with confidence " + fConfidence; - return s; + String s = "Match of " + fRecognizer.getName(); + if(fRecognizer.getLanguage() != null) { + s += " in " + fRecognizer.getLanguage(); + } + s += " with confidence " + fConfidence; + return s; } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java index 129c9a8..c59da36 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_2022.java @@ -7,95 +7,98 @@ package org.apache.tika.parser.txt; /** - * class CharsetRecog_2022 part of the ICU charset detection imlementation. - * This is a superclass for the individual detectors for - * each of the detectable members of the ISO 2022 family - * of encodings. - *

      - * The separate classes are nested within this class. - * + * class CharsetRecog_2022 part of the ICU charset detection imlementation. + * This is a superclass for the individual detectors for + * each of the detectable members of the ISO 2022 family + * of encodings. + * + * The separate classes are nested within this class. + * * @internal */ abstract class CharsetRecog_2022 extends CharsetRecognizer { - + /** * Matching function shared among the 2022 detectors JP, CN and KR * Counts up the number of legal an unrecognized escape sequences in * the sample of text, and computes a score based on the total number & * the proportion that fit the encoding. - * - * @param text the byte buffer containing text to analyse - * @param textLen the size of the text in the byte. + * + * + * @param text the byte buffer containing text to analyse + * @param textLen the size of the text in the byte. * @param escapeSequences the byte escape sequences to test for. * @return match quality, in the range of 0-100. */ - int match(byte[] text, int textLen, byte[][] escapeSequences) { - int i, j; - int escN; - int hits = 0; - int misses = 0; - int shifts = 0; - int quality; + int match(byte [] text, int textLen, byte [][] escapeSequences) { + int i, j; + int escN; + int hits = 0; + int misses = 0; + int shifts = 0; + int quality; scanInput: - for (i = 0; i < textLen; i++) { - if (text[i] == 0x1b) { - checkEscapes: - for (escN = 0; escN < escapeSequences.length; escN++) { - byte[] seq = escapeSequences[escN]; - - if ((textLen - i) < seq.length) { - continue checkEscapes; - } - - for (j = 1; j < seq.length; j++) { - if (seq[j] != text[i + j]) { - continue checkEscapes; + for (i=0; i= 3 && + boolean hasBOM = false; + int numValid = 0; + int numInvalid = 0; + byte input[] = det.fRawInput; + int i; + int trailBytes = 0; + int confidence; + + if (det.fRawLength >= 3 && (input[0] & 0xFF) == 0xef && (input[1] & 0xFF) == 0xbb && (input[2] & 0xFF) == 0xbf) { hasBOM = true; } - + // Scan for multi-byte sequences - for (i = 0; i < det.fRawLength; i++) { + for (i=0; i= det.fRawLength) { + if (i>=det.fRawLength) { break; } b = input[i]; @@ -72,24 +72,24 @@ break; } } - + } - + // Cook up some sort of confidence score, based on presense of a BOM // and the existence of valid and/or invalid multi-byte sequences. confidence = 0; - if (hasBOM && numInvalid == 0) { + if (hasBOM && numInvalid==0) { confidence = 100; - } else if (hasBOM && numValid > numInvalid * 10) { + } else if (hasBOM && numValid > numInvalid*10) { confidence = 80; } else if (numValid > 3 && numInvalid == 0) { - confidence = 100; + confidence = 100; } else if (numValid > 0 && numInvalid == 0) { confidence = 80; } else if (numValid == 0 && numInvalid == 0) { // Plain ASCII. - confidence = 10; - } else if (numValid > numInvalid * 10) { + confidence = 10; + } else if (numValid > numInvalid*10) { // Probably corruput utf-8 data. Valid sequences aren't likely by chance. confidence = 25; } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java index be6455f..65fe9d7 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_Unicode.java @@ -10,7 +10,7 @@ /** * This class matches UTF-16 and UTF-32, both big- and little-endian. The * BOM will be used if it is present. - * + * * @internal */ abstract class CharsetRecog_Unicode extends CharsetRecognizer { @@ -24,115 +24,130 @@ * @see com.ibm.icu.text.CharsetRecognizer#match(com.ibm.icu.text.CharsetDetector) */ abstract int match(CharsetDetector det); - - static class CharsetRecog_UTF_16_BE extends CharsetRecog_Unicode { - String getName() { + + static class CharsetRecog_UTF_16_BE extends CharsetRecog_Unicode + { + String getName() + { return "UTF-16BE"; } - - int match(CharsetDetector det) { + + int match(CharsetDetector det) + { byte[] input = det.fRawInput; - - if (input.length >= 2 && ((input[0] & 0xFF) == 0xFE && (input[1] & 0xFF) == 0xFF)) { + + if (input.length>=2 && ((input[0] & 0xFF) == 0xFE && (input[1] & 0xFF) == 0xFF)) { return 100; } - + // TODO: Do some statistics to check for unsigned UTF-16BE return 0; } } - - static class CharsetRecog_UTF_16_LE extends CharsetRecog_Unicode { - String getName() { + + static class CharsetRecog_UTF_16_LE extends CharsetRecog_Unicode + { + String getName() + { return "UTF-16LE"; } - - int match(CharsetDetector det) { + + int match(CharsetDetector det) + { byte[] input = det.fRawInput; - - if (input.length >= 2 && ((input[0] & 0xFF) == 0xFF && (input[1] & 0xFF) == 0xFE)) { - // An LE BOM is present. - if (input.length >= 4 && input[2] == 0x00 && input[3] == 0x00) { - // It is probably UTF-32 LE, not UTF-16 - return 0; - } - return 100; - } - + + if (input.length >= 2 && ((input[0] & 0xFF) == 0xFF && (input[1] & 0xFF) == 0xFE)) + { + // An LE BOM is present. + if (input.length>=4 && input[2] == 0x00 && input[3] == 0x00) { + // It is probably UTF-32 LE, not UTF-16 + return 0; + } + return 100; + } + // TODO: Do some statistics to check for unsigned UTF-16LE return 0; } } - - static abstract class CharsetRecog_UTF_32 extends CharsetRecog_Unicode { + + static abstract class CharsetRecog_UTF_32 extends CharsetRecog_Unicode + { abstract int getChar(byte[] input, int index); - + abstract String getName(); - - int match(CharsetDetector det) { - byte[] input = det.fRawInput; - int limit = (det.fRawLength / 4) * 4; - int numValid = 0; + + int match(CharsetDetector det) + { + byte[] input = det.fRawInput; + int limit = (det.fRawLength / 4) * 4; + int numValid = 0; int numInvalid = 0; boolean hasBOM = false; int confidence = 0; - - if (limit == 0) { + + if (limit==0) { return 0; } if (getChar(input, 0) == 0x0000FEFF) { hasBOM = true; } - - for (int i = 0; i < limit; i += 4) { + + for(int i = 0; i < limit; i += 4) { int ch = getChar(input, i); - + if (ch < 0 || ch >= 0x10FFFF || (ch >= 0xD800 && ch <= 0xDFFF)) { numInvalid += 1; } else { numValid += 1; } } - - + + // Cook up some sort of confidence score, based on presence of a BOM // and the existence of valid and/or invalid multi-byte sequences. - if (hasBOM && numInvalid == 0) { + if (hasBOM && numInvalid==0) { confidence = 100; - } else if (hasBOM && numValid > numInvalid * 10) { + } else if (hasBOM && numValid > numInvalid*10) { confidence = 80; } else if (numValid > 3 && numInvalid == 0) { - confidence = 100; + confidence = 100; } else if (numValid > 0 && numInvalid == 0) { confidence = 80; - } else if (numValid > numInvalid * 10) { + } else if (numValid > numInvalid*10) { // Probably corrupt UTF-32BE data. Valid sequences aren't likely by chance. confidence = 25; } - + return confidence; } } - - static class CharsetRecog_UTF_32_BE extends CharsetRecog_UTF_32 { - int getChar(byte[] input, int index) { + + static class CharsetRecog_UTF_32_BE extends CharsetRecog_UTF_32 + { + int getChar(byte[] input, int index) + { return (input[index + 0] & 0xFF) << 24 | (input[index + 1] & 0xFF) << 16 | - (input[index + 2] & 0xFF) << 8 | (input[index + 3] & 0xFF); + (input[index + 2] & 0xFF) << 8 | (input[index + 3] & 0xFF); } - - String getName() { + + String getName() + { return "UTF-32BE"; } } - - static class CharsetRecog_UTF_32_LE extends CharsetRecog_UTF_32 { - int getChar(byte[] input, int index) { + + static class CharsetRecog_UTF_32_LE extends CharsetRecog_UTF_32 + { + int getChar(byte[] input, int index) + { return (input[index + 3] & 0xFF) << 24 | (input[index + 2] & 0xFF) << 16 | - (input[index + 1] & 0xFF) << 8 | (input[index + 0] & 0xFF); + (input[index + 1] & 0xFF) << 8 | (input[index + 0] & 0xFF); } - - String getName() { + + String getName() + { return "UTF-32LE"; } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java index 35d2b4f..6e69074 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_mbcs.java @@ -11,58 +11,56 @@ /** * CharsetRecognizer implemenation for Asian - double or multi-byte - charsets. - * Match is determined mostly by the input data adhering to the - * encoding scheme for the charset, and, optionally, - * frequency-of-occurence of characters. + * Match is determined mostly by the input data adhering to the + * encoding scheme for the charset, and, optionally, + * frequency-of-occurence of characters. *

      - * Instances of this class are singletons, one per encoding - * being recognized. They are created in the main - * CharsetDetector class and kept in the global list of available - * encodings to be checked. The specific encoding being recognized - * is determined by subclass. - * - * @internal + * Instances of this class are singletons, one per encoding + * being recognized. They are created in the main + * CharsetDetector class and kept in the global list of available + * encodings to be checked. The specific encoding being recognized + * is determined by subclass. + * + * @internal */ abstract class CharsetRecog_mbcs extends CharsetRecognizer { - /** + /** * Get the IANA name of this charset. - * * @return the charset name. */ - abstract String getName(); - - + abstract String getName() ; + + /** * Test the match of this charset with the input text data - * which is obtained via the CharsetDetector object. - * - * @param det The CharsetDetector, which contains the input text - * to be checked for being in this charset. - * @return Two values packed into one int (Damn java, anyhow) - *
      - * bits 0-7: the match confidence, ranging from 0-100 - *
      - * bits 8-15: The match reason, an enum-like value. + * which is obtained via the CharsetDetector object. + * + * @param det The CharsetDetector, which contains the input text + * to be checked for being in this charset. + * @return Two values packed into one int (Damn java, anyhow) + *
      + * bits 0-7: the match confidence, ranging from 0-100 + *
      + * bits 8-15: The match reason, an enum-like value. */ - int match(CharsetDetector det, int[] commonChars) { - int singleByteCharCount = 0; - int doubleByteCharCount = 0; - int commonCharCount = 0; - int badCharCount = 0; - int totalCharCount = 0; - int confidence = 0; - iteratedChar iter = new iteratedChar(); - - detectBlock: - { - for (iter.reset(); nextChar(iter, det); ) { + int match(CharsetDetector det, int [] commonChars) { + int singleByteCharCount = 0; + int doubleByteCharCount = 0; + int commonCharCount = 0; + int badCharCount = 0; + int totalCharCount = 0; + int confidence = 0; + iteratedChar iter = new iteratedChar(); + + detectBlock: { + for (iter.reset(); nextChar(iter, det);) { totalCharCount++; if (iter.error) { - badCharCount++; + badCharCount++; } else { long cv = iter.charValue & 0xFFFFFFFFL; - + if (cv <= 0xff) { singleByteCharCount++; } else { @@ -75,458 +73,470 @@ } } } - if (badCharCount >= 2 && badCharCount * 5 >= doubleByteCharCount) { + if (badCharCount >= 2 && badCharCount*5 >= doubleByteCharCount) { // Bail out early if the byte data is not matching the encoding scheme. break detectBlock; } } - - if (doubleByteCharCount <= 10 && badCharCount == 0) { + + if (doubleByteCharCount <= 10 && badCharCount== 0) { // Not many multi-byte chars. if (doubleByteCharCount == 0 && totalCharCount < 10) { // There weren't any multibyte sequences, and there was a low density of non-ASCII single bytes. // We don't have enough data to have any confidence. // Statistical analysis of single byte non-ASCII charcters would probably help here. confidence = 0; - } else { + } + else { // ASCII or ISO file? It's probably not our encoding, // but is not incompatible with our encoding, so don't give it a zero. confidence = 10; } - + break detectBlock; } - + // // No match if there are too many characters that don't fit the encoding scheme. // (should we have zero tolerance for these?) // - if (doubleByteCharCount < 20 * badCharCount) { + if (doubleByteCharCount < 20*badCharCount) { confidence = 0; break detectBlock; } - + if (commonChars == null) { // We have no statistics on frequently occuring characters. // Assess confidence purely on having a reasonable number of // multi-byte characters (the more the better - confidence = 30 + doubleByteCharCount - 20 * badCharCount; + confidence = 30 + doubleByteCharCount - 20*badCharCount; if (confidence > 100) { confidence = 100; } - } else { + }else { // // Frequency of occurence statistics exist. // - double maxVal = Math.log((float) doubleByteCharCount / 4); + double maxVal = Math.log((float)doubleByteCharCount / 4); double scaleFactor = 90.0 / maxVal; - confidence = (int) (Math.log(commonCharCount + 1) * scaleFactor + 10); + confidence = (int)(Math.log(commonCharCount+1) * scaleFactor + 10); confidence = Math.min(confidence, 100); } } // end of detectBlock: - + return confidence; } - - /** - * Get the next character (however many bytes it is) from the input data - * Subclasses for specific charset encodings must implement this function - * to get characters according to the rules of their encoding scheme. - *

      - * This function is not a method of class iteratedChar only because - * that would require a lot of extra derived classes, which is awkward. - * - * @param it The iteratedChar "struct" into which the returned char is placed. - * @param det The charset detector, which is needed to get at the input byte data - * being iterated over. - * @return True if a character was returned, false at end of input. - */ - abstract boolean nextChar(iteratedChar it, CharsetDetector det); - - // "Character" iterated character class. - // Recognizers for specific mbcs encodings make their "characters" available - // by providing a nextChar() function that fills in an instance of iteratedChar - // with the next char from the input. - // The returned characters are not converted to Unicode, but remain as the raw - // bytes (concatenated into an int) from the codepage data. - // - // For Asian charsets, use the raw input rather than the input that has been - // stripped of markup. Detection only considers multi-byte chars, effectively - // stripping markup anyway, and double byte chars do occur in markup too. - // - static class iteratedChar { - int charValue = 0; // 1-4 bytes from the raw input data - int index = 0; - int nextIndex = 0; - boolean error = false; - boolean done = false; - - void reset() { - charValue = 0; - index = -1; - nextIndex = 0; - error = false; - done = false; - } - - int nextByte(CharsetDetector det) { - if (nextIndex >= det.fRawLength) { - done = true; - return -1; - } - int byteValue = (int) det.fRawInput[nextIndex++] & 0x00ff; - return byteValue; - } - } - - /** - * Shift-JIS charset recognizer. - */ - static class CharsetRecog_sjis extends CharsetRecog_mbcs { - static int[] commonChars = - // TODO: This set of data comes from the character frequency- - // of-occurence analysis tool. The data needs to be moved - // into a resource and loaded from there. - {0x8140, 0x8141, 0x8142, 0x8145, 0x815b, 0x8169, 0x816a, 0x8175, 0x8176, 0x82a0, - 0x82a2, 0x82a4, 0x82a9, 0x82aa, 0x82ab, 0x82ad, 0x82af, 0x82b1, 0x82b3, 0x82b5, - 0x82b7, 0x82bd, 0x82be, 0x82c1, 0x82c4, 0x82c5, 0x82c6, 0x82c8, 0x82c9, 0x82cc, - 0x82cd, 0x82dc, 0x82e0, 0x82e7, 0x82e8, 0x82e9, 0x82ea, 0x82f0, 0x82f1, 0x8341, - 0x8343, 0x834e, 0x834f, 0x8358, 0x835e, 0x8362, 0x8367, 0x8375, 0x8376, 0x8389, - 0x838a, 0x838b, 0x838d, 0x8393, 0x8e96, 0x93fa, 0x95aa}; - - boolean nextChar(iteratedChar it, CharsetDetector det) { - it.index = it.nextIndex; - it.error = false; - int firstByte; - firstByte = it.charValue = it.nextByte(det); - if (firstByte < 0) { - return false; - } - - if (firstByte <= 0x7f || (firstByte > 0xa0 && firstByte <= 0xdf)) { - return true; - } - - int secondByte = it.nextByte(det); - if (secondByte < 0) { - return false; - } - it.charValue = (firstByte << 8) | secondByte; - if (!((secondByte >= 0x40 && secondByte <= 0x7f) || (secondByte >= 0x80 && secondByte <= 0xff))) { - // Illegal second byte value. - it.error = true; - } - return true; - } - - int match(CharsetDetector det) { - return match(det, commonChars); - } - - String getName() { - return "Shift_JIS"; - } - - public String getLanguage() { - return "ja"; - } - - - } - - - /** - * Big5 charset recognizer. - */ - static class CharsetRecog_big5 extends CharsetRecog_mbcs { - static int[] commonChars = - // TODO: This set of data comes from the character frequency- - // of-occurence analysis tool. The data needs to be moved - // into a resource and loaded from there. - {0xa140, 0xa141, 0xa142, 0xa143, 0xa147, 0xa149, 0xa175, 0xa176, 0xa440, 0xa446, - 0xa447, 0xa448, 0xa451, 0xa454, 0xa457, 0xa464, 0xa46a, 0xa46c, 0xa477, 0xa4a3, - 0xa4a4, 0xa4a7, 0xa4c1, 0xa4ce, 0xa4d1, 0xa4df, 0xa4e8, 0xa4fd, 0xa540, 0xa548, - 0xa558, 0xa569, 0xa5cd, 0xa5e7, 0xa657, 0xa661, 0xa662, 0xa668, 0xa670, 0xa6a8, - 0xa6b3, 0xa6b9, 0xa6d3, 0xa6db, 0xa6e6, 0xa6f2, 0xa740, 0xa751, 0xa759, 0xa7da, - 0xa8a3, 0xa8a5, 0xa8ad, 0xa8d1, 0xa8d3, 0xa8e4, 0xa8fc, 0xa9c0, 0xa9d2, 0xa9f3, - 0xaa6b, 0xaaba, 0xaabe, 0xaacc, 0xaafc, 0xac47, 0xac4f, 0xacb0, 0xacd2, 0xad59, - 0xaec9, 0xafe0, 0xb0ea, 0xb16f, 0xb2b3, 0xb2c4, 0xb36f, 0xb44c, 0xb44e, 0xb54c, - 0xb5a5, 0xb5bd, 0xb5d0, 0xb5d8, 0xb671, 0xb7ed, 0xb867, 0xb944, 0xbad8, 0xbb44, - 0xbba1, 0xbdd1, 0xc2c4, 0xc3b9, 0xc440, 0xc45f}; - - boolean nextChar(iteratedChar it, CharsetDetector det) { - it.index = it.nextIndex; - it.error = false; - int firstByte; - firstByte = it.charValue = it.nextByte(det); - if (firstByte < 0) { - return false; - } - - if (firstByte <= 0x7f || firstByte == 0xff) { - // single byte character. - return true; - } - - int secondByte = it.nextByte(det); - if (secondByte < 0) { - return false; - } - it.charValue = (it.charValue << 8) | secondByte; - - if (secondByte < 0x40 || - secondByte == 0x7f || - secondByte == 0xff) { - it.error = true; - } - return true; - } - - int match(CharsetDetector det) { - return match(det, commonChars); - } - - String getName() { - return "Big5"; - } - - - public String getLanguage() { - return "zh"; - } - } - - - /** - * EUC charset recognizers. One abstract class that provides the common function - * for getting the next character according to the EUC encoding scheme, - * and nested derived classes for EUC_KR, EUC_JP, EUC_CN. - */ - abstract static class CharsetRecog_euc extends CharsetRecog_mbcs { - - /* - * (non-Javadoc) - * Get the next character value for EUC based encodings. - * Character "value" is simply the raw bytes that make up the character - * packed into an int. - */ - boolean nextChar(iteratedChar it, CharsetDetector det) { - it.index = it.nextIndex; - it.error = false; - int firstByte = 0; - int secondByte = 0; - int thirdByte = 0; - //int fourthByte = 0; - - buildChar: - { - firstByte = it.charValue = it.nextByte(det); - if (firstByte < 0) { - // Ran off the end of the input data - it.done = true; - break buildChar; - } - if (firstByte <= 0x8d) { - // single byte char - break buildChar; - } - - secondByte = it.nextByte(det); - it.charValue = (it.charValue << 8) | secondByte; - - if (firstByte >= 0xA1 && firstByte <= 0xfe) { - // Two byte Char - if (secondByte < 0xa1) { - it.error = true; - } - break buildChar; - } - if (firstByte == 0x8e) { - // Code Set 2. - // In EUC-JP, total char size is 2 bytes, only one byte of actual char value. - // In EUC-TW, total char size is 4 bytes, three bytes contribute to char value. - // We don't know which we've got. - // Treat it like EUC-JP. If the data really was EUC-TW, the following two - // bytes will look like a well formed 2 byte char. - if (secondByte < 0xa1) { - it.error = true; - } - break buildChar; - } - - if (firstByte == 0x8f) { - // Code set 3. - // Three byte total char size, two bytes of actual char value. - thirdByte = it.nextByte(det); - it.charValue = (it.charValue << 8) | thirdByte; - if (thirdByte < 0xa1) { - it.error = true; - } - } - } - - return (it.done == false); - } - - /** - * The charset recognize for EUC-JP. A singleton instance of this class - * is created and kept by the public CharsetDetector class - */ - static class CharsetRecog_euc_jp extends CharsetRecog_euc { - static int[] commonChars = - // TODO: This set of data comes from the character frequency- - // of-occurence analysis tool. The data needs to be moved - // into a resource and loaded from there. - {0xa1a1, 0xa1a2, 0xa1a3, 0xa1a6, 0xa1bc, 0xa1ca, 0xa1cb, 0xa1d6, 0xa1d7, 0xa4a2, - 0xa4a4, 0xa4a6, 0xa4a8, 0xa4aa, 0xa4ab, 0xa4ac, 0xa4ad, 0xa4af, 0xa4b1, 0xa4b3, - 0xa4b5, 0xa4b7, 0xa4b9, 0xa4bb, 0xa4bd, 0xa4bf, 0xa4c0, 0xa4c1, 0xa4c3, 0xa4c4, - 0xa4c6, 0xa4c7, 0xa4c8, 0xa4c9, 0xa4ca, 0xa4cb, 0xa4ce, 0xa4cf, 0xa4d0, 0xa4de, - 0xa4df, 0xa4e1, 0xa4e2, 0xa4e4, 0xa4e8, 0xa4e9, 0xa4ea, 0xa4eb, 0xa4ec, 0xa4ef, - 0xa4f2, 0xa4f3, 0xa5a2, 0xa5a3, 0xa5a4, 0xa5a6, 0xa5a7, 0xa5aa, 0xa5ad, 0xa5af, - 0xa5b0, 0xa5b3, 0xa5b5, 0xa5b7, 0xa5b8, 0xa5b9, 0xa5bf, 0xa5c3, 0xa5c6, 0xa5c7, - 0xa5c8, 0xa5c9, 0xa5cb, 0xa5d0, 0xa5d5, 0xa5d6, 0xa5d7, 0xa5de, 0xa5e0, 0xa5e1, - 0xa5e5, 0xa5e9, 0xa5ea, 0xa5eb, 0xa5ec, 0xa5ed, 0xa5f3, 0xb8a9, 0xb9d4, 0xbaee, - 0xbbc8, 0xbef0, 0xbfb7, 0xc4ea, 0xc6fc, 0xc7bd, 0xcab8, 0xcaf3, 0xcbdc, 0xcdd1}; - - String getName() { - return "EUC-JP"; - } - - int match(CharsetDetector det) { - return match(det, commonChars); - } - - public String getLanguage() { - return "ja"; - } - } - - /** - * The charset recognize for EUC-KR. A singleton instance of this class - * is created and kept by the public CharsetDetector class - */ - static class CharsetRecog_euc_kr extends CharsetRecog_euc { - static int[] commonChars = - // TODO: This set of data comes from the character frequency- - // of-occurence analysis tool. The data needs to be moved - // into a resource and loaded from there. - {0xb0a1, 0xb0b3, 0xb0c5, 0xb0cd, 0xb0d4, 0xb0e6, 0xb0ed, 0xb0f8, 0xb0fa, 0xb0fc, - 0xb1b8, 0xb1b9, 0xb1c7, 0xb1d7, 0xb1e2, 0xb3aa, 0xb3bb, 0xb4c2, 0xb4cf, 0xb4d9, - 0xb4eb, 0xb5a5, 0xb5b5, 0xb5bf, 0xb5c7, 0xb5e9, 0xb6f3, 0xb7af, 0xb7c2, 0xb7ce, - 0xb8a6, 0xb8ae, 0xb8b6, 0xb8b8, 0xb8bb, 0xb8e9, 0xb9ab, 0xb9ae, 0xb9cc, 0xb9ce, - 0xb9fd, 0xbab8, 0xbace, 0xbad0, 0xbaf1, 0xbbe7, 0xbbf3, 0xbbfd, 0xbcad, 0xbcba, - 0xbcd2, 0xbcf6, 0xbdba, 0xbdc0, 0xbdc3, 0xbdc5, 0xbec6, 0xbec8, 0xbedf, 0xbeee, - 0xbef8, 0xbefa, 0xbfa1, 0xbfa9, 0xbfc0, 0xbfe4, 0xbfeb, 0xbfec, 0xbff8, 0xc0a7, - 0xc0af, 0xc0b8, 0xc0ba, 0xc0bb, 0xc0bd, 0xc0c7, 0xc0cc, 0xc0ce, 0xc0cf, 0xc0d6, - 0xc0da, 0xc0e5, 0xc0fb, 0xc0fc, 0xc1a4, 0xc1a6, 0xc1b6, 0xc1d6, 0xc1df, 0xc1f6, - 0xc1f8, 0xc4a1, 0xc5cd, 0xc6ae, 0xc7cf, 0xc7d1, 0xc7d2, 0xc7d8, 0xc7e5, 0xc8ad}; - - String getName() { - return "EUC-KR"; - } - - int match(CharsetDetector det) { - return match(det, commonChars); - } - - public String getLanguage() { - return "ko"; - } - } - } - - /** - * GB-18030 recognizer. Uses simplified Chinese statistics. - */ - static class CharsetRecog_gb_18030 extends CharsetRecog_mbcs { - - static int[] commonChars = - // TODO: This set of data comes from the character frequency- - // of-occurence analysis tool. The data needs to be moved - // into a resource and loaded from there. - {0xa1a1, 0xa1a2, 0xa1a3, 0xa1a4, 0xa1b0, 0xa1b1, 0xa1f1, 0xa1f3, 0xa3a1, 0xa3ac, - 0xa3ba, 0xb1a8, 0xb1b8, 0xb1be, 0xb2bb, 0xb3c9, 0xb3f6, 0xb4f3, 0xb5bd, 0xb5c4, - 0xb5e3, 0xb6af, 0xb6d4, 0xb6e0, 0xb7a2, 0xb7a8, 0xb7bd, 0xb7d6, 0xb7dd, 0xb8b4, - 0xb8df, 0xb8f6, 0xb9ab, 0xb9c9, 0xb9d8, 0xb9fa, 0xb9fd, 0xbacd, 0xbba7, 0xbbd6, - 0xbbe1, 0xbbfa, 0xbcbc, 0xbcdb, 0xbcfe, 0xbdcc, 0xbecd, 0xbedd, 0xbfb4, 0xbfc6, - 0xbfc9, 0xc0b4, 0xc0ed, 0xc1cb, 0xc2db, 0xc3c7, 0xc4dc, 0xc4ea, 0xc5cc, 0xc6f7, - 0xc7f8, 0xc8ab, 0xc8cb, 0xc8d5, 0xc8e7, 0xc9cf, 0xc9fa, 0xcab1, 0xcab5, 0xcac7, - 0xcad0, 0xcad6, 0xcaf5, 0xcafd, 0xccec, 0xcdf8, 0xceaa, 0xcec4, 0xced2, 0xcee5, - 0xcfb5, 0xcfc2, 0xcfd6, 0xd0c2, 0xd0c5, 0xd0d0, 0xd0d4, 0xd1a7, 0xd2aa, 0xd2b2, - 0xd2b5, 0xd2bb, 0xd2d4, 0xd3c3, 0xd3d0, 0xd3fd, 0xd4c2, 0xd4da, 0xd5e2, 0xd6d0}; - - /* - * (non-Javadoc) - * Get the next character value for EUC based encodings. - * Character "value" is simply the raw bytes that make up the character - * packed into an int. - */ - boolean nextChar(iteratedChar it, CharsetDetector det) { - it.index = it.nextIndex; - it.error = false; - int firstByte = 0; - int secondByte = 0; - int thirdByte = 0; - int fourthByte = 0; - - buildChar: - { - firstByte = it.charValue = it.nextByte(det); - - if (firstByte < 0) { - // Ran off the end of the input data - it.done = true; - break buildChar; - } - - if (firstByte <= 0x80) { - // single byte char - break buildChar; - } - - secondByte = it.nextByte(det); - it.charValue = (it.charValue << 8) | secondByte; - - if (firstByte >= 0x81 && firstByte <= 0xFE) { - // Two byte Char - if ((secondByte >= 0x40 && secondByte <= 0x7E) || (secondByte >= 80 && secondByte <= 0xFE)) { - break buildChar; - } - - // Four byte char - if (secondByte >= 0x30 && secondByte <= 0x39) { - thirdByte = it.nextByte(det); - - if (thirdByte >= 0x81 && thirdByte <= 0xFE) { - fourthByte = it.nextByte(det); - - if (fourthByte >= 0x30 && fourthByte <= 0x39) { - it.charValue = (it.charValue << 16) | (thirdByte << 8) | fourthByte; - break buildChar; - } - } - } - - it.error = true; - break buildChar; - } - } - - return (it.done == false); - } - - String getName() { - return "GB18030"; - } - - int match(CharsetDetector det) { - return match(det, commonChars); - } - - public String getLanguage() { - return "zh"; - } - } - - + + // "Character" iterated character class. + // Recognizers for specific mbcs encodings make their "characters" available + // by providing a nextChar() function that fills in an instance of iteratedChar + // with the next char from the input. + // The returned characters are not converted to Unicode, but remain as the raw + // bytes (concatenated into an int) from the codepage data. + // + // For Asian charsets, use the raw input rather than the input that has been + // stripped of markup. Detection only considers multi-byte chars, effectively + // stripping markup anyway, and double byte chars do occur in markup too. + // + static class iteratedChar { + int charValue = 0; // 1-4 bytes from the raw input data + int index = 0; + int nextIndex = 0; + boolean error = false; + boolean done = false; + + void reset() { + charValue = 0; + index = -1; + nextIndex = 0; + error = false; + done = false; + } + + int nextByte(CharsetDetector det) { + if (nextIndex >= det.fRawLength) { + done = true; + return -1; + } + int byteValue = (int)det.fRawInput[nextIndex++] & 0x00ff; + return byteValue; + } + } + + /** + * Get the next character (however many bytes it is) from the input data + * Subclasses for specific charset encodings must implement this function + * to get characters according to the rules of their encoding scheme. + * + * This function is not a method of class iteratedChar only because + * that would require a lot of extra derived classes, which is awkward. + * @param it The iteratedChar "struct" into which the returned char is placed. + * @param det The charset detector, which is needed to get at the input byte data + * being iterated over. + * @return True if a character was returned, false at end of input. + */ + abstract boolean nextChar(iteratedChar it, CharsetDetector det); + + + + + + /** + * Shift-JIS charset recognizer. + * + */ + static class CharsetRecog_sjis extends CharsetRecog_mbcs { + static int [] commonChars = + // TODO: This set of data comes from the character frequency- + // of-occurence analysis tool. The data needs to be moved + // into a resource and loaded from there. + {0x8140, 0x8141, 0x8142, 0x8145, 0x815b, 0x8169, 0x816a, 0x8175, 0x8176, 0x82a0, + 0x82a2, 0x82a4, 0x82a9, 0x82aa, 0x82ab, 0x82ad, 0x82af, 0x82b1, 0x82b3, 0x82b5, + 0x82b7, 0x82bd, 0x82be, 0x82c1, 0x82c4, 0x82c5, 0x82c6, 0x82c8, 0x82c9, 0x82cc, + 0x82cd, 0x82dc, 0x82e0, 0x82e7, 0x82e8, 0x82e9, 0x82ea, 0x82f0, 0x82f1, 0x8341, + 0x8343, 0x834e, 0x834f, 0x8358, 0x835e, 0x8362, 0x8367, 0x8375, 0x8376, 0x8389, + 0x838a, 0x838b, 0x838d, 0x8393, 0x8e96, 0x93fa, 0x95aa}; + + boolean nextChar(iteratedChar it, CharsetDetector det) { + it.index = it.nextIndex; + it.error = false; + int firstByte; + firstByte = it.charValue = it.nextByte(det); + if (firstByte < 0) { + return false; + } + + if (firstByte <= 0x7f || (firstByte>0xa0 && firstByte<=0xdf)) { + return true; + } + + int secondByte = it.nextByte(det); + if (secondByte < 0) { + return false; + } + it.charValue = (firstByte << 8) | secondByte; + if (! ((secondByte>=0x40 && secondByte<=0x7f) || (secondByte>=0x80 && secondByte<=0xff))) { + // Illegal second byte value. + it.error = true; + } + return true; + } + + int match(CharsetDetector det) { + return match(det, commonChars); + } + + String getName() { + return "Shift_JIS"; + } + + public String getLanguage() + { + return "ja"; + } + + + } + + + /** + * Big5 charset recognizer. + * + */ + static class CharsetRecog_big5 extends CharsetRecog_mbcs { + static int [] commonChars = + // TODO: This set of data comes from the character frequency- + // of-occurence analysis tool. The data needs to be moved + // into a resource and loaded from there. + {0xa140, 0xa141, 0xa142, 0xa143, 0xa147, 0xa149, 0xa175, 0xa176, 0xa440, 0xa446, + 0xa447, 0xa448, 0xa451, 0xa454, 0xa457, 0xa464, 0xa46a, 0xa46c, 0xa477, 0xa4a3, + 0xa4a4, 0xa4a7, 0xa4c1, 0xa4ce, 0xa4d1, 0xa4df, 0xa4e8, 0xa4fd, 0xa540, 0xa548, + 0xa558, 0xa569, 0xa5cd, 0xa5e7, 0xa657, 0xa661, 0xa662, 0xa668, 0xa670, 0xa6a8, + 0xa6b3, 0xa6b9, 0xa6d3, 0xa6db, 0xa6e6, 0xa6f2, 0xa740, 0xa751, 0xa759, 0xa7da, + 0xa8a3, 0xa8a5, 0xa8ad, 0xa8d1, 0xa8d3, 0xa8e4, 0xa8fc, 0xa9c0, 0xa9d2, 0xa9f3, + 0xaa6b, 0xaaba, 0xaabe, 0xaacc, 0xaafc, 0xac47, 0xac4f, 0xacb0, 0xacd2, 0xad59, + 0xaec9, 0xafe0, 0xb0ea, 0xb16f, 0xb2b3, 0xb2c4, 0xb36f, 0xb44c, 0xb44e, 0xb54c, + 0xb5a5, 0xb5bd, 0xb5d0, 0xb5d8, 0xb671, 0xb7ed, 0xb867, 0xb944, 0xbad8, 0xbb44, + 0xbba1, 0xbdd1, 0xc2c4, 0xc3b9, 0xc440, 0xc45f}; + + boolean nextChar(iteratedChar it, CharsetDetector det) { + it.index = it.nextIndex; + it.error = false; + int firstByte; + firstByte = it.charValue = it.nextByte(det); + if (firstByte < 0) { + return false; + } + + if (firstByte <= 0x7f || firstByte==0xff) { + // single byte character. + return true; + } + + int secondByte = it.nextByte(det); + if (secondByte < 0) { + return false; + } + it.charValue = (it.charValue << 8) | secondByte; + + if (secondByte < 0x40 || + secondByte ==0x7f || + secondByte == 0xff) { + it.error = true; + } + return true; + } + + int match(CharsetDetector det) { + return match(det, commonChars); + } + + String getName() { + return "Big5"; + } + + + public String getLanguage() + { + return "zh"; + } + } + + + /** + * EUC charset recognizers. One abstract class that provides the common function + * for getting the next character according to the EUC encoding scheme, + * and nested derived classes for EUC_KR, EUC_JP, EUC_CN. + * + */ + abstract static class CharsetRecog_euc extends CharsetRecog_mbcs { + + /* + * (non-Javadoc) + * Get the next character value for EUC based encodings. + * Character "value" is simply the raw bytes that make up the character + * packed into an int. + */ + boolean nextChar(iteratedChar it, CharsetDetector det) { + it.index = it.nextIndex; + it.error = false; + int firstByte = 0; + int secondByte = 0; + int thirdByte = 0; + //int fourthByte = 0; + + buildChar: { + firstByte = it.charValue = it.nextByte(det); + if (firstByte < 0) { + // Ran off the end of the input data + it.done = true; + break buildChar; + } + if (firstByte <= 0x8d) { + // single byte char + break buildChar; + } + + secondByte = it.nextByte(det); + it.charValue = (it.charValue << 8) | secondByte; + + if (firstByte >= 0xA1 && firstByte <= 0xfe) { + // Two byte Char + if (secondByte < 0xa1) { + it.error = true; + } + break buildChar; + } + if (firstByte == 0x8e) { + // Code Set 2. + // In EUC-JP, total char size is 2 bytes, only one byte of actual char value. + // In EUC-TW, total char size is 4 bytes, three bytes contribute to char value. + // We don't know which we've got. + // Treat it like EUC-JP. If the data really was EUC-TW, the following two + // bytes will look like a well formed 2 byte char. + if (secondByte < 0xa1) { + it.error = true; + } + break buildChar; + } + + if (firstByte == 0x8f) { + // Code set 3. + // Three byte total char size, two bytes of actual char value. + thirdByte = it.nextByte(det); + it.charValue = (it.charValue << 8) | thirdByte; + if (thirdByte < 0xa1) { + it.error = true; + } + } + } + + return (it.done == false); + } + + /** + * The charset recognize for EUC-JP. A singleton instance of this class + * is created and kept by the public CharsetDetector class + */ + static class CharsetRecog_euc_jp extends CharsetRecog_euc { + static int [] commonChars = + // TODO: This set of data comes from the character frequency- + // of-occurence analysis tool. The data needs to be moved + // into a resource and loaded from there. + {0xa1a1, 0xa1a2, 0xa1a3, 0xa1a6, 0xa1bc, 0xa1ca, 0xa1cb, 0xa1d6, 0xa1d7, 0xa4a2, + 0xa4a4, 0xa4a6, 0xa4a8, 0xa4aa, 0xa4ab, 0xa4ac, 0xa4ad, 0xa4af, 0xa4b1, 0xa4b3, + 0xa4b5, 0xa4b7, 0xa4b9, 0xa4bb, 0xa4bd, 0xa4bf, 0xa4c0, 0xa4c1, 0xa4c3, 0xa4c4, + 0xa4c6, 0xa4c7, 0xa4c8, 0xa4c9, 0xa4ca, 0xa4cb, 0xa4ce, 0xa4cf, 0xa4d0, 0xa4de, + 0xa4df, 0xa4e1, 0xa4e2, 0xa4e4, 0xa4e8, 0xa4e9, 0xa4ea, 0xa4eb, 0xa4ec, 0xa4ef, + 0xa4f2, 0xa4f3, 0xa5a2, 0xa5a3, 0xa5a4, 0xa5a6, 0xa5a7, 0xa5aa, 0xa5ad, 0xa5af, + 0xa5b0, 0xa5b3, 0xa5b5, 0xa5b7, 0xa5b8, 0xa5b9, 0xa5bf, 0xa5c3, 0xa5c6, 0xa5c7, + 0xa5c8, 0xa5c9, 0xa5cb, 0xa5d0, 0xa5d5, 0xa5d6, 0xa5d7, 0xa5de, 0xa5e0, 0xa5e1, + 0xa5e5, 0xa5e9, 0xa5ea, 0xa5eb, 0xa5ec, 0xa5ed, 0xa5f3, 0xb8a9, 0xb9d4, 0xbaee, + 0xbbc8, 0xbef0, 0xbfb7, 0xc4ea, 0xc6fc, 0xc7bd, 0xcab8, 0xcaf3, 0xcbdc, 0xcdd1}; + String getName() { + return "EUC-JP"; + } + + int match(CharsetDetector det) { + return match(det, commonChars); + } + + public String getLanguage() + { + return "ja"; + } + } + + /** + * The charset recognize for EUC-KR. A singleton instance of this class + * is created and kept by the public CharsetDetector class + */ + static class CharsetRecog_euc_kr extends CharsetRecog_euc { + static int [] commonChars = + // TODO: This set of data comes from the character frequency- + // of-occurence analysis tool. The data needs to be moved + // into a resource and loaded from there. + {0xb0a1, 0xb0b3, 0xb0c5, 0xb0cd, 0xb0d4, 0xb0e6, 0xb0ed, 0xb0f8, 0xb0fa, 0xb0fc, + 0xb1b8, 0xb1b9, 0xb1c7, 0xb1d7, 0xb1e2, 0xb3aa, 0xb3bb, 0xb4c2, 0xb4cf, 0xb4d9, + 0xb4eb, 0xb5a5, 0xb5b5, 0xb5bf, 0xb5c7, 0xb5e9, 0xb6f3, 0xb7af, 0xb7c2, 0xb7ce, + 0xb8a6, 0xb8ae, 0xb8b6, 0xb8b8, 0xb8bb, 0xb8e9, 0xb9ab, 0xb9ae, 0xb9cc, 0xb9ce, + 0xb9fd, 0xbab8, 0xbace, 0xbad0, 0xbaf1, 0xbbe7, 0xbbf3, 0xbbfd, 0xbcad, 0xbcba, + 0xbcd2, 0xbcf6, 0xbdba, 0xbdc0, 0xbdc3, 0xbdc5, 0xbec6, 0xbec8, 0xbedf, 0xbeee, + 0xbef8, 0xbefa, 0xbfa1, 0xbfa9, 0xbfc0, 0xbfe4, 0xbfeb, 0xbfec, 0xbff8, 0xc0a7, + 0xc0af, 0xc0b8, 0xc0ba, 0xc0bb, 0xc0bd, 0xc0c7, 0xc0cc, 0xc0ce, 0xc0cf, 0xc0d6, + 0xc0da, 0xc0e5, 0xc0fb, 0xc0fc, 0xc1a4, 0xc1a6, 0xc1b6, 0xc1d6, 0xc1df, 0xc1f6, + 0xc1f8, 0xc4a1, 0xc5cd, 0xc6ae, 0xc7cf, 0xc7d1, 0xc7d2, 0xc7d8, 0xc7e5, 0xc8ad}; + + String getName() { + return "EUC-KR"; + } + + int match(CharsetDetector det) { + return match(det, commonChars); + } + + public String getLanguage() + { + return "ko"; + } + } + } + + /** + * + * GB-18030 recognizer. Uses simplified Chinese statistics. + * + */ + static class CharsetRecog_gb_18030 extends CharsetRecog_mbcs { + + /* + * (non-Javadoc) + * Get the next character value for EUC based encodings. + * Character "value" is simply the raw bytes that make up the character + * packed into an int. + */ + boolean nextChar(iteratedChar it, CharsetDetector det) { + it.index = it.nextIndex; + it.error = false; + int firstByte = 0; + int secondByte = 0; + int thirdByte = 0; + int fourthByte = 0; + + buildChar: { + firstByte = it.charValue = it.nextByte(det); + + if (firstByte < 0) { + // Ran off the end of the input data + it.done = true; + break buildChar; + } + + if (firstByte <= 0x80) { + // single byte char + break buildChar; + } + + secondByte = it.nextByte(det); + it.charValue = (it.charValue << 8) | secondByte; + + if (firstByte >= 0x81 && firstByte <= 0xFE) { + // Two byte Char + if ((secondByte >= 0x40 && secondByte <= 0x7E) || (secondByte >=80 && secondByte <=0xFE)) { + break buildChar; + } + + // Four byte char + if (secondByte >= 0x30 && secondByte <= 0x39) { + thirdByte = it.nextByte(det); + + if (thirdByte >= 0x81 && thirdByte <= 0xFE) { + fourthByte = it.nextByte(det); + + if (fourthByte >= 0x30 && fourthByte <= 0x39) { + it.charValue = (it.charValue << 16) | (thirdByte << 8) | fourthByte; + break buildChar; + } + } + } + + it.error = true; + break buildChar; + } + } + + return (it.done == false); + } + + static int [] commonChars = + // TODO: This set of data comes from the character frequency- + // of-occurence analysis tool. The data needs to be moved + // into a resource and loaded from there. + {0xa1a1, 0xa1a2, 0xa1a3, 0xa1a4, 0xa1b0, 0xa1b1, 0xa1f1, 0xa1f3, 0xa3a1, 0xa3ac, + 0xa3ba, 0xb1a8, 0xb1b8, 0xb1be, 0xb2bb, 0xb3c9, 0xb3f6, 0xb4f3, 0xb5bd, 0xb5c4, + 0xb5e3, 0xb6af, 0xb6d4, 0xb6e0, 0xb7a2, 0xb7a8, 0xb7bd, 0xb7d6, 0xb7dd, 0xb8b4, + 0xb8df, 0xb8f6, 0xb9ab, 0xb9c9, 0xb9d8, 0xb9fa, 0xb9fd, 0xbacd, 0xbba7, 0xbbd6, + 0xbbe1, 0xbbfa, 0xbcbc, 0xbcdb, 0xbcfe, 0xbdcc, 0xbecd, 0xbedd, 0xbfb4, 0xbfc6, + 0xbfc9, 0xc0b4, 0xc0ed, 0xc1cb, 0xc2db, 0xc3c7, 0xc4dc, 0xc4ea, 0xc5cc, 0xc6f7, + 0xc7f8, 0xc8ab, 0xc8cb, 0xc8d5, 0xc8e7, 0xc9cf, 0xc9fa, 0xcab1, 0xcab5, 0xcac7, + 0xcad0, 0xcad6, 0xcaf5, 0xcafd, 0xccec, 0xcdf8, 0xceaa, 0xcec4, 0xced2, 0xcee5, + 0xcfb5, 0xcfc2, 0xcfd6, 0xd0c2, 0xd0c5, 0xd0d0, 0xd0d4, 0xd1a7, 0xd2aa, 0xd2b2, + 0xd2b5, 0xd2bb, 0xd2d4, 0xd3c3, 0xd3d0, 0xd3fd, 0xd4c2, 0xd4da, 0xd5e2, 0xd6d0}; + + + String getName() { + return "GB18030"; + } + + int match(CharsetDetector det) { + return match(det, commonChars); + } + + public String getLanguage() + { + return "zh"; + } + } + + } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java index 87f831b..c621328 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecog_sbcs.java @@ -12,26 +12,24 @@ /** * This class recognizes single-byte encodings. Because the encoding scheme is so * simple, language statistics are used to do the matching. - *

      + * * The Recognizer works by first mapping from bytes in the encoding under test - * into that Recognizer's ngram space. Normally this means performing a - * lowercase, and excluding codepoints that don't correspond to numbers of - * letters. (Accented letters may or may not be ignored or normalised, depending - * on the needs of the ngrams) + * into that Recognizer's ngram space. Normally this means performing a + * lowercase, and excluding codepoints that don't correspond to numbers of + * letters. (Accented letters may or may not be ignored or normalised, depending + * on the needs of the ngrams) * Then, ngram analysis is run against the transformed text, and a confidence - * is calculated. - *

      + * is calculated. + * * For many of our Recognizers, we have one ngram set per language in each - * encoding, and do a simultanious language+charset detection. - *

      + * encoding, and do a simultanious language+charset detection. + * * When adding new Recognizers, the easiest way is to byte map to an existing - * encoding for which we have ngrams, excluding non text, and re-use the ngrams. - * + * encoding for which we have ngrams, excluding non text, and re-use the ngrams. + * * @internal */ abstract class CharsetRecog_sbcs extends CharsetRecognizer { - - protected boolean haveC1Bytes = false; /* (non-Javadoc) * @see com.ibm.icu.text.CharsetRecognizer#getName() @@ -42,53 +40,44 @@ * @see com.ibm.icu.text.CharsetRecognizer#match(com.ibm.icu.text.CharsetDetector) */ abstract int match(CharsetDetector det); - - int match(CharsetDetector det, int[] ngrams, byte[] byteMap) { - return match(det, ngrams, byteMap, (byte) 0x20); - } - - int match(CharsetDetector det, int[] ngrams, byte[] byteMap, byte spaceChar) { - NGramParser parser = new NGramParser(ngrams, byteMap); - - haveC1Bytes = det.fC1Bytes; - - return parser.parse(det, spaceChar); - } - - static class NGramParser { - // private static final int N_GRAM_SIZE = 3; + + static class NGramParser + { +// private static final int N_GRAM_SIZE = 3; private static final int N_GRAM_MASK = 0xFFFFFF; private int byteIndex = 0; private int ngram = 0; - + private int[] ngramList; private byte[] byteMap; - + private int ngramCount; private int hitCount; - + private byte spaceChar; - - public NGramParser(int[] theNgramList, byte[] theByteMap) { + + public NGramParser(int[] theNgramList, byte[] theByteMap) + { ngramList = theNgramList; - byteMap = theByteMap; - + byteMap = theByteMap; + ngram = 0; - + ngramCount = hitCount = 0; } - + /* * Binary search for value in table, which must have exactly 64 entries. */ - private static int search(int[] table, int value) { + private static int search(int[] table, int value) + { int index = 0; - + if (table[index + 32] <= value) { index += 32; } - + if (table[index + 16] <= value) { index += 16; } @@ -112,82 +101,103 @@ if (table[index] > value) { index -= 1; } - + if (index < 0 || table[index] != value) { return -1; } - + return index; } - private void lookup(int thisNgram) { + private void lookup(int thisNgram) + { ngramCount += 1; - + if (search(ngramList, thisNgram) >= 0) { hitCount += 1; } - - } - - private void addByte(int b) { + + } + + private void addByte(int b) + { ngram = ((ngram << 8) + (b & 0xFF)) & N_GRAM_MASK; lookup(ngram); } - - private int nextByte(CharsetDetector det) { + + private int nextByte(CharsetDetector det) + { if (byteIndex >= det.fInputLen) { return -1; } - + return det.fInputBytes[byteIndex++] & 0xFF; } - - public int parse(CharsetDetector det) { - return parse(det, (byte) 0x20); - } - - public int parse(CharsetDetector det, byte spaceCh) { + + public int parse(CharsetDetector det) + { + return parse (det, (byte)0x20); + } + public int parse(CharsetDetector det, byte spaceCh) + { int b; boolean ignoreSpace = false; this.spaceChar = spaceCh; - + while ((b = nextByte(det)) >= 0) { byte mb = byteMap[b]; - + // TODO: 0x20 might not be a space in all character sets... if (mb != 0) { if (!(mb == spaceChar && ignoreSpace)) { - addByte(mb); + addByte(mb); } - + ignoreSpace = (mb == spaceChar); - } else if (mb == 0 && b != 0) { - // Indicates an invalid character in the charset - // Bump the ngram count up a bit to indicate uncertainty - ngramCount += 4; + } else if(mb == 0 && b != 0) { + // Indicates an invalid character in the charset + // Bump the ngram count up a bit to indicate uncertainty + ngramCount += 4; } } - + // TODO: Is this OK? The buffer could have ended in the middle of a word... addByte(spaceChar); double rawPercent = (double) hitCount / (double) ngramCount; - + // if (rawPercent <= 2.0) { // return 0; // } - + // TODO - This is a bit of a hack to take care of a case // were we were getting a confidence of 135... if (rawPercent > 0.33) { return 98; } - + return (int) (rawPercent * 300.0); } } - - abstract static class CharsetRecog_8859_1 extends CharsetRecog_sbcs { + + protected boolean haveC1Bytes = false; + + int match(CharsetDetector det, int[] ngrams, byte[] byteMap) + { + return match (det, ngrams, byteMap, (byte)0x20); + } + + int match(CharsetDetector det, int[] ngrams, byte[] byteMap, byte spaceChar) + { + NGramParser parser = new NGramParser(ngrams, byteMap); + + haveC1Bytes = det.fC1Bytes; + + return parser.parse(det, spaceChar); + } + + abstract static class CharsetRecog_8859_1 extends CharsetRecog_sbcs + { protected static byte[] byteMap = { /* 0x00-0x07 */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, /* 0x08-0x0f */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, @@ -220,830 +230,921 @@ /* 0xe0-0xe7 */ (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, /* 0xe8-0xef */ (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, /* 0xf0-0xf7 */ (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, -/* 0xf8-0xff */ (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, - }; - - public String getName() { - return haveC1Bytes ? "windows-1252" : "ISO-8859-1"; - } - } - - static class CharsetRecog_8859_1_da extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x206166, 0x206174, 0x206465, 0x20656E, 0x206572, 0x20666F, 0x206861, 0x206920, 0x206D65, 0x206F67, 0x2070E5, 0x207369, 0x207374, 0x207469, 0x207669, 0x616620, - 0x616E20, 0x616E64, 0x617220, 0x617420, 0x646520, 0x64656E, 0x646572, 0x646574, 0x652073, 0x656420, 0x656465, 0x656E20, 0x656E64, 0x657220, 0x657265, 0x657320, - 0x657420, 0x666F72, 0x676520, 0x67656E, 0x676572, 0x696765, 0x696C20, 0x696E67, 0x6B6520, 0x6B6B65, 0x6C6572, 0x6C6967, 0x6C6C65, 0x6D6564, 0x6E6465, 0x6E6520, - 0x6E6720, 0x6E6765, 0x6F6720, 0x6F6D20, 0x6F7220, 0x70E520, 0x722064, 0x722065, 0x722073, 0x726520, 0x737465, 0x742073, 0x746520, 0x746572, 0x74696C, 0x766572, - }; - - public String getLanguage() { +/* 0xf8-0xff */ (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, + }; + + public String getName() + { + return haveC1Bytes? "windows-1252" : "ISO-8859-1"; + } + } + + static class CharsetRecog_8859_1_da extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x206166, 0x206174, 0x206465, 0x20656E, 0x206572, 0x20666F, 0x206861, 0x206920, 0x206D65, 0x206F67, 0x2070E5, 0x207369, 0x207374, 0x207469, 0x207669, 0x616620, + 0x616E20, 0x616E64, 0x617220, 0x617420, 0x646520, 0x64656E, 0x646572, 0x646574, 0x652073, 0x656420, 0x656465, 0x656E20, 0x656E64, 0x657220, 0x657265, 0x657320, + 0x657420, 0x666F72, 0x676520, 0x67656E, 0x676572, 0x696765, 0x696C20, 0x696E67, 0x6B6520, 0x6B6B65, 0x6C6572, 0x6C6967, 0x6C6C65, 0x6D6564, 0x6E6465, 0x6E6520, + 0x6E6720, 0x6E6765, 0x6F6720, 0x6F6D20, 0x6F7220, 0x70E520, 0x722064, 0x722065, 0x722073, 0x726520, 0x737465, 0x742073, 0x746520, 0x746572, 0x74696C, 0x766572, + }; + + public String getLanguage() + { return "da"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_de extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x20616E, 0x206175, 0x206265, 0x206461, 0x206465, 0x206469, 0x206569, 0x206765, 0x206861, 0x20696E, 0x206D69, 0x207363, 0x207365, 0x20756E, 0x207665, 0x20766F, - 0x207765, 0x207A75, 0x626572, 0x636820, 0x636865, 0x636874, 0x646173, 0x64656E, 0x646572, 0x646965, 0x652064, 0x652073, 0x65696E, 0x656974, 0x656E20, 0x657220, - 0x657320, 0x67656E, 0x68656E, 0x687420, 0x696368, 0x696520, 0x696E20, 0x696E65, 0x697420, 0x6C6963, 0x6C6C65, 0x6E2061, 0x6E2064, 0x6E2073, 0x6E6420, 0x6E6465, - 0x6E6520, 0x6E6720, 0x6E6765, 0x6E7465, 0x722064, 0x726465, 0x726569, 0x736368, 0x737465, 0x742064, 0x746520, 0x74656E, 0x746572, 0x756E64, 0x756E67, 0x766572, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_de extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x20616E, 0x206175, 0x206265, 0x206461, 0x206465, 0x206469, 0x206569, 0x206765, 0x206861, 0x20696E, 0x206D69, 0x207363, 0x207365, 0x20756E, 0x207665, 0x20766F, + 0x207765, 0x207A75, 0x626572, 0x636820, 0x636865, 0x636874, 0x646173, 0x64656E, 0x646572, 0x646965, 0x652064, 0x652073, 0x65696E, 0x656974, 0x656E20, 0x657220, + 0x657320, 0x67656E, 0x68656E, 0x687420, 0x696368, 0x696520, 0x696E20, 0x696E65, 0x697420, 0x6C6963, 0x6C6C65, 0x6E2061, 0x6E2064, 0x6E2073, 0x6E6420, 0x6E6465, + 0x6E6520, 0x6E6720, 0x6E6765, 0x6E7465, 0x722064, 0x726465, 0x726569, 0x736368, 0x737465, 0x742064, 0x746520, 0x74656E, 0x746572, 0x756E64, 0x756E67, 0x766572, + }; + + public String getLanguage() + { return "de"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_en extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x206120, 0x20616E, 0x206265, 0x20636F, 0x20666F, 0x206861, 0x206865, 0x20696E, 0x206D61, 0x206F66, 0x207072, 0x207265, 0x207361, 0x207374, 0x207468, 0x20746F, - 0x207768, 0x616964, 0x616C20, 0x616E20, 0x616E64, 0x617320, 0x617420, 0x617465, 0x617469, 0x642061, 0x642074, 0x652061, 0x652073, 0x652074, 0x656420, 0x656E74, - 0x657220, 0x657320, 0x666F72, 0x686174, 0x686520, 0x686572, 0x696420, 0x696E20, 0x696E67, 0x696F6E, 0x697320, 0x6E2061, 0x6E2074, 0x6E6420, 0x6E6720, 0x6E7420, - 0x6F6620, 0x6F6E20, 0x6F7220, 0x726520, 0x727320, 0x732061, 0x732074, 0x736169, 0x737420, 0x742074, 0x746572, 0x746861, 0x746865, 0x74696F, 0x746F20, 0x747320, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_en extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x206120, 0x20616E, 0x206265, 0x20636F, 0x20666F, 0x206861, 0x206865, 0x20696E, 0x206D61, 0x206F66, 0x207072, 0x207265, 0x207361, 0x207374, 0x207468, 0x20746F, + 0x207768, 0x616964, 0x616C20, 0x616E20, 0x616E64, 0x617320, 0x617420, 0x617465, 0x617469, 0x642061, 0x642074, 0x652061, 0x652073, 0x652074, 0x656420, 0x656E74, + 0x657220, 0x657320, 0x666F72, 0x686174, 0x686520, 0x686572, 0x696420, 0x696E20, 0x696E67, 0x696F6E, 0x697320, 0x6E2061, 0x6E2074, 0x6E6420, 0x6E6720, 0x6E7420, + 0x6F6620, 0x6F6E20, 0x6F7220, 0x726520, 0x727320, 0x732061, 0x732074, 0x736169, 0x737420, 0x742074, 0x746572, 0x746861, 0x746865, 0x74696F, 0x746F20, 0x747320, + }; + + public String getLanguage() + { return "en"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_es extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x206120, 0x206361, 0x20636F, 0x206465, 0x20656C, 0x20656E, 0x206573, 0x20696E, 0x206C61, 0x206C6F, 0x207061, 0x20706F, 0x207072, 0x207175, 0x207265, 0x207365, - 0x20756E, 0x207920, 0x612063, 0x612064, 0x612065, 0x61206C, 0x612070, 0x616369, 0x61646F, 0x616C20, 0x617220, 0x617320, 0x6369F3, 0x636F6E, 0x646520, 0x64656C, - 0x646F20, 0x652064, 0x652065, 0x65206C, 0x656C20, 0x656E20, 0x656E74, 0x657320, 0x657374, 0x69656E, 0x69F36E, 0x6C6120, 0x6C6F73, 0x6E2065, 0x6E7465, 0x6F2064, - 0x6F2065, 0x6F6E20, 0x6F7220, 0x6F7320, 0x706172, 0x717565, 0x726120, 0x726573, 0x732064, 0x732065, 0x732070, 0x736520, 0x746520, 0x746F20, 0x756520, 0xF36E20, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_es extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x206120, 0x206361, 0x20636F, 0x206465, 0x20656C, 0x20656E, 0x206573, 0x20696E, 0x206C61, 0x206C6F, 0x207061, 0x20706F, 0x207072, 0x207175, 0x207265, 0x207365, + 0x20756E, 0x207920, 0x612063, 0x612064, 0x612065, 0x61206C, 0x612070, 0x616369, 0x61646F, 0x616C20, 0x617220, 0x617320, 0x6369F3, 0x636F6E, 0x646520, 0x64656C, + 0x646F20, 0x652064, 0x652065, 0x65206C, 0x656C20, 0x656E20, 0x656E74, 0x657320, 0x657374, 0x69656E, 0x69F36E, 0x6C6120, 0x6C6F73, 0x6E2065, 0x6E7465, 0x6F2064, + 0x6F2065, 0x6F6E20, 0x6F7220, 0x6F7320, 0x706172, 0x717565, 0x726120, 0x726573, 0x732064, 0x732065, 0x732070, 0x736520, 0x746520, 0x746F20, 0x756520, 0xF36E20, + }; + + public String getLanguage() + { return "es"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_fr extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x206175, 0x20636F, 0x206461, 0x206465, 0x206475, 0x20656E, 0x206574, 0x206C61, 0x206C65, 0x207061, 0x20706F, 0x207072, 0x207175, 0x207365, 0x20736F, 0x20756E, - 0x20E020, 0x616E74, 0x617469, 0x636520, 0x636F6E, 0x646520, 0x646573, 0x647520, 0x652061, 0x652063, 0x652064, 0x652065, 0x65206C, 0x652070, 0x652073, 0x656E20, - 0x656E74, 0x657220, 0x657320, 0x657420, 0x657572, 0x696F6E, 0x697320, 0x697420, 0x6C6120, 0x6C6520, 0x6C6573, 0x6D656E, 0x6E2064, 0x6E6520, 0x6E7320, 0x6E7420, - 0x6F6E20, 0x6F6E74, 0x6F7572, 0x717565, 0x72206C, 0x726520, 0x732061, 0x732064, 0x732065, 0x73206C, 0x732070, 0x742064, 0x746520, 0x74696F, 0x756520, 0x757220, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_fr extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x206175, 0x20636F, 0x206461, 0x206465, 0x206475, 0x20656E, 0x206574, 0x206C61, 0x206C65, 0x207061, 0x20706F, 0x207072, 0x207175, 0x207365, 0x20736F, 0x20756E, + 0x20E020, 0x616E74, 0x617469, 0x636520, 0x636F6E, 0x646520, 0x646573, 0x647520, 0x652061, 0x652063, 0x652064, 0x652065, 0x65206C, 0x652070, 0x652073, 0x656E20, + 0x656E74, 0x657220, 0x657320, 0x657420, 0x657572, 0x696F6E, 0x697320, 0x697420, 0x6C6120, 0x6C6520, 0x6C6573, 0x6D656E, 0x6E2064, 0x6E6520, 0x6E7320, 0x6E7420, + 0x6F6E20, 0x6F6E74, 0x6F7572, 0x717565, 0x72206C, 0x726520, 0x732061, 0x732064, 0x732065, 0x73206C, 0x732070, 0x742064, 0x746520, 0x74696F, 0x756520, 0x757220, + }; + + public String getLanguage() + { return "fr"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_it extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x20616C, 0x206368, 0x20636F, 0x206465, 0x206469, 0x206520, 0x20696C, 0x20696E, 0x206C61, 0x207065, 0x207072, 0x20756E, 0x612063, 0x612064, 0x612070, 0x612073, - 0x61746F, 0x636865, 0x636F6E, 0x64656C, 0x646920, 0x652061, 0x652063, 0x652064, 0x652069, 0x65206C, 0x652070, 0x652073, 0x656C20, 0x656C6C, 0x656E74, 0x657220, - 0x686520, 0x692061, 0x692063, 0x692064, 0x692073, 0x696120, 0x696C20, 0x696E20, 0x696F6E, 0x6C6120, 0x6C6520, 0x6C6920, 0x6C6C61, 0x6E6520, 0x6E6920, 0x6E6F20, - 0x6E7465, 0x6F2061, 0x6F2064, 0x6F2069, 0x6F2073, 0x6F6E20, 0x6F6E65, 0x706572, 0x726120, 0x726520, 0x736920, 0x746120, 0x746520, 0x746920, 0x746F20, 0x7A696F, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_it extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x20616C, 0x206368, 0x20636F, 0x206465, 0x206469, 0x206520, 0x20696C, 0x20696E, 0x206C61, 0x207065, 0x207072, 0x20756E, 0x612063, 0x612064, 0x612070, 0x612073, + 0x61746F, 0x636865, 0x636F6E, 0x64656C, 0x646920, 0x652061, 0x652063, 0x652064, 0x652069, 0x65206C, 0x652070, 0x652073, 0x656C20, 0x656C6C, 0x656E74, 0x657220, + 0x686520, 0x692061, 0x692063, 0x692064, 0x692073, 0x696120, 0x696C20, 0x696E20, 0x696F6E, 0x6C6120, 0x6C6520, 0x6C6920, 0x6C6C61, 0x6E6520, 0x6E6920, 0x6E6F20, + 0x6E7465, 0x6F2061, 0x6F2064, 0x6F2069, 0x6F2073, 0x6F6E20, 0x6F6E65, 0x706572, 0x726120, 0x726520, 0x736920, 0x746120, 0x746520, 0x746920, 0x746F20, 0x7A696F, + }; + + public String getLanguage() + { return "it"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_nl extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x20616C, 0x206265, 0x206461, 0x206465, 0x206469, 0x206565, 0x20656E, 0x206765, 0x206865, 0x20696E, 0x206D61, 0x206D65, 0x206F70, 0x207465, 0x207661, 0x207665, - 0x20766F, 0x207765, 0x207A69, 0x61616E, 0x616172, 0x616E20, 0x616E64, 0x617220, 0x617420, 0x636874, 0x646520, 0x64656E, 0x646572, 0x652062, 0x652076, 0x65656E, - 0x656572, 0x656E20, 0x657220, 0x657273, 0x657420, 0x67656E, 0x686574, 0x696520, 0x696E20, 0x696E67, 0x697320, 0x6E2062, 0x6E2064, 0x6E2065, 0x6E2068, 0x6E206F, - 0x6E2076, 0x6E6465, 0x6E6720, 0x6F6E64, 0x6F6F72, 0x6F7020, 0x6F7220, 0x736368, 0x737465, 0x742064, 0x746520, 0x74656E, 0x746572, 0x76616E, 0x766572, 0x766F6F, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_nl extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x20616C, 0x206265, 0x206461, 0x206465, 0x206469, 0x206565, 0x20656E, 0x206765, 0x206865, 0x20696E, 0x206D61, 0x206D65, 0x206F70, 0x207465, 0x207661, 0x207665, + 0x20766F, 0x207765, 0x207A69, 0x61616E, 0x616172, 0x616E20, 0x616E64, 0x617220, 0x617420, 0x636874, 0x646520, 0x64656E, 0x646572, 0x652062, 0x652076, 0x65656E, + 0x656572, 0x656E20, 0x657220, 0x657273, 0x657420, 0x67656E, 0x686574, 0x696520, 0x696E20, 0x696E67, 0x697320, 0x6E2062, 0x6E2064, 0x6E2065, 0x6E2068, 0x6E206F, + 0x6E2076, 0x6E6465, 0x6E6720, 0x6F6E64, 0x6F6F72, 0x6F7020, 0x6F7220, 0x736368, 0x737465, 0x742064, 0x746520, 0x74656E, 0x746572, 0x76616E, 0x766572, 0x766F6F, + }; + + public String getLanguage() + { return "nl"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_no extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x206174, 0x206176, 0x206465, 0x20656E, 0x206572, 0x20666F, 0x206861, 0x206920, 0x206D65, 0x206F67, 0x2070E5, 0x207365, 0x20736B, 0x20736F, 0x207374, 0x207469, - 0x207669, 0x20E520, 0x616E64, 0x617220, 0x617420, 0x646520, 0x64656E, 0x646574, 0x652073, 0x656420, 0x656E20, 0x656E65, 0x657220, 0x657265, 0x657420, 0x657474, - 0x666F72, 0x67656E, 0x696B6B, 0x696C20, 0x696E67, 0x6B6520, 0x6B6B65, 0x6C6520, 0x6C6C65, 0x6D6564, 0x6D656E, 0x6E2073, 0x6E6520, 0x6E6720, 0x6E6765, 0x6E6E65, - 0x6F6720, 0x6F6D20, 0x6F7220, 0x70E520, 0x722073, 0x726520, 0x736F6D, 0x737465, 0x742073, 0x746520, 0x74656E, 0x746572, 0x74696C, 0x747420, 0x747465, 0x766572, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_no extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x206174, 0x206176, 0x206465, 0x20656E, 0x206572, 0x20666F, 0x206861, 0x206920, 0x206D65, 0x206F67, 0x2070E5, 0x207365, 0x20736B, 0x20736F, 0x207374, 0x207469, + 0x207669, 0x20E520, 0x616E64, 0x617220, 0x617420, 0x646520, 0x64656E, 0x646574, 0x652073, 0x656420, 0x656E20, 0x656E65, 0x657220, 0x657265, 0x657420, 0x657474, + 0x666F72, 0x67656E, 0x696B6B, 0x696C20, 0x696E67, 0x6B6520, 0x6B6B65, 0x6C6520, 0x6C6C65, 0x6D6564, 0x6D656E, 0x6E2073, 0x6E6520, 0x6E6720, 0x6E6765, 0x6E6E65, + 0x6F6720, 0x6F6D20, 0x6F7220, 0x70E520, 0x722073, 0x726520, 0x736F6D, 0x737465, 0x742073, 0x746520, 0x74656E, 0x746572, 0x74696C, 0x747420, 0x747465, 0x766572, + }; + + public String getLanguage() + { return "no"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_pt extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x206120, 0x20636F, 0x206461, 0x206465, 0x20646F, 0x206520, 0x206573, 0x206D61, 0x206E6F, 0x206F20, 0x207061, 0x20706F, 0x207072, 0x207175, 0x207265, 0x207365, - 0x20756D, 0x612061, 0x612063, 0x612064, 0x612070, 0x616465, 0x61646F, 0x616C20, 0x617220, 0x617261, 0x617320, 0x636F6D, 0x636F6E, 0x646120, 0x646520, 0x646F20, - 0x646F73, 0x652061, 0x652064, 0x656D20, 0x656E74, 0x657320, 0x657374, 0x696120, 0x696361, 0x6D656E, 0x6E7465, 0x6E746F, 0x6F2061, 0x6F2063, 0x6F2064, 0x6F2065, - 0x6F2070, 0x6F7320, 0x706172, 0x717565, 0x726120, 0x726573, 0x732061, 0x732064, 0x732065, 0x732070, 0x737461, 0x746520, 0x746F20, 0x756520, 0xE36F20, 0xE7E36F, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_pt extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x206120, 0x20636F, 0x206461, 0x206465, 0x20646F, 0x206520, 0x206573, 0x206D61, 0x206E6F, 0x206F20, 0x207061, 0x20706F, 0x207072, 0x207175, 0x207265, 0x207365, + 0x20756D, 0x612061, 0x612063, 0x612064, 0x612070, 0x616465, 0x61646F, 0x616C20, 0x617220, 0x617261, 0x617320, 0x636F6D, 0x636F6E, 0x646120, 0x646520, 0x646F20, + 0x646F73, 0x652061, 0x652064, 0x656D20, 0x656E74, 0x657320, 0x657374, 0x696120, 0x696361, 0x6D656E, 0x6E7465, 0x6E746F, 0x6F2061, 0x6F2063, 0x6F2064, 0x6F2065, + 0x6F2070, 0x6F7320, 0x706172, 0x717565, 0x726120, 0x726573, 0x732061, 0x732064, 0x732065, 0x732070, 0x737461, 0x746520, 0x746F20, 0x756520, 0xE36F20, 0xE7E36F, + }; + + public String getLanguage() + { return "pt"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_1_sv extends CharsetRecog_8859_1 { - private static int[] ngrams = { - 0x206174, 0x206176, 0x206465, 0x20656E, 0x2066F6, 0x206861, 0x206920, 0x20696E, 0x206B6F, 0x206D65, 0x206F63, 0x2070E5, 0x20736B, 0x20736F, 0x207374, 0x207469, - 0x207661, 0x207669, 0x20E472, 0x616465, 0x616E20, 0x616E64, 0x617220, 0x617474, 0x636820, 0x646520, 0x64656E, 0x646572, 0x646574, 0x656420, 0x656E20, 0x657220, - 0x657420, 0x66F672, 0x67656E, 0x696C6C, 0x696E67, 0x6B6120, 0x6C6C20, 0x6D6564, 0x6E2073, 0x6E6120, 0x6E6465, 0x6E6720, 0x6E6765, 0x6E696E, 0x6F6368, 0x6F6D20, - 0x6F6E20, 0x70E520, 0x722061, 0x722073, 0x726120, 0x736B61, 0x736F6D, 0x742073, 0x746120, 0x746520, 0x746572, 0x74696C, 0x747420, 0x766172, 0xE47220, 0xF67220, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_1_sv extends CharsetRecog_8859_1 + { + private static int[] ngrams = { + 0x206174, 0x206176, 0x206465, 0x20656E, 0x2066F6, 0x206861, 0x206920, 0x20696E, 0x206B6F, 0x206D65, 0x206F63, 0x2070E5, 0x20736B, 0x20736F, 0x207374, 0x207469, + 0x207661, 0x207669, 0x20E472, 0x616465, 0x616E20, 0x616E64, 0x617220, 0x617474, 0x636820, 0x646520, 0x64656E, 0x646572, 0x646574, 0x656420, 0x656E20, 0x657220, + 0x657420, 0x66F672, 0x67656E, 0x696C6C, 0x696E67, 0x6B6120, 0x6C6C20, 0x6D6564, 0x6E2073, 0x6E6120, 0x6E6465, 0x6E6720, 0x6E6765, 0x6E696E, 0x6F6368, 0x6F6D20, + 0x6F6E20, 0x70E520, 0x722061, 0x722073, 0x726120, 0x736B61, 0x736F6D, 0x742073, 0x746120, 0x746520, 0x746572, 0x74696C, 0x747420, 0x766172, 0xE47220, 0xF67220, + }; + + public String getLanguage() + { return "sv"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - abstract static class CharsetRecog_8859_2 extends CharsetRecog_sbcs { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + abstract static class CharsetRecog_8859_2 extends CharsetRecog_sbcs + { protected static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0xB1, (byte) 0x20, (byte) 0xB3, (byte) 0x20, (byte) 0xB5, (byte) 0xB6, (byte) 0x20, - (byte) 0x20, (byte) 0xB9, (byte) 0xBA, (byte) 0xBB, (byte) 0xBC, (byte) 0x20, (byte) 0xBE, (byte) 0xBF, - (byte) 0x20, (byte) 0xB1, (byte) 0x20, (byte) 0xB3, (byte) 0x20, (byte) 0xB5, (byte) 0xB6, (byte) 0xB7, - (byte) 0x20, (byte) 0xB9, (byte) 0xBA, (byte) 0xBB, (byte) 0xBC, (byte) 0x20, (byte) 0xBE, (byte) 0xBF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xDF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0x20, - }; - - public String getName() { - return haveC1Bytes ? "windows-1250" : "ISO-8859-2"; - } - } - - static class CharsetRecog_8859_2_cs extends CharsetRecog_8859_2 { - private static int[] ngrams = { - 0x206120, 0x206279, 0x20646F, 0x206A65, 0x206E61, 0x206E65, 0x206F20, 0x206F64, 0x20706F, 0x207072, 0x2070F8, 0x20726F, 0x207365, 0x20736F, 0x207374, 0x20746F, - 0x207620, 0x207679, 0x207A61, 0x612070, 0x636520, 0x636820, 0x652070, 0x652073, 0x652076, 0x656D20, 0x656EED, 0x686F20, 0x686F64, 0x697374, 0x6A6520, 0x6B7465, - 0x6C6520, 0x6C6920, 0x6E6120, 0x6EE920, 0x6EEC20, 0x6EED20, 0x6F2070, 0x6F646E, 0x6F6A69, 0x6F7374, 0x6F7520, 0x6F7661, 0x706F64, 0x706F6A, 0x70726F, 0x70F865, - 0x736520, 0x736F75, 0x737461, 0x737469, 0x73746E, 0x746572, 0x746EED, 0x746F20, 0x752070, 0xBE6520, 0xE16EED, 0xE9686F, 0xED2070, 0xED2073, 0xED6D20, 0xF86564, - }; - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0xB1, (byte) 0x20, (byte) 0xB3, (byte) 0x20, (byte) 0xB5, (byte) 0xB6, (byte) 0x20, + (byte) 0x20, (byte) 0xB9, (byte) 0xBA, (byte) 0xBB, (byte) 0xBC, (byte) 0x20, (byte) 0xBE, (byte) 0xBF, + (byte) 0x20, (byte) 0xB1, (byte) 0x20, (byte) 0xB3, (byte) 0x20, (byte) 0xB5, (byte) 0xB6, (byte) 0xB7, + (byte) 0x20, (byte) 0xB9, (byte) 0xBA, (byte) 0xBB, (byte) 0xBC, (byte) 0x20, (byte) 0xBE, (byte) 0xBF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xDF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0x20, + }; + + public String getName() + { + return haveC1Bytes? "windows-1250" : "ISO-8859-2"; + } + } + + static class CharsetRecog_8859_2_cs extends CharsetRecog_8859_2 + { + private static int[] ngrams = { + 0x206120, 0x206279, 0x20646F, 0x206A65, 0x206E61, 0x206E65, 0x206F20, 0x206F64, 0x20706F, 0x207072, 0x2070F8, 0x20726F, 0x207365, 0x20736F, 0x207374, 0x20746F, + 0x207620, 0x207679, 0x207A61, 0x612070, 0x636520, 0x636820, 0x652070, 0x652073, 0x652076, 0x656D20, 0x656EED, 0x686F20, 0x686F64, 0x697374, 0x6A6520, 0x6B7465, + 0x6C6520, 0x6C6920, 0x6E6120, 0x6EE920, 0x6EEC20, 0x6EED20, 0x6F2070, 0x6F646E, 0x6F6A69, 0x6F7374, 0x6F7520, 0x6F7661, 0x706F64, 0x706F6A, 0x70726F, 0x70F865, + 0x736520, 0x736F75, 0x737461, 0x737469, 0x73746E, 0x746572, 0x746EED, 0x746F20, 0x752070, 0xBE6520, 0xE16EED, 0xE9686F, 0xED2070, 0xED2073, 0xED6D20, 0xF86564, + }; + + public String getLanguage() + { return "cs"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_2_hu extends CharsetRecog_8859_2 { - private static int[] ngrams = { - 0x206120, 0x20617A, 0x206265, 0x206567, 0x20656C, 0x206665, 0x206861, 0x20686F, 0x206973, 0x206B65, 0x206B69, 0x206BF6, 0x206C65, 0x206D61, 0x206D65, 0x206D69, - 0x206E65, 0x20737A, 0x207465, 0x20E973, 0x612061, 0x61206B, 0x61206D, 0x612073, 0x616B20, 0x616E20, 0x617A20, 0x62616E, 0x62656E, 0x656779, 0x656B20, 0x656C20, - 0x656C65, 0x656D20, 0x656E20, 0x657265, 0x657420, 0x657465, 0x657474, 0x677920, 0x686F67, 0x696E74, 0x697320, 0x6B2061, 0x6BF67A, 0x6D6567, 0x6D696E, 0x6E2061, - 0x6E616B, 0x6E656B, 0x6E656D, 0x6E7420, 0x6F6779, 0x732061, 0x737A65, 0x737A74, 0x737AE1, 0x73E967, 0x742061, 0x747420, 0x74E173, 0x7A6572, 0xE16E20, 0xE97320, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_2_hu extends CharsetRecog_8859_2 + { + private static int[] ngrams = { + 0x206120, 0x20617A, 0x206265, 0x206567, 0x20656C, 0x206665, 0x206861, 0x20686F, 0x206973, 0x206B65, 0x206B69, 0x206BF6, 0x206C65, 0x206D61, 0x206D65, 0x206D69, + 0x206E65, 0x20737A, 0x207465, 0x20E973, 0x612061, 0x61206B, 0x61206D, 0x612073, 0x616B20, 0x616E20, 0x617A20, 0x62616E, 0x62656E, 0x656779, 0x656B20, 0x656C20, + 0x656C65, 0x656D20, 0x656E20, 0x657265, 0x657420, 0x657465, 0x657474, 0x677920, 0x686F67, 0x696E74, 0x697320, 0x6B2061, 0x6BF67A, 0x6D6567, 0x6D696E, 0x6E2061, + 0x6E616B, 0x6E656B, 0x6E656D, 0x6E7420, 0x6F6779, 0x732061, 0x737A65, 0x737A74, 0x737AE1, 0x73E967, 0x742061, 0x747420, 0x74E173, 0x7A6572, 0xE16E20, 0xE97320, + }; + + public String getLanguage() + { return "hu"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_2_pl extends CharsetRecog_8859_2 { - private static int[] ngrams = { - 0x20637A, 0x20646F, 0x206920, 0x206A65, 0x206B6F, 0x206D61, 0x206D69, 0x206E61, 0x206E69, 0x206F64, 0x20706F, 0x207072, 0x207369, 0x207720, 0x207769, 0x207779, - 0x207A20, 0x207A61, 0x612070, 0x612077, 0x616E69, 0x636820, 0x637A65, 0x637A79, 0x646F20, 0x647A69, 0x652070, 0x652073, 0x652077, 0x65207A, 0x65676F, 0x656A20, - 0x656D20, 0x656E69, 0x676F20, 0x696120, 0x696520, 0x69656A, 0x6B6120, 0x6B6920, 0x6B6965, 0x6D6965, 0x6E6120, 0x6E6961, 0x6E6965, 0x6F2070, 0x6F7761, 0x6F7769, - 0x706F6C, 0x707261, 0x70726F, 0x70727A, 0x727A65, 0x727A79, 0x7369EA, 0x736B69, 0x737461, 0x776965, 0x796368, 0x796D20, 0x7A6520, 0x7A6965, 0x7A7920, 0xF37720, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_2_pl extends CharsetRecog_8859_2 + { + private static int[] ngrams = { + 0x20637A, 0x20646F, 0x206920, 0x206A65, 0x206B6F, 0x206D61, 0x206D69, 0x206E61, 0x206E69, 0x206F64, 0x20706F, 0x207072, 0x207369, 0x207720, 0x207769, 0x207779, + 0x207A20, 0x207A61, 0x612070, 0x612077, 0x616E69, 0x636820, 0x637A65, 0x637A79, 0x646F20, 0x647A69, 0x652070, 0x652073, 0x652077, 0x65207A, 0x65676F, 0x656A20, + 0x656D20, 0x656E69, 0x676F20, 0x696120, 0x696520, 0x69656A, 0x6B6120, 0x6B6920, 0x6B6965, 0x6D6965, 0x6E6120, 0x6E6961, 0x6E6965, 0x6F2070, 0x6F7761, 0x6F7769, + 0x706F6C, 0x707261, 0x70726F, 0x70727A, 0x727A65, 0x727A79, 0x7369EA, 0x736B69, 0x737461, 0x776965, 0x796368, 0x796D20, 0x7A6520, 0x7A6965, 0x7A7920, 0xF37720, + }; + + public String getLanguage() + { return "pl"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_2_ro extends CharsetRecog_8859_2 { - private static int[] ngrams = { - 0x206120, 0x206163, 0x206361, 0x206365, 0x20636F, 0x206375, 0x206465, 0x206469, 0x206C61, 0x206D61, 0x207065, 0x207072, 0x207365, 0x2073E3, 0x20756E, 0x20BA69, - 0x20EE6E, 0x612063, 0x612064, 0x617265, 0x617420, 0x617465, 0x617520, 0x636172, 0x636F6E, 0x637520, 0x63E320, 0x646520, 0x652061, 0x652063, 0x652064, 0x652070, - 0x652073, 0x656120, 0x656920, 0x656C65, 0x656E74, 0x657374, 0x692061, 0x692063, 0x692064, 0x692070, 0x696520, 0x696920, 0x696E20, 0x6C6120, 0x6C6520, 0x6C6F72, - 0x6C7569, 0x6E6520, 0x6E7472, 0x6F7220, 0x70656E, 0x726520, 0x726561, 0x727520, 0x73E320, 0x746520, 0x747275, 0x74E320, 0x756920, 0x756C20, 0xBA6920, 0xEE6E20, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_2_ro extends CharsetRecog_8859_2 + { + private static int[] ngrams = { + 0x206120, 0x206163, 0x206361, 0x206365, 0x20636F, 0x206375, 0x206465, 0x206469, 0x206C61, 0x206D61, 0x207065, 0x207072, 0x207365, 0x2073E3, 0x20756E, 0x20BA69, + 0x20EE6E, 0x612063, 0x612064, 0x617265, 0x617420, 0x617465, 0x617520, 0x636172, 0x636F6E, 0x637520, 0x63E320, 0x646520, 0x652061, 0x652063, 0x652064, 0x652070, + 0x652073, 0x656120, 0x656920, 0x656C65, 0x656E74, 0x657374, 0x692061, 0x692063, 0x692064, 0x692070, 0x696520, 0x696920, 0x696E20, 0x6C6120, 0x6C6520, 0x6C6F72, + 0x6C7569, 0x6E6520, 0x6E7472, 0x6F7220, 0x70656E, 0x726520, 0x726561, 0x727520, 0x73E320, 0x746520, 0x747275, 0x74E320, 0x756920, 0x756C20, 0xBA6920, 0xEE6E20, + }; + + public String getLanguage() + { return "ro"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - abstract static class CharsetRecog_8859_5 extends CharsetRecog_sbcs { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + abstract static class CharsetRecog_8859_5 extends CharsetRecog_sbcs + { protected static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0x20, (byte) 0xFE, (byte) 0xFF, - (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, - (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, - (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0x20, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0x20, (byte) 0xFE, (byte) 0xFF, - }; - - public String getName() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0x20, (byte) 0xFE, (byte) 0xFF, + (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, + (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, + (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0x20, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0x20, (byte) 0xFE, (byte) 0xFF, + }; + + public String getName() + { return "ISO-8859-5"; } } - - static class CharsetRecog_8859_5_ru extends CharsetRecog_8859_5 { - private static int[] ngrams = { - 0x20D220, 0x20D2DE, 0x20D4DE, 0x20D7D0, 0x20D820, 0x20DAD0, 0x20DADE, 0x20DDD0, 0x20DDD5, 0x20DED1, 0x20DFDE, 0x20DFE0, 0x20E0D0, 0x20E1DE, 0x20E1E2, 0x20E2DE, - 0x20E7E2, 0x20EDE2, 0xD0DDD8, 0xD0E2EC, 0xD3DE20, 0xD5DBEC, 0xD5DDD8, 0xD5E1E2, 0xD5E220, 0xD820DF, 0xD8D520, 0xD8D820, 0xD8EF20, 0xDBD5DD, 0xDBD820, 0xDBECDD, - 0xDDD020, 0xDDD520, 0xDDD8D5, 0xDDD8EF, 0xDDDE20, 0xDDDED2, 0xDE20D2, 0xDE20DF, 0xDE20E1, 0xDED220, 0xDED2D0, 0xDED3DE, 0xDED920, 0xDEDBEC, 0xDEDC20, 0xDEE1E2, - 0xDFDEDB, 0xDFE0D5, 0xDFE0D8, 0xDFE0DE, 0xE0D0D2, 0xE0D5D4, 0xE1E2D0, 0xE1E2D2, 0xE1E2D8, 0xE1EF20, 0xE2D5DB, 0xE2DE20, 0xE2DEE0, 0xE2EC20, 0xE7E2DE, 0xEBE520, - }; - - public String getLanguage() { + + static class CharsetRecog_8859_5_ru extends CharsetRecog_8859_5 + { + private static int[] ngrams = { + 0x20D220, 0x20D2DE, 0x20D4DE, 0x20D7D0, 0x20D820, 0x20DAD0, 0x20DADE, 0x20DDD0, 0x20DDD5, 0x20DED1, 0x20DFDE, 0x20DFE0, 0x20E0D0, 0x20E1DE, 0x20E1E2, 0x20E2DE, + 0x20E7E2, 0x20EDE2, 0xD0DDD8, 0xD0E2EC, 0xD3DE20, 0xD5DBEC, 0xD5DDD8, 0xD5E1E2, 0xD5E220, 0xD820DF, 0xD8D520, 0xD8D820, 0xD8EF20, 0xDBD5DD, 0xDBD820, 0xDBECDD, + 0xDDD020, 0xDDD520, 0xDDD8D5, 0xDDD8EF, 0xDDDE20, 0xDDDED2, 0xDE20D2, 0xDE20DF, 0xDE20E1, 0xDED220, 0xDED2D0, 0xDED3DE, 0xDED920, 0xDEDBEC, 0xDEDC20, 0xDEE1E2, + 0xDFDEDB, 0xDFE0D5, 0xDFE0D8, 0xDFE0DE, 0xE0D0D2, 0xE0D5D4, 0xE1E2D0, 0xE1E2D2, 0xE1E2D8, 0xE1EF20, 0xE2D5DB, 0xE2DE20, 0xE2DEE0, 0xE2EC20, 0xE7E2DE, 0xEBE520, + }; + + public String getLanguage() + { return "ru"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - abstract static class CharsetRecog_8859_6 extends CharsetRecog_sbcs { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + abstract static class CharsetRecog_8859_6 extends CharsetRecog_sbcs + { protected static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, - (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, - (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, - (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - }; - - public String getName() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, + (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, + (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, + (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + }; + + public String getName() + { return "ISO-8859-6"; } } - - static class CharsetRecog_8859_6_ar extends CharsetRecog_8859_6 { - private static int[] ngrams = { - 0x20C7E4, 0x20C7E6, 0x20C8C7, 0x20D9E4, 0x20E1EA, 0x20E4E4, 0x20E5E6, 0x20E8C7, 0xC720C7, 0xC7C120, 0xC7CA20, 0xC7D120, 0xC7E420, 0xC7E4C3, 0xC7E4C7, 0xC7E4C8, - 0xC7E4CA, 0xC7E4CC, 0xC7E4CD, 0xC7E4CF, 0xC7E4D3, 0xC7E4D9, 0xC7E4E2, 0xC7E4E5, 0xC7E4E8, 0xC7E4EA, 0xC7E520, 0xC7E620, 0xC7E6CA, 0xC820C7, 0xC920C7, 0xC920E1, - 0xC920E4, 0xC920E5, 0xC920E8, 0xCA20C7, 0xCF20C7, 0xCFC920, 0xD120C7, 0xD1C920, 0xD320C7, 0xD920C7, 0xD9E4E9, 0xE1EA20, 0xE420C7, 0xE4C920, 0xE4E920, 0xE4EA20, - 0xE520C7, 0xE5C720, 0xE5C920, 0xE5E620, 0xE620C7, 0xE720C7, 0xE7C720, 0xE8C7E4, 0xE8E620, 0xE920C7, 0xEA20C7, 0xEA20E5, 0xEA20E8, 0xEAC920, 0xEAD120, 0xEAE620, - }; - - public String getLanguage() { + + static class CharsetRecog_8859_6_ar extends CharsetRecog_8859_6 + { + private static int[] ngrams = { + 0x20C7E4, 0x20C7E6, 0x20C8C7, 0x20D9E4, 0x20E1EA, 0x20E4E4, 0x20E5E6, 0x20E8C7, 0xC720C7, 0xC7C120, 0xC7CA20, 0xC7D120, 0xC7E420, 0xC7E4C3, 0xC7E4C7, 0xC7E4C8, + 0xC7E4CA, 0xC7E4CC, 0xC7E4CD, 0xC7E4CF, 0xC7E4D3, 0xC7E4D9, 0xC7E4E2, 0xC7E4E5, 0xC7E4E8, 0xC7E4EA, 0xC7E520, 0xC7E620, 0xC7E6CA, 0xC820C7, 0xC920C7, 0xC920E1, + 0xC920E4, 0xC920E5, 0xC920E8, 0xCA20C7, 0xCF20C7, 0xCFC920, 0xD120C7, 0xD1C920, 0xD320C7, 0xD920C7, 0xD9E4E9, 0xE1EA20, 0xE420C7, 0xE4C920, 0xE4E920, 0xE4EA20, + 0xE520C7, 0xE5C720, 0xE5C920, 0xE5E620, 0xE620C7, 0xE720C7, 0xE7C720, 0xE8C7E4, 0xE8E620, 0xE920C7, 0xEA20C7, 0xEA20E5, 0xEA20E8, 0xEAC920, 0xEAD120, 0xEAE620, + }; + + public String getLanguage() + { return "ar"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - abstract static class CharsetRecog_8859_7 extends CharsetRecog_sbcs { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + abstract static class CharsetRecog_8859_7 extends CharsetRecog_sbcs + { protected static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0xA1, (byte) 0xA2, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xDC, (byte) 0x20, - (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, (byte) 0x20, (byte) 0xFC, (byte) 0x20, (byte) 0xFD, (byte) 0xFE, - (byte) 0xC0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0x20, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0x20, - }; - - public String getName() { - return haveC1Bytes ? "windows-1253" : "ISO-8859-7"; - } - } - - static class CharsetRecog_8859_7_el extends CharsetRecog_8859_7 { - private static int[] ngrams = { - 0x20E1ED, 0x20E1F0, 0x20E3E9, 0x20E4E9, 0x20E5F0, 0x20E720, 0x20EAE1, 0x20ECE5, 0x20EDE1, 0x20EF20, 0x20F0E1, 0x20F0EF, 0x20F0F1, 0x20F3F4, 0x20F3F5, 0x20F4E7, - 0x20F4EF, 0xDFE120, 0xE120E1, 0xE120F4, 0xE1E920, 0xE1ED20, 0xE1F0FC, 0xE1F220, 0xE3E9E1, 0xE5E920, 0xE5F220, 0xE720F4, 0xE7ED20, 0xE7F220, 0xE920F4, 0xE9E120, - 0xE9EADE, 0xE9F220, 0xEAE1E9, 0xEAE1F4, 0xECE520, 0xED20E1, 0xED20E5, 0xED20F0, 0xEDE120, 0xEFF220, 0xEFF520, 0xF0EFF5, 0xF0F1EF, 0xF0FC20, 0xF220E1, 0xF220E5, - 0xF220EA, 0xF220F0, 0xF220F4, 0xF3E520, 0xF3E720, 0xF3F4EF, 0xF4E120, 0xF4E1E9, 0xF4E7ED, 0xF4E7F2, 0xF4E9EA, 0xF4EF20, 0xF4EFF5, 0xF4F9ED, 0xF9ED20, 0xFEED20, - }; - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0xA1, (byte) 0xA2, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xDC, (byte) 0x20, + (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, (byte) 0x20, (byte) 0xFC, (byte) 0x20, (byte) 0xFD, (byte) 0xFE, + (byte) 0xC0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0x20, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0x20, + }; + + public String getName() + { + return haveC1Bytes? "windows-1253" : "ISO-8859-7"; + } + } + + static class CharsetRecog_8859_7_el extends CharsetRecog_8859_7 + { + private static int[] ngrams = { + 0x20E1ED, 0x20E1F0, 0x20E3E9, 0x20E4E9, 0x20E5F0, 0x20E720, 0x20EAE1, 0x20ECE5, 0x20EDE1, 0x20EF20, 0x20F0E1, 0x20F0EF, 0x20F0F1, 0x20F3F4, 0x20F3F5, 0x20F4E7, + 0x20F4EF, 0xDFE120, 0xE120E1, 0xE120F4, 0xE1E920, 0xE1ED20, 0xE1F0FC, 0xE1F220, 0xE3E9E1, 0xE5E920, 0xE5F220, 0xE720F4, 0xE7ED20, 0xE7F220, 0xE920F4, 0xE9E120, + 0xE9EADE, 0xE9F220, 0xEAE1E9, 0xEAE1F4, 0xECE520, 0xED20E1, 0xED20E5, 0xED20F0, 0xEDE120, 0xEFF220, 0xEFF520, 0xF0EFF5, 0xF0F1EF, 0xF0FC20, 0xF220E1, 0xF220E5, + 0xF220EA, 0xF220F0, 0xF220F4, 0xF3E520, 0xF3E720, 0xF3F4EF, 0xF4E120, 0xF4E1E9, 0xF4E7ED, 0xF4E7F2, 0xF4E9EA, 0xF4EF20, 0xF4EFF5, 0xF4F9ED, 0xF9ED20, 0xFEED20, + }; + + public String getLanguage() + { return "el"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - abstract static class CharsetRecog_8859_8 extends CharsetRecog_sbcs { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + abstract static class CharsetRecog_8859_8 extends CharsetRecog_sbcs + { protected static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xB5, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - }; - - public String getName() { - return haveC1Bytes ? "windows-1255" : "ISO-8859-8"; - } - } - - static class CharsetRecog_8859_8_I_he extends CharsetRecog_8859_8 { - private static int[] ngrams = { - 0x20E0E5, 0x20E0E7, 0x20E0E9, 0x20E0FA, 0x20E1E9, 0x20E1EE, 0x20E4E0, 0x20E4E5, 0x20E4E9, 0x20E4EE, 0x20E4F2, 0x20E4F9, 0x20E4FA, 0x20ECE0, 0x20ECE4, 0x20EEE0, - 0x20F2EC, 0x20F9EC, 0xE0FA20, 0xE420E0, 0xE420E1, 0xE420E4, 0xE420EC, 0xE420EE, 0xE420F9, 0xE4E5E0, 0xE5E020, 0xE5ED20, 0xE5EF20, 0xE5F820, 0xE5FA20, 0xE920E4, - 0xE9E420, 0xE9E5FA, 0xE9E9ED, 0xE9ED20, 0xE9EF20, 0xE9F820, 0xE9FA20, 0xEC20E0, 0xEC20E4, 0xECE020, 0xECE420, 0xED20E0, 0xED20E1, 0xED20E4, 0xED20EC, 0xED20EE, - 0xED20F9, 0xEEE420, 0xEF20E4, 0xF0E420, 0xF0E920, 0xF0E9ED, 0xF2EC20, 0xF820E4, 0xF8E9ED, 0xF9EC20, 0xFA20E0, 0xFA20E1, 0xFA20E4, 0xFA20EC, 0xFA20EE, 0xFA20F9, - }; - - public String getName() { - return haveC1Bytes ? "windows-1255" : /*"ISO-8859-8-I"*/ "ISO-8859-8"; - } - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xB5, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + }; + + public String getName() + { + return haveC1Bytes? "windows-1255" : "ISO-8859-8"; + } + } + + static class CharsetRecog_8859_8_I_he extends CharsetRecog_8859_8 + { + private static int[] ngrams = { + 0x20E0E5, 0x20E0E7, 0x20E0E9, 0x20E0FA, 0x20E1E9, 0x20E1EE, 0x20E4E0, 0x20E4E5, 0x20E4E9, 0x20E4EE, 0x20E4F2, 0x20E4F9, 0x20E4FA, 0x20ECE0, 0x20ECE4, 0x20EEE0, + 0x20F2EC, 0x20F9EC, 0xE0FA20, 0xE420E0, 0xE420E1, 0xE420E4, 0xE420EC, 0xE420EE, 0xE420F9, 0xE4E5E0, 0xE5E020, 0xE5ED20, 0xE5EF20, 0xE5F820, 0xE5FA20, 0xE920E4, + 0xE9E420, 0xE9E5FA, 0xE9E9ED, 0xE9ED20, 0xE9EF20, 0xE9F820, 0xE9FA20, 0xEC20E0, 0xEC20E4, 0xECE020, 0xECE420, 0xED20E0, 0xED20E1, 0xED20E4, 0xED20EC, 0xED20EE, + 0xED20F9, 0xEEE420, 0xEF20E4, 0xF0E420, 0xF0E920, 0xF0E9ED, 0xF2EC20, 0xF820E4, 0xF8E9ED, 0xF9EC20, 0xFA20E0, 0xFA20E1, 0xFA20E4, 0xFA20EC, 0xFA20EE, 0xFA20F9, + }; + + public String getName() + { + return haveC1Bytes? "windows-1255" : /*"ISO-8859-8-I"*/ "ISO-8859-8"; + } + + public String getLanguage() + { return "he"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_8859_8_he extends CharsetRecog_8859_8 { - private static int[] ngrams = { - 0x20E0E5, 0x20E0EC, 0x20E4E9, 0x20E4EC, 0x20E4EE, 0x20E4F0, 0x20E9F0, 0x20ECF2, 0x20ECF9, 0x20EDE5, 0x20EDE9, 0x20EFE5, 0x20EFE9, 0x20F8E5, 0x20F8E9, 0x20FAE0, - 0x20FAE5, 0x20FAE9, 0xE020E4, 0xE020EC, 0xE020ED, 0xE020FA, 0xE0E420, 0xE0E5E4, 0xE0EC20, 0xE0EE20, 0xE120E4, 0xE120ED, 0xE120FA, 0xE420E4, 0xE420E9, 0xE420EC, - 0xE420ED, 0xE420EF, 0xE420F8, 0xE420FA, 0xE4EC20, 0xE5E020, 0xE5E420, 0xE7E020, 0xE9E020, 0xE9E120, 0xE9E420, 0xEC20E4, 0xEC20ED, 0xEC20FA, 0xECF220, 0xECF920, - 0xEDE9E9, 0xEDE9F0, 0xEDE9F8, 0xEE20E4, 0xEE20ED, 0xEE20FA, 0xEEE120, 0xEEE420, 0xF2E420, 0xF920E4, 0xF920ED, 0xF920FA, 0xF9E420, 0xFAE020, 0xFAE420, 0xFAE5E9, - }; - - public String getLanguage() { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_8859_8_he extends CharsetRecog_8859_8 + { + private static int[] ngrams = { + 0x20E0E5, 0x20E0EC, 0x20E4E9, 0x20E4EC, 0x20E4EE, 0x20E4F0, 0x20E9F0, 0x20ECF2, 0x20ECF9, 0x20EDE5, 0x20EDE9, 0x20EFE5, 0x20EFE9, 0x20F8E5, 0x20F8E9, 0x20FAE0, + 0x20FAE5, 0x20FAE9, 0xE020E4, 0xE020EC, 0xE020ED, 0xE020FA, 0xE0E420, 0xE0E5E4, 0xE0EC20, 0xE0EE20, 0xE120E4, 0xE120ED, 0xE120FA, 0xE420E4, 0xE420E9, 0xE420EC, + 0xE420ED, 0xE420EF, 0xE420F8, 0xE420FA, 0xE4EC20, 0xE5E020, 0xE5E420, 0xE7E020, 0xE9E020, 0xE9E120, 0xE9E420, 0xEC20E4, 0xEC20ED, 0xEC20FA, 0xECF220, 0xECF920, + 0xEDE9E9, 0xEDE9F0, 0xEDE9F8, 0xEE20E4, 0xEE20ED, 0xEE20FA, 0xEEE120, 0xEEE420, 0xF2E420, 0xF920E4, 0xF920ED, 0xF920FA, 0xF9E420, 0xFAE020, 0xFAE420, 0xFAE5E9, + }; + + public String getLanguage() + { return "he"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - abstract static class CharsetRecog_8859_9 extends CharsetRecog_sbcs { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + abstract static class CharsetRecog_8859_9 extends CharsetRecog_sbcs + { protected static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0xAA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xB5, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0xBA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0x69, (byte) 0xFE, (byte) 0xDF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, - }; - - public String getName() { - return haveC1Bytes ? "windows-1254" : "ISO-8859-9"; - } - } - - static class CharsetRecog_8859_9_tr extends CharsetRecog_8859_9 { - private static int[] ngrams = { - 0x206261, 0x206269, 0x206275, 0x206461, 0x206465, 0x206765, 0x206861, 0x20696C, 0x206B61, 0x206B6F, 0x206D61, 0x206F6C, 0x207361, 0x207461, 0x207665, 0x207961, - 0x612062, 0x616B20, 0x616C61, 0x616D61, 0x616E20, 0x616EFD, 0x617220, 0x617261, 0x6172FD, 0x6173FD, 0x617961, 0x626972, 0x646120, 0x646520, 0x646920, 0x652062, - 0x65206B, 0x656469, 0x656E20, 0x657220, 0x657269, 0x657369, 0x696C65, 0x696E20, 0x696E69, 0x697220, 0x6C616E, 0x6C6172, 0x6C6520, 0x6C6572, 0x6E2061, 0x6E2062, - 0x6E206B, 0x6E6461, 0x6E6465, 0x6E6520, 0x6E6920, 0x6E696E, 0x6EFD20, 0x72696E, 0x72FD6E, 0x766520, 0x796120, 0x796F72, 0xFD6E20, 0xFD6E64, 0xFD6EFD, 0xFDF0FD, - }; - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0xAA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xB5, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0xBA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0x69, (byte) 0xFE, (byte) 0xDF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0x20, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, + }; + + public String getName() + { + return haveC1Bytes? "windows-1254" : "ISO-8859-9"; + } + } + + static class CharsetRecog_8859_9_tr extends CharsetRecog_8859_9 + { + private static int[] ngrams = { + 0x206261, 0x206269, 0x206275, 0x206461, 0x206465, 0x206765, 0x206861, 0x20696C, 0x206B61, 0x206B6F, 0x206D61, 0x206F6C, 0x207361, 0x207461, 0x207665, 0x207961, + 0x612062, 0x616B20, 0x616C61, 0x616D61, 0x616E20, 0x616EFD, 0x617220, 0x617261, 0x6172FD, 0x6173FD, 0x617961, 0x626972, 0x646120, 0x646520, 0x646920, 0x652062, + 0x65206B, 0x656469, 0x656E20, 0x657220, 0x657269, 0x657369, 0x696C65, 0x696E20, 0x696E69, 0x697220, 0x6C616E, 0x6C6172, 0x6C6520, 0x6C6572, 0x6E2061, 0x6E2062, + 0x6E206B, 0x6E6461, 0x6E6465, 0x6E6520, 0x6E6920, 0x6E696E, 0x6EFD20, 0x72696E, 0x72FD6E, 0x766520, 0x796120, 0x796F72, 0xFD6E20, 0xFD6E64, 0xFD6EFD, 0xFDF0FD, + }; + + public String getLanguage() + { return "tr"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_windows_1251 extends CharsetRecog_sbcs { - private static int[] ngrams = { - 0x20E220, 0x20E2EE, 0x20E4EE, 0x20E7E0, 0x20E820, 0x20EAE0, 0x20EAEE, 0x20EDE0, 0x20EDE5, 0x20EEE1, 0x20EFEE, 0x20EFF0, 0x20F0E0, 0x20F1EE, 0x20F1F2, 0x20F2EE, - 0x20F7F2, 0x20FDF2, 0xE0EDE8, 0xE0F2FC, 0xE3EE20, 0xE5EBFC, 0xE5EDE8, 0xE5F1F2, 0xE5F220, 0xE820EF, 0xE8E520, 0xE8E820, 0xE8FF20, 0xEBE5ED, 0xEBE820, 0xEBFCED, - 0xEDE020, 0xEDE520, 0xEDE8E5, 0xEDE8FF, 0xEDEE20, 0xEDEEE2, 0xEE20E2, 0xEE20EF, 0xEE20F1, 0xEEE220, 0xEEE2E0, 0xEEE3EE, 0xEEE920, 0xEEEBFC, 0xEEEC20, 0xEEF1F2, - 0xEFEEEB, 0xEFF0E5, 0xEFF0E8, 0xEFF0EE, 0xF0E0E2, 0xF0E5E4, 0xF1F2E0, 0xF1F2E2, 0xF1F2E8, 0xF1FF20, 0xF2E5EB, 0xF2EE20, 0xF2EEF0, 0xF2FC20, 0xF7F2EE, 0xFBF520, + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_windows_1251 extends CharsetRecog_sbcs + { + private static int[] ngrams = { + 0x20E220, 0x20E2EE, 0x20E4EE, 0x20E7E0, 0x20E820, 0x20EAE0, 0x20EAEE, 0x20EDE0, 0x20EDE5, 0x20EEE1, 0x20EFEE, 0x20EFF0, 0x20F0E0, 0x20F1EE, 0x20F1F2, 0x20F2EE, + 0x20F7F2, 0x20FDF2, 0xE0EDE8, 0xE0F2FC, 0xE3EE20, 0xE5EBFC, 0xE5EDE8, 0xE5F1F2, 0xE5F220, 0xE820EF, 0xE8E520, 0xE8E820, 0xE8FF20, 0xEBE5ED, 0xEBE820, 0xEBFCED, + 0xEDE020, 0xEDE520, 0xEDE8E5, 0xEDE8FF, 0xEDEE20, 0xEDEEE2, 0xEE20E2, 0xEE20EF, 0xEE20F1, 0xEEE220, 0xEEE2E0, 0xEEE3EE, 0xEEE920, 0xEEEBFC, 0xEEEC20, 0xEEF1F2, + 0xEFEEEB, 0xEFF0E5, 0xEFF0E8, 0xEFF0EE, 0xF0E0E2, 0xF0E5E4, 0xF1F2E0, 0xF1F2E2, 0xF1F2E8, 0xF1FF20, 0xF2E5EB, 0xF2EE20, 0xF2EEF0, 0xF2FC20, 0xF7F2EE, 0xFBF520, }; private static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x90, (byte) 0x83, (byte) 0x20, (byte) 0x83, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x9A, (byte) 0x20, (byte) 0x9C, (byte) 0x9D, (byte) 0x9E, (byte) 0x9F, - (byte) 0x90, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x9A, (byte) 0x20, (byte) 0x9C, (byte) 0x9D, (byte) 0x9E, (byte) 0x9F, - (byte) 0x20, (byte) 0xA2, (byte) 0xA2, (byte) 0xBC, (byte) 0x20, (byte) 0xB4, (byte) 0x20, (byte) 0x20, - (byte) 0xB8, (byte) 0x20, (byte) 0xBA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xBF, - (byte) 0x20, (byte) 0x20, (byte) 0xB3, (byte) 0xB3, (byte) 0xB4, (byte) 0xB5, (byte) 0x20, (byte) 0x20, - (byte) 0xB8, (byte) 0x20, (byte) 0xBA, (byte) 0x20, (byte) 0xBC, (byte) 0xBE, (byte) 0xBE, (byte) 0xBF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, - }; - - public String getName() { - return "windows-1251"; - } - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x90, (byte) 0x83, (byte) 0x20, (byte) 0x83, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x9A, (byte) 0x20, (byte) 0x9C, (byte) 0x9D, (byte) 0x9E, (byte) 0x9F, + (byte) 0x90, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x9A, (byte) 0x20, (byte) 0x9C, (byte) 0x9D, (byte) 0x9E, (byte) 0x9F, + (byte) 0x20, (byte) 0xA2, (byte) 0xA2, (byte) 0xBC, (byte) 0x20, (byte) 0xB4, (byte) 0x20, (byte) 0x20, + (byte) 0xB8, (byte) 0x20, (byte) 0xBA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xBF, + (byte) 0x20, (byte) 0x20, (byte) 0xB3, (byte) 0xB3, (byte) 0xB4, (byte) 0xB5, (byte) 0x20, (byte) 0x20, + (byte) 0xB8, (byte) 0x20, (byte) 0xBA, (byte) 0x20, (byte) 0xBC, (byte) 0xBE, (byte) 0xBE, (byte) 0xBF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, + }; + + public String getName() + { + return "windows-1251"; + } + + public String getLanguage() + { return "ru"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_IBM866_ru extends CharsetRecog_sbcs { - private static int[] ngrams = { - 0x20E220, 0x20E2EE, 0x20E4EE, 0x20E7E0, 0x20E820, 0x20EAE0, 0x20EAEE, 0x20EDE0, 0x20EDE5, 0x20EEE1, 0x20EFEE, 0x20EFF0, 0x20F0E0, 0x20F1EE, 0x20F1F2, 0x20F2EE, - 0x20F7F2, 0x20FDF2, 0xE0EDE8, 0xE0F2FC, 0xE3EE20, 0xE5EBFC, 0xE5EDE8, 0xE5F1F2, 0xE5F220, 0xE820EF, 0xE8E520, 0xE8E820, 0xE8FF20, 0xEBE5ED, 0xEBE820, 0xEBFCED, - 0xEDE020, 0xEDE520, 0xEDE8E5, 0xEDE8FF, 0xEDEE20, 0xEDEEE2, 0xEE20E2, 0xEE20EF, 0xEE20F1, 0xEEE220, 0xEEE2E0, 0xEEE3EE, 0xEEE920, 0xEEEBFC, 0xEEEC20, 0xEEF1F2, - 0xEFEEEB, 0xEFF0E5, 0xEFF0E8, 0xEFF0EE, 0xF0E0E2, 0xF0E5E4, 0xF1F2E0, 0xF1F2E2, 0xF1F2E8, 0xF1FF20, 0xF2E5EB, 0xF2EE20, 0xF2EEF0, 0xF2FC20, 0xF7F2EE, 0xFBF520, + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_IBM866_ru extends CharsetRecog_sbcs + { + private static int[] ngrams = { + 0x20E220, 0x20E2EE, 0x20E4EE, 0x20E7E0, 0x20E820, 0x20EAE0, 0x20EAEE, 0x20EDE0, 0x20EDE5, 0x20EEE1, 0x20EFEE, 0x20EFF0, 0x20F0E0, 0x20F1EE, 0x20F1F2, 0x20F2EE, + 0x20F7F2, 0x20FDF2, 0xE0EDE8, 0xE0F2FC, 0xE3EE20, 0xE5EBFC, 0xE5EDE8, 0xE5F1F2, 0xE5F220, 0xE820EF, 0xE8E520, 0xE8E820, 0xE8FF20, 0xEBE5ED, 0xEBE820, 0xEBFCED, + 0xEDE020, 0xEDE520, 0xEDE8E5, 0xEDE8FF, 0xEDEE20, 0xEDEEE2, 0xEE20E2, 0xEE20EF, 0xEE20F1, 0xEEE220, 0xEEE2E0, 0xEEE3EE, 0xEEE920, 0xEEEBFC, 0xEEEC20, 0xEEF1F2, + 0xEFEEEB, 0xEFF0E5, 0xEFF0E8, 0xEFF0EE, 0xF0E0E2, 0xF0E5E4, 0xF1F2E0, 0xF1F2E2, 0xF1F2E8, 0xF1FF20, 0xF2E5EB, 0xF2EE20, 0xF2EEF0, 0xF2FC20, 0xF7F2EE, 0xFBF520, }; // bytemap converts cp866 chars to cp1251 chars, so ngrams are still unchanged private static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, - (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, - (byte) 0xB8, (byte) 0xB8, (byte) 0xBA, (byte) 0xBA, (byte) 0xBF, (byte) 0xBF, (byte) 0xA2, (byte) 0xA2, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - }; - - public String getName() { - return "IBM866"; - } - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, + (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, + (byte) 0xB8, (byte) 0xB8, (byte) 0xBA, (byte) 0xBA, (byte) 0xBF, (byte) 0xBF, (byte) 0xA2, (byte) 0xA2, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + }; + + public String getName() + { + return "IBM866"; + } + + public String getLanguage() + { return "ru"; } - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_windows_1256 extends CharsetRecog_sbcs { - private static int[] ngrams = { - 0x20C7E1, 0x20C7E4, 0x20C8C7, 0x20DAE1, 0x20DDED, 0x20E1E1, 0x20E3E4, 0x20E6C7, 0xC720C7, 0xC7C120, 0xC7CA20, 0xC7D120, 0xC7E120, 0xC7E1C3, 0xC7E1C7, 0xC7E1C8, - 0xC7E1CA, 0xC7E1CC, 0xC7E1CD, 0xC7E1CF, 0xC7E1D3, 0xC7E1DA, 0xC7E1DE, 0xC7E1E3, 0xC7E1E6, 0xC7E1ED, 0xC7E320, 0xC7E420, 0xC7E4CA, 0xC820C7, 0xC920C7, 0xC920DD, - 0xC920E1, 0xC920E3, 0xC920E6, 0xCA20C7, 0xCF20C7, 0xCFC920, 0xD120C7, 0xD1C920, 0xD320C7, 0xDA20C7, 0xDAE1EC, 0xDDED20, 0xE120C7, 0xE1C920, 0xE1EC20, 0xE1ED20, - 0xE320C7, 0xE3C720, 0xE3C920, 0xE3E420, 0xE420C7, 0xE520C7, 0xE5C720, 0xE6C7E1, 0xE6E420, 0xEC20C7, 0xED20C7, 0xED20E3, 0xED20E6, 0xEDC920, 0xEDD120, 0xEDE420, + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_windows_1256 extends CharsetRecog_sbcs + { + private static int[] ngrams = { + 0x20C7E1, 0x20C7E4, 0x20C8C7, 0x20DAE1, 0x20DDED, 0x20E1E1, 0x20E3E4, 0x20E6C7, 0xC720C7, 0xC7C120, 0xC7CA20, 0xC7D120, 0xC7E120, 0xC7E1C3, 0xC7E1C7, 0xC7E1C8, + 0xC7E1CA, 0xC7E1CC, 0xC7E1CD, 0xC7E1CF, 0xC7E1D3, 0xC7E1DA, 0xC7E1DE, 0xC7E1E3, 0xC7E1E6, 0xC7E1ED, 0xC7E320, 0xC7E420, 0xC7E4CA, 0xC820C7, 0xC920C7, 0xC920DD, + 0xC920E1, 0xC920E3, 0xC920E6, 0xCA20C7, 0xCF20C7, 0xCFC920, 0xD120C7, 0xD1C920, 0xD320C7, 0xDA20C7, 0xDAE1EC, 0xDDED20, 0xE120C7, 0xE1C920, 0xE1EC20, 0xE1ED20, + 0xE320C7, 0xE3C720, 0xE3C920, 0xE3E420, 0xE420C7, 0xE520C7, 0xE5C720, 0xE6C7E1, 0xE6E420, 0xEC20C7, 0xED20C7, 0xED20E3, 0xED20E6, 0xEDC920, 0xEDD120, 0xEDE420, }; private static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x81, (byte) 0x20, (byte) 0x83, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x88, (byte) 0x20, (byte) 0x8A, (byte) 0x20, (byte) 0x9C, (byte) 0x8D, (byte) 0x8E, (byte) 0x8F, - (byte) 0x90, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x98, (byte) 0x20, (byte) 0x9A, (byte) 0x20, (byte) 0x9C, (byte) 0x20, (byte) 0x20, (byte) 0x9F, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0xAA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xB5, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, - (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, - (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0x20, - (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, - (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, - (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xF4, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0xF9, (byte) 0x20, (byte) 0xFB, (byte) 0xFC, (byte) 0x20, (byte) 0x20, (byte) 0xFF, - }; - - public String getName() { - return "windows-1256"; - } - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x81, (byte) 0x20, (byte) 0x83, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x88, (byte) 0x20, (byte) 0x8A, (byte) 0x20, (byte) 0x9C, (byte) 0x8D, (byte) 0x8E, (byte) 0x8F, + (byte) 0x90, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x98, (byte) 0x20, (byte) 0x9A, (byte) 0x20, (byte) 0x9C, (byte) 0x20, (byte) 0x20, (byte) 0x9F, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0xAA, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xB5, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, + (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, + (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0x20, + (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, + (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, + (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xF4, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0xF9, (byte) 0x20, (byte) 0xFB, (byte) 0xFC, (byte) 0x20, (byte) 0x20, (byte) 0xFF, + }; + + public String getName() + { + return "windows-1256"; + } + + public String getLanguage() + { return "ar"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - static class CharsetRecog_KOI8_R extends CharsetRecog_sbcs { - private static int[] ngrams = { - 0x20C4CF, 0x20C920, 0x20CBC1, 0x20CBCF, 0x20CEC1, 0x20CEC5, 0x20CFC2, 0x20D0CF, 0x20D0D2, 0x20D2C1, 0x20D3CF, 0x20D3D4, 0x20D4CF, 0x20D720, 0x20D7CF, 0x20DAC1, - 0x20DCD4, 0x20DED4, 0xC1CEC9, 0xC1D4D8, 0xC5CCD8, 0xC5CEC9, 0xC5D3D4, 0xC5D420, 0xC7CF20, 0xC920D0, 0xC9C520, 0xC9C920, 0xC9D120, 0xCCC5CE, 0xCCC920, 0xCCD8CE, - 0xCEC120, 0xCEC520, 0xCEC9C5, 0xCEC9D1, 0xCECF20, 0xCECFD7, 0xCF20D0, 0xCF20D3, 0xCF20D7, 0xCFC7CF, 0xCFCA20, 0xCFCCD8, 0xCFCD20, 0xCFD3D4, 0xCFD720, 0xCFD7C1, - 0xD0CFCC, 0xD0D2C5, 0xD0D2C9, 0xD0D2CF, 0xD2C1D7, 0xD2C5C4, 0xD3D120, 0xD3D4C1, 0xD3D4C9, 0xD3D4D7, 0xD4C5CC, 0xD4CF20, 0xD4CFD2, 0xD4D820, 0xD9C820, 0xDED4CF, + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + static class CharsetRecog_KOI8_R extends CharsetRecog_sbcs + { + private static int[] ngrams = { + 0x20C4CF, 0x20C920, 0x20CBC1, 0x20CBCF, 0x20CEC1, 0x20CEC5, 0x20CFC2, 0x20D0CF, 0x20D0D2, 0x20D2C1, 0x20D3CF, 0x20D3D4, 0x20D4CF, 0x20D720, 0x20D7CF, 0x20DAC1, + 0x20DCD4, 0x20DED4, 0xC1CEC9, 0xC1D4D8, 0xC5CCD8, 0xC5CEC9, 0xC5D3D4, 0xC5D420, 0xC7CF20, 0xC920D0, 0xC9C520, 0xC9C920, 0xC9D120, 0xCCC5CE, 0xCCC920, 0xCCD8CE, + 0xCEC120, 0xCEC520, 0xCEC9C5, 0xCEC9D1, 0xCECF20, 0xCECFD7, 0xCF20D0, 0xCF20D3, 0xCF20D7, 0xCFC7CF, 0xCFCA20, 0xCFCCD8, 0xCFCD20, 0xCFD3D4, 0xCFD720, 0xCFD7C1, + 0xD0CFCC, 0xD0D2C5, 0xD0D2C9, 0xD0D2CF, 0xD2C1D7, 0xD2C5C4, 0xD3D120, 0xD3D4C1, 0xD3D4C9, 0xD3D4D7, 0xD4C5CC, 0xD4CF20, 0xD4CFD2, 0xD4D820, 0xD9C820, 0xDED4CF, }; private static byte[] byteMap = { - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, - (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, - (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, - (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xA3, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xA3, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, - (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, - (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, - (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, - (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, - (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, - (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, - (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, - (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, - }; - - public String getName() { - return "KOI8-R"; - } - - public String getLanguage() { + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, + (byte) 0x68, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, + (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, + (byte) 0x78, (byte) 0x79, (byte) 0x7A, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xA3, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0xA3, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, + (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, + (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, + (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, + (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, + (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, + (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xCB, (byte) 0xCC, (byte) 0xCD, (byte) 0xCE, (byte) 0xCF, + (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, + (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, + }; + + public String getName() + { + return "KOI8-R"; + } + + public String getLanguage() + { return "ru"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap); - } - } - - abstract static class CharsetRecog_IBM424_he extends CharsetRecog_sbcs { + + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap); + } + } + + abstract static class CharsetRecog_IBM424_he extends CharsetRecog_sbcs + { protected static byte[] byteMap = { /* -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F */ /* 0- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, @@ -1061,102 +1162,107 @@ /* C- */ (byte) 0x40, (byte) 0x81, (byte) 0x82, (byte) 0x83, (byte) 0x84, (byte) 0x85, (byte) 0x86, (byte) 0x87, (byte) 0x88, (byte) 0x89, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, /* D- */ (byte) 0x40, (byte) 0x91, (byte) 0x92, (byte) 0x93, (byte) 0x94, (byte) 0x95, (byte) 0x96, (byte) 0x97, (byte) 0x98, (byte) 0x99, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, /* E- */ (byte) 0x40, (byte) 0x40, (byte) 0xA2, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA6, (byte) 0xA7, (byte) 0xA8, (byte) 0xA9, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* F- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, - }; - - public String getLanguage() { +/* F- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, + }; + + public String getLanguage() + { return "he"; } } - - static class CharsetRecog_IBM424_he_rtl extends CharsetRecog_IBM424_he { - private static int[] ngrams = { - 0x404146, 0x404148, 0x404151, 0x404171, 0x404251, 0x404256, 0x404541, 0x404546, 0x404551, 0x404556, 0x404562, 0x404569, 0x404571, 0x405441, 0x405445, 0x405641, - 0x406254, 0x406954, 0x417140, 0x454041, 0x454042, 0x454045, 0x454054, 0x454056, 0x454069, 0x454641, 0x464140, 0x465540, 0x465740, 0x466840, 0x467140, 0x514045, - 0x514540, 0x514671, 0x515155, 0x515540, 0x515740, 0x516840, 0x517140, 0x544041, 0x544045, 0x544140, 0x544540, 0x554041, 0x554042, 0x554045, 0x554054, 0x554056, - 0x554069, 0x564540, 0x574045, 0x584540, 0x585140, 0x585155, 0x625440, 0x684045, 0x685155, 0x695440, 0x714041, 0x714042, 0x714045, 0x714054, 0x714056, 0x714069, - }; - - public String getName() { + static class CharsetRecog_IBM424_he_rtl extends CharsetRecog_IBM424_he + { + public String getName() + { return "IBM424_rtl"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap, (byte) 0x40); - } - } - - static class CharsetRecog_IBM424_he_ltr extends CharsetRecog_IBM424_he { - private static int[] ngrams = { - 0x404146, 0x404154, 0x404551, 0x404554, 0x404556, 0x404558, 0x405158, 0x405462, 0x405469, 0x405546, 0x405551, 0x405746, 0x405751, 0x406846, 0x406851, 0x407141, - 0x407146, 0x407151, 0x414045, 0x414054, 0x414055, 0x414071, 0x414540, 0x414645, 0x415440, 0x415640, 0x424045, 0x424055, 0x424071, 0x454045, 0x454051, 0x454054, - 0x454055, 0x454057, 0x454068, 0x454071, 0x455440, 0x464140, 0x464540, 0x484140, 0x514140, 0x514240, 0x514540, 0x544045, 0x544055, 0x544071, 0x546240, 0x546940, - 0x555151, 0x555158, 0x555168, 0x564045, 0x564055, 0x564071, 0x564240, 0x564540, 0x624540, 0x694045, 0x694055, 0x694071, 0x694540, 0x714140, 0x714540, 0x714651 - - }; - - public String getName() { + private static int[] ngrams = { + 0x404146, 0x404148, 0x404151, 0x404171, 0x404251, 0x404256, 0x404541, 0x404546, 0x404551, 0x404556, 0x404562, 0x404569, 0x404571, 0x405441, 0x405445, 0x405641, + 0x406254, 0x406954, 0x417140, 0x454041, 0x454042, 0x454045, 0x454054, 0x454056, 0x454069, 0x454641, 0x464140, 0x465540, 0x465740, 0x466840, 0x467140, 0x514045, + 0x514540, 0x514671, 0x515155, 0x515540, 0x515740, 0x516840, 0x517140, 0x544041, 0x544045, 0x544140, 0x544540, 0x554041, 0x554042, 0x554045, 0x554054, 0x554056, + 0x554069, 0x564540, 0x574045, 0x584540, 0x585140, 0x585155, 0x625440, 0x684045, 0x685155, 0x695440, 0x714041, 0x714042, 0x714045, 0x714054, 0x714056, 0x714069, + }; + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap, (byte)0x40); + } + } + static class CharsetRecog_IBM424_he_ltr extends CharsetRecog_IBM424_he + { + public String getName() + { return "IBM424_ltr"; } - - public int match(CharsetDetector det) { - return match(det, ngrams, byteMap, (byte) 0x40); - } - } - - abstract static class CharsetRecog_IBM420_ar extends CharsetRecog_sbcs { - protected static byte[] byteMap = { -/* -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F */ -/* 0- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 1- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 2- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 3- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 4- */ (byte) 0x40, (byte) 0x40, (byte) 0x42, (byte) 0x43, (byte) 0x44, (byte) 0x45, (byte) 0x46, (byte) 0x47, (byte) 0x48, (byte) 0x49, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 5- */ (byte) 0x40, (byte) 0x51, (byte) 0x52, (byte) 0x40, (byte) 0x40, (byte) 0x55, (byte) 0x56, (byte) 0x57, (byte) 0x58, (byte) 0x59, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 6- */ (byte) 0x40, (byte) 0x40, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, (byte) 0x68, (byte) 0x69, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 7- */ (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, (byte) 0x78, (byte) 0x79, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 8- */ (byte) 0x80, (byte) 0x81, (byte) 0x82, (byte) 0x83, (byte) 0x84, (byte) 0x85, (byte) 0x86, (byte) 0x87, (byte) 0x88, (byte) 0x89, (byte) 0x8A, (byte) 0x8B, (byte) 0x8C, (byte) 0x8D, (byte) 0x8E, (byte) 0x8F, -/* 9- */ (byte) 0x90, (byte) 0x91, (byte) 0x92, (byte) 0x93, (byte) 0x94, (byte) 0x95, (byte) 0x96, (byte) 0x97, (byte) 0x98, (byte) 0x99, (byte) 0x9A, (byte) 0x9B, (byte) 0x9C, (byte) 0x9D, (byte) 0x9E, (byte) 0x9F, -/* A- */ (byte) 0xA0, (byte) 0x40, (byte) 0xA2, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA6, (byte) 0xA7, (byte) 0xA8, (byte) 0xA9, (byte) 0xAA, (byte) 0xAB, (byte) 0xAC, (byte) 0xAD, (byte) 0xAE, (byte) 0xAF, -/* B- */ (byte) 0xB0, (byte) 0xB1, (byte) 0xB2, (byte) 0xB3, (byte) 0xB4, (byte) 0xB5, (byte) 0x40, (byte) 0x40, (byte) 0xB8, (byte) 0xB9, (byte) 0xBA, (byte) 0xBB, (byte) 0xBC, (byte) 0xBD, (byte) 0xBE, (byte) 0xBF, -/* C- */ (byte) 0x40, (byte) 0x81, (byte) 0x82, (byte) 0x83, (byte) 0x84, (byte) 0x85, (byte) 0x86, (byte) 0x87, (byte) 0x88, (byte) 0x89, (byte) 0x40, (byte) 0xCB, (byte) 0x40, (byte) 0xCD, (byte) 0x40, (byte) 0xCF, -/* D- */ (byte) 0x40, (byte) 0x91, (byte) 0x92, (byte) 0x93, (byte) 0x94, (byte) 0x95, (byte) 0x96, (byte) 0x97, (byte) 0x98, (byte) 0x99, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, -/* E- */ (byte) 0x40, (byte) 0x40, (byte) 0xA2, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA6, (byte) 0xA7, (byte) 0xA8, (byte) 0xA9, (byte) 0xEA, (byte) 0xEB, (byte) 0x40, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, -/* F- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0x40, - }; - protected static byte[] unshapeMap = { -/* -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F */ -/* 0- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 1- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 2- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 3- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, -/* 4- */ (byte) 0x40, (byte) 0x40, (byte) 0x42, (byte) 0x42, (byte) 0x44, (byte) 0x45, (byte) 0x46, (byte) 0x47, (byte) 0x47, (byte) 0x49, (byte) 0x4A, (byte) 0x4B, (byte) 0x4C, (byte) 0x4D, (byte) 0x4E, (byte) 0x4F, -/* 5- */ (byte) 0x50, (byte) 0x49, (byte) 0x52, (byte) 0x53, (byte) 0x54, (byte) 0x55, (byte) 0x56, (byte) 0x56, (byte) 0x58, (byte) 0x58, (byte) 0x5A, (byte) 0x5B, (byte) 0x5C, (byte) 0x5D, (byte) 0x5E, (byte) 0x5F, -/* 6- */ (byte) 0x60, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x63, (byte) 0x65, (byte) 0x65, (byte) 0x67, (byte) 0x67, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, -/* 7- */ (byte) 0x69, (byte) 0x71, (byte) 0x71, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, (byte) 0x77, (byte) 0x79, (byte) 0x7A, (byte) 0x7B, (byte) 0x7C, (byte) 0x7D, (byte) 0x7E, (byte) 0x7F, -/* 8- */ (byte) 0x80, (byte) 0x81, (byte) 0x82, (byte) 0x83, (byte) 0x84, (byte) 0x85, (byte) 0x86, (byte) 0x87, (byte) 0x88, (byte) 0x89, (byte) 0x80, (byte) 0x8B, (byte) 0x8B, (byte) 0x8D, (byte) 0x8D, (byte) 0x8F, -/* 9- */ (byte) 0x90, (byte) 0x91, (byte) 0x92, (byte) 0x93, (byte) 0x94, (byte) 0x95, (byte) 0x96, (byte) 0x97, (byte) 0x98, (byte) 0x99, (byte) 0x9A, (byte) 0x9A, (byte) 0x9A, (byte) 0x9A, (byte) 0x9E, (byte) 0x9E, -/* A- */ (byte) 0x9E, (byte) 0xA1, (byte) 0xA2, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA6, (byte) 0xA7, (byte) 0xA8, (byte) 0xA9, (byte) 0x9E, (byte) 0xAB, (byte) 0xAB, (byte) 0xAD, (byte) 0xAD, (byte) 0xAF, -/* B- */ (byte) 0xAF, (byte) 0xB1, (byte) 0xB2, (byte) 0xB3, (byte) 0xB4, (byte) 0xB5, (byte) 0xB6, (byte) 0xB7, (byte) 0xB8, (byte) 0xB9, (byte) 0xB1, (byte) 0xBB, (byte) 0xBB, (byte) 0xBD, (byte) 0xBD, (byte) 0xBF, -/* C- */ (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xBF, (byte) 0xCC, (byte) 0xBF, (byte) 0xCE, (byte) 0xCF, -/* D- */ (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDA, (byte) 0xDC, (byte) 0xDC, (byte) 0xDC, (byte) 0xDF, -/* E- */ (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, -/* F- */ (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, - }; + private static int[] ngrams = { + 0x404146, 0x404154, 0x404551, 0x404554, 0x404556, 0x404558, 0x405158, 0x405462, 0x405469, 0x405546, 0x405551, 0x405746, 0x405751, 0x406846, 0x406851, 0x407141, + 0x407146, 0x407151, 0x414045, 0x414054, 0x414055, 0x414071, 0x414540, 0x414645, 0x415440, 0x415640, 0x424045, 0x424055, 0x424071, 0x454045, 0x454051, 0x454054, + 0x454055, 0x454057, 0x454068, 0x454071, 0x455440, 0x464140, 0x464540, 0x484140, 0x514140, 0x514240, 0x514540, 0x544045, 0x544055, 0x544071, 0x546240, 0x546940, + 0x555151, 0x555158, 0x555168, 0x564045, 0x564055, 0x564071, 0x564240, 0x564540, 0x624540, 0x694045, 0x694055, 0x694071, 0x694540, 0x714140, 0x714540, 0x714651 + + }; + public int match(CharsetDetector det) + { + return match(det, ngrams, byteMap, (byte)0x40); + } + } + + abstract static class CharsetRecog_IBM420_ar extends CharsetRecog_sbcs + { //arabic shaping class, method shape/unshape //protected static ArabicShaping as = new ArabicShaping(ArabicShaping.LETTERS_UNSHAPE); protected byte[] prev_fInputBytes = null; - public String getLanguage() { + protected static byte[] byteMap = { +/* -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F */ +/* 0- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 1- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 2- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 3- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 4- */ (byte) 0x40, (byte) 0x40, (byte) 0x42, (byte) 0x43, (byte) 0x44, (byte) 0x45, (byte) 0x46, (byte) 0x47, (byte) 0x48, (byte) 0x49, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 5- */ (byte) 0x40, (byte) 0x51, (byte) 0x52, (byte) 0x40, (byte) 0x40, (byte) 0x55, (byte) 0x56, (byte) 0x57, (byte) 0x58, (byte) 0x59, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 6- */ (byte) 0x40, (byte) 0x40, (byte) 0x62, (byte) 0x63, (byte) 0x64, (byte) 0x65, (byte) 0x66, (byte) 0x67, (byte) 0x68, (byte) 0x69, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 7- */ (byte) 0x70, (byte) 0x71, (byte) 0x72, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, (byte) 0x78, (byte) 0x79, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 8- */ (byte) 0x80, (byte) 0x81, (byte) 0x82, (byte) 0x83, (byte) 0x84, (byte) 0x85, (byte) 0x86, (byte) 0x87, (byte) 0x88, (byte) 0x89, (byte) 0x8A, (byte) 0x8B, (byte) 0x8C, (byte) 0x8D, (byte) 0x8E, (byte) 0x8F, +/* 9- */ (byte) 0x90, (byte) 0x91, (byte) 0x92, (byte) 0x93, (byte) 0x94, (byte) 0x95, (byte) 0x96, (byte) 0x97, (byte) 0x98, (byte) 0x99, (byte) 0x9A, (byte) 0x9B, (byte) 0x9C, (byte) 0x9D, (byte) 0x9E, (byte) 0x9F, +/* A- */ (byte) 0xA0, (byte) 0x40, (byte) 0xA2, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA6, (byte) 0xA7, (byte) 0xA8, (byte) 0xA9, (byte) 0xAA, (byte) 0xAB, (byte) 0xAC, (byte) 0xAD, (byte) 0xAE, (byte) 0xAF, +/* B- */ (byte) 0xB0, (byte) 0xB1, (byte) 0xB2, (byte) 0xB3, (byte) 0xB4, (byte) 0xB5, (byte) 0x40, (byte) 0x40, (byte) 0xB8, (byte) 0xB9, (byte) 0xBA, (byte) 0xBB, (byte) 0xBC, (byte) 0xBD, (byte) 0xBE, (byte) 0xBF, +/* C- */ (byte) 0x40, (byte) 0x81, (byte) 0x82, (byte) 0x83, (byte) 0x84, (byte) 0x85, (byte) 0x86, (byte) 0x87, (byte) 0x88, (byte) 0x89, (byte) 0x40, (byte) 0xCB, (byte) 0x40, (byte) 0xCD, (byte) 0x40, (byte) 0xCF, +/* D- */ (byte) 0x40, (byte) 0x91, (byte) 0x92, (byte) 0x93, (byte) 0x94, (byte) 0x95, (byte) 0x96, (byte) 0x97, (byte) 0x98, (byte) 0x99, (byte) 0xDA, (byte) 0xDB, (byte) 0xDC, (byte) 0xDD, (byte) 0xDE, (byte) 0xDF, +/* E- */ (byte) 0x40, (byte) 0x40, (byte) 0xA2, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA6, (byte) 0xA7, (byte) 0xA8, (byte) 0xA9, (byte) 0xEA, (byte) 0xEB, (byte) 0x40, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, +/* F- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0x40, + }; + + protected static byte[] unshapeMap = { +/* -0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -A -B -C -D -E -F */ +/* 0- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 1- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 2- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 3- */ (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, (byte) 0x40, +/* 4- */ (byte) 0x40, (byte) 0x40, (byte) 0x42, (byte) 0x42, (byte) 0x44, (byte) 0x45, (byte) 0x46, (byte) 0x47, (byte) 0x47, (byte) 0x49, (byte) 0x4A, (byte) 0x4B, (byte) 0x4C, (byte) 0x4D, (byte) 0x4E, (byte) 0x4F, +/* 5- */ (byte) 0x50, (byte) 0x49, (byte) 0x52, (byte) 0x53, (byte) 0x54, (byte) 0x55, (byte) 0x56, (byte) 0x56, (byte) 0x58, (byte) 0x58, (byte) 0x5A, (byte) 0x5B, (byte) 0x5C, (byte) 0x5D, (byte) 0x5E, (byte) 0x5F, +/* 6- */ (byte) 0x60, (byte) 0x61, (byte) 0x62, (byte) 0x63, (byte) 0x63, (byte) 0x65, (byte) 0x65, (byte) 0x67, (byte) 0x67, (byte) 0x69, (byte) 0x6A, (byte) 0x6B, (byte) 0x6C, (byte) 0x6D, (byte) 0x6E, (byte) 0x6F, +/* 7- */ (byte) 0x69, (byte) 0x71, (byte) 0x71, (byte) 0x73, (byte) 0x74, (byte) 0x75, (byte) 0x76, (byte) 0x77, (byte) 0x77, (byte) 0x79, (byte) 0x7A, (byte) 0x7B, (byte) 0x7C, (byte) 0x7D, (byte) 0x7E, (byte) 0x7F, +/* 8- */ (byte) 0x80, (byte) 0x81, (byte) 0x82, (byte) 0x83, (byte) 0x84, (byte) 0x85, (byte) 0x86, (byte) 0x87, (byte) 0x88, (byte) 0x89, (byte) 0x80, (byte) 0x8B, (byte) 0x8B, (byte) 0x8D, (byte) 0x8D, (byte) 0x8F, +/* 9- */ (byte) 0x90, (byte) 0x91, (byte) 0x92, (byte) 0x93, (byte) 0x94, (byte) 0x95, (byte) 0x96, (byte) 0x97, (byte) 0x98, (byte) 0x99, (byte) 0x9A, (byte) 0x9A, (byte) 0x9A, (byte) 0x9A, (byte) 0x9E, (byte) 0x9E, +/* A- */ (byte) 0x9E, (byte) 0xA1, (byte) 0xA2, (byte) 0xA3, (byte) 0xA4, (byte) 0xA5, (byte) 0xA6, (byte) 0xA7, (byte) 0xA8, (byte) 0xA9, (byte) 0x9E, (byte) 0xAB, (byte) 0xAB, (byte) 0xAD, (byte) 0xAD, (byte) 0xAF, +/* B- */ (byte) 0xAF, (byte) 0xB1, (byte) 0xB2, (byte) 0xB3, (byte) 0xB4, (byte) 0xB5, (byte) 0xB6, (byte) 0xB7, (byte) 0xB8, (byte) 0xB9, (byte) 0xB1, (byte) 0xBB, (byte) 0xBB, (byte) 0xBD, (byte) 0xBD, (byte) 0xBF, +/* C- */ (byte) 0xC0, (byte) 0xC1, (byte) 0xC2, (byte) 0xC3, (byte) 0xC4, (byte) 0xC5, (byte) 0xC6, (byte) 0xC7, (byte) 0xC8, (byte) 0xC9, (byte) 0xCA, (byte) 0xBF, (byte) 0xCC, (byte) 0xBF, (byte) 0xCE, (byte) 0xCF, +/* D- */ (byte) 0xD0, (byte) 0xD1, (byte) 0xD2, (byte) 0xD3, (byte) 0xD4, (byte) 0xD5, (byte) 0xD6, (byte) 0xD7, (byte) 0xD8, (byte) 0xD9, (byte) 0xDA, (byte) 0xDA, (byte) 0xDC, (byte) 0xDC, (byte) 0xDC, (byte) 0xDF, +/* E- */ (byte) 0xE0, (byte) 0xE1, (byte) 0xE2, (byte) 0xE3, (byte) 0xE4, (byte) 0xE5, (byte) 0xE6, (byte) 0xE7, (byte) 0xE8, (byte) 0xE9, (byte) 0xEA, (byte) 0xEB, (byte) 0xEC, (byte) 0xED, (byte) 0xEE, (byte) 0xEF, +/* F- */ (byte) 0xF0, (byte) 0xF1, (byte) 0xF2, (byte) 0xF3, (byte) 0xF4, (byte) 0xF5, (byte) 0xF6, (byte) 0xF7, (byte) 0xF8, (byte) 0xF9, (byte) 0xFA, (byte) 0xFB, (byte) 0xFC, (byte) 0xFD, (byte) 0xFE, (byte) 0xFF, + }; + + public String getLanguage() + { return "ar"; } - - protected void matchInit(CharsetDetector det) { - prev_fInputBytes = det.fInputBytes.clone(); + protected void matchInit(CharsetDetector det) + { + prev_fInputBytes = (byte[])det.fInputBytes.clone(); byte bb[] = unshape(det.fInputBytes); det.setText(bb); } - + /* * Arabic shaping needs to be done manually. Cannot call ArabicShaping class * because CharsetDetector is dealing with bytes not Unicode code points. We could @@ -1166,187 +1272,204 @@ */ private byte[] unshape(byte[] inputBytes) { byte resultByteArr[] = unshapeLamAlef(inputBytes); - - for (int i = 0; i < inputBytes.length; i++) { - resultByteArr[i] = unshapeMap[resultByteArr[i] & 0xFF]; + + for (int i=0; i= (byte) 0xb2) && (b != (byte) 0xb6); - } - + return (b <= (byte)0xb8) && (b >= (byte)0xb2) && (b != (byte)0xb6); + } + protected void matchFinish(CharsetDetector det) { if (prev_fInputBytes != null) det.setText(prev_fInputBytes); } - - } - - static class CharsetRecog_IBM420_ar_rtl extends CharsetRecog_IBM420_ar { - private static int[] ngrams = { - 0x4056B1, 0x4056BD, 0x405856, 0x409AB1, 0x40ABDC, 0x40B1B1, 0x40BBBD, 0x40CF56, 0x564056, 0x564640, 0x566340, 0x567540, 0x56B140, 0x56B149, 0x56B156, 0x56B158, - 0x56B163, 0x56B167, 0x56B169, 0x56B173, 0x56B178, 0x56B19A, 0x56B1AD, 0x56B1BB, 0x56B1CF, 0x56B1DC, 0x56BB40, 0x56BD40, 0x56BD63, 0x584056, 0x624056, 0x6240AB, - 0x6240B1, 0x6240BB, 0x6240CF, 0x634056, 0x734056, 0x736240, 0x754056, 0x756240, 0x784056, 0x9A4056, 0x9AB1DA, 0xABDC40, 0xB14056, 0xB16240, 0xB1DA40, 0xB1DC40, - 0xBB4056, 0xBB5640, 0xBB6240, 0xBBBD40, 0xBD4056, 0xBF4056, 0xBF5640, 0xCF56B1, 0xCFBD40, 0xDA4056, 0xDC4056, 0xDC40BB, 0xDC40CF, 0xDC6240, 0xDC7540, 0xDCBD40, - }; - - public String getName() { + + } + static class CharsetRecog_IBM420_ar_rtl extends CharsetRecog_IBM420_ar + { + private static int[] ngrams = { + 0x4056B1, 0x4056BD, 0x405856, 0x409AB1, 0x40ABDC, 0x40B1B1, 0x40BBBD, 0x40CF56, 0x564056, 0x564640, 0x566340, 0x567540, 0x56B140, 0x56B149, 0x56B156, 0x56B158, + 0x56B163, 0x56B167, 0x56B169, 0x56B173, 0x56B178, 0x56B19A, 0x56B1AD, 0x56B1BB, 0x56B1CF, 0x56B1DC, 0x56BB40, 0x56BD40, 0x56BD63, 0x584056, 0x624056, 0x6240AB, + 0x6240B1, 0x6240BB, 0x6240CF, 0x634056, 0x734056, 0x736240, 0x754056, 0x756240, 0x784056, 0x9A4056, 0x9AB1DA, 0xABDC40, 0xB14056, 0xB16240, 0xB1DA40, 0xB1DC40, + 0xBB4056, 0xBB5640, 0xBB6240, 0xBBBD40, 0xBD4056, 0xBF4056, 0xBF5640, 0xCF56B1, 0xCFBD40, 0xDA4056, 0xDC4056, 0xDC40BB, 0xDC40CF, 0xDC6240, 0xDC7540, 0xDCBD40, + }; + + public String getName() + { return "IBM420_rtl"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { matchInit(det); - int result = match(det, ngrams, byteMap, (byte) 0x40); + int result = match(det, ngrams, byteMap, (byte)0x40); matchFinish(det); return result; } - - } - - static class CharsetRecog_IBM420_ar_ltr extends CharsetRecog_IBM420_ar { - private static int[] ngrams = { - 0x404656, 0x4056BB, 0x4056BF, 0x406273, 0x406275, 0x4062B1, 0x4062BB, 0x4062DC, 0x406356, 0x407556, 0x4075DC, 0x40B156, 0x40BB56, 0x40BD56, 0x40BDBB, 0x40BDCF, - 0x40BDDC, 0x40DAB1, 0x40DCAB, 0x40DCB1, 0x49B156, 0x564056, 0x564058, 0x564062, 0x564063, 0x564073, 0x564075, 0x564078, 0x56409A, 0x5640B1, 0x5640BB, 0x5640BD, - 0x5640BF, 0x5640DA, 0x5640DC, 0x565840, 0x56B156, 0x56CF40, 0x58B156, 0x63B156, 0x63BD56, 0x67B156, 0x69B156, 0x73B156, 0x78B156, 0x9AB156, 0xAB4062, 0xADB156, - 0xB14062, 0xB15640, 0xB156CF, 0xB19A40, 0xB1B140, 0xBB4062, 0xBB40DC, 0xBBB156, 0xBD5640, 0xBDBB40, 0xCF4062, 0xCF40DC, 0xCFB156, 0xDAB19A, 0xDCAB40, 0xDCB156 - }; - - public String getName() { + + } + static class CharsetRecog_IBM420_ar_ltr extends CharsetRecog_IBM420_ar + { + private static int[] ngrams = { + 0x404656, 0x4056BB, 0x4056BF, 0x406273, 0x406275, 0x4062B1, 0x4062BB, 0x4062DC, 0x406356, 0x407556, 0x4075DC, 0x40B156, 0x40BB56, 0x40BD56, 0x40BDBB, 0x40BDCF, + 0x40BDDC, 0x40DAB1, 0x40DCAB, 0x40DCB1, 0x49B156, 0x564056, 0x564058, 0x564062, 0x564063, 0x564073, 0x564075, 0x564078, 0x56409A, 0x5640B1, 0x5640BB, 0x5640BD, + 0x5640BF, 0x5640DA, 0x5640DC, 0x565840, 0x56B156, 0x56CF40, 0x58B156, 0x63B156, 0x63BD56, 0x67B156, 0x69B156, 0x73B156, 0x78B156, 0x9AB156, 0xAB4062, 0xADB156, + 0xB14062, 0xB15640, 0xB156CF, 0xB19A40, 0xB1B140, 0xBB4062, 0xBB40DC, 0xBBB156, 0xBD5640, 0xBDBB40, 0xCF4062, 0xCF40DC, 0xCFB156, 0xDAB19A, 0xDCAB40, 0xDCB156 + }; + + public String getName() + { return "IBM420_ltr"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { matchInit(det); - int result = match(det, ngrams, byteMap, (byte) 0x40); + int result = match(det, ngrams, byteMap, (byte)0x40); matchFinish(det); return result; } } - - static abstract class CharsetRecog_EBCDIC_500 extends CharsetRecog_sbcs { + + static abstract class CharsetRecog_EBCDIC_500 extends CharsetRecog_sbcs + { // This maps EBCDIC 500 codepoints onto either space (not of interest), or a lower // case ISO_8859_1 number/letter/accented-letter codepoint for ngram matching // Because we map to ISO_8859_1, we can re-use the ngrams from those detectors // To avoid mis-detection, we skip many of the control characters in the 0x00-0x3f range protected static byte[] byteMap = { -/* 0x00-0x07 */ (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, -/* 0x08-0x0f */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x10-0x17 */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x18-0x1f */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x20-0x27 */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x28-0x2f */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, (byte) 0x00, -/* 0x30-0x37 */ (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, -/* 0x38-0x3f */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00, -/* 0x40-0x47 */ (byte) 0x20, (byte) 0x20, (byte) 0xe2, (byte) 0xe4, (byte) 0xe0, (byte) 0xe1, (byte) 0xe3, (byte) 0xe5, -/* 0x48-0x4f */ (byte) 0xe7, (byte) 0xf1, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x50-0x57 */ (byte) 0x20, (byte) 0xe9, (byte) 0xea, (byte) 0xeb, (byte) 0xe8, (byte) 0xed, (byte) 0xee, (byte) 0xef, -/* 0x58-0x5f */ (byte) 0xec, (byte) 0xdf, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x60-0x67 */ (byte) 0x20, (byte) 0x20, (byte) 0xe2, (byte) 0xe4, (byte) 0xe0, (byte) 0xe1, (byte) 0xe3, (byte) 0xe5, -/* 0x68-0x6f */ (byte) 0xe7, (byte) 0xf1, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x70-0x77 */ (byte) 0xf8, (byte) 0xe9, (byte) 0xea, (byte) 0xeb, (byte) 0xe8, (byte) 0xed, (byte) 0xee, (byte) 0xef, -/* 0x78-0x7f */ (byte) 0xec, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x80-0x87 */ (byte) 0xd8, (byte) 'a', (byte) 'b', (byte) 'c', (byte) 'd', (byte) 'e', (byte) 'f', (byte) 'g', -/* 0x88-0x8f */ (byte) 'h', (byte) 'i', (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0x90-0x97 */ (byte) 0x20, (byte) 'j', (byte) 'k', (byte) 'l', (byte) 'm', (byte) 'n', (byte) 'o', (byte) 'p', -/* 0x98-0x9f */ (byte) 'q', (byte) 'r', (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0xa0-0xa7 */ (byte) 0x20, (byte) 0x20, (byte) 's', (byte) 't', (byte) 'u', (byte) 'v', (byte) 'w', (byte) 'x', -/* 0xa8-0xaf */ (byte) 'y', (byte) 'z', (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0xb0-0xb7 */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0xb8-0xbf */ (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, (byte) 0x20, -/* 0xc0-0xc7 */ (byte) 0x20, (byte) 'a', (byte) 'b', (byte) 'c', (byte) 'd', (byte) 'e', (byte) 'f', (byte) 'g', -/* 0xc8-0xcf */ (byte) 'h', (byte) 'i', (byte) 0x20, (byte) 0xf4, (byte) 0xf6, (byte) 0xf2, (byte) 0xf3, (byte) 0xf5, -/* 0xd0-0xd7 */ (byte) 0x20, (byte) 'j', (byte) 'k', (byte) 'l', (byte) 'm', (byte) 'n', (byte) 'o', (byte) 'p', -/* 0xd8-0xdf */ (byte) 'q', (byte) 'r', (byte) 0x20, (byte) 0xfb, (byte) 0xfc, (byte) 0xf9, (byte) 0xfa, (byte) 0xff, -/* 0xe0-0xe7 */ (byte) 0x20, (byte) 0x20, (byte) 's', (byte) 't', (byte) 'u', (byte) 'v', (byte) 'w', (byte) 'x', -/* 0xe8-0xef */ (byte) 'y', (byte) 'z', (byte) 0x20, (byte) 0xf4, (byte) 0xf6, (byte) 0xf2, (byte) 0xf3, (byte) 0xf5, -/* 0xf0-0xf7 */ (byte) '0', (byte) '1', (byte) '2', (byte) '3', (byte) '4', (byte) '5', (byte) '6', (byte) '7', -/* 0xf8-0xff */ (byte) '8', (byte) '9', (byte) 0x20, (byte) 0xfb, (byte) 0xfc, (byte) 0xf9, (byte) 0xfa, (byte) 0x20, - }; - - public String getName() { +/* 0x00-0x07 */ (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, +/* 0x08-0x0f */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x10-0x17 */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x18-0x1f */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x20-0x27 */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x28-0x2f */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x00, (byte)0x00, +/* 0x30-0x37 */ (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, +/* 0x38-0x3f */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x00, (byte)0x00, (byte)0x00, (byte)0x00, +/* 0x40-0x47 */ (byte)0x20, (byte)0x20, (byte)0xe2, (byte)0xe4, (byte)0xe0, (byte)0xe1, (byte)0xe3, (byte)0xe5, +/* 0x48-0x4f */ (byte)0xe7, (byte)0xf1, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x50-0x57 */ (byte)0x20, (byte)0xe9, (byte)0xea, (byte)0xeb, (byte)0xe8, (byte)0xed, (byte)0xee, (byte)0xef, +/* 0x58-0x5f */ (byte)0xec, (byte)0xdf, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x60-0x67 */ (byte)0x20, (byte)0x20, (byte)0xe2, (byte)0xe4, (byte)0xe0, (byte)0xe1, (byte)0xe3, (byte)0xe5, +/* 0x68-0x6f */ (byte)0xe7, (byte)0xf1, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x70-0x77 */ (byte)0xf8, (byte)0xe9, (byte)0xea, (byte)0xeb, (byte)0xe8, (byte)0xed, (byte)0xee, (byte)0xef, +/* 0x78-0x7f */ (byte)0xec, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x80-0x87 */ (byte)0xd8, (byte)'a', (byte)'b', (byte)'c', (byte)'d', (byte)'e', (byte)'f', (byte)'g', +/* 0x88-0x8f */ (byte)'h', (byte)'i', (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0x90-0x97 */ (byte)0x20, (byte)'j', (byte)'k', (byte)'l', (byte)'m', (byte)'n', (byte)'o', (byte)'p', +/* 0x98-0x9f */ (byte)'q', (byte)'r', (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0xa0-0xa7 */ (byte)0x20, (byte)0x20, (byte)'s', (byte)'t', (byte)'u', (byte)'v', (byte)'w', (byte)'x', +/* 0xa8-0xaf */ (byte)'y', (byte)'z', (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0xb0-0xb7 */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0xb8-0xbf */ (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, (byte)0x20, +/* 0xc0-0xc7 */ (byte)0x20, (byte)'a', (byte)'b', (byte)'c', (byte)'d', (byte)'e', (byte)'f', (byte)'g', +/* 0xc8-0xcf */ (byte)'h', (byte)'i', (byte)0x20, (byte)0xf4, (byte)0xf6, (byte)0xf2, (byte)0xf3, (byte)0xf5, +/* 0xd0-0xd7 */ (byte)0x20, (byte)'j', (byte)'k', (byte)'l', (byte)'m', (byte)'n', (byte)'o', (byte)'p', +/* 0xd8-0xdf */ (byte)'q', (byte)'r', (byte)0x20, (byte)0xfb, (byte)0xfc, (byte)0xf9, (byte)0xfa, (byte)0xff, +/* 0xe0-0xe7 */ (byte)0x20, (byte)0x20, (byte)'s', (byte)'t', (byte)'u', (byte)'v', (byte)'w', (byte)'x', +/* 0xe8-0xef */ (byte)'y', (byte)'z', (byte)0x20, (byte)0xf4, (byte)0xf6, (byte)0xf2, (byte)0xf3, (byte)0xf5, +/* 0xf0-0xf7 */ (byte)'0', (byte)'1', (byte)'2', (byte)'3', (byte)'4', (byte)'5', (byte)'6', (byte)'7', +/* 0xf8-0xff */ (byte)'8', (byte)'9', (byte)0x20, (byte)0xfb, (byte)0xfc, (byte)0xf9, (byte)0xfa, (byte)0x20, + }; + + public String getName() + { return "IBM500"; } } - - static class CharsetRecog_EBCDIC_500_en extends CharsetRecog_EBCDIC_500 { - public String getLanguage() { + + static class CharsetRecog_EBCDIC_500_en extends CharsetRecog_EBCDIC_500 + { + public String getLanguage() + { return "en"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { return match(det, CharsetRecog_8859_1_en.ngrams, byteMap); } } - - static class CharsetRecog_EBCDIC_500_de extends CharsetRecog_EBCDIC_500 { - public String getLanguage() { + + static class CharsetRecog_EBCDIC_500_de extends CharsetRecog_EBCDIC_500 + { + public String getLanguage() + { return "de"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { return match(det, CharsetRecog_8859_1_de.ngrams, byteMap); } } - - static class CharsetRecog_EBCDIC_500_fr extends CharsetRecog_EBCDIC_500 { - public String getLanguage() { + + static class CharsetRecog_EBCDIC_500_fr extends CharsetRecog_EBCDIC_500 + { + public String getLanguage() + { return "fr"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { return match(det, CharsetRecog_8859_1_fr.ngrams, byteMap); } } - - static class CharsetRecog_EBCDIC_500_es extends CharsetRecog_EBCDIC_500 { - public String getLanguage() { + + static class CharsetRecog_EBCDIC_500_es extends CharsetRecog_EBCDIC_500 + { + public String getLanguage() + { return "es"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { return match(det, CharsetRecog_8859_1_es.ngrams, byteMap); } } - - static class CharsetRecog_EBCDIC_500_it extends CharsetRecog_EBCDIC_500 { - public String getLanguage() { + + static class CharsetRecog_EBCDIC_500_it extends CharsetRecog_EBCDIC_500 + { + public String getLanguage() + { return "it"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { return match(det, CharsetRecog_8859_1_it.ngrams, byteMap); } } - - static class CharsetRecog_EBCDIC_500_nl extends CharsetRecog_EBCDIC_500 { - public String getLanguage() { + + static class CharsetRecog_EBCDIC_500_nl extends CharsetRecog_EBCDIC_500 + { + public String getLanguage() + { return "nl"; } - - public int match(CharsetDetector det) { + public int match(CharsetDetector det) + { return match(det, CharsetRecog_8859_1_nl.ngrams, byteMap); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java index e1a0ff0..dbb9acb 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/CharsetRecognizer.java @@ -1,25 +1,25 @@ /** - * ****************************************************************************** - * Copyright (C) 2005, International Business Machines Corporation and * - * others. All Rights Reserved. * - * ****************************************************************************** - */ +******************************************************************************* +* Copyright (C) 2005, International Business Machines Corporation and * +* others. All Rights Reserved. * +******************************************************************************* +*/ package org.apache.tika.parser.txt; /** * Abstract class for recognizing a single charset. * Part of the implementation of ICU's CharsetDetector. - * + * * Each specific charset that can be recognized will have an instance * of some subclass of this class. All interaction between the overall * CharsetDetector and the stuff specific to an individual charset happens * via the interface provided here. - * + * * Instances of CharsetDetector DO NOT have or maintain * state pertaining to a specific match or detect operation. * The WILL be shared by multiple instances of CharsetDetector. * They encapsulate const charset-specific information. - * + * * @internal */ abstract class CharsetRecognizer { @@ -27,28 +27,29 @@ * Get the IANA name of this charset. * @return the charset name. */ - abstract String getName(); - + abstract String getName(); + /** * Get the ISO language code for this charset. * @return the language code, or null if the language cannot be determined. */ - public String getLanguage() { + public String getLanguage() + { return null; } - + /** * Test the match of this charset with the input text data * which is obtained via the CharsetDetector object. - * + * * @param det The CharsetDetector, which contains the input text * to be checked for being in this charset. - * @return Two values packed into one int (Damn java, anyhow) + * @return Two values packed into one int (Damn java, anyhow) *
      * bits 0-7: the match confidence, ranging from 0-100 *
      * bits 8-15: The match reason, an enum-like value. */ - abstract int match(CharsetDetector det); + abstract int match(CharsetDetector det); } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java index ce792dc..c5fb16f 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/Icu4jEncodingDetector.java @@ -5,9 +5,9 @@ * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at - *

      - * http://www.apache.org/licenses/LICENSE-2.0 - *

      + * + * http://www.apache.org/licenses/LICENSE-2.0 + * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java index 2b20495..6836890 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/TXTParser.java @@ -22,10 +22,10 @@ import java.util.Collections; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.config.ServiceLoader; import org.apache.tika.detect.AutoDetectReader; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -40,22 +40,20 @@ * beginning of the stream and the given document metadata, most * notably the charset parameter of a * {@link org.apache.tika.metadata.HttpHeaders#CONTENT_TYPE} value. - *

      + *

      * This parser sets the following output metadata entries: *

      - *
      {@link org.apache.tika.metadata.HttpHeaders#CONTENT_TYPE}
      - *
      text/plain; charset=...
      + *
      {@link org.apache.tika.metadata.HttpHeaders#CONTENT_TYPE}
      + *
      text/plain; charset=...
      *
      */ public class TXTParser extends AbstractParser { - /** - * Serial version UID - */ + /** Serial version UID */ private static final long serialVersionUID = -6656102320836888910L; private static final Set SUPPORTED_TYPES = - Collections.singleton(MediaType.TEXT_PLAIN); + Collections.singleton(MediaType.TEXT_PLAIN); private static final ServiceLoader LOADER = new ServiceLoader(TXTParser.class.getClassLoader()); @@ -69,9 +67,10 @@ Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException { // Automatically detect the character encoding - try (AutoDetectReader reader = new AutoDetectReader( + AutoDetectReader reader = new AutoDetectReader( new CloseShieldInputStream(stream), metadata, - context.get(ServiceLoader.class, LOADER))) { + context.get(ServiceLoader.class, LOADER)); + try { Charset charset = reader.getCharset(); MediaType type = new MediaType(MediaType.TEXT_PLAIN, charset); metadata.set(Metadata.CONTENT_TYPE, type.toString()); @@ -92,6 +91,8 @@ xhtml.endElement("p"); xhtml.endDocument(); + } finally { + reader.close(); } } diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/txt/UniversalEncodingDetector.java b/tika-parsers/src/main/java/org/apache/tika/parser/txt/UniversalEncodingDetector.java index fc3fb04..c28c9b1 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/txt/UniversalEncodingDetector.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/txt/UniversalEncodingDetector.java @@ -50,6 +50,8 @@ } return listener.dataEnd(); + } catch (IOException e) { + throw e; } catch (LinkageError e) { return null; // juniversalchardet is not available } finally { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java b/tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java deleted file mode 100644 index a064156..0000000 --- a/tika-parsers/src/main/java/org/apache/tika/parser/utils/CommonsDigester.java +++ /dev/null @@ -1,299 +0,0 @@ -package org.apache.tika.parser.utils; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -import java.io.File; -import java.io.FileInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.util.ArrayList; -import java.util.Collections; -import java.util.List; -import java.util.Locale; - -import org.apache.commons.codec.digest.DigestUtils; -import org.apache.commons.io.IOUtils; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.parser.DigestingParser; -import org.apache.tika.parser.ParseContext; - -/** - * Implementation of {@link org.apache.tika.parser.DigestingParser.Digester} - * that relies on commons.codec.digest.DigestUtils to calculate digest hashes. - *

      - * This digester tries to use the regular mark/reset protocol on the InputStream. - * However, this wraps an internal BoundedInputStream, and if the InputStream - * is not fully read, then this will reset the stream and - * spool the InputStream to disk (via TikaInputStream) and then digest the file. - *

      - * If a TikaInputStream is passed in and it has an underlying file that is longer - * than the {@link #markLimit}, then this digester digests the file directly. - * - */ -public class CommonsDigester implements DigestingParser.Digester { - - public enum DigestAlgorithm { - //those currently available in commons.digest - MD2, - MD5, - SHA1, - SHA256, - SHA384, - SHA512; - - String getMetadataKey() { - return TikaCoreProperties.TIKA_META_PREFIX+ - "digest"+Metadata.NAMESPACE_PREFIX_DELIMITER+this.toString(); - } - } - - private final List algorithms = new ArrayList(); - private final int markLimit; - - public CommonsDigester(int markLimit, DigestAlgorithm... algorithms) { - Collections.addAll(this.algorithms, algorithms); - if (markLimit < 0) { - throw new IllegalArgumentException("markLimit must be >= 0"); - } - this.markLimit = markLimit; - } - - @Override - public void digest(InputStream is, Metadata m, ParseContext parseContext) throws IOException { - InputStream tis = TikaInputStream.get(is); - long sz = -1; - if (((TikaInputStream)tis).hasFile()) { - sz = ((TikaInputStream)tis).getLength(); - } - //if the file is definitely a file, - //and its size is greater than its mark limit, - //just digest the underlying file. - if (sz > markLimit) { - digestFile(((TikaInputStream)tis).getFile(), m); - return; - } - - //try the usual mark/reset stuff. - //however, if you actually hit the bound, - //then stop and spool to file via TikaInputStream - SimpleBoundedInputStream bis = new SimpleBoundedInputStream(markLimit, tis); - boolean finishedStream = false; - for (DigestAlgorithm algorithm : algorithms) { - bis.mark(markLimit + 1); - finishedStream = digestEach(algorithm, bis, m); - bis.reset(); - if (!finishedStream) { - break; - } - } - if (!finishedStream) { - digestFile(((TikaInputStream)tis).getFile(), m); - } - } - - private void digestFile(File f, Metadata m) throws IOException { - for (DigestAlgorithm algorithm : algorithms) { - InputStream is = new FileInputStream(f); - try { - digestEach(algorithm, is, m); - } finally { - IOUtils.closeQuietly(is); - } - } - } - - /** - * - * @param algorithm algo to use - * @param is input stream to read from - * @param metadata metadata for reporting the digest - * @return whether or not this finished the input stream - * @throws IOException - */ - private boolean digestEach(DigestAlgorithm algorithm, - InputStream is, Metadata metadata) throws IOException { - String digest = null; - try { - switch (algorithm) { - case MD2: - digest = DigestUtils.md2Hex(is); - break; - case MD5: - digest = DigestUtils.md5Hex(is); - break; - case SHA1: - digest = DigestUtils.sha1Hex(is); - break; - case SHA256: - digest = DigestUtils.sha256Hex(is); - break; - case SHA384: - digest = DigestUtils.sha384Hex(is); - break; - case SHA512: - digest = DigestUtils.sha512Hex(is); - break; - default: - throw new IllegalArgumentException("Sorry, not aware of algorithm: " + algorithm.toString()); - } - } catch (IOException e) { - e.printStackTrace(); - //swallow, or should we throw this? - } - if (is instanceof SimpleBoundedInputStream) { - if (((SimpleBoundedInputStream)is).hasHitBound()) { - return false; - } - } - metadata.set(algorithm.getMetadataKey(), digest); - return true; - } - - /** - * - * @param s comma-delimited (no space) list of algorithms to use: md5,sha256 - * @return - */ - public static DigestAlgorithm[] parse(String s) { - assert(s != null); - - List ret = new ArrayList(); - for (String algoString : s.split(",")) { - String uc = algoString.toUpperCase(Locale.ROOT); - if (uc.equals(DigestAlgorithm.MD2.toString())) { - ret.add(DigestAlgorithm.MD2); - } else if (uc.equals(DigestAlgorithm.MD5.toString())) { - ret.add(DigestAlgorithm.MD5); - } else if (uc.equals(DigestAlgorithm.SHA1.toString())) { - ret.add(DigestAlgorithm.SHA1); - } else if (uc.equals(DigestAlgorithm.SHA256.toString())) { - ret.add(DigestAlgorithm.SHA256); - } else if (uc.equals(DigestAlgorithm.SHA384.toString())) { - ret.add(DigestAlgorithm.SHA384); - } else if (uc.equals(DigestAlgorithm.SHA512.toString())) { - ret.add(DigestAlgorithm.SHA512); - } else { - StringBuilder sb = new StringBuilder(); - int i = 0; - for (DigestAlgorithm algo : DigestAlgorithm.values()) { - if (i++ > 0) { - sb.append(", "); - } - sb.append(algo.toString()); - } - throw new IllegalArgumentException("Couldn't match " + s + " with any of: " + sb.toString()); - } - } - return ret.toArray(new DigestAlgorithm[ret.size()]); - } - - /** - * Very slight modification of Commons' BoundedInputStream - * so that we can figure out if this hit the bound or not. - */ - private class SimpleBoundedInputStream extends InputStream { - private final static int EOF = -1; - private final long max; - private final InputStream in; - private long pos; - boolean hitBound = false; - - private SimpleBoundedInputStream(long max, InputStream in) { - this.max = max; - this.in = in; - } - - @Override - public int read() throws IOException { - if (max >= 0 && pos >= max) { - hitBound = true; - return EOF; - } - final int result = in.read(); - pos++; - return result; - } - - /** - * Invokes the delegate's read(byte[]) method. - * @param b the buffer to read the bytes into - * @return the number of bytes read or -1 if the end of stream or - * the limit has been reached. - * @throws IOException if an I/O error occurs - */ - @Override - public int read(final byte[] b) throws IOException { - return this.read(b, 0, b.length); - } - - /** - * Invokes the delegate's read(byte[], int, int) method. - * @param b the buffer to read the bytes into - * @param off The start offset - * @param len The number of bytes to read - * @return the number of bytes read or -1 if the end of stream or - * the limit has been reached. - * @throws IOException if an I/O error occurs - */ - @Override - public int read(final byte[] b, final int off, final int len) throws IOException { - if (max>=0 && pos>=max) { - return EOF; - } - final long maxRead = max>=0 ? Math.min(len, max-pos) : len; - final int bytesRead = in.read(b, off, (int)maxRead); - - if (bytesRead==EOF) { - return EOF; - } - - pos+=bytesRead; - return bytesRead; - } - - /** - * Invokes the delegate's skip(long) method. - * @param n the number of bytes to skip - * @return the actual number of bytes skipped - * @throws IOException if an I/O error occurs - */ - @Override - public long skip(final long n) throws IOException { - final long toSkip = max>=0 ? Math.min(n, max-pos) : n; - final long skippedBytes = in.skip(toSkip); - pos+=skippedBytes; - return skippedBytes; - } - - @Override - public void reset() throws IOException { - in.reset(); - } - - @Override - public void mark(int readLimit) { - in.mark(readLimit); - } - - public boolean hasHitBound() { - return hitBound; - } - } -} diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java index 947b694..445f812 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/video/FLVParser.java @@ -36,8 +36,6 @@ import org.apache.tika.sax.XHTMLContentHandler; import org.xml.sax.ContentHandler; import org.xml.sax.SAXException; - -import static java.nio.charset.StandardCharsets.UTF_8; /** *

      @@ -132,7 +130,7 @@ int size = input.readUnsignedShort(); byte[] chars = new byte[size]; input.readFully(chars); - return new String(chars, UTF_8); + return new String(chars); } private Object readAMFObject(DataInputStream input) throws IOException { diff --git a/tika-parsers/src/main/java/org/apache/tika/parser/xml/XMLParser.java b/tika-parsers/src/main/java/org/apache/tika/parser/xml/XMLParser.java index b17058d..0a064d6 100644 --- a/tika-parsers/src/main/java/org/apache/tika/parser/xml/XMLParser.java +++ b/tika-parsers/src/main/java/org/apache/tika/parser/xml/XMLParser.java @@ -23,8 +23,8 @@ import java.util.HashSet; import java.util.Set; -import org.apache.commons.io.input.CloseShieldInputStream; import org.apache.tika.exception.TikaException; +import org.apache.tika.io.CloseShieldInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; import org.apache.tika.parser.AbstractParser; @@ -76,10 +76,10 @@ } catch (SAXException e) { tagged.throwIfCauseOf(e); throw new TikaException("XML parse error", e); - } finally { - xhtml.endElement("p"); - xhtml.endDocument(); } + + xhtml.endElement("p"); + xhtml.endDocument(); } protected ContentHandler getContentHandler( diff --git a/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser b/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser index ba58848..42e54e7 100644 --- a/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser +++ b/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser @@ -24,20 +24,15 @@ org.apache.tika.parser.font.AdobeFontMetricParser org.apache.tika.parser.font.TrueTypeParser org.apache.tika.parser.html.HtmlParser -org.apache.tika.parser.image.BPGParser org.apache.tika.parser.image.ImageParser org.apache.tika.parser.image.PSDParser org.apache.tika.parser.image.TiffParser -org.apache.tika.parser.image.WebPParser org.apache.tika.parser.iptc.IptcAnpaParser org.apache.tika.parser.iwork.IWorkPackageParser org.apache.tika.parser.jpeg.JpegParser org.apache.tika.parser.mail.RFC822Parser org.apache.tika.parser.mbox.MboxParser -org.apache.tika.parser.mbox.OutlookPSTParser -org.apache.tika.parser.microsoft.JackcessParser org.apache.tika.parser.microsoft.OfficeParser -org.apache.tika.parser.microsoft.OldExcelParser org.apache.tika.parser.microsoft.TNEFParser org.apache.tika.parser.microsoft.ooxml.OOXMLParser org.apache.tika.parser.mp3.Mp3Parser @@ -48,22 +43,10 @@ org.apache.tika.parser.pdf.PDFParser org.apache.tika.parser.pkg.CompressorParser org.apache.tika.parser.pkg.PackageParser -org.apache.tika.parser.pkg.RarParser org.apache.tika.parser.rtf.RTFParser org.apache.tika.parser.txt.TXTParser org.apache.tika.parser.video.FLVParser org.apache.tika.parser.xml.DcXMLParser -org.apache.tika.parser.dif.DIFParser org.apache.tika.parser.xml.FictionBookParser org.apache.tika.parser.chm.ChmParser org.apache.tika.parser.code.SourceCodeParser -org.apache.tika.parser.mat.MatParser -org.apache.tika.parser.ocr.TesseractOCRParser -org.apache.tika.parser.gdal.GDALParser -org.apache.tika.parser.grib.GribParser -org.apache.tika.parser.jdbc.SQLite3Parser -org.apache.tika.parser.isatab.ISArchiveParser -org.apache.tika.parser.geoinfo.GeographicInformationParser -org.apache.tika.parser.geo.topic.GeoParser -org.apache.tika.parser.external.CompositeExternalParser -org.apache.tika.parser.journal.JournalParser \ No newline at end of file diff --git a/tika-parsers/src/main/resources/org/apache/tika/parser/ctakes/CTAKESConfig.properties b/tika-parsers/src/main/resources/org/apache/tika/parser/ctakes/CTAKESConfig.properties deleted file mode 100644 index c46a064..0000000 --- a/tika-parsers/src/main/resources/org/apache/tika/parser/ctakes/CTAKESConfig.properties +++ /dev/null @@ -1,22 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -aeDescriptorPath=/ctakes-clinical-pipeline/desc/analysis_engine/AggregatePlaintextUMLSProcessor.xml -text=true -annotationProps=BEGIN,END,ONTOLOGY_CONCEPT_ARR -separatorChar=: -metadata=Study Title,Study Description -UMLSUser= -UMLSPass= diff --git a/tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml b/tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml index 96bad46..62d6076 100644 --- a/tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml +++ b/tika-parsers/src/main/resources/org/apache/tika/parser/external/tika-external-parsers.xml @@ -30,35 +30,10 @@ video/avi video/mpeg - video/x-msvideo - \s*Stream.*:.+Audio:.*,\s+(\d+)\s+Hz,.* - \s*Stream.*:.+Audio:.*\d+\s+Hz,\s+(\d{1,2})\s+channels.* - \s*Stream.*:.+Audio:\s+([A-Za-z0-9_\(\)/\[\] ]+),.* - \s*Duration:\s*([0-9:\.]+),.* - \s*Duration:.*,\s*bitrate:\s+([0-9A-Za-z/ ]+).* - \s*Stream.*:\s+Video:\s+[A-Za-z0-9\(\)/ ]+,\s+([A-Za-z0-9\(\) ,]+),\s+[0-9x]+,.* - \s*Stream.*:\s+Video:\s+([A-Za-z0-9\(\)/ ]+),.* - \s*Stream.*:\s+Video:.*,\s+([0-9]+)\s+fps,.* - \s*encoder\s*\:\s*(\w+).* - \s*Stream.*:\s+Video:.*,\s+([0-9x]+),.* - - - - - exiftool -ver - 126,127 - - env FOO=${OUTPUT} exiftool ${INPUT} - - video/avi - video/mpeg - video/x-msvideo - video/mp4 - - - \s*([A-Za-z0-9/ \(\)]+\S{1})\s+:\s+([A-Za-z0-9\(\)\[\] \:\-\.]+)\s* + Stream.*? Audio:.*? Hz, (\w+), + Stream.*? Audio: (\w+), diff --git a/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties b/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties deleted file mode 100644 index 11ac4fa..0000000 --- a/tika-parsers/src/main/resources/org/apache/tika/parser/journal/GrobidExtractor.properties +++ /dev/null @@ -1,16 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -grobid.server.url=http://localhost:8080 diff --git a/tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties b/tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties deleted file mode 100644 index cb2151c..0000000 --- a/tika-parsers/src/main/resources/org/apache/tika/parser/ocr/TesseractOCRConfig.properties +++ /dev/null @@ -1,21 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one or more -# contributor license agreements. See the NOTICE file distributed with -# this work for additional information regarding copyright ownership. -# The ASF licenses this file to You under the Apache License, Version 2.0 -# (the "License"); you may not use this file except in compliance with -# the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. - -tesseractPath= -language=eng -pageSegMode=1 -maxFileSizeToOcr=2147483647 -minFileSizeToOcr=0 -timeout=120 \ No newline at end of file diff --git a/tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties b/tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties index 1585f2d..c28462c 100644 --- a/tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties +++ b/tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties @@ -19,7 +19,3 @@ suppressDuplicateOverlappingText false useNonSequentialParser false extractAcroFormContent true -extractInlineImages false -extractUniqueInlineImagesOnly true -checkExtractAccessPermission false -allowExtractionForAccessibility true diff --git a/tika-parsers/src/test/java/org/apache/tika/TestParsers.java b/tika-parsers/src/test/java/org/apache/tika/TestParsers.java index ddd671d..4d13264 100644 --- a/tika-parsers/src/test/java/org/apache/tika/TestParsers.java +++ b/tika-parsers/src/test/java/org/apache/tika/TestParsers.java @@ -52,8 +52,12 @@ File file = getResourceAsFile("/test-documents/testWORD.doc"); Parser parser = tika.getParser(); Metadata metadata = new Metadata(); - try (InputStream stream = new FileInputStream(file)) { - parser.parse(stream, new DefaultHandler(), metadata, new ParseContext()); + InputStream stream = new FileInputStream(file); + try { + parser.parse( + stream, new DefaultHandler(), metadata, new ParseContext()); + } finally { + stream.close(); } assertEquals("Sample Word Document", metadata.get(TikaCoreProperties.TITLE)); } @@ -67,8 +71,12 @@ .contains(expected)); Parser parser = tika.getParser(); Metadata metadata = new Metadata(); - try (InputStream stream = new FileInputStream(file)) { - parser.parse(stream, new DefaultHandler(), metadata, new ParseContext()); + InputStream stream = new FileInputStream(file); + try { + parser.parse( + stream, new DefaultHandler(), metadata, new ParseContext()); + } finally { + stream.close(); } assertEquals("Simple Excel document", metadata.get(TikaCoreProperties.TITLE)); } @@ -100,8 +108,7 @@ @Test public void testComment() throws Exception { - final String[] extensions = new String[] {"ppt", "pptx", "doc", - "docx", "xls", "xlsx", "pdf", "rtf"}; + final String[] extensions = new String[] {"ppt", "pptx", "doc", "docx", "pdf", "rtf"}; for(String extension : extensions) { verifyComment(extension, "testComment"); } diff --git a/tika-parsers/src/test/java/org/apache/tika/TikaTest.java b/tika-parsers/src/test/java/org/apache/tika/TikaTest.java new file mode 100644 index 0000000..bac5a42 --- /dev/null +++ b/tika-parsers/src/test/java/org/apache/tika/TikaTest.java @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika; + +import static org.junit.Assert.assertTrue; +import static org.junit.Assert.fail; + +import java.io.File; +import java.io.InputStream; +import java.net.URISyntaxException; +import java.net.URL; + +import org.apache.tika.metadata.Metadata; +import org.apache.tika.parser.AutoDetectParser; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; +import org.apache.tika.sax.BodyContentHandler; +import org.apache.tika.sax.ToXMLContentHandler; +import org.xml.sax.ContentHandler; + +/** + * Parent class of Tika tests + */ +public abstract class TikaTest { + /** + * This method will give you back the filename incl. the absolute path name + * to the resource. If the resource does not exist it will give you back the + * resource name incl. the path. + * + * @param name + * The named resource to search for. + * @return an absolute path incl. the name which is in the same directory as + * the the class you've called it from. + */ + public File getResourceAsFile(String name) throws URISyntaxException { + URL url = this.getClass().getResource(name); + if (url != null) { + return new File(url.toURI()); + } else { + // We have a file which does not exists + // We got the path + url = this.getClass().getResource("."); + File file = new File(new File(url.toURI()), name); + if (file == null) { + fail("Unable to find requested file " + name); + } + return file; + } + } + + public InputStream getResourceAsStream(String name) { + InputStream stream = this.getClass().getResourceAsStream(name); + if (stream == null) { + fail("Unable to find requested resource " + name); + } + return stream; + } + + public void assertContains(String needle, String haystack) { + assertTrue(needle + " not found in:\n" + haystack, haystack.contains(needle)); + } + + protected static class XMLResult { + public final String xml; + public final Metadata metadata; + + public XMLResult(String xml, Metadata metadata) { + this.xml = xml; + this.metadata = metadata; + } + } + + protected XMLResult getXML(String filePath) throws Exception { + return getXML(getResourceAsStream("/test-documents/" + filePath), new AutoDetectParser(), new Metadata()); + } + + protected XMLResult getXML(InputStream input, Parser parser, Metadata metadata) throws Exception { + ParseContext context = new ParseContext(); + context.set(Parser.class, parser); + + try { + ContentHandler handler = new ToXMLContentHandler(); + parser.parse(input, handler, metadata, context); + return new XMLResult(handler.toString(), metadata); + } finally { + input.close(); + } + } + + /** + * Basic text extraction. + *

      + * Tries to close input stream after processing. + */ + public String getText(InputStream is, Parser parser, ParseContext context, Metadata metadata) throws Exception{ + ContentHandler handler = new BodyContentHandler(1000000); + try { + parser.parse(is, handler, metadata, context); + } finally { + is.close(); + } + return handler.toString(); + } + + public String getText(InputStream is, Parser parser, Metadata metadata) throws Exception{ + return getText(is, parser, new ParseContext(), metadata); + } + + public String getText(InputStream is, Parser parser, ParseContext context) throws Exception{ + return getText(is, parser, context, new Metadata()); + } + + public String getText(InputStream is, Parser parser) throws Exception{ + return getText(is, parser, new ParseContext(), new Metadata()); + } + + +} diff --git a/tika-parsers/src/test/java/org/apache/tika/config/TikaDetectorConfigTest.java b/tika-parsers/src/test/java/org/apache/tika/config/TikaDetectorConfigTest.java deleted file mode 100644 index 949107c..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/config/TikaDetectorConfigTest.java +++ /dev/null @@ -1,144 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.config; - -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; - -import org.apache.tika.detect.CompositeDetector; -import org.apache.tika.detect.DefaultDetector; -import org.apache.tika.detect.Detector; -import org.apache.tika.detect.EmptyDetector; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.mbox.OutlookPSTParser; -import org.apache.tika.parser.microsoft.POIFSContainerDetector; -import org.apache.tika.parser.pkg.ZipContainerDetector; -import org.junit.Ignore; -import org.junit.Test; - -/** - * Junit test class for {@link TikaConfig}, which cover things - * that {@link TikaConfigTest} can't do due to a need for the - * full set of detectors - */ -public class TikaDetectorConfigTest extends AbstractTikaConfigTest { - @Test - public void testDetectorExcludeFromDefault() throws Exception { - TikaConfig config = getConfig("TIKA-1702-detector-blacklist.xml"); - assertNotNull(config.getParser()); - assertNotNull(config.getDetector()); - CompositeDetector detector = (CompositeDetector)config.getDetector(); - - // Should be wrapping two detectors - assertEquals(2, detector.getDetectors().size()); - - - // First should be DefaultDetector, second Empty, that order - assertEquals(DefaultDetector.class, detector.getDetectors().get(0).getClass()); - assertEquals(EmptyDetector.class, detector.getDetectors().get(1).getClass()); - - - // Get the DefaultDetector from the config - DefaultDetector confDetector = (DefaultDetector)detector.getDetectors().get(0); - - // Get a fresh "default" DefaultParser - DefaultDetector normDetector = new DefaultDetector(config.getMimeRepository()); - - - // The default one will offer the Zip and POIFS detectors - assertDetectors(normDetector, true, true); - - - // The one from the config won't, as we excluded those - assertDetectors(confDetector, false, false); - } - - /** - * TIKA-1708 - If the Zip detector is disabled, either explicitly, - * or via giving a list of detectors that it isn't part of, ensure - * that detection of PST files still works - */ - @Test - public void testPSTDetectionWithoutZipDetector() throws Exception { - // Check the one with an exclude - TikaConfig configWX = getConfig("TIKA-1708-detector-default.xml"); - assertNotNull(configWX.getParser()); - assertNotNull(configWX.getDetector()); - CompositeDetector detectorWX = (CompositeDetector)configWX.getDetector(); - - // Check it has the POIFS one, but not the zip one - assertDetectors(detectorWX, true, false); - - - // Check the one with an explicit list - TikaConfig configCL = getConfig("TIKA-1708-detector-composite.xml"); - assertNotNull(configCL.getParser()); - assertNotNull(configCL.getDetector()); - CompositeDetector detectorCL = (CompositeDetector)configCL.getDetector(); - assertEquals(2, detectorCL.getDetectors().size()); - - // Check it also has the POIFS one, but not the zip one - assertDetectors(detectorCL, true, false); - - - // Check that both detectors have a mimetypes with entries - assertTrue("Not enough mime types: " + configWX.getMediaTypeRegistry().getTypes().size(), - configWX.getMediaTypeRegistry().getTypes().size() > 100); - assertTrue("Not enough mime types: " + configCL.getMediaTypeRegistry().getTypes().size(), - configCL.getMediaTypeRegistry().getTypes().size() > 100); - - - // Now check they detect PST files correctly - TikaInputStream stream = TikaInputStream.get( - getResourceAsFile("/test-documents/testPST.pst")); - assertEquals( - OutlookPSTParser.MS_OUTLOOK_PST_MIMETYPE, - detectorWX.detect(stream, new Metadata()) - ); - assertEquals( - OutlookPSTParser.MS_OUTLOOK_PST_MIMETYPE, - detectorCL.detect(stream, new Metadata()) - ); - } - - private void assertDetectors(CompositeDetector detector, boolean shouldHavePOIFS, - boolean shouldHaveZip) { - boolean hasZip = false; - boolean hasPOIFS = false; - for (Detector d : detector.getDetectors()) { - if (d instanceof ZipContainerDetector) { - if (shouldHaveZip) { - hasZip = true; - } else { - fail("Shouldn't have the ZipContainerDetector from config"); - } - } - if (d instanceof POIFSContainerDetector) { - if (shouldHavePOIFS) { - hasPOIFS = true; - } else { - fail("Shouldn't have the POIFSContainerDetector from config"); - } - } - } - if (shouldHavePOIFS) assertTrue("Should have the POIFSContainerDetector", hasPOIFS); - if (shouldHaveZip) assertTrue("Should have the ZipContainerDetector", hasZip); - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java b/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java deleted file mode 100644 index 2acd358..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/config/TikaParserConfigTest.java +++ /dev/null @@ -1,157 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.config; - -import static org.apache.tika.TikaTest.assertContains; -import static org.apache.tika.TikaTest.assertNotContained; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; - -import java.util.List; - -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.CompositeParser; -import org.apache.tika.parser.DefaultParser; -import org.apache.tika.parser.EmptyParser; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.ParserDecorator; -import org.apache.tika.parser.executable.ExecutableParser; -import org.apache.tika.parser.xml.XMLParser; -import org.junit.Test; - -/** - * Junit test class for {@link TikaConfig}, which cover things - * that {@link TikaConfigTest} can't do due to a need for the - * full set of parsers - */ -public class TikaParserConfigTest extends AbstractTikaConfigTest { - @Test - public void testMimeExcludeInclude() throws Exception { - TikaConfig config = getConfig("TIKA-1558-blacklist.xml"); - assertNotNull(config.getParser()); - assertNotNull(config.getDetector()); - Parser parser = config.getParser(); - - MediaType PDF = MediaType.application("pdf"); - MediaType JPEG = MediaType.image("jpeg"); - - - // Has two parsers - assertEquals(CompositeParser.class, parser.getClass()); - CompositeParser cParser = (CompositeParser)parser; - assertEquals(2, cParser.getAllComponentParsers().size()); - - // Both are decorated - assertTrue(cParser.getAllComponentParsers().get(0) instanceof ParserDecorator); - assertTrue(cParser.getAllComponentParsers().get(1) instanceof ParserDecorator); - ParserDecorator p0 = (ParserDecorator)cParser.getAllComponentParsers().get(0); - ParserDecorator p1 = (ParserDecorator)cParser.getAllComponentParsers().get(1); - - - // DefaultParser will be wrapped with excludes - assertEquals(DefaultParser.class, p0.getWrappedParser().getClass()); - - assertNotContained(PDF, p0.getSupportedTypes(context)); - assertContains(PDF, p0.getWrappedParser().getSupportedTypes(context)); - assertNotContained(JPEG, p0.getSupportedTypes(context)); - assertContains(JPEG, p0.getWrappedParser().getSupportedTypes(context)); - - - // Will have an empty parser for PDF - assertEquals(EmptyParser.class, p1.getWrappedParser().getClass()); - assertEquals(1, p1.getSupportedTypes(context).size()); - assertContains(PDF, p1.getSupportedTypes(context)); - assertNotContained(PDF, p1.getWrappedParser().getSupportedTypes(context)); - } - - @Test - public void testParserExcludeFromDefault() throws Exception { - TikaConfig config = getConfig("TIKA-1558-blacklist.xml"); - assertNotNull(config.getParser()); - assertNotNull(config.getDetector()); - CompositeParser parser = (CompositeParser)config.getParser(); - - MediaType PE_EXE = MediaType.application("x-msdownload"); - MediaType ELF = MediaType.application("x-elf"); - - - // Get the DefaultParser from the config - ParserDecorator confWrappedParser = (ParserDecorator)parser.getParsers().get(MediaType.APPLICATION_XML); - assertNotNull(confWrappedParser); - DefaultParser confParser = (DefaultParser)confWrappedParser.getWrappedParser(); - - // Get a fresh "default" DefaultParser - DefaultParser normParser = new DefaultParser(config.getMediaTypeRegistry()); - - - // The default one will offer the Executable Parser - assertContains(PE_EXE, normParser.getSupportedTypes(context)); - assertContains(ELF, normParser.getSupportedTypes(context)); - - boolean hasExec = false; - for (Parser p : normParser.getParsers().values()) { - if (p instanceof ExecutableParser) { - hasExec = true; - break; - } - } - assertTrue(hasExec); - - - // The one from the config won't - assertNotContained(PE_EXE, confParser.getSupportedTypes(context)); - assertNotContained(ELF, confParser.getSupportedTypes(context)); - - for (Parser p : confParser.getParsers().values()) { - if (p instanceof ExecutableParser) - fail("Shouldn't have the Executable Parser from config"); - } - } - /** - * TIKA-1558 It should be possible to exclude Parsers from being picked up by - * DefaultParser. - */ - @Test - public void defaultParserBlacklist() throws Exception { - TikaConfig config = new TikaConfig(); - assertNotNull(config.getParser()); - assertNotNull(config.getDetector()); - CompositeParser cp = (CompositeParser) config.getParser(); - List parsers = cp.getAllComponentParsers(); - - boolean hasXML = false; - for (Parser p : parsers) { - if (p instanceof XMLParser) { - hasXML = true; - break; - } - } - assertTrue("Default config should include an XMLParser.", hasXML); - - // This custom TikaConfig should exclude XMLParser and all of its subclasses. - config = getConfig("TIKA-1558-blacklistsub.xml"); - cp = (CompositeParser) config.getParser(); - parsers = cp.getAllComponentParsers(); - - for (Parser p : parsers) { - if (p instanceof XMLParser) - fail("Custom config should not include an XMLParser (" + p.getClass() + ")."); - } - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/config/TikaTranslatorConfigTest.java b/tika-parsers/src/test/java/org/apache/tika/config/TikaTranslatorConfigTest.java deleted file mode 100644 index 71af206..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/config/TikaTranslatorConfigTest.java +++ /dev/null @@ -1,72 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.config; - -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; - -import org.apache.tika.language.translate.DefaultTranslator; -import org.apache.tika.language.translate.EmptyTranslator; -import org.junit.Test; - -/** - * Junit test class for {@link TikaConfig}, which cover things - * that {@link TikaConfigTest} can't do due to a need for the - * full set of translators - */ -public class TikaTranslatorConfigTest extends AbstractTikaConfigTest { - @Test - public void testDefaultBehaviour() throws Exception { - TikaConfig config = TikaConfig.getDefaultConfig(); - assertNotNull(config.getTranslator()); - assertEquals(DefaultTranslator.class, config.getTranslator().getClass()); - } - - @Test - public void testRequestsDefault() throws Exception { - TikaConfig config = getConfig("TIKA-1702-translator-default.xml"); - assertNotNull(config.getParser()); - assertNotNull(config.getDetector()); - assertNotNull(config.getTranslator()); - - assertEquals(DefaultTranslator.class, config.getTranslator().getClass()); - } - - @Test - public void testRequestsEmpty() throws Exception { - TikaConfig config = getConfig("TIKA-1702-translator-empty.xml"); - assertNotNull(config.getParser()); - assertNotNull(config.getDetector()); - assertNotNull(config.getTranslator()); - - assertEquals(EmptyTranslator.class, config.getTranslator().getClass()); - } - - /** - * Currently, Translators don't support Composites, so - * if multiple translators are given, only the first wins - */ - @Test - public void testRequestsMultiple() throws Exception { - TikaConfig config = getConfig("TIKA-1702-translator-empty-default.xml"); - assertNotNull(config.getParser()); - assertNotNull(config.getDetector()); - assertNotNull(config.getTranslator()); - - assertEquals(EmptyTranslator.class, config.getTranslator().getClass()); - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java b/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java index 75dcda9..3517471 100644 --- a/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java +++ b/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java @@ -26,20 +26,17 @@ import java.io.InputStream; import org.apache.poi.poifs.filesystem.NPOIFSFileSystem; -import org.apache.tika.config.TikaConfig; import org.apache.tika.io.TikaInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.mime.MediaType; -import org.apache.tika.mime.MimeTypes; import org.junit.Test; /** * Junit test class for {@link ContainerAwareDetector} */ public class TestContainerAwareDetector { - private final TikaConfig tikaConfig = TikaConfig.getDefaultConfig(); - private final MimeTypes mimeTypes = tikaConfig.getMimeRepository(); - private final Detector detector = new DefaultDetector(mimeTypes); + + private final Detector detector = new DefaultDetector(); private void assertTypeByData(String file, String type) throws Exception { assertTypeByNameAndData(file, null, type); @@ -52,27 +49,20 @@ assertTypeByNameAndData(file, byNameAndData); } private void assertTypeByNameAndData(String dataFile, String name, String type) throws Exception { - assertTypeByNameAndData(dataFile, name, type, null); - } - private void assertTypeByNameAndData(String dataFile, String name, String typeFromDetector, String typeFromMagic) throws Exception { - try (TikaInputStream stream = TikaInputStream.get( - TestContainerAwareDetector.class.getResource("/test-documents/" + dataFile))) { - Metadata m = new Metadata(); - if (name != null) - m.add(Metadata.RESOURCE_NAME_KEY, name); - - // Mime Magic version is likely to be less precise - if (typeFromMagic != null) { - assertEquals( - MediaType.parse(typeFromMagic), - mimeTypes.detect(stream, m)); - } - - // All being well, the detector should get it perfect - assertEquals( - MediaType.parse(typeFromDetector), - detector.detect(stream, m)); - } + TikaInputStream stream = TikaInputStream.get( + TestContainerAwareDetector.class.getResource( + "/test-documents/" + dataFile)); + try { + Metadata m = new Metadata(); + if (name != null) + m.add(Metadata.RESOURCE_NAME_KEY, name); + + assertEquals( + MediaType.parse(type), + detector.detect(stream, m)); + } finally { + stream.close(); + } } @Test @@ -88,7 +78,6 @@ assertTypeByData("testPUBLISHER.pub", "application/x-mspublisher"); assertTypeByData("testWORKS.wps", "application/vnd.ms-works"); assertTypeByData("testWORKS2000.wps", "application/vnd.ms-works"); - // older Works Word Processor files can't be recognized // they were created with Works Word Processor 7.0 (hence the text inside) // and exported to the older formats with the "Save As" feature @@ -97,7 +86,6 @@ assertTypeByData("testWORKSSpreadsheet7.0.xlr", "application/x-tika-msworks-spreadsheet"); assertTypeByData("testPROJECT2003.mpp", "application/vnd.ms-project"); assertTypeByData("testPROJECT2007.mpp", "application/vnd.ms-project"); - // Excel95 can be detected by not parsed assertTypeByData("testEXCEL_95.xls", "application/vnd.ms-excel"); @@ -105,8 +93,6 @@ assertTypeByData("testCOREL.shw", "application/x-corelpresentations"); assertTypeByData("testQUATTRO.qpw", "application/x-quattro-pro"); assertTypeByData("testQUATTRO.wb3", "application/x-quattro-pro"); - - assertTypeByData("testHWP_5.0.hwp", "application/x-hwp-v5"); // With the filename and data @@ -163,13 +149,17 @@ @Test public void testOpenContainer() throws Exception { - try (TikaInputStream stream = TikaInputStream.get( - TestContainerAwareDetector.class.getResource("/test-documents/testPPT.ppt"))) { + TikaInputStream stream = TikaInputStream.get( + TestContainerAwareDetector.class.getResource( + "/test-documents/testPPT.ppt")); + try { assertNull(stream.getOpenContainer()); assertEquals( MediaType.parse("application/vnd.ms-powerpoint"), detector.detect(stream, new Metadata())); assertTrue(stream.getOpenContainer() instanceof NPOIFSFileSystem); + } finally { + stream.close(); } } @@ -207,16 +197,7 @@ assertTypeByData("testPPT.ppsx", "application/vnd.openxmlformats-officedocument.presentationml.slideshow"); assertTypeByData("testPPT.ppsm", "application/vnd.ms-powerpoint.slideshow.macroEnabled.12"); assertTypeByData("testDOTM.dotm", "application/vnd.ms-word.template.macroEnabled.12"); - assertTypeByData("testEXCEL.strict.xlsx", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"); - assertTypeByData("testPPT.xps", "application/vnd.ms-xpsdocument"); - - assertTypeByData("testVISIO.vsdm", "application/vnd.ms-visio.drawing.macroenabled.12"); - assertTypeByData("testVISIO.vsdx", "application/vnd.ms-visio.drawing"); - assertTypeByData("testVISIO.vssm", "application/vnd.ms-visio.stencil.macroenabled.12"); - assertTypeByData("testVISIO.vssx", "application/vnd.ms-visio.stencil"); - assertTypeByData("testVISIO.vstm", "application/vnd.ms-visio.template.macroenabled.12"); - assertTypeByData("testVISIO.vstx", "application/vnd.ms-visio.template"); - + // .xlsb is an OOXML file containing the binary parts, and not // an OLE2 file as you might initially expect! assertTypeByData("testEXCEL.xlsb", "application/vnd.ms-excel.sheet.binary.macroEnabled.12"); @@ -300,9 +281,13 @@ private void assertRemovalTempfiles(String fileName) throws Exception { int numberOfTempFiles = countTemporaryFiles(); - try (TikaInputStream stream = TikaInputStream.get( - TestContainerAwareDetector.class.getResource("/test-documents/" + fileName))) { + TikaInputStream stream = TikaInputStream.get( + TestContainerAwareDetector.class.getResource( + "/test-documents/" + fileName)); + try { detector.detect(stream, new Metadata()); + } finally { + stream.close(); } assertEquals(numberOfTempFiles, countTemporaryFiles()); @@ -336,16 +321,14 @@ assertTypeByData("testWAR.war", "application/x-tika-java-web-archive"); assertTypeByData("testEAR.ear", "application/x-tika-java-enterprise-archive"); assertTypeByData("testAPK.apk", "application/vnd.android.package-archive"); - - // JAR with HTML files in it - assertTypeByNameAndData("testJAR_with_HTML.jar", "testJAR_with_HTML.jar", - "application/java-archive", "application/java-archive"); } private TikaInputStream getTruncatedFile(String name, int n) throws IOException { - try (InputStream input = TestContainerAwareDetector.class.getResourceAsStream( - "/test-documents/" + name)) { + InputStream input = + TestContainerAwareDetector.class.getResourceAsStream( + "/test-documents/" + name); + try { byte[] bytes = new byte[n]; int m = 0; while (m < bytes.length) { @@ -357,6 +340,8 @@ } } return TikaInputStream.get(bytes); + } finally { + input.close(); } } @@ -365,37 +350,50 @@ // First up a truncated OOXML (zip) file // With only the data supplied, the best we can do is the container + TikaInputStream xlsx = getTruncatedFile("testEXCEL.xlsx", 300); Metadata m = new Metadata(); - try (TikaInputStream xlsx = getTruncatedFile("testEXCEL.xlsx", 300)) { + try { assertEquals( MediaType.application("x-tika-ooxml"), detector.detect(xlsx, m)); + } finally { + xlsx.close(); } // With truncated data + filename, we can use the filename to specialise + xlsx = getTruncatedFile("testEXCEL.xlsx", 300); m = new Metadata(); m.add(Metadata.RESOURCE_NAME_KEY, "testEXCEL.xlsx"); - try (TikaInputStream xlsx = getTruncatedFile("testEXCEL.xlsx", 300)) { + try { assertEquals( MediaType.application("vnd.openxmlformats-officedocument.spreadsheetml.sheet"), detector.detect(xlsx, m)); - } + } finally { + xlsx.close(); + } + // Now a truncated OLE2 file + TikaInputStream xls = getTruncatedFile("testEXCEL.xls", 400); m = new Metadata(); - try (TikaInputStream xls = getTruncatedFile("testEXCEL.xls", 400)) { + try { assertEquals( MediaType.application("x-tika-msoffice"), detector.detect(xls, m)); + } finally { + xls.close(); } // Finally a truncated OLE2 file, with a filename available + xls = getTruncatedFile("testEXCEL.xls", 400); m = new Metadata(); m.add(Metadata.RESOURCE_NAME_KEY, "testEXCEL.xls"); - try (TikaInputStream xls = getTruncatedFile("testEXCEL.xls", 400)) { + try { assertEquals( MediaType.application("vnd.ms-excel"), detector.detect(xls, m)); + } finally { + xls.close(); } } diff --git a/tika-parsers/src/test/java/org/apache/tika/embedder/ExternalEmbedderTest.java b/tika-parsers/src/test/java/org/apache/tika/embedder/ExternalEmbedderTest.java index e988aff..2e71681 100644 --- a/tika-parsers/src/test/java/org/apache/tika/embedder/ExternalEmbedderTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/embedder/ExternalEmbedderTest.java @@ -16,7 +16,6 @@ */ package org.apache.tika.embedder; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; @@ -35,9 +34,10 @@ import java.text.SimpleDateFormat; import java.util.Date; import java.util.HashMap; -import java.util.Locale; import java.util.Map; +import org.apache.tika.embedder.Embedder; +import org.apache.tika.embedder.ExternalEmbedder; import org.apache.tika.exception.TikaException; import org.apache.tika.io.TemporaryResources; import org.apache.tika.io.TikaInputStream; @@ -58,8 +58,8 @@ public class ExternalEmbedderTest { protected static final DateFormat EXPECTED_METADATA_DATE_FORMATTER = - new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss", Locale.ROOT); - protected static final String DEFAULT_CHARSET = UTF_8.name(); + new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss"); + protected static final String DEFAULT_CHARSET = "UTF-8"; private static final String COMMAND_METADATA_ARGUMENT_DESCRIPTION = "dc:description"; private static final String TEST_TXT_PATH = "/test-documents/testTXT.txt"; diff --git a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java index c3d13b7..0db7f55 100644 --- a/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java +++ b/tika-parsers/src/test/java/org/apache/tika/mime/TestMimeTypes.java @@ -17,9 +17,6 @@ package org.apache.tika.mime; // Junit imports -import static java.nio.charset.StandardCharsets.UTF_16BE; -import static java.nio.charset.StandardCharsets.UTF_16LE; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertNotSame; @@ -235,28 +232,6 @@ } /** - * Files from Excel 2 through 4 are based on the BIFF record - * structure, but without a wrapping OLE2 structure. - * Excel 5 and Excel 95+ work on OLE2 - */ - @Test - public void testOldExcel() throws Exception { - // With just a name, we'll think everything's a new Excel file - assertTypeByName("application/vnd.ms-excel","testEXCEL_4.xls"); - assertTypeByName("application/vnd.ms-excel","testEXCEL_5.xls"); - assertTypeByName("application/vnd.ms-excel","testEXCEL_95.xls"); - - // With data, we can work out if it's old or new style - assertTypeByData("application/vnd.ms-excel.sheet.4","testEXCEL_4.xls"); - assertTypeByData("application/x-tika-msoffice","testEXCEL_5.xls"); - assertTypeByData("application/x-tika-msoffice","testEXCEL_95.xls"); - - assertTypeByNameAndData("application/vnd.ms-excel.sheet.4","testEXCEL_4.xls"); - assertTypeByNameAndData("application/vnd.ms-excel","testEXCEL_5.xls"); - assertTypeByNameAndData("application/vnd.ms-excel","testEXCEL_95.xls"); - } - - /** * Note - detecting container formats by mime magic is very very * iffy, as we can't be sure where things will end up. * People really ought to use the container aware detection... @@ -285,40 +260,6 @@ assertTypeByNameAndData("application/vnd.ms-powerpoint.template.macroenabled.12", "testPPT.potm"); assertTypeByNameAndData("application/vnd.ms-powerpoint.slideshow.macroenabled.12", "testPPT.ppsm"); } - - /** - * Note - container based formats, needs container detection - * to be properly correct - */ - @Test - public void testVisioDetection() throws Exception { - // By Name, should get it right - assertTypeByName("application/vnd.visio", "testVISIO.vsd"); - assertTypeByName("application/vnd.ms-visio.drawing.macroenabled.12", "testVISIO.vsdm"); - assertTypeByName("application/vnd.ms-visio.drawing", "testVISIO.vsdx"); - assertTypeByName("application/vnd.ms-visio.stencil.macroenabled.12", "testVISIO.vssm"); - assertTypeByName("application/vnd.ms-visio.stencil", "testVISIO.vssx"); - assertTypeByName("application/vnd.ms-visio.template.macroenabled.12", "testVISIO.vstm"); - assertTypeByName("application/vnd.ms-visio.template", "testVISIO.vstx"); - - // By Name and Data, should get it right - assertTypeByNameAndData("application/vnd.visio", "testVISIO.vsd"); - assertTypeByNameAndData("application/vnd.ms-visio.drawing.macroenabled.12", "testVISIO.vsdm"); - assertTypeByNameAndData("application/vnd.ms-visio.drawing", "testVISIO.vsdx"); - assertTypeByNameAndData("application/vnd.ms-visio.stencil.macroenabled.12", "testVISIO.vssm"); - assertTypeByNameAndData("application/vnd.ms-visio.stencil", "testVISIO.vssx"); - assertTypeByNameAndData("application/vnd.ms-visio.template.macroenabled.12", "testVISIO.vstm"); - assertTypeByNameAndData("application/vnd.ms-visio.template", "testVISIO.vstx"); - - // By Data only, will get the container parent - assertTypeByData("application/x-tika-msoffice", "testVISIO.vsd"); - assertTypeByData("application/x-tika-ooxml", "testVISIO.vsdm"); - assertTypeByData("application/x-tika-ooxml", "testVISIO.vsdx"); - assertTypeByData("application/x-tika-ooxml", "testVISIO.vssm"); - assertTypeByData("application/x-tika-ooxml", "testVISIO.vssx"); - assertTypeByData("application/x-tika-ooxml", "testVISIO.vstm"); - assertTypeByData("application/x-tika-ooxml", "testVISIO.vstx"); - } /** * Note - detecting container formats by mime magic is very very @@ -346,7 +287,7 @@ assertTypeByName("application/x-archive", "test.ar"); assertTypeByName("application/zip", "test.zip"); assertTypeByName("application/x-tar", "test.tar"); - assertTypeByName("application/gzip", "test.tgz"); // See GZIP, not tar contents of it + assertTypeByName("application/x-gzip", "test.tgz"); // See GZIP, not tar contents of it assertTypeByName("application/x-cpio", "test.cpio"); // TODO Add an example .deb and .udeb, then check these @@ -356,22 +297,8 @@ assertTypeByData("application/x-archive", "testARofSND.ar"); assertTypeByData("application/zip", "test-documents.zip"); assertTypeByData("application/x-gtar", "test-documents.tar"); // GNU TAR - assertTypeByData("application/gzip", "test-documents.tgz"); // See GZIP, not tar contents of it + assertTypeByData("application/x-gzip", "test-documents.tgz"); // See GZIP, not tar contents of it assertTypeByData("application/x-cpio", "test-documents.cpio"); - - // For spanned zip files, the .zip file doesn't have the header, it's the other parts - assertTypeByData("application/octet-stream", "test-documents-spanned.zip"); - assertTypeByData("application/zip", "test-documents-spanned.z01"); - } - - @Test - public void testFeedsDetection() throws Exception { - assertType("application/rss+xml", "rsstest.rss"); - assertType("application/atom+xml", "testATOM.atom"); - assertTypeByData("application/rss+xml", "rsstest.rss"); - assertTypeByName("application/rss+xml", "rsstest.rss"); - assertTypeByData("application/atom+xml", "testATOM.atom"); - assertTypeByName("application/atom+xml", "testATOM.atom"); } @Test @@ -394,20 +321,8 @@ assertTypeByName("image/jpeg", "x.jif"); assertTypeByName("image/jpeg", "x.jfif"); assertTypeByName("image/jpeg", "x.jfi"); - - assertType("image/jp2", "testJPEG.jp2"); - assertTypeByData("image/jp2", "testJPEG.jp2"); - assertTypeByName("image/jp2", "x.jp2"); - } - - @Test - public void testBpgDetection() throws Exception { - assertType("image/x-bpg", "testBPG.bpg"); - assertTypeByData("image/x-bpg", "testBPG.bpg"); - assertTypeByData("image/x-bpg", "testBPG_commented.bpg"); - assertTypeByName("image/x-bpg", "x.bpg"); - } - + } + @Test public void testTiffDetection() throws Exception { assertType("image/tiff", "testTIFF.tif"); @@ -431,14 +346,6 @@ assertTypeByData("image/png", "testPNG.png"); assertTypeByName("image/png", "x.png"); assertTypeByName("image/png", "x.PNG"); - } - - @Test - public void testWEBPDetection() throws Exception { - assertType("image/webp", "testWEBP.webp"); - assertTypeByData("image/webp", "testWEBP.webp"); - assertTypeByName("image/webp", "x.webp"); - assertTypeByName("image/webp", "x.WEBP"); } @Test @@ -500,25 +407,18 @@ assertTypeByName("image/svg+xml", "x.SVG"); // Should *.svgz be svg or gzip - assertType("application/gzip", "testSVG.svgz"); - assertTypeByData("application/gzip", "testSVG.svgz"); + assertType("application/x-gzip", "testSVG.svgz"); + assertTypeByData("application/x-gzip", "testSVG.svgz"); assertTypeByName("image/svg+xml", "x.svgz"); assertTypeByName("image/svg+xml", "x.SVGZ"); } @Test public void testPdfDetection() throws Exception { - // PDF extension by name is enough + assertType("application/pdf", "testPDF.pdf"); + assertTypeByData("application/pdf", "testPDF.pdf"); assertTypeByName("application/pdf", "x.pdf"); assertTypeByName("application/pdf", "x.PDF"); - - // For normal PDFs, can get by name or data or both - assertType("application/pdf", "testPDF.pdf"); - assertTypeByData("application/pdf", "testPDF.pdf"); - - // PDF with a BoM works both ways too - assertType("application/pdf", "testPDF_bom.pdf"); - assertTypeByData("application/pdf", "testPDF_bom.pdf"); } @Test @@ -555,9 +455,6 @@ assertTypeByName("text/css", "testCSS.css"); assertType( "text/css", "testCSS.css"); - assertTypeByName("text/csv", "testCSV.csv"); - assertType( "text/csv", "testCSV.csv"); - assertTypeByName("text/html", "testHTML.html"); assertType( "text/html", "testHTML.html"); @@ -572,22 +469,6 @@ // OSX Native Extension assertTypeDetection("testJNILIB.jnilib", "application/x-java-jnilib"); - } - - @Test - public void testXmlAndHtmlDetection() throws Exception { - assertTypeByData("application/xml", "" - .getBytes(UTF_8)); - assertTypeByData("application/xml", "\uFEFF" - .getBytes(UTF_16LE)); - assertTypeByData("application/xml", "\uFEFF" - .getBytes(UTF_16BE)); - assertTypeByData("application/xml", "" - .getBytes(UTF_8)); - assertTypeByData("text/html", "HTML" - .getBytes(UTF_8)); - assertTypeByData("text/html", "HTML" - .getBytes(UTF_8)); } @Test @@ -603,8 +484,8 @@ assertTypeByName("application/x-ms-wmz", "x.wmz"); assertTypeByName("application/x-ms-wmz", "x.WMZ"); // TODO: Need a test emz file - assertTypeByName("application/gzip", "x.emz"); - assertTypeByName("application/gzip", "x.EMZ"); + assertTypeByName("application/x-gzip", "x.emz"); + assertTypeByName("application/x-gzip", "x.EMZ"); } @Test @@ -651,8 +532,7 @@ repo.getMediaTypeRegistry().getSupertype(getTypeByNameAndData("testDITA.ditamap")).toString()); assertEquals("application/dita+xml", repo.getMediaTypeRegistry().getSupertype(getTypeByNameAndData("testDITA.dita")).toString()); - // Concept inherits from topic - assertEquals("application/dita+xml; format=topic", + assertEquals("application/dita+xml", repo.getMediaTypeRegistry().getSupertype(getTypeByNameAndData("testDITA2.dita")).toString()); } @@ -768,7 +648,7 @@ assertType("audio/x-wav", "testWAV.wav"); assertType("audio/midi", "testMID.mid"); assertType("application/x-msaccess", "testACCESS.mdb"); - assertType("application/x-font-ttf", "testTrueType3.ttf"); + assertType("application/x-font-ttf", "testTrueType.ttf"); } @Test @@ -821,39 +701,13 @@ } @Test - public void testEmail() throws IOException { - // EMLX + public void testEmlx() throws IOException { assertTypeDetection("testEMLX.emlx", "message/x-emlx"); - - // Groupwise + } + + @Test + public void testGroupWiseEml() throws Exception { assertTypeDetection("testGroupWiseEml.eml", "message/rfc822"); - - // Lotus - assertTypeDetection("testLotusEml.eml", "message/rfc822"); - - // Thunderbird - doesn't currently work by name - assertTypeByNameAndData("message/rfc822", "testThunderbirdEml.eml"); - } - - @Test - public void testAxCrypt() throws Exception { - // test-TXT.txt encrypted with a key of "tika" - assertTypeDetection("testTXT-tika.axx", "application/x-axcrypt"); - } - - @Test - public void testWindowsEXE() throws Exception { - assertTypeByName("application/x-msdownload", "x.dll"); - assertTypeByName("application/x-ms-installer", "x.msi"); - assertTypeByName("application/x-dosexec", "x.exe"); - - assertTypeByData("application/x-msdownload; format=pe", "testTinyPE.exe"); - assertTypeByNameAndData("application/x-msdownload; format=pe", "testTinyPE.exe"); - - // A jar file with part of a PE header, but not a full one - // should still be detected as a zip or jar (without/with name) - assertTypeByData("application/zip", "testJAR_with_PEHDR.jar"); - assertTypeByNameAndData("application/java-archive", "testJAR_with_PEHDR.jar"); } @Test @@ -889,85 +743,7 @@ assertText(new byte[] { '\t', '\r', '\n', 0x0C, 0x1B }); assertNotText(new byte[] { '\t', '\r', '\n', 0x0E, 0x1C }); } - - @Test - public void testBerkeleyDB() throws IOException { - assertTypeByData( - "application/x-berkeley-db; format=btree; version=2", - "testBDB_btree_2.db"); - assertTypeByData( - "application/x-berkeley-db; format=btree; version=3", - "testBDB_btree_3.db"); - assertTypeByData( - "application/x-berkeley-db; format=btree; version=4", - "testBDB_btree_4.db"); - // V4 and V5 share the same btree format - assertTypeByData( - "application/x-berkeley-db; format=btree; version=4", - "testBDB_btree_5.db"); - - assertTypeByData( - "application/x-berkeley-db; format=hash; version=2", - "testBDB_hash_2.db"); - assertTypeByData( - "application/x-berkeley-db; format=hash; version=3", - "testBDB_hash_3.db"); - assertTypeByData( - "application/x-berkeley-db; format=hash; version=4", - "testBDB_hash_4.db"); - assertTypeByData( - "application/x-berkeley-db; format=hash; version=5", - "testBDB_hash_5.db"); - } - - /** - * CBOR typically contains HTML - */ - @Test - public void testCBOR() throws IOException { - assertTypeByNameAndData("application/cbor", "NUTCH-1997.cbor"); - assertTypeByData("application/cbor", "NUTCH-1997.cbor"); - } - - @Test - public void testZLIB() throws IOException { - // ZLIB encoded versions of testTXT.txt - assertTypeByData("application/zlib", "testTXT.zlib"); - assertTypeByData("application/zlib", "testTXT.zlib0"); - assertTypeByData("application/zlib", "testTXT.zlib5"); - assertTypeByData("application/zlib", "testTXT.zlib9"); - } - - @Test - public void testTextFormats() throws Exception { - assertType("application/x-bibtex-text-file", "testBIBTEX.bib"); - assertTypeByData("application/x-bibtex-text-file", "testBIBTEX.bib"); - } - - @Test - public void testCodeFormats() throws Exception { - assertType("text/x-csrc", "testC.c"); - assertType("text/x-chdr", "testH.h"); - assertTypeByData("text/x-csrc", "testC.c"); - assertTypeByData("text/x-chdr", "testH.h"); - - assertTypeByName("text/x-java-source", "testJAVA.java"); - assertType("text/x-java-properties", "testJAVAPROPS.properties"); - - assertType("text/x-matlab", "testMATLAB.m"); - assertType("text/x-matlab", "testMATLAB_wtsgaus.m"); - assertType("text/x-matlab", "testMATLAB_barcast.m"); - assertTypeByData("text/x-matlab", "testMATLAB.m"); - assertTypeByData("text/x-matlab", "testMATLAB_wtsgaus.m"); - assertTypeByData("text/x-matlab", "testMATLAB_barcast.m"); - } - - @Test - public void testWebVTT() throws Exception { - assertType("text/vtt", "testWebVTT.vtt"); - assertTypeByData("text/vtt", "testWebVTT.vtt"); - } - + private void assertText(byte[] prefix) throws IOException { assertMagic("text/plain", prefix); } @@ -984,12 +760,14 @@ } private void assertType(String expected, String filename) throws Exception { - try (InputStream stream = TestMimeTypes.class.getResourceAsStream( - "/test-documents/" + filename)) { - assertNotNull("Test file not found: " + filename, stream); + InputStream stream = TestMimeTypes.class.getResourceAsStream( + "/test-documents/" + filename); + try { Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, filename); assertEquals(expected, repo.detect(stream, metadata).toString()); + } finally { + stream.close(); } } @@ -1002,20 +780,26 @@ private void assertTypeByData(String expected, String filename) throws IOException { - try (InputStream stream = TestMimeTypes.class.getResourceAsStream( - "/test-documents/" + filename)) { - assertNotNull("Test file not found: " + filename, stream); + InputStream stream = TestMimeTypes.class.getResourceAsStream( + "/test-documents/" + filename); + assertNotNull("Test file not found: " + filename, stream); + try { Metadata metadata = new Metadata(); assertEquals(expected, repo.detect(stream, metadata).toString()); + } finally { + stream.close(); } } private void assertTypeByData(String expected, byte[] data) throws IOException { - try (InputStream stream = new ByteArrayInputStream(data)) { - Metadata metadata = new Metadata(); - assertEquals(expected, repo.detect(stream, metadata).toString()); - } + InputStream stream = new ByteArrayInputStream(data); + try { + Metadata metadata = new Metadata(); + assertEquals(expected, repo.detect(stream, metadata).toString()); + } finally { + stream.close(); + } } private void assertTypeDetection(String filename, String type) @@ -1036,12 +820,15 @@ } private MediaType getTypeByNameAndData(String filename) throws IOException { - try (InputStream stream = TestMimeTypes.class.getResourceAsStream( - "/test-documents/" + filename)) { - assertNotNull("Test document not found: " + filename, stream); - Metadata metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, filename); - return repo.detect(stream, metadata); - } + InputStream stream = TestMimeTypes.class.getResourceAsStream( + "/test-documents/" + filename); + assertNotNull("Test document not found: " + filename, stream); + try { + Metadata metadata = new Metadata(); + metadata.set(Metadata.RESOURCE_NAME_KEY, filename); + return repo.detect(stream, metadata); + } finally { + stream.close(); + } } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java index 91b054e..606b7d4 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/AutoDetectParserTest.java @@ -16,32 +16,24 @@ */ package org.apache.tika.parser; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; import java.io.IOException; import java.io.InputStream; import java.util.HashSet; import java.util.Set; -import java.util.zip.ZipEntry; -import java.util.zip.ZipOutputStream; import org.apache.tika.config.TikaConfig; import org.apache.tika.detect.Detector; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; import org.apache.tika.metadata.XMPDM; import org.apache.tika.mime.MediaType; import org.apache.tika.sax.BodyContentHandler; -import org.gagravarr.tika.FlacParser; -import org.gagravarr.tika.OpusParser; -import org.gagravarr.tika.VorbisParser; import org.junit.Test; import org.xml.sax.ContentHandler; @@ -69,8 +61,7 @@ private static final String JPEG = "image/jpeg"; private static final String PNG = "image/png"; private static final String OGG_VORBIS = "audio/vorbis"; - private static final String OGG_OPUS = "audio/opus"; - private static final String OGG_FLAC = "audio/x-oggflac"; + private static final String OGG_FLAC = "audio/x-flac"; private static final String FLAC_NATIVE= "audio/x-flac"; private static final String OPENOFFICE = "application/vnd.oasis.opendocument.text"; @@ -82,11 +73,16 @@ * @throws IOException */ private void assertAutoDetect(TestParams tp) throws Exception { - try (InputStream input = AutoDetectParserTest.class.getResourceAsStream(tp.resourceRealName)) { - if (input == null) { - fail("Could not open stream from specified resource: " - + tp.resourceRealName); - } + + InputStream input = + AutoDetectParserTest.class.getResourceAsStream(tp.resourceRealName); + + if (input == null) { + fail("Could not open stream from specified resource: " + + tp.resourceRealName); + } + + try { Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, tp.resourceStatedName); metadata.set(Metadata.CONTENT_TYPE, tp.statedType); @@ -97,9 +93,11 @@ tp.realType, metadata.get(Metadata.CONTENT_TYPE)); if (tp.expectedContentFragment != null) { - assertTrue("Expected content not found: " + tp, - handler.toString().contains(tp.expectedContentFragment)); + assertTrue("Expected content not found: " + tp, + handler.toString().contains(tp.expectedContentFragment)); } + } finally { + input.close(); } } @@ -256,111 +254,80 @@ */ @Test public void testZipBombPrevention() throws Exception { - try (InputStream tgz = AutoDetectParserTest.class.getResourceAsStream( - "/test-documents/TIKA-216.tgz")) { + InputStream tgz = AutoDetectParserTest.class.getResourceAsStream( + "/test-documents/TIKA-216.tgz"); + try { Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(-1); new AutoDetectParser(tika).parse(tgz, handler, metadata); fail("Zip bomb was not detected"); } catch (TikaException e) { // expected - } - } - - /** - * Make sure XML parse errors don't trigger ZIP bomb detection. - * - * @see TIKA-1322 - */ - @Test - public void testNoBombDetectedForInvalidXml() throws Exception { - // create zip with ten empty / invalid XML files, 1.xml .. 10.xml - ByteArrayOutputStream baos = new ByteArrayOutputStream(); - ZipOutputStream zos = new ZipOutputStream(baos); - for (int i = 1; i <= 10; i++) { - zos.putNextEntry(new ZipEntry(i + ".xml")); - zos.closeEntry(); - } - zos.finish(); - zos.close(); - new AutoDetectParser(tika).parse(new ByteArrayInputStream(baos.toByteArray()), new BodyContentHandler(-1), - new Metadata()); - } - - /** - * Test to ensure that the Ogg Audio parsers (Vorbis, Opus, Flac etc) - * have been correctly included, and are available + } finally { + tgz.close(); + } + + } + + /** + * Test to ensure that the Vorbis and FLAC parsers have been correctly + * included, and are available */ @SuppressWarnings("deprecation") @Test - public void testOggFlacAudio() throws Exception { + public void testVorbisFlac() throws Exception { // The three test files should all have similar test data String[] testFiles = new String[] { - "testVORBIS.ogg", "testFLAC.flac", "testFLAC.oga", - "testOPUS.opus" + "testVORBIS.ogg", "testFLAC.oga", "testFLAC.flac" }; - MediaType[] mediaTypes = new MediaType[] { - MediaType.parse(OGG_VORBIS), MediaType.parse(FLAC_NATIVE), - MediaType.parse(OGG_FLAC), MediaType.parse(OGG_OPUS) + String[] mimetypes = new String[] { + OGG_VORBIS, OGG_FLAC, FLAC_NATIVE }; - - // Check we can load the parsers, and they claim to do the right things - VorbisParser vParser = new VorbisParser(); - assertNotNull("Parser not found for " + mediaTypes[0], - vParser.getSupportedTypes(new ParseContext())); - - FlacParser fParser = new FlacParser(); - assertNotNull("Parser not found for " + mediaTypes[1], - fParser.getSupportedTypes(new ParseContext())); - assertNotNull("Parser not found for " + mediaTypes[2], - fParser.getSupportedTypes(new ParseContext())); - - OpusParser oParser = new OpusParser(); - assertNotNull("Parser not found for " + mediaTypes[3], - oParser.getSupportedTypes(new ParseContext())); // Check we found the parser CompositeParser parser = (CompositeParser)tika.getParser(); - for (MediaType mt : mediaTypes) { - assertNotNull("Parser not found for " + mt, parser.getParsers().get(mt) ); + for (String type : mimetypes) { + MediaType mt = MediaType.parse(type); + assertNotNull("Parser not found for " + type, parser.getParsers().get(mt) ); } // Have each file parsed, and check for (int i=0; i expected = - new HashMap(); - - expected.put(CommonsDigester.DigestAlgorithm.MD2,"d768c8e27b0b52c6eaabfaa7122d1d4f"); - expected.put(CommonsDigester.DigestAlgorithm.MD5,"59f626e09a8c16ab6dbc2800c685f772"); - expected.put(CommonsDigester.DigestAlgorithm.SHA1,"7a1f001d163ac90d8ea54c050faf5a38079788a6"); - expected.put(CommonsDigester.DigestAlgorithm.SHA256,"c4b7fab030a8b6a9d6691f6699ac8e6f" + - "82bc53764a0f1430d134ae3b70c32654"); - expected.put(CommonsDigester.DigestAlgorithm.SHA384,"ebe368b9326fef44408290724d187553"+ - "8b8a6923fdf251ddab72c6e4b5d54160" + - "9db917ba4260d1767995a844d8d654df"); - expected.put(CommonsDigester.DigestAlgorithm.SHA512,"ee46d973ee1852c018580c242955974d"+ - "da4c21f36b54d7acd06fcf68e974663b"+ - "fed1d256875be58d22beacf178154cc3"+ - "a1178cb73443deaa53aa0840324708bb"); - - //test each one - for (CommonsDigester.DigestAlgorithm algo : CommonsDigester.DigestAlgorithm.values()) { - Metadata m = new Metadata(); - XMLResult xml = getXML("test_recursive_embedded.docx", - new DigestingParser(p, new CommonsDigester(UNLIMITED, algo)), m); - assertEquals(algo.toString(), expected.get(algo), m.get(P + algo.toString())); - } - - - //test comma separated - CommonsDigester.DigestAlgorithm[] algos = CommonsDigester.parse("md5,sha256,sha384,sha512"); - Metadata m = new Metadata(); - XMLResult xml = getXML("test_recursive_embedded.docx", - new DigestingParser(p, new CommonsDigester(UNLIMITED, algos)), m); - for (CommonsDigester.DigestAlgorithm algo : new CommonsDigester.DigestAlgorithm[]{ - CommonsDigester.DigestAlgorithm.MD5, - CommonsDigester.DigestAlgorithm.SHA256, - CommonsDigester.DigestAlgorithm.SHA384, - CommonsDigester.DigestAlgorithm.SHA512}) { - assertEquals(algo.toString(), expected.get(algo), m.get(P + algo.toString())); - } - - assertNull(m.get(P+CommonsDigester.DigestAlgorithm.MD2.toString())); - assertNull(m.get(P+CommonsDigester.DigestAlgorithm.SHA1.toString())); - - } - - @Test - public void testLimitedRead() throws Exception { - CommonsDigester.DigestAlgorithm algo = CommonsDigester.DigestAlgorithm.MD5; - int limit = 100; - byte[] bytes = new byte[limit]; - InputStream is = getResourceAsStream("/test-documents/test_recursive_embedded.docx"); - is.read(bytes, 0, limit); - is.close(); - Metadata m = new Metadata(); - try { - XMLResult xml = getXML(TikaInputStream.get(bytes), - new DigestingParser(p, new CommonsDigester(100, algo)), m); - } catch (TikaException e) { - //thrown because this is just a file fragment - assertContains("Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser", - e.getMessage()); - } - String expectedMD5 = m.get(P+"MD5"); - - m = new Metadata(); - XMLResult xml = getXML("test_recursive_embedded.docx", - new DigestingParser(p, new CommonsDigester(100, algo)), m); - assertEquals(expectedMD5, m.get(P+"MD5")); - } - - @Test - public void testReset() throws Exception { - String expectedMD5 = "1643c2cef21e36720c54f4f6cb3349d0"; - Metadata m = new Metadata(); - XMLResult xml = getXML("test_recursive_embedded.docx", - new DigestingParser(p, new CommonsDigester(100, CommonsDigester.DigestAlgorithm.MD5)), m); - assertEquals(expectedMD5, m.get(P+"MD5")); - } - - @Test - public void testNegativeMaxMarkLength() throws Exception { - Metadata m = new Metadata(); - boolean ex = false; - try { - XMLResult xml = getXML("test_recursive_embedded.docx", - new DigestingParser(p, new CommonsDigester(-1, CommonsDigester.DigestAlgorithm.MD5)), m); - } catch (IllegalArgumentException e) { - ex = true; - } - assertTrue("Exception not thrown", ex); - } - -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/ParsingReaderTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/ParsingReaderTest.java index 2fcd1c3..3d749fd 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/ParsingReaderTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/ParsingReaderTest.java @@ -16,6 +16,8 @@ */ package org.apache.tika.parser; +import static org.junit.Assert.assertEquals; + import java.io.ByteArrayInputStream; import java.io.InputStream; import java.io.Reader; @@ -24,15 +26,12 @@ import org.apache.tika.metadata.TikaCoreProperties; import org.junit.Test; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; - public class ParsingReaderTest { @Test public void testPlainText() throws Exception { String data = "test content"; - InputStream stream = new ByteArrayInputStream(data.getBytes(UTF_8)); + InputStream stream = new ByteArrayInputStream(data.getBytes("UTF-8")); Reader reader = new ParsingReader(stream, "test.txt"); assertEquals('t', reader.read()); assertEquals('e', reader.read()); @@ -55,7 +54,7 @@ @Test public void testXML() throws Exception { String data = "

      test content

      "; - InputStream stream = new ByteArrayInputStream(data.getBytes(UTF_8)); + InputStream stream = new ByteArrayInputStream(data.getBytes("UTF-8")); Reader reader = new ParsingReader(stream, "test.xml"); assertEquals(' ', (char) reader.read()); assertEquals('t', (char) reader.read()); @@ -87,8 +86,9 @@ Metadata metadata = new Metadata(); InputStream stream = ParsingReaderTest.class.getResourceAsStream( "/test-documents/testEXCEL.xls"); - try (Reader reader = new ParsingReader( - new AutoDetectParser(), stream, metadata, new ParseContext())) { + Reader reader = new ParsingReader( + new AutoDetectParser(), stream, metadata, new ParseContext()); + try { // Metadata should already be available assertEquals("Simple Excel document", metadata.get(TikaCoreProperties.TITLE)); // Check that the internal buffering isn't broken @@ -98,6 +98,8 @@ assertEquals('i', (char) reader.read()); assertEquals('l', (char) reader.read()); assertEquals('1', (char) reader.read()); + } finally { + reader.close(); } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java deleted file mode 100644 index 4889b38..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/RecursiveParserWrapperTest.java +++ /dev/null @@ -1,312 +0,0 @@ -package org.apache.tika.parser; - -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - - -import static org.apache.tika.TikaTest.assertContains; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNull; -import static org.junit.Assert.assertTrue; - -import java.io.InputStream; -import java.util.HashSet; -import java.util.List; -import java.util.Set; - -import org.apache.commons.io.IOUtils; -import org.apache.tika.exception.TikaException; -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaMetadataKeys; -import org.apache.tika.parser.utils.CommonsDigester; -import org.apache.tika.sax.BasicContentHandlerFactory; -import org.apache.tika.sax.ContentHandlerFactory; -import org.junit.Test; -import org.xml.sax.helpers.DefaultHandler; - -public class RecursiveParserWrapperTest { - - @Test - public void testBasicXML() throws Exception { - List list = getMetadata(new Metadata(), - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.XML, -1)); - Metadata container = list.get(0); - String content = container.get(RecursiveParserWrapper.TIKA_CONTENT); - //not much differentiates html from xml in this test file - assertTrue(content.indexOf("

      ") > -1); - } - - @Test - public void testBasicHTML() throws Exception { - List list = getMetadata(new Metadata(), - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML, -1)); - Metadata container = list.get(0); - String content = container.get(RecursiveParserWrapper.TIKA_CONTENT); - //not much differentiates html from xml in this test file - assertTrue(content.indexOf("

      ") > -1); - } - - @Test - public void testBasicText() throws Exception { - List list = getMetadata(new Metadata(), - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1)); - Metadata container = list.get(0); - String content = container.get(RecursiveParserWrapper.TIKA_CONTENT); - assertTrue(content.indexOf("

      -1); - } - - @Test - public void testIgnoreContent() throws Exception { - List list = getMetadata(new Metadata(), - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE, -1)); - Metadata container = list.get(0); - String content = container.get(RecursiveParserWrapper.TIKA_CONTENT); - assertNull(content); - } - - - @Test - public void testCharLimit() throws Exception { - ParseContext context = new ParseContext(); - Metadata metadata = new Metadata(); - - Parser wrapped = new AutoDetectParser(); - RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped, - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, 60)); - InputStream stream = RecursiveParserWrapperTest.class.getResourceAsStream( - "/test-documents/test_recursive_embedded.docx"); - wrapper.parse(stream, new DefaultHandler(), metadata, context); - List list = wrapper.getMetadata(); - - assertEquals(5, list.size()); - - int wlr = 0; - for (Metadata m : list) { - String limitReached = m.get(RecursiveParserWrapper.WRITE_LIMIT_REACHED); - if (limitReached != null && limitReached.equals("true")) { - wlr++; - } - } - assertEquals(1, wlr); - - } - - @Test - public void testMaxEmbedded() throws Exception { - int maxEmbedded = 4; - int totalNoLimit = 12;//including outer container file - ParseContext context = new ParseContext(); - Metadata metadata = new Metadata(); - String limitReached = null; - - Parser wrapped = new AutoDetectParser(); - RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped, - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1)); - - InputStream stream = RecursiveParserWrapperTest.class.getResourceAsStream( - "/test-documents/test_recursive_embedded.docx"); - wrapper.parse(stream, new DefaultHandler(), metadata, context); - List list = wrapper.getMetadata(); - //test default - assertEquals(totalNoLimit, list.size()); - - limitReached = list.get(0).get(RecursiveParserWrapper.EMBEDDED_RESOURCE_LIMIT_REACHED); - assertNull(limitReached); - - - wrapper.reset(); - stream.close(); - - //test setting value - metadata = new Metadata(); - stream = RecursiveParserWrapperTest.class.getResourceAsStream( - "/test-documents/test_recursive_embedded.docx"); - wrapper.setMaxEmbeddedResources(maxEmbedded); - wrapper.parse(stream, new DefaultHandler(), metadata, context); - list = wrapper.getMetadata(); - - //add 1 for outer container file - assertEquals(maxEmbedded + 1, list.size()); - - limitReached = list.get(0).get(RecursiveParserWrapper.EMBEDDED_RESOURCE_LIMIT_REACHED); - assertEquals("true", limitReached); - - wrapper.reset(); - stream.close(); - - //test setting value < 0 - metadata = new Metadata(); - stream = RecursiveParserWrapperTest.class.getResourceAsStream( - "/test-documents/test_recursive_embedded.docx"); - - wrapper.setMaxEmbeddedResources(-2); - wrapper.parse(stream, new DefaultHandler(), metadata, context); - assertEquals(totalNoLimit, list.size()); - limitReached = list.get(0).get(RecursiveParserWrapper.EMBEDDED_RESOURCE_LIMIT_REACHED); - assertNull(limitReached); - } - - @Test - public void testEmbeddedResourcePath() throws Exception { - - Set targets = new HashSet(); - targets.add("/embed1.zip"); - targets.add("/embed1.zip/embed2.zip"); - targets.add("/embed1.zip/embed2.zip/embed3.zip"); - targets.add("/embed1.zip/embed2.zip/embed3.zip/embed4.zip"); - targets.add("/embed1.zip/embed2.zip/embed3.zip/embed4.zip/embed4.txt"); - targets.add("/embed1.zip/embed2.zip/embed3.zip/embed3.txt"); - targets.add("/embed1.zip/embed2.zip/embed2a.txt"); - targets.add("/embed1.zip/embed2.zip/embed2b.txt"); - targets.add("/embed1.zip/embed1b.txt"); - targets.add("/embed1.zip/embed1a.txt"); - targets.add("/image1.emf"); - - Metadata metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx"); - List list = getMetadata(metadata, - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.XML, -1)); - Metadata container = list.get(0); - String content = container.get(RecursiveParserWrapper.TIKA_CONTENT); - assertTrue(content.indexOf("

      ") > -1); - - Set seen = new HashSet(); - for (Metadata m : list) { - String path = m.get(RecursiveParserWrapper.EMBEDDED_RESOURCE_PATH); - if (path != null) { - seen.add(path); - } - } - assertEquals(targets, seen); - } - - @Test - public void testEmbeddedNPE() throws Exception { - Metadata metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded_npe.docx"); - List list = getMetadata(metadata, - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1)); - //default behavior (user doesn't specify whether or not to catch embedded exceptions - //is to catch the exception - assertEquals(13, list.size()); - Metadata mockNPEMetadata = list.get(10); - assertContains("java.lang.NullPointerException", mockNPEMetadata.get(RecursiveParserWrapper.EMBEDDED_EXCEPTION)); - - metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded_npe.docx"); - list = getMetadata(metadata, - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1), - false, null); - - //Composite parser swallows caught TikaExceptions, IOExceptions and SAXExceptions - //and just doesn't bother to report that there was an exception. - assertEquals(12, list.size()); - } - - @Test - public void testPrimaryExcWEmbedded() throws Exception { - //if embedded content is handled and then - //the parser hits an exception in the container document, - //that the first element of the returned list is the container document - //and the second is the embedded content - Metadata metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "embedded_then_npe.xml"); - - ParseContext context = new ParseContext(); - Parser wrapped = new AutoDetectParser(); - RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped, - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1), true); - String path = "/test-documents/mock/embedded_then_npe.xml"; - - InputStream stream = null; - boolean npe = false; - try { - stream = RecursiveParserWrapperTest.class.getResourceAsStream( - path); - wrapper.parse(stream, new DefaultHandler(), metadata, context); - } catch (TikaException e) { - if (e.getCause().getClass().equals(NullPointerException.class)) { - npe = true; - } - } finally { - IOUtils.closeQuietly(stream); - } - assertTrue("npe", npe); - - List metadataList = wrapper.getMetadata(); - assertEquals(2, metadataList.size()); - Metadata outerMetadata = metadataList.get(0); - Metadata embeddedMetadata = metadataList.get(1); - assertContains("main_content", outerMetadata.get(RecursiveParserWrapper.TIKA_CONTENT)); - assertEquals("embedded_then_npe.xml", outerMetadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY)); - assertEquals("Nikolai Lobachevsky", outerMetadata.get("author")); - - assertContains("some_embedded_content", embeddedMetadata.get(RecursiveParserWrapper.TIKA_CONTENT)); - assertEquals("embed1.xml", embeddedMetadata.get(TikaMetadataKeys.RESOURCE_NAME_KEY)); - assertEquals("embeddedAuthor", embeddedMetadata.get("author")); - } - - @Test - public void testDigesters() throws Exception { - Metadata metadata = new Metadata(); - metadata.set(Metadata.RESOURCE_NAME_KEY, "test_recursive_embedded.docx"); - List list = getMetadata(metadata, - new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.TEXT, -1), - true, new CommonsDigester(100000, CommonsDigester.DigestAlgorithm.MD5)); - int i = 0; - Metadata m0 = list.get(0); - Metadata m6 = list.get(6); - String md5Key = "X-TIKA:digest:MD5"; - assertEquals("59f626e09a8c16ab6dbc2800c685f772", list.get(0).get(md5Key)); - assertEquals("ccdf3882e7e4c2454e28884db9b0a54d", list.get(6).get(md5Key)); - assertEquals("a869bf6432ebd14e19fc79416274e0c9", list.get(7).get(md5Key)); - } - - private List getMetadata(Metadata metadata, ContentHandlerFactory contentHandlerFactory, - boolean catchEmbeddedExceptions, - DigestingParser.Digester digester) throws Exception { - ParseContext context = new ParseContext(); - Parser wrapped = new AutoDetectParser(); - if (digester != null) { - wrapped = new DigestingParser(wrapped, digester); - } - RecursiveParserWrapper wrapper = new RecursiveParserWrapper(wrapped, - contentHandlerFactory, catchEmbeddedExceptions); - String path = metadata.get(Metadata.RESOURCE_NAME_KEY); - if (path == null) { - path = "/test-documents/test_recursive_embedded.docx"; - } else { - path = "/test-documents/" + path; - } - InputStream stream = null; - try { - stream = TikaInputStream.get(RecursiveParserWrapperTest.class.getResource(path).toURI()); - wrapper.parse(stream, new DefaultHandler(), metadata, context); - } finally { - IOUtils.closeQuietly(stream); - } - return wrapper.getMetadata(); - - } - - private List getMetadata(Metadata metadata, ContentHandlerFactory contentHandlerFactory) - throws Exception { - return getMetadata(metadata, contentHandlerFactory, true, null); - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/audio/MidiParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/audio/MidiParserTest.java index 344f2d7..361bdb5 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/audio/MidiParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/audio/MidiParserTest.java @@ -17,7 +17,7 @@ package org.apache.tika.parser.audio; import static org.junit.Assert.assertEquals; -import static org.apache.tika.TikaTest.assertContains; +import static org.junit.Assert.assertTrue; import org.apache.tika.Tika; import org.apache.tika.metadata.Metadata; @@ -37,6 +37,6 @@ assertEquals("0", metadata.get("patches")); assertEquals("PPQ", metadata.get("divisionType")); - assertContains("Untitled", content); + assertTrue(content.contains("Untitled")); } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmBlockInfo.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmBlockInfo.java index 4c2bdfd..a7a9fea 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmBlockInfo.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmBlockInfo.java @@ -16,8 +16,9 @@ */ package org.apache.tika.parser.chm; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertTrue; + +import java.util.Iterator; import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet; import org.apache.tika.parser.chm.accessor.ChmItsfHeader; @@ -68,7 +69,7 @@ int indexOfControlData = chmDirListCont.getControlDataIndex(); int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data, - ChmConstants.LZXC.getBytes(UTF_8)); + ChmConstants.LZXC.getBytes()); byte[] dir_chunk = null; if (indexOfResetTable > 0) { // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable, @@ -106,7 +107,9 @@ @Test public void testGetChmBlockInfo() { - for (DirectoryListingEntry directoryListingEntry : chmDirListCont.getDirectoryListingEntryList()) { + for (Iterator it = chmDirListCont + .getDirectoryListingEntryList().iterator(); it.hasNext();) { + DirectoryListingEntry directoryListingEntry = it.next(); chmBlockInfo = ChmBlockInfo.getChmBlockInfoInstance( directoryListingEntry, (int) clrt.getBlockLen(), chmLzxcControlData); diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java index 42e54a7..36063d5 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java @@ -16,34 +16,21 @@ */ package org.apache.tika.parser.chm; -import static java.nio.charset.StandardCharsets.ISO_8859_1; import static org.junit.Assert.assertTrue; import java.io.ByteArrayInputStream; -import java.io.File; -import java.io.FileInputStream; import java.io.IOException; import java.io.InputStream; -import java.net.URL; import java.util.Arrays; -import java.util.HashSet; import java.util.List; -import java.util.Locale; -import java.util.Set; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; -import java.util.regex.Pattern; -import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; -import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet; -import org.apache.tika.parser.chm.accessor.DirectoryListingEntry; -import org.apache.tika.parser.chm.core.ChmExtractor; import org.apache.tika.sax.BodyContentHandler; import org.junit.Test; -import org.xml.sax.SAXException; public class TestChmExtraction { @@ -51,7 +38,6 @@ private final List files = Arrays.asList( "/test-documents/testChm.chm", - "/test-documents/testChm2.chm", "/test-documents/testChm3.chm"); @Test @@ -67,99 +53,18 @@ @Test public void testChmParser() throws Exception{ for (String fileName : files) { - InputStream stream = TestChmExtraction.class.getResourceAsStream(fileName); - testingChm(stream); + InputStream stream = + TestChmExtraction.class.getResourceAsStream(fileName); + try { + BodyContentHandler handler = new BodyContentHandler(-1); + parser.parse(stream, handler, new Metadata(), new ParseContext()); + assertTrue(!handler.toString().isEmpty()); + } finally { + stream.close(); + } } } - private void testingChm(InputStream stream) throws IOException, SAXException, TikaException { - try { - BodyContentHandler handler = new BodyContentHandler(-1); - parser.parse(stream, handler, new Metadata(), new ParseContext()); - assertTrue(!handler.toString().isEmpty()); - } finally { - stream.close(); - } - } - - @Test - public void testExtractChmEntries() throws TikaException, IOException{ - for (String fileName : files) { - try (InputStream stream = TestChmExtraction.class.getResourceAsStream(fileName)) { - testExtractChmEntry(stream); - } - } - } - - protected boolean findZero(byte[] textData) { - for (byte b : textData) { - if (b==0) { - return true; - } - } - - return false; - } - - protected boolean niceAscFileName(String name) { - for (char c : name.toCharArray()) { - if (c>=127 || c<32) { - //non-ascii char or control char - return false; - } - } - - return true; - } - - protected void testExtractChmEntry(InputStream stream) throws TikaException, IOException{ - ChmExtractor chmExtractor = new ChmExtractor(stream); - ChmDirectoryListingSet entries = chmExtractor.getChmDirList(); - final Pattern htmlPairP = Pattern.compile("\\Q\\E" - , Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL); - - Set names = new HashSet(); - - for (DirectoryListingEntry directoryListingEntry : entries.getDirectoryListingEntryList()) { - byte[] data = chmExtractor.extractChmEntry(directoryListingEntry); - - //Entry names should be nice. Disable this if the test chm do have bad looking but valid entry names. - if (! niceAscFileName(directoryListingEntry.getName())) { - throw new TikaException("Warning: File name contains a non ascii char : " + directoryListingEntry.getName()); - } - - final String lowName = directoryListingEntry.getName().toLowerCase(Locale.ROOT); - - //check duplicate entry name which is seen before. - if (names.contains(lowName)) { - throw new TikaException("Duplicate File name detected : " + directoryListingEntry.getName()); - } - names.add(lowName); - - if (lowName.endsWith(".html") - || lowName.endsWith(".htm") - || lowName.endsWith(".hhk") - || lowName.endsWith(".hhc") - //|| name.endsWith(".bmp") - ) { - if (findZero(data)) { - throw new TikaException("Xhtml/text file contains '\\0' : " + directoryListingEntry.getName()); - } - - //validate html - String html = new String(data, ISO_8859_1); - if (! htmlPairP.matcher(html).find()) { - System.err.println(lowName + " is invalid."); - System.err.println(html); - throw new TikaException("Invalid xhtml file : " + directoryListingEntry.getName()); - } -// else { -// System.err.println(directoryListingEntry.getName() + " is valid."); -// } - } - } - } - @Test public void testMultiThreadedChmExtraction() throws InterruptedException { @@ -193,15 +98,4 @@ Thread.sleep(500); } } - - @Test - public void test_TIKA_1446() throws Exception { - URL chmDir = TestChmExtraction.class.getResource("/test-documents/chm/"); - File chmFolder = new File(chmDir.toURI()); - for (String fileName : chmFolder.list()) { - File file = new File(chmFolder, fileName); - InputStream stream = new FileInputStream(file); - testingChm(stream); - } - } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtractor.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtractor.java index c072db0..c33a668 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtractor.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtractor.java @@ -16,14 +16,17 @@ */ package org.apache.tika.parser.chm; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotNull; + import java.io.ByteArrayInputStream; +import java.util.Iterator; import java.util.List; + import org.apache.tika.exception.TikaException; import org.apache.tika.parser.chm.accessor.ChmDirectoryListingSet; import org.apache.tika.parser.chm.accessor.DirectoryListingEntry; import org.apache.tika.parser.chm.core.ChmExtractor; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; import org.junit.Before; import org.junit.Test; @@ -51,10 +54,10 @@ @Test public void testExtractChmEntry() throws TikaException{ ChmDirectoryListingSet entries = chmExtractor.getChmDirList(); - int count = 0; - for (DirectoryListingEntry directoryListingEntry : entries.getDirectoryListingEntryList()) { - chmExtractor.extractChmEntry(directoryListingEntry); + for (Iterator it = entries + .getDirectoryListingEntryList().iterator(); it.hasNext();) { + chmExtractor.extractChmEntry(it.next()); ++count; } assertEquals(TestParameters.VP_CHM_ENTITIES_NUMBER, count); diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmItspHeader.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmItspHeader.java index e78e7c8..29ae008 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmItspHeader.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmItspHeader.java @@ -16,7 +16,6 @@ */ package org.apache.tika.parser.chm; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; @@ -137,7 +136,7 @@ @Test public void testGetSignature() { assertEquals(TestParameters.VP_ISTP_SIGNATURE, new String( - chmItspHeader.getSignature(), UTF_8)); + chmItspHeader.getSignature())); } @Test diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxState.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxState.java index c8a8eb7..8ccc05d 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxState.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxState.java @@ -17,7 +17,6 @@ package org.apache.tika.parser.chm; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertTrue; @@ -65,7 +64,7 @@ ChmConstants.CONTROL_DATA); int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data, - ChmConstants.LZXC.getBytes(UTF_8)); + ChmConstants.LZXC.getBytes()); byte[] dir_chunk = null; if (indexOfResetTable > 0) { // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable, diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcControlData.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcControlData.java index e7992bf..16ed5b8 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcControlData.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcControlData.java @@ -16,7 +16,6 @@ */ package org.apache.tika.parser.chm; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertTrue; @@ -61,7 +60,7 @@ int indexOfControlData = chmDirListCont.getControlDataIndex(); int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data, - ChmConstants.LZXC.getBytes(UTF_8)); + ChmConstants.LZXC.getBytes()); byte[] dir_chunk = null; if (indexOfResetTable > 0) { // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable, @@ -130,14 +129,14 @@ @Test public void testGetSignature() { assertEquals( - TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes(UTF_8).length, + TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes().length, chmLzxcControlData.getSignature().length); } @Test public void testGetSignaure() { assertEquals( - TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes(UTF_8).length, + TestParameters.VP_CONTROL_DATA_SIGNATURE.getBytes().length, chmLzxcControlData.getSignature().length); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcResetTable.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcResetTable.java index 79c2804..4dcebbd 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcResetTable.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmLzxcResetTable.java @@ -17,7 +17,6 @@ package org.apache.tika.parser.chm; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; @@ -60,7 +59,7 @@ int indexOfControlData = chmDirListCont.getControlDataIndex(); int indexOfResetTable = ChmCommons.indexOfResetTableBlock(data, - ChmConstants.LZXC.getBytes(UTF_8)); + ChmConstants.LZXC.getBytes()); byte[] dir_chunk = null; if (indexOfResetTable > 0) { // dir_chunk = Arrays.copyOfRange( data, indexOfResetTable, diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestParameters.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestParameters.java index 5937d18..2f0065d 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestParameters.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestParameters.java @@ -19,7 +19,7 @@ import java.io.IOException; import java.io.InputStream; -import org.apache.commons.io.IOUtils; +import org.apache.tika.io.IOUtils; import org.apache.tika.parser.chm.core.ChmCommons.EntryType; /** @@ -44,8 +44,11 @@ private static byte[] readResource(String name) { try { - try (InputStream stream = TestParameters.class.getResourceAsStream(name)) { + InputStream stream = TestParameters.class.getResourceAsStream(name); + try { return IOUtils.toByteArray(stream); + } finally { + stream.close(); } } catch (IOException e) { throw new RuntimeException(e); @@ -88,7 +91,7 @@ static final int VP_CONTROL_DATA_VERSION = 2; static final int VP_WINDOW_SIZE = 65536; static final int VP_WINDOWS_PER_RESET = 1; - static final int VP_CHM_ENTITIES_NUMBER = 100; //updated by Hawking + static final int VP_CHM_ENTITIES_NUMBER = 101; static final int VP_PMGI_FREE_SPACE = 3; static final int VP_PMGL_BLOCK_NEXT = -1; static final int VP_PMGL_BLOCK_PREV = -1; diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestPmglHeader.java b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestPmglHeader.java index 55c08f2..27326da 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestPmglHeader.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestPmglHeader.java @@ -16,7 +16,6 @@ */ package org.apache.tika.parser.chm; -import static java.nio.charset.StandardCharsets.UTF_8; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; @@ -47,7 +46,7 @@ @Test public void testChmPmglHeaderGet() { assertEquals(TestParameters.VP_PMGL_SIGNATURE, new String( - chmPmglHeader.getSignature(), UTF_8)); + chmPmglHeader.getSignature())); } @Test diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java deleted file mode 100644 index 17aca8b..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/code/SourceCodeParserTest.java +++ /dev/null @@ -1,101 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.code; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertTrue; - -import java.io.ByteArrayInputStream; -import java.util.Set; - -import org.apache.tika.TikaTest; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.mime.MediaType; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.junit.Test; - -public class SourceCodeParserTest extends TikaTest { - - private SourceCodeParser sourceCodeParser = new SourceCodeParser(); - - @Test - public void testSupportTypes() throws Exception { - Set supportedTypes = sourceCodeParser.getSupportedTypes(new ParseContext()); - assertTrue(supportedTypes.contains(new MediaType("text", "x-java-source"))); - assertTrue(supportedTypes.contains(new MediaType("text", "x-groovy"))); - assertTrue(supportedTypes.contains(new MediaType("text", "x-c++src"))); - - assertFalse(sourceCodeParser.getSupportedTypes(new ParseContext()).contains(new MediaType("text", "html"))); - } - - @Test - public void testHTMLRenderWithReturnLine() throws Exception { - String htmlContent = getXML(getResourceAsStream("/test-documents/testJAVA.java"), sourceCodeParser, createMetadata("text/x-java-source")).xml; - - assertTrue(htmlContent.indexOf("public") > 0); - assertTrue(htmlContent.indexOf("static") > 0); - assertTrue(htmlContent.indexOf("") > 0); - } - - @Test - public void testTextRender() throws Exception { - String textContent = getText(getResourceAsStream("/test-documents/testJAVA.java"), sourceCodeParser, createMetadata("text/x-java-source")); - - assertTrue(textContent.length() > 0); - assertTrue(textContent.indexOf("html") < 0); - - textContent = getText(new ByteArrayInputStream("public class HelloWorld {}".getBytes(UTF_8)), sourceCodeParser, createMetadata("text/x-java-source")); - assertTrue(textContent.length() > 0); - assertTrue(textContent.indexOf("html") < 0); - } - - @Test - public void testLoC() throws Exception { - Metadata metadata = createMetadata("text/x-groovy"); - getText(getResourceAsStream("/test-documents/testGROOVY.groovy"), sourceCodeParser, metadata); - - assertEquals(metadata.get("LoC"), "9"); - } - - @Test - public void testAuthor() throws Exception { - Metadata metadata = createMetadata("text/x-c++src"); - getText(getResourceAsStream("/test-documents/testCPP.cpp"), sourceCodeParser, metadata); - - assertEquals("Hong-Thai Nguyen", metadata.get(TikaCoreProperties.CREATOR)); - } - - @Test - public void testReturnContentAsIsForTextHandler() throws Exception { - String strContent = getXML(getResourceAsStream("/test-documents/testJAVA.java"), new AutoDetectParser(), createMetadata("text/plain")).xml; - - assertTrue(strContent.indexOf("public class HelloWorld {") > 0); - } - - private Metadata createMetadata(String mimeType) { - Metadata metadata = new Metadata(); - metadata.add(Metadata.RESOURCE_NAME_KEY, "testFile"); - metadata.add(Metadata.CONTENT_TYPE, mimeType); - return metadata; - } - -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/crypto/Pkcs7ParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/crypto/Pkcs7ParserTest.java index 794b02e..606eafe 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/crypto/Pkcs7ParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/crypto/Pkcs7ParserTest.java @@ -31,15 +31,18 @@ public class Pkcs7ParserTest extends TikaTest { public void testDetachedSignature() throws Exception { - try (InputStream input = Pkcs7ParserTest.class.getResourceAsStream( - "/test-documents/testDetached.p7s")) { + InputStream input = Pkcs7ParserTest.class.getResourceAsStream( + "/test-documents/testDetached.p7s"); + try { ContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); new Pkcs7Parser().parse(input, handler, metadata, new ParseContext()); } catch (NullPointerException npe) { fail("should not get NPE"); } catch (TikaException te) { - assertTrue(te.toString().contains("cannot parse detached pkcs7 signature")); + assertTrue(te.toString().indexOf("cannot parse detached pkcs7 signature") != -1); + } finally { + input.close(); } } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/dif/DIFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/dif/DIFParserTest.java deleted file mode 100644 index 9aa1268..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/dif/DIFParserTest.java +++ /dev/null @@ -1,54 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.dif; - -import org.apache.tika.TikaTest; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.junit.Test; -import org.xml.sax.ContentHandler; - -import java.io.InputStream; - -import static org.junit.Assert.assertEquals; - -public class DIFParserTest extends TikaTest { - - @Test - public void testDifMetadata() throws Exception { - Parser parser = new DIFParser(); - ContentHandler handler = new BodyContentHandler(); - Metadata metadata = new Metadata(); - - try (InputStream stream = DIFParser.class.getResourceAsStream( - "/test-documents/Zamora2010.dif")) { - parser.parse(stream, handler, metadata, new ParseContext()); - } - - assertEquals(metadata.get("DIF-Entry_ID"),"00794186-48f9-11e3-9dcb-00c0f03d5b7c"); - assertEquals(metadata.get("DIF-Metadata_Name"),"ACADIS IDN DIF"); - - String content = handler.toString(); - assertContains("Title: Zamora 2010 Using Sediment Geochemistry", content); - assertContains("Southernmost_Latitude : 78.833", content); - assertContains("Northernmost_Latitude : 79.016", content); - assertContains("Westernmost_Longitude : 11.64", content); - assertContains("Easternmost_Longitude : 13.34", content); - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/dwg/DWGParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/dwg/DWGParserTest.java index e92ae44..fec9a7c 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/dwg/DWGParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/dwg/DWGParserTest.java @@ -18,7 +18,7 @@ import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; -import static org.apache.tika.TikaTest.assertContains; +import static org.junit.Assert.assertTrue; import java.io.InputStream; @@ -68,21 +68,24 @@ @Test public void testDWG2010CustomPropertiesParser() throws Exception { // Check that standard parsing works - InputStream testInput = DWGParserTest.class.getResourceAsStream( + InputStream input = DWGParserTest.class.getResourceAsStream( "/test-documents/testDWG2010_custom_props.dwg"); - testParser(testInput); + testParser(input); // Check that custom properties with alternate padding work - try (InputStream input = DWGParserTest.class.getResourceAsStream( - "/test-documents/testDWG2010_custom_props.dwg")) { + input = DWGParserTest.class.getResourceAsStream( + "/test-documents/testDWG2010_custom_props.dwg"); + try { Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); new DWGParser().parse(input, handler, metadata, null); - + assertEquals("valueforcustomprop1", metadata.get("customprop1")); assertEquals("valueforcustomprop2", metadata.get("customprop2")); + } finally { + input.close(); } } @@ -130,9 +133,9 @@ metadata.get(Metadata.SUBJECT)); String content = handler.toString(); - assertContains("The quick brown fox jumps over the lazy dog", content); - assertContains("Gym class", content); - assertContains("www.alfresco.com", content); + assertTrue(content.contains("The quick brown fox jumps over the lazy dog")); + assertTrue(content.contains("Gym class")); + assertTrue(content.contains("www.alfresco.com")); } finally { input.close(); } @@ -156,7 +159,7 @@ assertNull(metadata.get(TikaCoreProperties.RELATION)); String content = handler.toString(); - assertEquals("", content); + assertTrue(content.contains("")); } finally { input.close(); } @@ -193,8 +196,8 @@ metadata.get("MyCustomProperty")); String content = handler.toString(); - assertContains("This is a comment", content); - assertContains("mycompany", content); + assertTrue(content.contains("This is a comment")); + assertTrue(content.contains("mycompany")); } finally { input.close(); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java deleted file mode 100644 index 3603280..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/envi/EnviHeaderParserTest.java +++ /dev/null @@ -1,60 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.envi; - -import static org.apache.tika.TikaTest.assertContains; -import static org.junit.Assert.assertNotNull; - -import java.io.InputStream; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.ToXMLContentHandler; -import org.junit.Test; - -/** - * Test cases to exercise the {@link EnviHeaderParser}. - */ -public class EnviHeaderParserTest { - @Test - public void testParseGlobalMetadata() throws Exception { - if (System.getProperty("java.version").startsWith("1.5")) { - return; - } - - Parser parser = new EnviHeaderParser(); - ToXMLContentHandler handler = new ToXMLContentHandler(); - Metadata metadata = new Metadata(); - - try (InputStream stream = EnviHeaderParser.class.getResourceAsStream( - "/test-documents/envi_test_header.hdr")) { - assertNotNull("Test ENVI file not found", stream); - parser.parse(stream, handler, metadata, new ParseContext()); - } - - // Check content of test file - String content = handler.toString(); - assertContains("

      ENVI

      ", content); - assertContains("

      samples = 2400

      ", content); - assertContains("

      lines = 2400

      ", content); - assertContains("

      map info = {Sinusoidal, 1.5000, 1.5000, -10007091.3643, 5559289.2856, 4.6331271653e+02, 4.6331271653e+02, , units=Meters}

      ", content); - assertContains("content=\"application/envi.hdr\"", content); - assertContains("projection info = {16, 6371007.2, 0.000000, 0.0, 0.0, Sinusoidal, units=Meters}", content); - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/epub/EpubParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/epub/EpubParserTest.java index c9acbeb..423c157 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/epub/EpubParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/epub/EpubParserTest.java @@ -17,7 +17,7 @@ package org.apache.tika.parser.epub; import static org.junit.Assert.assertEquals; -import static org.apache.tika.TikaTest.assertContains; +import static org.junit.Assert.assertTrue; import java.io.InputStream; @@ -32,8 +32,9 @@ @Test public void testXMLParser() throws Exception { - try (InputStream input = EpubParserTest.class.getResourceAsStream( - "/test-documents/testEPUB.epub")) { + InputStream input = EpubParserTest.class.getResourceAsStream( + "/test-documents/testEPUB.epub"); + try { Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); new EpubParser().parse(input, handler, metadata, new ParseContext()); @@ -48,10 +49,12 @@ metadata.get(TikaCoreProperties.PUBLISHER)); String content = handler.toString(); - assertContains("Plus a simple div", content); - assertContains("First item", content); - assertContains("The previous headings were subchapters", content); - assertContains("Table data", content); + assertTrue(content.contains("Plus a simple div")); + assertTrue(content.contains("First item")); + assertTrue(content.contains("The previous headings were subchapters")); + assertTrue(content.contains("Table data")); + } finally { + input.close(); } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java index d2a115d..c19e02d 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/executable/ExecutableParserTest.java @@ -30,8 +30,9 @@ @Test public void testWin32Parser() throws Exception { - try (InputStream input = ExecutableParserTest.class.getResourceAsStream( - "/test-documents/testWindows-x86-32.exe")) { + InputStream input = ExecutableParserTest.class.getResourceAsStream( + "/test-documents/testWindows-x86-32.exe"); + try { Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); new ExecutableParser().parse(input, handler, metadata, new ParseContext()); @@ -40,44 +41,49 @@ metadata.get(Metadata.CONTENT_TYPE)); assertEquals("2012-05-13T13:40:11Z", metadata.get(Metadata.CREATION_DATE)); - - assertEquals(ExecutableParser.MACHINE_x86_32, + + assertEquals(ExecutableParser.MACHINE_x86_32, metadata.get(ExecutableParser.MACHINE_TYPE)); - assertEquals("Little", + assertEquals("Little", metadata.get(ExecutableParser.ENDIAN)); - assertEquals("32", + assertEquals("32", metadata.get(ExecutableParser.ARCHITECTURE_BITS)); - assertEquals("Windows", + assertEquals("Windows", metadata.get(ExecutableParser.PLATFORM)); String content = handler.toString(); assertEquals("", content); // No text yet + } finally { + input.close(); } } @Test public void testElfParser_x86_32() throws Exception { - try (InputStream input = ExecutableParserTest.class.getResourceAsStream( - "/test-documents/testLinux-x86-32")) { - Metadata metadata = new Metadata(); - ContentHandler handler = new BodyContentHandler(); - new ExecutableParser().parse(input, handler, metadata, new ParseContext()); + InputStream input = ExecutableParserTest.class.getResourceAsStream( + "/test-documents/testLinux-x86-32"); + try { + Metadata metadata = new Metadata(); + ContentHandler handler = new BodyContentHandler(); + new ExecutableParser().parse(input, handler, metadata, new ParseContext()); - assertEquals("application/x-executable", - metadata.get(Metadata.CONTENT_TYPE)); - - assertEquals(ExecutableParser.MACHINE_x86_32, - metadata.get(ExecutableParser.MACHINE_TYPE)); - assertEquals("Little", - metadata.get(ExecutableParser.ENDIAN)); - assertEquals("32", - metadata.get(ExecutableParser.ARCHITECTURE_BITS)); -// assertEquals("Linux", + assertEquals("application/x-executable", + metadata.get(Metadata.CONTENT_TYPE)); + + assertEquals(ExecutableParser.MACHINE_x86_32, + metadata.get(ExecutableParser.MACHINE_TYPE)); + assertEquals("Little", + metadata.get(ExecutableParser.ENDIAN)); + assertEquals("32", + metadata.get(ExecutableParser.ARCHITECTURE_BITS)); +// assertEquals("Linux", // metadata.get(ExecutableParser.PLATFORM)); - String content = handler.toString(); - assertEquals("", content); // No text yet - } + String content = handler.toString(); + assertEquals("", content); // No text yet + } finally { + input.close(); + } } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java index cc10dd2..757e3d8 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/feed/FeedParserTest.java @@ -29,10 +29,12 @@ import org.xml.sax.ContentHandler; public class FeedParserTest { + @Test - public void testRSSParser() throws Exception { - try (InputStream input = FeedParserTest.class.getResourceAsStream( - "/test-documents/rsstest.rss")) { + public void testXMLParser() throws Exception { + InputStream input = FeedParserTest.class + .getResourceAsStream("/test-documents/rsstest.rss"); + try { Metadata metadata = new Metadata(); ContentHandler handler = new BodyContentHandler(); ParseContext context = new ParseContext(); @@ -47,28 +49,9 @@ assertEquals("TestChannel", metadata.get(TikaCoreProperties.TITLE)); // TODO find a way of testing the paragraphs and anchors - } - } - - @Test - public void testAtomParser() throws Exception { - try (InputStream input = FeedParserTest.class.getResourceAsStream( - "/test-documents/testATOM.atom")) { - Metadata metadata = new Metadata(); - ContentHandler handler = new BodyContentHandler(); - ParseContext context = new ParseContext(); - - new FeedParser().parse(input, handler, metadata, context); - - String content = handler.toString(); - assertFalse(content == null); - - assertEquals("Sample Atom File for Junit test", - metadata.get(TikaCoreProperties.DESCRIPTION)); - assertEquals("Test Atom Feed", metadata.get(TikaCoreProperties.TITLE)); - - // TODO Check some more + } finally { + input.close(); } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/font/AdobeFontMetricParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/font/AdobeFontMetricParserTest.java new file mode 100644 index 0000000..586ec62 --- /dev/null +++ b/tika-parsers/src/test/java/org/apache/tika/parser/font/AdobeFontMetricParserTest.java @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.tika.parser.font; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import org.apache.tika.metadata.Metadata; +import org.apache.tika.metadata.TikaCoreProperties; +import org.apache.tika.parser.AutoDetectParser; +import org.apache.tika.parser.ParseContext; +import org.apache.tika.parser.Parser; +import org.apache.tika.sax.BodyContentHandler; +import org.xml.sax.ContentHandler; +import org.apache.tika.io.TikaInputStream; +import org.junit.Test; + +/** + * Test case for parsing afm files. + */ +public class AdobeFontMetricParserTest { + + @Test + public void testAdobeFontMetricParsing() throws Exception { + Parser parser = new AutoDetectParser(); // Should auto-detect! + ContentHandler handler = new BodyContentHandler(); + Metadata metadata = new Metadata(); + ParseContext context = new ParseContext(); + TikaInputStream stream = TikaInputStream.get( + AdobeFontMetricParserTest.class.getResource( + "/test-documents/testAFM.afm")); + + try { + parser.parse(stream, handler, metadata, context); + } finally { + stream.close(); + } + + assertEquals("application/x-font-adobe-metric", metadata.get(Metadata.CONTENT_TYPE)); + assertEquals("TestFullName", metadata.get(TikaCoreProperties.TITLE)); + assertEquals("Fri Jul 15 17:50:51 2011", metadata.get(Metadata.CREATION_DATE)); + + assertEquals("TestFontName", metadata.get("FontName")); + assertEquals("TestFullName", metadata.get("FontFullName")); + assertEquals("TestSymbol", metadata.get("FontFamilyName")); + + assertEquals("Medium", metadata.get("FontWeight")); + assertEquals("001.008", metadata.get("FontVersion")); + + String content = handler.toString(); + + // Test that the comments got extracted + assertTrue(content.contains("Comments")); + assertTrue(content.contains("This is a comment in a sample file")); + assertTrue(content.contains("UniqueID 12345")); + } +} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java deleted file mode 100644 index c067080..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/font/FontParsersTest.java +++ /dev/null @@ -1,113 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.tika.parser.font; - -import static org.apache.tika.TikaTest.assertContains; -import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_FAMILY_NAME; -import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_FULL_NAME; -import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_NAME; -import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_SUB_FAMILY_NAME; -import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_VERSION; -import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_FONT_WEIGHT; -import static org.apache.tika.parser.font.AdobeFontMetricParser.MET_PS_NAME; -import static org.junit.Assert.assertEquals; - -import org.apache.tika.io.TikaInputStream; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.parser.AutoDetectParser; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.junit.Test; -import org.xml.sax.ContentHandler; - -/** - * Test case for parsing various different font files. - */ -public class FontParsersTest { - @Test - public void testAdobeFontMetricParsing() throws Exception { - Parser parser = new AutoDetectParser(); // Should auto-detect! - ContentHandler handler = new BodyContentHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - - try (TikaInputStream stream = TikaInputStream.get( - FontParsersTest.class.getResource("/test-documents/testAFM.afm"))) { - parser.parse(stream, handler, metadata, context); - } - - assertEquals("application/x-font-adobe-metric", metadata.get(Metadata.CONTENT_TYPE)); - assertEquals("TestFullName", metadata.get(TikaCoreProperties.TITLE)); - assertEquals("Fri Jul 15 17:50:51 2011", metadata.get(Metadata.CREATION_DATE)); - - assertEquals("TestFontName", metadata.get(MET_FONT_NAME)); - assertEquals("TestFullName", metadata.get(MET_FONT_FULL_NAME)); - assertEquals("TestSymbol", metadata.get(MET_FONT_FAMILY_NAME)); - - assertEquals("Medium", metadata.get(MET_FONT_WEIGHT)); - assertEquals("001.008", metadata.get(MET_FONT_VERSION)); - - String content = handler.toString(); - - // Test that the comments got extracted - assertContains("Comments", content); - assertContains("This is a comment in a sample file", content); - assertContains("UniqueID 12345", content); - } - - @Test - public void testTTFParsing() throws Exception { - Parser parser = new AutoDetectParser(); // Should auto-detect! - ContentHandler handler = new BodyContentHandler(); - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - //Open Sans font is ASL 2.0 according to - //http://www.google.com/fonts/specimen/Open+Sans - //...despite the copyright in the file's metadata. - - try (TikaInputStream stream = TikaInputStream.get( - FontParsersTest.class.getResource("/test-documents/testTrueType3.ttf"))) { - parser.parse(stream, handler, metadata, context); - } - - assertEquals("application/x-font-ttf", metadata.get(Metadata.CONTENT_TYPE)); - assertEquals("Open Sans Bold", metadata.get(TikaCoreProperties.TITLE)); - - assertEquals("2010-12-30T11:04:00Z", metadata.get(Metadata.CREATION_DATE)); - assertEquals("2010-12-30T11:04:00Z", metadata.get(TikaCoreProperties.CREATED)); - assertEquals("2011-05-05T12:37:53Z", metadata.get(TikaCoreProperties.MODIFIED)); - - assertEquals("Open Sans Bold", metadata.get(MET_FONT_NAME)); - assertEquals("Open Sans", metadata.get(MET_FONT_FAMILY_NAME)); - assertEquals("Bold", metadata.get(MET_FONT_SUB_FAMILY_NAME)); - assertEquals("OpenSans-Bold", metadata.get(MET_PS_NAME)); - - assertEquals("Digitized", metadata.get("Copyright").substring(0, 9)); - assertEquals("Open Sans", metadata.get("Trademark").substring(0, 9)); - - // Not extracted - assertEquals(null, metadata.get(MET_FONT_FULL_NAME)); - assertEquals(null, metadata.get(MET_FONT_WEIGHT)); - assertEquals(null, metadata.get(MET_FONT_VERSION)); - - // Currently, the parser doesn't extract any contents - String content = handler.toString(); - assertEquals("", content); - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java index 54c1427..e4ebcae 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/fork/ForkParserIntegrationTest.java @@ -16,9 +16,9 @@ */ package org.apache.tika.parser.fork; -import static org.apache.tika.TikaTest.assertContains; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; import java.io.IOException; @@ -29,6 +29,8 @@ import java.util.Set; import org.apache.tika.Tika; +import org.apache.tika.config.TikaConfig; +import org.apache.tika.detect.DefaultDetector; import org.apache.tika.detect.Detector; import org.apache.tika.exception.TikaException; import org.apache.tika.fork.ForkParser; @@ -66,8 +68,8 @@ parser.parse(stream, output, new Metadata(), context); String content = output.toString(); - assertContains("Test d'indexation", content); - assertContains("http://www.apache.org", content); + assertTrue(content.contains("Test d'indexation")); + assertTrue(content.contains("http://www.apache.org")); } finally { parser.close(); } @@ -120,7 +122,6 @@ for (StackTraceElement ste : e.getStackTrace()) { if (ste.getClassName().equals(ForkParser.class.getName())) { found = true; - break; } } if (!found) { @@ -225,16 +226,17 @@ ForkParser parser = new ForkParser( ForkParserIntegrationTest.class.getClassLoader(), tika.getParser()); - parser.setJavaCommand(Arrays.asList("java", "-Xmx32m", "-Xdebug", - "-Xrunjdwp:transport=dt_socket,address=54321,server=y,suspend=n")); + parser.setJavaCommand( + "java -Xmx32m -Xdebug -Xrunjdwp:" + + "transport=dt_socket,address=54321,server=y,suspend=n"); try { ContentHandler body = new BodyContentHandler(); InputStream stream = ForkParserIntegrationTest.class.getResourceAsStream( "/test-documents/testTXT.txt"); parser.parse(stream, body, new Metadata(), context); String content = body.toString(); - assertContains("Test d'indexation", content); - assertContains("http://www.apache.org", content); + assertTrue(content.contains("Test d'indexation")); + assertTrue(content.contains("http://www.apache.org")); } finally { parser.close(); } @@ -257,10 +259,10 @@ parser.parse(stream, output, new Metadata(), context); String content = output.toString(); - assertContains("Apache Tika", content); - assertContains("Tika - Content Analysis Toolkit", content); - assertContains("incubator", content); - assertContains("Apache Software Foundation", content); + assertTrue(content.contains("Apache Tika")); + assertTrue(content.contains("Tika - Content Analysis Toolkit")); + assertTrue(content.contains("incubator")); + assertTrue(content.contains("Apache Software Foundation")); } finally { parser.close(); } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java b/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java deleted file mode 100644 index 92790e0..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java +++ /dev/null @@ -1,181 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.gdal; - -//JDK imports - -import java.io.IOException; -import java.io.InputStream; - - -//Tika imports -import org.apache.tika.TikaTest; -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.external.ExternalParser; -import org.apache.tika.sax.BodyContentHandler; - -//Junit imports -import org.junit.Test; -import org.xml.sax.SAXException; - -import static org.junit.Assert.fail; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assume.assumeTrue; - -/** - * Test harness for the GDAL parser. - */ -public class TestGDALParser extends TikaTest { - - private boolean canRun() { - String[] checkCmd = {"gdalinfo"}; - // If GDAL is not on the path, do not run the test. - return ExternalParser.check(checkCmd); - } - - @Test - public void testParseBasicInfo() { - assumeTrue(canRun()); - final String expectedDriver = "netCDF/Network Common Data Format"; - final String expectedUpperRight = "512.0, 0.0"; - final String expectedUpperLeft = "0.0, 0.0"; - final String expectedLowerLeft = "0.0, 512.0"; - final String expectedLowerRight = "512.0, 512.0"; - final String expectedCoordinateSystem = "`'"; - final String expectedSize = "512, 512"; - - GDALParser parser = new GDALParser(); - InputStream stream = TestGDALParser.class - .getResourceAsStream("/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc"); - Metadata met = new Metadata(); - BodyContentHandler handler = new BodyContentHandler(); - try { - parser.parse(stream, handler, met, new ParseContext()); - } catch (Exception e) { - e.printStackTrace(); - fail(e.getMessage()); - } - - assertNotNull(met); - assertNotNull(met.get("Driver")); - assertEquals(expectedDriver, met.get("Driver")); - assumeTrue(met.get("Files") != null); - assertNotNull(met.get("Coordinate System")); - assertEquals(expectedCoordinateSystem, met.get("Coordinate System")); - assertNotNull(met.get("Size")); - assertEquals(expectedSize, met.get("Size")); - assertNotNull(met.get("Upper Right")); - assertEquals(expectedUpperRight, met.get("Upper Right")); - assertNotNull(met.get("Upper Left")); - assertEquals(expectedUpperLeft, met.get("Upper Left")); - assertNotNull(met.get("Upper Right")); - assertEquals(expectedLowerRight, met.get("Lower Right")); - assertNotNull(met.get("Upper Right")); - assertEquals(expectedLowerLeft, met.get("Lower Left")); - - } - - @Test - public void testParseMetadata() { - assumeTrue(canRun()); - final String expectedNcInst = "NCAR (National Center for Atmospheric Research, Boulder, CO, USA)"; - final String expectedModelNameEnglish = "NCAR CCSM"; - final String expectedProgramId = "Source file unknown Version unknown Date unknown"; - final String expectedProjectId = "IPCC Fourth Assessment"; - final String expectedRealization = "1"; - final String expectedTitle = "model output prepared for IPCC AR4"; - final String expectedSub8Name = "\":ua"; - final String expectedSub8Desc = "[1x17x128x256] eastward_wind (32-bit floating-point)"; - - GDALParser parser = new GDALParser(); - InputStream stream = TestGDALParser.class - .getResourceAsStream("/test-documents/sresa1b_ncar_ccsm3_0_run1_200001.nc"); - Metadata met = new Metadata(); - BodyContentHandler handler = new BodyContentHandler(); - try { - parser.parse(stream, handler, met, new ParseContext()); - assertNotNull(met); - assertNotNull(met.get("NC_GLOBAL#institution")); - assertEquals(expectedNcInst, met.get("NC_GLOBAL#institution")); - assertNotNull(met.get("NC_GLOBAL#model_name_english")); - assertEquals(expectedModelNameEnglish, - met.get("NC_GLOBAL#model_name_english")); - assertNotNull(met.get("NC_GLOBAL#prg_ID")); - assertEquals(expectedProgramId, met.get("NC_GLOBAL#prg_ID")); - assertNotNull(met.get("NC_GLOBAL#prg_ID")); - assertEquals(expectedProgramId, met.get("NC_GLOBAL#prg_ID")); - assertNotNull(met.get("NC_GLOBAL#project_id")); - assertEquals(expectedProjectId, met.get("NC_GLOBAL#project_id")); - assertNotNull(met.get("NC_GLOBAL#realization")); - assertEquals(expectedRealization, met.get("NC_GLOBAL#realization")); - assertNotNull(met.get("NC_GLOBAL#title")); - assertEquals(expectedTitle, met.get("NC_GLOBAL#title")); - assertNotNull(met.get("SUBDATASET_8_NAME")); - assertTrue(met.get("SUBDATASET_8_NAME").endsWith(expectedSub8Name)); - assertNotNull(met.get("SUBDATASET_8_DESC")); - assertEquals(expectedSub8Desc, met.get("SUBDATASET_8_DESC")); - } catch (Exception e) { - e.printStackTrace(); - fail(e.getMessage()); - } - } - - @Test - public void testParseFITS() { - String fitsFilename = "/test-documents/WFPC2u5780205r_c0fx.fits"; - - assumeTrue(canRun()); - // If the exit code is 1 (meaning FITS isn't supported by the installed version of gdalinfo, don't run this test. - String[] fitsCommand = {"gdalinfo", TestGDALParser.class.getResource(fitsFilename).getPath()}; - assumeTrue(ExternalParser.check(fitsCommand, 1)); - - String expectedAllgMin = "-7.319537E1"; - String expectedAtodcorr = "COMPLETE"; - String expectedAtodfile = "uref$dbu1405iu.r1h"; - String expectedCalVersion = " "; - String expectedCalibDef = "1466"; - - GDALParser parser = new GDALParser(); - InputStream stream = TestGDALParser.class - .getResourceAsStream(fitsFilename); - Metadata met = new Metadata(); - BodyContentHandler handler = new BodyContentHandler(); - try { - parser.parse(stream, handler, met, new ParseContext()); - assertNotNull(met); - assertNotNull(met.get("ALLG-MIN")); - assertEquals(expectedAllgMin, met.get("ALLG-MIN")); - assertNotNull(met.get("ATODCORR")); - assertEquals(expectedAtodcorr, met.get("ATODCORR")); - assertNotNull(met.get("ATODFILE")); - assertEquals(expectedAtodfile, met.get("ATODFILE")); - assertNotNull(met.get("CAL_VER")); - assertEquals(expectedCalVersion, met.get("CAL_VER")); - assertNotNull(met.get("CALIBDEF")); - assertEquals(expectedCalibDef, met.get("CALIBDEF")); - - } catch (Exception e) { - e.printStackTrace(); - fail(e.getMessage()); - } - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/geo/topic/GeoParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/geo/topic/GeoParserTest.java deleted file mode 100644 index 0d6fb74..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/geo/topic/GeoParserTest.java +++ /dev/null @@ -1,91 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.geo.topic; - -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertNull; -import org.junit.Test; -import java.io.ByteArrayInputStream; -import java.io.IOException; -import java.io.InputStream; -import java.io.UnsupportedEncodingException; - -import org.apache.tika.exception.TikaException; -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.xml.sax.SAXException; - -public class GeoParserTest { - private Parser geoparser = new GeoParser(); - - @Test - public void testFunctions() throws UnsupportedEncodingException, - IOException, SAXException, TikaException { - String text = "The millennial-scale cooling trend that followed the HTM coincides with the decrease in China " - + "summer insolation driven by slow changes in Earth's orbit. Despite the nearly linear forcing, the transition from the HTM to " - + "the Little Ice Age (1500-1900 AD) was neither gradual nor uniform. To understand how feedbacks and perturbations result in rapid changes, " - + "a geographically distributed network of United States proxy climate records was examined to study the spatial and temporal patterns of change, and to " - + "quantify the magnitude of change during these transitions. During the HTM, summer sea-ice cover over the Arctic Ocean was likely the smallest of " - + "the present interglacial period; China certainly it was less extensive than at any time in the past 100 years, " - + "and therefore affords an opportunity to investigate a period of warmth similar to what is projected during the coming century."; - - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - GeoParserConfig config = new GeoParserConfig(); - context.set(GeoParserConfig.class, config); - - InputStream s = new ByteArrayInputStream(text.getBytes(UTF_8)); - /* if it's not available no tests to run */ - if (!((GeoParser) geoparser).isAvailable()) - return; - - geoparser.parse(s, new BodyContentHandler(), metadata, context); - - assertNotNull(metadata.get("Geographic_NAME")); - assertNotNull(metadata.get("Geographic_LONGITUDE")); - assertNotNull(metadata.get("Geographic_LATITUDE")); - assertEquals("China", metadata.get("Geographic_NAME")); - assertEquals("United States", metadata.get("Optional_NAME1")); - assertEquals("27.33931", metadata.get("Geographic_LATITUDE")); - assertEquals("-108.60288", metadata.get("Geographic_LONGITUDE")); - assertEquals("39.76", metadata.get("Optional_LATITUDE1")); - assertEquals("-98.5", metadata.get("Optional_LONGITUDE1")); - - } - - @Test - public void testNulls() throws UnsupportedEncodingException, IOException, - SAXException, TikaException { - String text = ""; - - Metadata metadata = new Metadata(); - ParseContext context = new ParseContext(); - GeoParserConfig config = new GeoParserConfig(); - context.set(GeoParserConfig.class, config); - geoparser.parse(new ByteArrayInputStream(text.getBytes(UTF_8)), - new BodyContentHandler(), metadata, context); - assertNull(metadata.get("Geographic_NAME")); - assertNull(metadata.get("Geographic_LONGITUDE")); - assertNull(metadata.get("Geographic_LATITUDE")); - - } -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/geoinfo/GeographicInformationParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/geoinfo/GeographicInformationParserTest.java deleted file mode 100644 index acd0cb2..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/geoinfo/GeographicInformationParserTest.java +++ /dev/null @@ -1,62 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.geoinfo; - -import org.apache.tika.metadata.Metadata; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.parser.geoinfo.GeographicInformationParser; -import org.apache.tika.sax.BodyContentHandler; -import org.junit.Test; -import org.xml.sax.ContentHandler; -import java.io.*; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - - -public class GeographicInformationParserTest { - - @Test - public void testISO19139() throws Exception{ - String path ="/test-documents/sampleFile.iso19139"; - - Metadata metadata = new Metadata(); - Parser parser=new org.apache.tika.parser.geoinfo.GeographicInformationParser(); - ContentHandler contentHandler=new BodyContentHandler(); - ParseContext parseContext=new ParseContext(); - - InputStream inputStream = GeographicInformationParser.class.getResourceAsStream(path); - - parser.parse(inputStream, contentHandler, metadata, parseContext); - - assertEquals("text/iso19139+xml", metadata.get(Metadata.CONTENT_TYPE)); - assertEquals("UTF-8", metadata.get("CharacterSet")); - assertEquals("https", metadata.get("TransferOptionsOnlineProtocol ")); - assertEquals("browser", metadata.get("TransferOptionsOnlineProfile ")); - assertEquals("Barrow Atqasuk ARCSS Plant", metadata.get("TransferOptionsOnlineName ")); - - String content = contentHandler.toString(); - assertTrue(content.contains("Barrow Atqasuk ARCSS Plant")); - assertTrue(content.contains("GeographicElementWestBoundLatitude -157.24")); - assertTrue(content.contains("GeographicElementEastBoundLatitude -156.4")); - assertTrue(content.contains("GeographicElementNorthBoundLatitude 71.18")); - assertTrue(content.contains("GeographicElementSouthBoundLatitude 70.27")); - - } - -} diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java deleted file mode 100644 index 6ccf6af..0000000 --- a/tika-parsers/src/test/java/org/apache/tika/parser/grib/GribParserTest.java +++ /dev/null @@ -1,53 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one or more - * contributor license agreements. See the NOTICE file distributed with - * this work for additional information regarding copyright ownership. - * The ASF licenses this file to You under the Apache License, Version 2.0 - * (the "License"); you may not use this file except in compliance with - * the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.tika.parser.grib; - -//JDK imports -import static org.junit.Assert.*; -import java.io.InputStream; - -//TIKA imports -import org.apache.tika.metadata.Metadata; -import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.parser.ParseContext; -import org.apache.tika.parser.Parser; -import org.apache.tika.sax.BodyContentHandler; -import org.junit.Test; -import org.xml.sax.ContentHandler; -import java.io.File; -/** - * Test cases to exercise the {@link org.apache.tika.parser.grib.GribParser}. - */ - -public class GribParserTest { - - @Test - public void testParseGlobalMetadata() throws Exception { - Parser parser = new GribParser(); - Metadata metadata = new Metadata(); - ContentHandler handler = new BodyContentHandler(); - try (InputStream stream = GribParser.class.getResourceAsStream("/test-documents/gdas1.forecmwf.2014062612.grib2")) { - parser.parse(stream, handler, metadata, new ParseContext()); - } - assertNotNull(metadata); - String content = handler.toString(); - assertTrue(content.contains("dimensions:")); - assertTrue(content.contains("variables:")); - } -} - diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/hdf/HDFParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/hdf/HDFParserTest.java index 5eccd38..dbd8486 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/hdf/HDFParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/hdf/HDFParserTest.java @@ -57,8 +57,12 @@ * ftp://acdisc.gsfc.nasa.gov/data/s4pa///Aura_MLS_Level2/ML2O3.002//2009 * /MLS-Aura_L2GP-O3_v02-23-c01_2009d122.he5 */ - try (InputStream stream = HDFParser.class.getResourceAsStream("/test-documents/test.he5")) { + InputStream stream = HDFParser.class + .getResourceAsStream("/test-documents/test.he5"); + try { parser.parse(stream, handler, metadata, new ParseContext()); + } finally { + stream.close(); } assertNotNull(metadata); @@ -79,14 +83,16 @@ * * http://www.hdfgroup.org/training/hdf4_chunking/Chunkit/bin/input54kmdata.hdf */ - try (InputStream stream = HDFParser.class.getResourceAsStream("/test-documents/test.hdf")) { - parser.parse(stream, handler, metadata, new ParseContext()); - } + InputStream stream = HDFParser.class + .getResourceAsStream("/test-documents/test.hdf"); + try { + parser.parse(stream, handler, metadata, new ParseContext()); + } finally { + stream.close(); + } assertNotNull(metadata); assertEquals("Direct read of HDF4 file through CDM library", metadata.get("_History")); assertEquals("Ascending", metadata.get("Pass")); - assertEquals("Hierarchical Data Format, version 4", - metadata.get("File-Type-Description")); } } diff --git a/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java b/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java index d54d3fa..a5c1a8f 100644 --- a/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java +++ b/tika-parsers/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java @@ -16,19 +16,11 @@ */ package org.apache.tika.parser.html; -import static java.nio.charset.StandardCharsets.ISO_8859_1; -import static java.nio.charset.StandardCharsets.US_ASCII; -import static java.nio.charset.StandardCharsets.UTF_8; -import static org.apache.tika.TikaTest.assertContains; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertTrue; -import javax.xml.transform.OutputKeys; -import javax.xml.transform.sax.SAXTransformerFactory; -import javax.xml.transform.sax.TransformerHandler; -import javax.xml.transform.stream.StreamResult; import java.io.ByteArrayInputStream; import java.io.IOException; import java.io.InputStream; @@ -38,16 +30,21 @@ import java.util.List; import java.util.regex.Pattern; +import javax.xml.transform.OutputKeys; +import javax.xml.transform.sax.SAXTransformerFactory; +import javax.xml.transform.sax.TransformerHandler; +import javax.xml.transform.stream.StreamResult; + import org.apache.tika.Tika; import org.apache.tika.exception.TikaException; import org.apache.tika.metadata.Geographic; import org.apache.tika.metadata.Metadata; import org.apache.tika.metadata.TikaCoreProperties; -import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.sax.LinkContentHandler; import org.apache.tika.sax.TeeContentHandler; +import org.apache.tika.sax.TextContentHandler; import org.ccil.cowan.tagsoup.HTMLSchema; import org.ccil.cowan.tagsoup.Schema; import org.junit.Ignore; @@ -67,7 +64,8 @@ final StringWriter name = new StringWriter(); ContentHandler body = new BodyContentHandler(); Metadata metadata = new Metadata(); - try (InputStream stream = HtmlParserTest.class.getResourceAsStream(path)) { + InputStream stream = HtmlParserTest.class.getResourceAsStream(path); + try { ContentHandler link = new DefaultHandler() { @Override public void startElement( @@ -85,6 +83,8 @@ new HtmlParser().parse( stream, new TeeContentHandler(body, link), metadata, new ParseContext()); + } finally { + stream.close(); } assertEquals( @@ -134,15 +134,14 @@ String content = new Tika().parseToString( HtmlParserTest.class.getResourceAsStream(path), metadata); - //can't specify charset because default differs between OS's - assertTrue(metadata.get(Metadata.CONTENT_TYPE).startsWith("application/xhtml+xml; charset=")); + assertEquals("application/xhtml+xml", metadata.get(Metadata.CONTENT_TYPE)); assertEquals("XHTML test document", metadata.get(TikaCoreProperties.TITLE)); assertEquals("Tika Developers", metadata.get("Author")); assertEquals("5", metadata.get("refresh")); - assertContains("ability of Apache Tika", content); - assertContains("extract content", content); - assertContains("an XHTML document", content); + assertTrue(content.contains("ability of Apache Tika")); + assertTrue(content.contains("extract content")); + assertTrue(content.contains("an XHTML document")); } @Test @@ -150,26 +149,24 @@ ContentHandler handler = new BodyContentHandler(); new HtmlParser().parse( new ByteArrayInputStream(new byte[0]), - handler, new Metadata(), new ParseContext()); + handler, new Metadata(), new ParseContext()); assertEquals("", handler.toString()); } /** * Test case for TIKA-210 - * * @see TIKA-210 */ @Test public void testCharactersDirectlyUnderBodyElement() throws Exception { String test = "test"; String content = new Tika().parseToString( - new ByteArrayInputStream(test.getBytes(UTF_8))); + new ByteArrayInputStream(test.getBytes("UTF-8"))); assertEquals("test", content); } /** * Test case for TIKA-287 - * * @see TIKA-287 */ @Test @@ -217,11 +214,11 @@ private void assertRelativeLink(String url, String base, String relative) throws Exception { String test = - "" - + "test"; + "" + + "test"; final List links = new ArrayList(); new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), + new ByteArrayInputStream(test.getBytes("UTF-8")), new DefaultHandler() { @Override public void startElement( @@ -239,95 +236,90 @@ /** * Test case for TIKA-268 - * * @see TIKA-268 */ @Test public void testWhitespaceBetweenTableCells() throws Exception { String test = - "
    ab
    "; + "
    ab
    "; String content = new Tika().parseToString( - new ByteArrayInputStream(test.getBytes(UTF_8))); - assertContains("a", content); - assertContains("b", content); + new ByteArrayInputStream(test.getBytes("UTF-8"))); + assertTrue(content.contains("a")); + assertTrue(content.contains("b")); assertFalse(content.contains("ab")); } /** * Test case for TIKA-332 - * * @see TIKA-332 */ @Test public void testHttpEquivCharset() throws Exception { String test = - "" - + "the name is \u00e1ndre" - + ""; - Metadata metadata = new Metadata(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(ISO_8859_1)), - new BodyContentHandler(), metadata, new ParseContext()); + "" + + "the name is \u00e1ndre" + + ""; + Metadata metadata = new Metadata(); + new HtmlParser().parse ( + new ByteArrayInputStream(test.getBytes("ISO-8859-1")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING)); } /** * Test case for TIKA-892 - * * @see TIKA-892 */ @Test public void testHtml5Charset() throws Exception { String test = "" - + "the name is \u00e1ndre" - + ""; - Metadata metadata = new Metadata(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(ISO_8859_1)), + + "the name is \u00e1ndre" + + ""; + Metadata metadata = new Metadata(); + new HtmlParser().parse( + new ByteArrayInputStream(test.getBytes("ISO-8859-1")), new BodyContentHandler(), metadata, new ParseContext()); assertEquals("ISO-8859-15", metadata.get(Metadata.CONTENT_ENCODING)); } /** * Test case for TIKA-334 - * * @see TIKA-334 */ @Test public void testDetectOfCharset() throws Exception { String test = - "\u017d"; - Metadata metadata = new Metadata(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), - new BodyContentHandler(), metadata, new ParseContext()); + "\u017d"; + Metadata metadata = new Metadata(); + new HtmlParser().parse ( + new ByteArrayInputStream(test.getBytes("UTF-8")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("\u017d", metadata.get(TikaCoreProperties.TITLE)); } /** * Test case for TIKA-341 - * * @see TIKA-341 */ @Test public void testUsingCharsetInContentTypeHeader() throws Exception { final String test = - "the name is \u00e1ndre" - + ""; - - Metadata metadata = new Metadata(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), - new BodyContentHandler(), metadata, new ParseContext()); + "the name is \u00e1ndre" + + ""; + + Metadata metadata = new Metadata(); + new HtmlParser().parse ( + new ByteArrayInputStream(test.getBytes("UTF-8")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("UTF-8", metadata.get(Metadata.CONTENT_ENCODING)); metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=ISO-8859-1"); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(ISO_8859_1)), - new BodyContentHandler(), metadata, new ParseContext()); + new HtmlParser().parse ( + new ByteArrayInputStream(test.getBytes("ISO-8859-1")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING)); } @@ -343,7 +335,7 @@ public void testLineBreak() throws Exception { String test = "
    foo
    bar
    baz"; String text = new Tika().parseToString( - new ByteArrayInputStream(test.getBytes(US_ASCII))); + new ByteArrayInputStream(test.getBytes("US-ASCII"))); String[] parts = text.trim().split("\\s+"); assertEquals(3, parts.length); assertEquals("foo", parts[0]); @@ -353,7 +345,6 @@ /** * Test case for TIKA-339: Don't use language returned by CharsetDetector - * * @see TIKA-339 */ @Test @@ -361,73 +352,70 @@ String test = "Simple Content"; Metadata metadata = new Metadata(); metadata.add(Metadata.CONTENT_LANGUAGE, "en"); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), - new BodyContentHandler(), metadata, new ParseContext()); + new HtmlParser().parse ( + new ByteArrayInputStream(test.getBytes("UTF-8")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("en", metadata.get(Metadata.CONTENT_LANGUAGE)); } /** * Test case for TIKA-349 - * * @see TIKA-349 */ @Test public void testHttpEquivCharsetFunkyAttributes() throws Exception { String test1 = - "" - + "the name is \u00e1ndre" - + ""; - Metadata metadata = new Metadata(); - new HtmlParser().parse( - new ByteArrayInputStream(test1.getBytes(ISO_8859_1)), - new BodyContentHandler(), metadata, new ParseContext()); + "" + + "the name is \u00e1ndre" + + ""; + Metadata metadata = new Metadata(); + new HtmlParser().parse ( + new ByteArrayInputStream(test1.getBytes("ISO-8859-1")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("ISO-8859-15", metadata.get(Metadata.CONTENT_ENCODING)); // Some HTML pages have errors like ';;' versus '; ' as separator String test2 = - "" - + "the name is \u00e1ndre" - + ""; + "" + + "the name is \u00e1ndre" + + ""; metadata = new Metadata(); - new HtmlParser().parse( - new ByteArrayInputStream(test2.getBytes(ISO_8859_1)), - new BodyContentHandler(), metadata, new ParseContext()); + new HtmlParser().parse ( + new ByteArrayInputStream(test2.getBytes("ISO-8859-1")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("ISO-8859-15", metadata.get(Metadata.CONTENT_ENCODING)); } /** * Test case for TIKA-350 - * * @see TIKA-350 */ @Test public void testUsingFunkyCharsetInContentTypeHeader() throws Exception { final String test = - "the name is \u00e1ndre" - + ""; - - Metadata metadata = new Metadata(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), - new BodyContentHandler(), metadata, new ParseContext()); + "the name is \u00e1ndre" + + ""; + + Metadata metadata = new Metadata(); + new HtmlParser().parse ( + new ByteArrayInputStream(test.getBytes("UTF-8")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("UTF-8", metadata.get(Metadata.CONTENT_ENCODING)); metadata = new Metadata(); metadata.set(Metadata.CONTENT_TYPE, "charset=ISO-8859-1;text/html"); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(ISO_8859_1)), - new BodyContentHandler(), metadata, new ParseContext()); + new HtmlParser().parse ( + new ByteArrayInputStream(test.getBytes("ISO-8859-1")), + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("ISO-8859-1", metadata.get(Metadata.CONTENT_ENCODING)); } /** * Test case for TIKA-357 - * * @see TIKA-357 */ @Test @@ -436,14 +424,13 @@ Metadata metadata = new Metadata(); new HtmlParser().parse( HtmlParserTest.class.getResourceAsStream(path), - new BodyContentHandler(), metadata, new ParseContext()); + new BodyContentHandler(), metadata, new ParseContext()); assertEquals("windows-1251", metadata.get(Metadata.CONTENT_ENCODING)); } /** * Test case for TIKA-420 - * * @see TIKA-420 */ @Test @@ -454,7 +441,7 @@ BodyContentHandler handler = new BodyContentHandler(); new HtmlParser().parse( HtmlParserTest.class.getResourceAsStream(path), - new BoilerpipeContentHandler(handler), metadata, new ParseContext()); + new BoilerpipeContentHandler(handler), metadata, new ParseContext()); String content = handler.toString(); assertTrue(content.startsWith("This is the real meat")); @@ -465,19 +452,18 @@ /** * Test case for TIKA-478. Don't emit sub-elements inside of . - * * @see TIKA-478 */ @Test public void testElementOrdering() throws Exception { final String test = "Title" + - "" + - "" + - "

    Simple Content

    "; - - StringWriter sw = new StringWriter(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), + "" + + "" + + "

    Simple Content

    "; + + StringWriter sw = new StringWriter(); + new HtmlParser().parse( + new ByteArrayInputStream(test.getBytes("UTF-8")), makeHtmlTransformer(sw), new Metadata(), new ParseContext()); String result = sw.toString(); @@ -504,18 +490,17 @@ /** * Test case for TIKA-463. Don't skip elements that have URLs. - * * @see TIKA-463 */ @Test public void testImgUrlExtraction() throws Exception { final String test = "Title" + - "" + - ""; - - StringWriter sw = new StringWriter(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), + "" + + ""; + + StringWriter sw = new StringWriter(); + new HtmlParser().parse( + new ByteArrayInputStream(test.getBytes("UTF-8")), makeHtmlTransformer(sw), new Metadata(), new ParseContext()); String result = sw.toString(); @@ -526,18 +511,17 @@ /** * Test case for TIKA-463. Don't skip elements that have URLs. - * * @see TIKA-463 */ @Test public void testFrameSrcExtraction() throws Exception { final String test = "Title" + - "" + - ""; - - StringWriter sw = new StringWriter(); - new HtmlParser().parse( - new ByteArrayInputStream(test.getBytes(UTF_8)), + "" + + ""; + + StringWriter sw = new StringWriter(); + new HtmlParser().parse( + new ByteArrayInputStream(test.getBytes("UTF-8")), makeHtmlTransformer(sw), new Metadata(), new ParseContext()); String result = sw.toString(); @@ -548,19 +532,18 @@ /** * Test case for TIKA-463. Don't skip elements that have URLs. - * * @see TIKA-463 */ @Test public void testIFrameSrcExtraction() throws Exception { final String test = "Title" + - "" + - "