Parsing Parsing Parsing...

After the release of LogicalDOC EE 4.5 our partners have reported a small problem during the automatic extraction of Tags on particular types of documents.
In fact, we checked that extracting tags from OpenOffice documents the accented letters (found in most Latin languages) were eliminated.
At this point our development team was activated to resolve the problem and to review the various parsers / extractors of LogicalDOC to extract text from documents in the best possible way and in accordance with the UTF-8.
In this way the texts are then indexed by LogicalDOC and the document archive is full-text searchable.

The Community Edition of LogicalDOC owns the parser for Microsoft Office 2003 applications (Word, Excel, Powerpoint), AbiWord, AbiWord compressed (.zabw files), OpenOffice 2.3/3.0, StarOffice, KOffice 1.6, HTML, XML, TXT, PDF, PS (PostScript), WordPerfect (versions 4, 5, 6).

The licensed version of LogicalDOC Enterprise as well as having the parser in the Open-Source release of the software is able to index the text content of Microsoft Office 2007 documents (.docx, .xslx, .pptx), Autocad DWG documents and is able to perform optical character recognition (OCR) in PDF files (PDF raster), TIFF (Multipage), JPG, PNG.

Our developers have been involved also to implement support for the extraction of the textual content of documents .eml (Thunderbird saved emails, MS Outlook Express, email forwards) and documents with .msg file extension (Microsoft Office Outlook 2007).
These new parser will be available for the next release of LogicalDOC Enterprise scheduled for next autumn.

Friday, August 14, 2009