New! Version 0.9 of the PDF Extraction Toolkit, which includes GraphWrap, has now been released and is available here.
Imagine that you have a large amount of data in one or more PDF files, which is presented in a consistent format, such as product specifications, measurements, prices or contact information. In order to make this data amenable to machine processing, it must first be extracted into a structured format such as XML or a relational database. As most PDF files lack the structuring information which would allow us to locate the individual data instances, this is a challenging task.
GraphWrap, which is currently at prototype stage, allows a non-expert user to create such wrappers for almost any PDF file in an intuitive and interactive manner. After selecting an example instance on the document, a few clicks on the graph representation to set conditions and choose which data items to extract are usually all that is required. This wrapper can then be run on other pages and documents which exhibit a similar visual structure. A screenshot of the system is shown below.
This prototype was presented at CeBIT at the stand of the Austrian Computer Society from 3-5 March 2009. The accompanying handouts from the presentation with instructions for use can be downloaded here in English (PDF) or German (PDF).
The back-end of GraphWrap is now published under the Apache licence. The GUI, which uses the TouchGraph and XMIllum libraries and can be used to interactively design wrappers, is published under the GPL licence.
Version 0.9 of the PDF Extraction Toolkit, which includes GraphWrap, has now been released and is available here.
More detailed instructions for the GraphWrap prototype are available in English (PDF) and German (PDF). If you have any further questions, please do not hesitate to send me an e-mail.
back to my homepage