[Date Prev][Date Next][Date Index]

Invited Talk Announcement: Data Extraction from poorly structured formats -- PDF to HTML Conversion



Date: Friday, 12th of November 13:30

Place: Seminarroom 184/2 (DBAI), Favoritenstraße 9-11/1842, 3rd floor (when you leave the elevator turn left, go through the corridor, the entrance is on the right side)

Title: Data Extraction from poorly structured formats -- PDF to HTML Conversion
Tamir Hassan

Abstract:

This talk presents my project on PDF to HTML Conversion, which was undertaken
during my third year of study at Warwick University.  Unlike similar
converters on the market, it does not attempt to reconstruct the original
layout and appearance of the page.  Instead, it generates a correctly
structured HTML document from which the content can easily be lifted and
re-used.

The resulting program can cope with various page structures, including
columns, and can therefore generate good results with fairly complicated
pages.  Certain structures, such as tables, are not yet understood by the
converter, and suggestions will be given for further work both to increase
the range of understood layouts and to improve the reliability of the
extraction process.

The project also provides a starting point for converting to a more
descriptive XML-based format, which can be integrated into Lixto to provide a
wrapper generator for PDF documents.