NEXTWRAP

NEXTWRAP - Next Generation Web Wrapper Technologies

This project aims at significant scientific and technological improvements of Web information extraction and annotation technology. Current systems for automated Web information extraction allow an application designer to visually specify extraction patterns on sample HTML documents. Pattern instances are then automatically extracted from production documents and translated into XML. In this project we want to pave the way to a next generation extraction technology by performing basic and experimental research towards the following goals:

Enabling visual data extraction from poorly structured sources such as plain character documents, IBM 3270 screen images, and PDF documents. Based on general and specific document structure ontologies, algorithms and methods will be developed for imposing a tree structure on such source documents and thus making them accessible to visual wrapping methods.
Enabling a visual information extraction system to deliver information into RDF repositories and other ontological knowledge bases through a tight coupling of the system's pattern hierarchy with an ontologically mapping mechanism. This would result in the first extraction technology being able to directly deliver knowledge content.
Enabling the automated correction of tree-based wrappers in case of major changes in the structure of the input document(s). So far, techniques for an automated adaptation and repair of wrappers were considered for text-based grammatical wrappers only. A major research effort - and an improved understanding of "change ontologies" - is necessary before repair techniques can be developed for the more powerful tree-based wrappers.
Investigating new interface paradigms for wrapper generation. Currently, the specification of nontrivial data extraction is very hard for non-experts. We want to investigate novel simplified interfaces to facilitate wrapper construction by lay users. At the same time, we want to facilitate the creation and maintenance of community-based ontologies.

All tasks are centred on tree-based wrapping and have as their common denominator the use of ontologies. We will develop a strong competence as a team of researchers in establishing a common ontological framework that we believe will form the basis of next generation extraction technology.

Our group is involved in this project together with TU Graz and Lixto Software GmbH. Project Duration 1.1.2005-31.12.2006.

Funded by FFG Fit-IT.