NEXTWRAP - Next Generation Web Wrapper Technologies
This project aims at significant scientific and technological improvements
of Web information extraction and annotation technology. Current systems for
automated Web information extraction allow an application designer to visually
specify extraction patterns on sample HTML documents. Pattern instances are
then automatically extracted from production documents and translated into XML.
In this project we want to pave the way to a next generation extraction technology
by performing basic and experimental research towards the following goals:
- Enabling visual data extraction from poorly structured sources such as plain
character documents, IBM 3270 screen images, and PDF documents. Based on general
and specific document structure ontologies, algorithms and methods will be
developed for imposing a tree structure on such source documents and thus
making them accessible to visual wrapping methods.
- Enabling a visual information extraction system to deliver information into
RDF repositories and other ontological knowledge bases through a tight coupling
of the system's pattern hierarchy with an ontologically mapping mechanism.
This would result in the first extraction technology being able to directly
deliver knowledge content.
- Enabling the automated correction of tree-based wrappers in case of major
changes in the structure of the input document(s). So far, techniques for
an automated adaptation and repair of wrappers were considered for text-based
grammatical wrappers only. A major research effort - and an improved understanding
of "change ontologies" - is necessary before repair techniques can
be developed for the more powerful tree-based wrappers.
- Investigating new interface paradigms for wrapper generation. Currently,
the specification of nontrivial data extraction is very hard for non-experts.
We want to investigate novel simplified interfaces to facilitate wrapper construction
by lay users. At the same time, we want to facilitate the creation and maintenance
of community-based ontologies.
All tasks are centred on tree-based wrapping and have as their common denominator
the use of ontologies. We will develop a strong competence as a team of researchers
in establishing a common ontological framework that we believe will form the
basis of next generation extraction technology.
Our group is involved in this project together with TU
Graz and Lixto Software GmbH. Project
Duration 1.1.2005-31.12.2006.
Funded by FFG Fit-IT.