[Date Prev][Date Next][Date Index]

Talk Announcement: tomorrow, Thursday, 3 May 11.45h -- "Towards Domain-Independent Information Extraction from Web Tables"



This talk is a trial run for a presentation at WWW-07 next week. 
Critical questions and feedback are very welcome after the talk.

============================================================================
Date & Time: Thursday, May 3, 2007, 11:45 - 12:30 (30 min talk)
Place: SE 184/2, Favoritenstraße 9-11, 3rd floor
Title: Towards Domain-Independent Information Extraction from Web Tables
Authors: Wolfgang Gatterbauer, Paul Bohunsky, Marcus Herzog, Bernhard Krüpl,
Bernhar Pollak

Abstract:
Traditionally, information extraction from web tables has focused on small,
more or less homogeneous corpora, often based on assumptions about the use
of <table> tags. A multitude of different HTML implementations of web tables
make these approaches difficult to scale. In this paper, we approach the
problem of domain-independent information extraction from web tables by
shifting our attention from the tree-based representation of web pages to a
variation of the two-dimensional visual box model used by web browsers to
display the information on the screen. The thereby obtained topological and
style information allows us to fill the gap created by missing
domain-specific knowledge about content and table templates. We believe
that, in a future step, this approach can become the basis for a new way of
large-scale knowledge acquisition from the current ``Visual Web."

Paper online available at: http://www2007.org/papers/paper790.pdf 
============================================================================


Afterwards, a shorter trial presentation (15 min) for another 
occasion will introduce the following topic.

============================================================================
Abstract:
Information contained in databases often appears in redundant form, meaning
that the actual core is stated repeatedly in various places. Dissemination
of the information is usually biased, meaning that some pieces of
information get repeated more often than others. This skewed presentation of
information has important consequences on a subsequent search or information
acquisition process and its facility to acquire and learn the contained core
information. This talk shortly investigates how bias in information
dissemination influences the effectiveness of data integration or
information acquisition. Firstly, we motivate a metric that reflects the
success of information integration from redundant data. Secondly, we derive
analytic functions for this measure depending on redundancy, biased
distribution of redundancy and the standard measure recall. Thirdly, we
apply the developed mathematical tools to real-world data like the Enron
email data set, Austrian World Wide Web link data, and the French Word
Corpus.