Web Data Extraction and Integration

Lecture Overview
Number and Type: 181.130 VU WS 2012/13
Lecturer: Robert Baumgartner (exercises together with tutor Alexander Fischl)
Selected Keywords: Information extraction, approaches, tools and methods for wrapper generation, web querying, data integration, XML
Preliminary Meeting: Friday 5th of October, 16:00 (s.t.), EI 2 Pichlmayer HS
Registration: Until 4th of October via TISS (limited participant number). Please de-register in TISS in case you decide not to take the course. ECML students who can not yet register please write me a message to reserve a place for you.
Language: Slides in English, lecture language depending whether non-german speaking students join
Timetable: Selected Fridays 16:00-19:00 (see below for details; two exercise slots)
Procedure: Lecture coupled with exercises and group work
Topics:
  • Information Extraction: Setting, History, IE vs. IR
  • Structured Data Extraction and Wrapping
  • XML Transformation and Query Languages, DOM
  • Web Wrapper Languages
  • Wrapper Generation Tools
  • Wrappers for Mashups, SOA and BI
  • Inductive Wrapper Generation
  • Automatic Data Extraction / Web Data Mining
  • Supervised Wrapper Generation
  • Deep Web Navigation Approaches
  • Data Extraction from PDF documents
  • Mediation and Integration Approaches
  • Web Data Cleaning
  • Lixto Visual Wrapper and Transformation Server
Fields of Study: This VU is a component of the curriculum of several master studies and is part of the European Master Programs Computational Logic.


Structure of the Lecture and Slides
Session Topics / Slides Date Lecture Time Lecture Location Grp A* Exercises Grp B* Exercises
1 Preliminary Meeting and Motivation/History Information Extraction (6 in 1) 5.10. 16:00-17:15 EI 2 Pichelmayer HS - -
2 XPath and XSLT (6 in 1 | Resources | 1st Exercises | Reference Solutions) 19.10. 16:00-18:00 EI 2 Pichelmayer HS - -
3 DOM and approaches to wrapper generation (6 in 1 | Resources | 2nd Exercises) 9.11. 17:00-18:00 EI 2 Pichelmayer HS 16:00-17:00 18:00-19:00
4 Tools for Web Information Extraction (6 in 1 | 3rd Exercises) 23.11. 17:00-18:00 EI 2 Pichelmayer HS 16:00-17:00 18:00-19:00
5 Lixto Visual Developer (6 in 1 | 4th Exercises | Group Project Topics) 30.11. 17:00-18:00 EI 2 Pichelmayer HS 16:00-17:00 18:00-19:00
6 Spatial Querying, Automated Extraction and Adaptation (6 in 1 | 5th Exercises) 7.12. 17:00-18:00 EI 2 Pichelmayer HS 16:00-17:00 18:00-19:00
7 Automatic Data Extraction and Web Data Integration (6 in 1 | 6th Exercises) 14.12. 17:00-18:00 EI 2 Pichelmayer HS 16:00-17:00 18:00-19:00
8 TamCrow Project (6 in 1) 11.1. 17:00-18:00 EI 2 Pichelmayer HS 16:00-17:00 18:00-19:00
9/A Group Presentations (Group Project Topics | Group Projects Agenda | Group Projects Download) 18.1. - EI 2 Pichelmayer HS 16:00-19:00 -
9/B 25.1. - EI 2 Pichelmayer HS - 16:00-19:00
Logo of Lixto   Logo of Altova