WebPageDump

webpagedump

introduction

WebPageDump is a Firefox extension which allows you to save local copies of pages from the Web. It sounds simple, but it's not. The standard "Save page" function of web browsers fails with most web pages and also web site downloaders don´t work in a satisfactory manner. This shortcomings were a serious problem for our research.

Each web page is saved in an automatic named subdirectory making it easy to create whole (shareable) web page collections. It is built upon the Scrapbook extension and enhances its capabilities regarding HTML entites, charsets and command-line/batch functionality improving the visual exactness of the local copy.

DOM Tree Concept

publication

Disclaimer - Please read before downloading: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each document's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holders. Personal use is permitted.

WebPageDump and the related problems including the WPD naming are described inside the follwing SOFSEM07 student research forum paper:

Bernhard Pollak, Wolfgang Gatterbauer. Creating Permanent Test Collections of Web Pages for Information Extraction Research , In Proceedings of SOFSEM 2007: Theory and Practice of Computer Science, Volume II ICS AS CR, Prague, 2007, pp. 103-115. ISBN: 80-903298-9-6. [paper] [presentation] [poster]

Abstract: In the research area of automatic web information extraction, there is a need for permanent and annotated web page collections enabling objective performance evaluation of different algorithms. Currently, researchers are suffering from the absence of such representative and contemporary test collections, especially on web tables. At the same time, creating your own sharable web page collections is not trivial nowadays because of the dynamic and diverse nature of modern web technologies employed to create often short- lived online content. In this paper, we cover the problem of creating static representations of web pages in order to build sharable ground truth test sets. We explain the principal difficulties of the problem, discuss possible approaches and introduce our solution, a Firefox extension capable of saving web pages exactly as they rendered online. Finally, we benchmark our system with current alternatives using an innovative automatic method based on image snapshots.
Keywords: saving web pages, web information extraction, test data, Firefox, web table ground truth, performance evaluation

using

WebPageDump can be used simply with the "WebPageDump" Entry inside the Firefox "Tools" menu. Hence the actual web page will be saved inside a WPD named subdirectory after selecting the destination directory. This mode is going to be the "normal" mode for most web page collecting applications.

For batch processing the following options can be used through the Firefox command-line. This command-line options are mainly present for webpagedump testing purposes but maybe useful for some special applications. Be sure that a single batch command has ended before proceeding with another one.

-wpd_srcurl [URL] -wpd_dest [Dest]	Save the webpage from [URL] to [Dest] directory. Sample: -wpd_srcurl "http://www.tuwien.ac.at" -wpd_dest "/home/foo"
-wpd_srcurl [URL] -wpd_dest	Save the webpage from [URL] to the actual directory.
-wpd_srcbatch [Batch File] -wpd_dest [WPDDest]	Process the Batch File and save to [WPDDest] directory
-wpd_srcbatch [Batch File]	Process the Batch File and save to the [Batch File] directory inside WPD named subdirectories
-wpd_srcdir [WPDSource] -wpd_dest [WPDDest]	Save from [WPDSource] to [WPDDest] directory.
-wpd_srcdir [WPDSource]	Use [WPDSource] directory as destination, no files are written (this mode is useful for generating local screenshots)
-wpd_srcdir	Use current directory as source and destination, no files are written.
-wpd_screenshot [FileName]	Generate full images from the webpages during the saving process using the Pearl Crescent Pagesaver (if installed). If no [FileName] is choosen "local.png" or "online.png" depending on the source is used
-wpd_testdir [WPDTest]	Report URLs, URL IDs and screenshot size inside the [WPDTest] directory to a textfile
-wpd_testdir	Report URLs, URL IDs and Screenshots inside the current WPD directory to a testfile
-wpd_collect	Copy the screenshots inside the the current WPD directory to a subdirectory named "wpd_collect" (will be created) and rename them according to the WPD naming
-wpd_collect [WPDDest]	Copy and rename screenshots in the current WPD directory to [WPDDest]
-wpd_collect [WPDSource] -wpd_dest [Dest]	Copy and rename screenshots from [WPDSource] to [Dest]
-wpd_update	Don't generate different versions; skip existing directories (useful with download errors)

Batch File: simple text file with the required URLs line-by-line.
WPD directory: directory which contains the web pages inside WPD named subdirectories (above referred to as [WPDSource],[WPDDest],...)
WPD naming: short readable name with the domain name and a number generated through the chars of the whole URL including version and copy count (e.g. "www_dbai_tuwien_ac_at_1243v1").

limitations

WebPageDump does an extensive DOM tree processing before generating the HTML code using the innerHTML property. Theoretically this approach will have no limitations because the DOM tree contains any information which is necessary for the visual appearance. Due to different bugs in the Gecko rendering engine there are practical limitations. For example, some side effects from tags occure which should not influence the rendering (e.g. empty JavaScript tags vs. removed ones...). Another limitation is the loss of named/numbered HTML Entites. All entities are converted to unicode chars inside the DOM Tree. Also the implementations of the innerHTML and the CSSText property are not free from bugs.

downloads

WebPageDump v0.3 (beta) firefox extension
WebPageDump v0.3 (beta) source code

The extension is provided under the terms of the Mozilla Public License. If you want to install WebPageDump you will either have to manually allow extension installations from this url or save the xpi file with "save as". See changes.txt for the version information.

Tested web pages (~68 MB)

Because of copyright issues we have removed the package of test web pages. But we will make them available for serious scientific research. They were downloaded and modified with WebPageDump using the SmartCache Java Proxy.

URL list

List of URLs used in the experiments.

Test results

Raw output of the test mode of WebPageDump (-wpd_testdir) containing the file sizes of the screenshots.

BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THE COPYRIGHT HOLDERS PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. IN NO EVENT WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM ACCORDING TO THE LICENSE, BE LIABLE TO YOU FOR DAMAGES (INCLUDING BUT NOT LIMITED TO LOSS OF DATA).

links

VENTrec Table Extraction System (for what WebPageDump was developed for)
Wolfgang Gatterbauer (my supervisor)
Paul Bohunsky (another "ventrecer")
Pearl Crescent Pagesaver
Scrapbook Extension
SmartCache Java Proxy
Zotero Personal Research Assistent (which includes webpagedump for saving webpages)

Our research group:
Database and Artificial Intelligence Group at the Vienna University of Technology

I am not responsible for any contents linked or referred from this pages. The provided links should not be understood to be an endorsement of that website or the site's owners (or their products/services).