WebPageDump is a Firefox extension which allows you to save
local copies of pages from the Web. It sounds simple, but it's not. The
standard "Save page" function of web browsers fails with most web
pages and also web site downloaders donĀ“t work in a satisfactory manner.
This shortcomings were a serious problem for our research.
Each web page is saved in an automatic named subdirectory making it easy to
create whole (shareable) web page collections.
It is built upon the
Scrapbook extension
and enhances its capabilities regarding HTML entites, charsets and command-line/batch functionality
improving the visual exactness of the local copy.
Disclaimer - Please read before downloading:
This material is presented to ensure timely dissemination of scholarly and technical work.
Copyright and all rights therein are retained by authors or by other copyright holders.
All persons copying this information are expected to adhere to the terms and constraints
invoked by each document's copyright. In most cases, these works may not be reposted
without the explicit permission of the copyright holders. Personal use is permitted.
WebPageDump and the related problems including the WPD naming are described inside the follwing
SOFSEM07 student research forum paper:
Abstract: In the research area of automatic web information extraction, there is
a need for permanent and annotated web page collections enabling objective
performance evaluation of different algorithms. Currently, researchers are
suffering from the absence of such representative and contemporary test
collections, especially on web tables. At the same time, creating your own
sharable web page collections is not trivial nowadays because of the dynamic
and diverse nature of modern web technologies employed to create often short-
lived online content. In this paper, we cover the problem of creating static
representations of web pages in order to build sharable ground truth test sets.
We explain the principal difficulties of the problem, discuss possible
approaches and introduce our solution, a Firefox extension capable of saving
web pages exactly as they rendered online. Finally, we benchmark our system
with current alternatives using an innovative automatic method based on image
snapshots.
Keywords: saving web pages, web information extraction, test data, Firefox,
web table ground truth, performance evaluation
WebPageDump can be used simply with the "WebPageDump" Entry inside the Firefox "Tools" menu.
Hence the actual web page will be saved inside a WPD named subdirectory after selecting the destination directory.
This mode is going to be the "normal" mode for most web page collecting applications.
For batch processing the following options can be used through the Firefox command-line. This command-line options
are mainly present for webpagedump testing purposes but maybe useful for some special
applications. Be sure that a single batch command has ended before proceeding with another one.
-wpd_srcurl [URL] -wpd_dest [Dest] |
Save the webpage from [URL] to [Dest] directory.
Sample: -wpd_srcurl "http://www.tuwien.ac.at" -wpd_dest "/home/foo"
|
-wpd_srcurl [URL] -wpd_dest |
Save the webpage from [URL] to the actual directory. |
-wpd_srcbatch [Batch File] -wpd_dest [WPDDest] |
Process the Batch File and save to [WPDDest] directory |
-wpd_srcbatch [Batch File] |
Process the Batch File and save to the [Batch File] directory inside WPD named subdirectories |
-wpd_srcdir [WPDSource] -wpd_dest [WPDDest] |
Save from [WPDSource] to [WPDDest] directory. |
-wpd_srcdir [WPDSource] |
Use [WPDSource] directory as destination, no files are written (this mode is useful for generating local screenshots) |
-wpd_srcdir |
Use current directory as source and destination, no files are written. |
-wpd_screenshot [FileName] |
Generate full images from the webpages during the saving process using the Pearl Crescent Pagesaver (if installed). If no [FileName] is choosen "local.png" or "online.png" depending on the source is used |
-wpd_testdir [WPDTest] |
Report URLs, URL IDs and screenshot size inside the [WPDTest] directory to a textfile |
-wpd_testdir |
Report URLs, URL IDs and Screenshots inside the current WPD directory to a testfile |
-wpd_collect |
Copy the screenshots inside the the current WPD directory to a subdirectory named "wpd_collect" (will be created) and rename them according to the WPD naming |
-wpd_collect [WPDDest] |
Copy and rename screenshots in the current WPD directory to [WPDDest] |
-wpd_collect [WPDSource] -wpd_dest [Dest] |
Copy and rename screenshots from [WPDSource] to [Dest] |
-wpd_update |
Don't generate different versions; skip existing directories (useful with download errors) |
Batch File: simple text file with the required URLs line-by-line.
WPD directory: directory which contains the web pages inside
WPD named subdirectories (above referred to as [WPDSource],[WPDDest],...)
WPD naming: short readable name with the domain name and a number generated through the chars of the whole URL including
version and copy count (e.g. "www_dbai_tuwien_ac_at_1243v1").
WebPageDump does an extensive DOM tree processing before generating the
HTML code using the innerHTML property. Theoretically this approach
will have no limitations because the DOM tree contains any
information which is necessary for the visual appearance.
Due to different bugs in the Gecko
rendering engine there are practical limitations. For example,
some side effects from tags occure which should not
influence the rendering (e.g. empty JavaScript tags vs. removed ones...).
Another limitation is the loss of named/numbered HTML Entites. All entities
are converted to unicode chars inside the DOM Tree. Also the implementations
of the innerHTML and the CSSText property are not free from bugs.
WebPageDump v0.3 (beta) firefox extension
WebPageDump v0.3 (beta) source code
The extension is provided under the terms of the Mozilla Public License.
If you want to install WebPageDump you will either have
to manually allow extension installations from this url
or save the xpi file with "save as".
See
changes.txt for the version information.
Tested web pages (~68 MB)
Because of copyright issues we have removed the package of test web pages.
But we will make them available for serious scientific research.
They were downloaded and modified with WebPageDump using the SmartCache Java Proxy.
URL list
List of URLs used in the experiments.
Test results
Raw output of the test mode of WebPageDump (-wpd_testdir)
containing the file sizes of the screenshots.
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THE COPYRIGHT
HOLDERS PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED. THE ENTIRE RISK AS TO THE
QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. IN NO EVENT
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
AND/OR REDISTRIBUTE THE PROGRAM ACCORDING TO THE LICENSE,
BE LIABLE TO YOU FOR DAMAGES (INCLUDING BUT NOT LIMITED
TO LOSS OF DATA).
© 2007 by
Bernhard Pollak -
I am not responsible for any contents linked or referred from this pages.
The provided links should not be understood to be an endorsement of that website or
the site's owners (or their products/services).