PDF to HTML Conversion

This page currently contains information about my third-year project, on PDF to HTML Conversion, which was written in Summer 2003 as part of my degree in Computer Science at the University of Warwick.

What's New?

I have received several requests for the older versions of the JDom and JPedal libraries that were used to write the program in 2003, as the current versions are not compatible.

These libraries are available here:

JDom: jdom.jar, 124 kB
JPedal: jpedal.jar, 859 kB

Project Summary

The original aim of this project was to convert PDF files into HTML as accurately as possible. After investigating the existing converters that were available, it was found that they all took this approach, and that the results were often unsatisfactory.

To avoid repeating work that already had been done, and to improve upon the situation, the project's aims moved towards "intelligent" text extraction - extracting textual data from a PDF file, which may include columns, images and other features, and creating a "clean" HTML file with all the text from the PDF, but without the original layout.

The implementation of the project has resulted in a program that can process fairly complex page layouts, including columns, and output the text in HTML with CSS, retaining formatting information where possible.

The program works solely by analysing the text blocks within the document. Suggestions for further improvements, such as dealing with graphics, are given in the final report.

The final report is available here in PDF format (2.2 MB).

The two program files are available for download:

pdf2html.java (main program)
PdfGrouping.java (text grouping library)

These files require the JDom and JPedal libraries to be installed.

Getting Started

Ensure that both java and javac are in your path. To compile the program, copy both Java files and both libraries into a directory. From within that directory type:

Windows: javac �classpath .;jpedal.jar;jdom.jar pdf2html.java
Unix: javac �classpath .:jpedal.jar:jdom.jar pdf2html.java

To run the converter type:

Windows: java �classpath .;jpedal.jar;jdom.jar pdf2html
Unix: java �classpath .:jpedal.jar:jdom.jar pdf2html

Running the program as above, with no command-line parameters, will display the syntax.

Other Downloads

The following historic documents are also available for download in pdf format:

Project Specification (28 October 2002)
Progress Report (6 December 2002)

If you have any further questions or suggestions etc., please feel free to get in touch.

back to my homepage