Internship 2010: Improve PDF Import

From Apache OpenOffice Wiki
Jump to: navigation, search


Writer Icon.png

Writer Project

Please view the guidelines
before contributing.

Popular Subcategories:

Internal Documentation:

API Documentation:

Ongoing Efforts:

Sw.OpenOffice.org


Abstract

The PDF Import Extension allows you to import and modify PDF documents. Best results with 100% layout accuracy can be achieved with the "PDF/ODF hybrid file" format, which this extension also enables. A hybrid PDF/ODF file is a PDF file that contains an embedded ODF source file. Hybrid PDF/ODF files will be opened in OpenOffice.org as an ODF file without any layout changes. Users without this extension can open the PDF part of the hybrid file with their PDF viewer.

The PDF Import Extension also allows you to import and modify PDF documents for non hybrid PDF/ODF files. PDF documents are imported in Draw to preserve the layout and to allow basic editing. This is the perfect solution for changing dates, numbers or small portions of text with a minimum loss of formatting information for simple formatted documents.

Goals for a PDF import

The document created by importing a PDF file should resemble the original as close as possible; nevertheless PDF per se does not lend itself to that end easily: most PDF files contain no information about layout or document structure at all. Therefore a PDF file will never be able to be imported on a 1:1 basis. We have to define goals to define what level of similarity must be achieved on a basis of feasibility.

These goals should be treated as paramount:

  • all text that is visible in the original PDF document should be imported
  • text attributes: font family, font size, weight (bold, not bold), style (italic, not italic) should be imported together with the respective text.
  • all drawing elements (images, vector graphics) should be imported.
  • if the implementation has to choose between layout fidelity and editability, lean towards layout.

Additionally there are some goals that would greatly enhance the import result, all of these features can by their nature only be implemented with heuristic methods since PDF (unless the file uses tagged PDF) does not contain structural information. The following text features should be detected (sequence in descending importance):

  • Paragraphs
  • Enumerations
  • Titles
  • Underlined text
  • subscript/superscript
Personal tools