Difference between revisions of "OpenOffice.org Internship/Projects/2010/Improve PDF Import"

From Apache OpenOffice Wiki
Jump to: navigation, search
(What has been done so far)
(Backlog)
Line 33: Line 33:
 
# [http://qa.openoffice.org/issues/show_bug.cgi?id=93793 Pop-up window which allows to replace fonts]  
 
# [http://qa.openoffice.org/issues/show_bug.cgi?id=93793 Pop-up window which allows to replace fonts]  
 
# [http://qa.openoffice.org/issues/show_bug.cgi?id=94532 Allow import of only selected pages]
 
# [http://qa.openoffice.org/issues/show_bug.cgi?id=94532 Allow import of only selected pages]
# [http://qa.openoffice.org/issues/show_bug.cgi?id=109708 Fix rotated text import]
 
 
# [[Native PDF forms]]  
 
# [[Native PDF forms]]  
 
# [[Proper paragraphs]]  
 
# [[Proper paragraphs]]  

Revision as of 08:58, 26 July 2010


Abstract

The PDF Import Extension allows you to import and modify PDF documents. Best results with 100% layout accuracy can be achieved with the "PDF/ODF hybrid file" format, which this extension also enables. A hybrid PDF/ODF file is a PDF file that contains an embedded ODF source file. Hybrid PDF/ODF files will be opened in OpenOffice.org as an ODF file without any layout changes. Users without this extension can open the PDF part of the hybrid file with their PDF viewer.

The PDF Import Extension also allows you to import and modify PDF documents for non hybrid PDF/ODF files. PDF documents are imported in Draw to preserve the layout and to allow basic editing. This is the perfect solution for changing dates, numbers or small portions of text with a minimum loss of formatting information for simple formatted documents.

Goals for a PDF import

The document created by importing a PDF file should resemble the original as close as possible; nevertheless PDF per se does not lend itself to that end easily: most PDF files contain no information about layout or document structure at all. Therefore a PDF file will never be able to be imported on a 1:1 basis. We have to define goals to define what level of similarity must be achieved on a basis of feasibility.

These goals should be treated as paramount:

  • all text that is visible in the original PDF document should be imported
  • text attributes: font family, font size, weight (bold, not bold), style (italic, not italic) should be imported together with the respective text.
  • all drawing elements (images, vector graphics) should be imported.
  • if the implementation has to choose between layout fidelity and editability, lean towards layout.

Additionally there are some goals that would greatly enhance the import result, all of these features can by their nature only be implemented with heuristic methods since PDF (unless the file uses tagged PDF) does not contain structural information. The following text features should be detected (sequence in descending importance):

  • Paragraphs
  • Enumerations
  • Titles
  • Underlined text
  • subscript/superscript

Backlog

This section contains the list of tasks that are going to be done during internship and haven't been started yet.

  1. Pop-up window which allows to replace fonts
  2. Allow import of only selected pages
  3. Native PDF forms
  4. Proper paragraphs
  5. Processing layout of LaTeX PDF
  6. Import of complex vector graphics elements
  7. Conversion of tables
  8. Import of EPS graphics
  9. RTL (right-to-left) text/font support
  10. Change ContentSink class
  11. Fix disappearing bookmarks
  12. Fix ghostscript pdf import

Current tasks

This section contains the list of tasks that are being done right now.

What has been done so far

  1. Introduction

Project status

  • The project is accepted for the OpenOffice summer internship program 2010
Personal tools