Internship/Tasks/Proper paragraphs

From Apache OpenOffice Wiki
Jump to: navigation, search

The task is to implement correct importing text paragraphs. In current version of extension we can import only single lines what is quite inconvenient when we try to edit text.

Line importing in current extension

There is no information that would come from XPDF to inform that tag BT was met, so we cannot determine if a new text object occurs. Line is recognized by the position of consecutive glyphs (rectangles containing glyphs indeed). If two consecutive rectangles are close enough to each other, they are threaten as belonging to the same line. This solution is not perfect because we have to determine what means "close enough".

Idea of paragraph importing

To import whole paragraphs I suggest similar solution to the one described above, but instead of glyphs and lines we will consider lines and paragraphs. It implies following: when lines are close enough they are threaten as one paragraph. Several cases may occur, but most of them are quite easy.

Moreover glyph processing is quite complex. It would be better to use encapsulation in order to delegate functionality of glyph processing to standalone class. It would reduce the mess in pdifprocessor that contains methods responsible for every kind of processing. The main goal is to make pdiprocessor a wrapper containing smaller classes with separate responsibilities - there is a lot of advantages of this approach.

Another solution

Another solution would be to modify Gfx and OutDev from XPDF. As it was said in the beginning of this page, there is no information when BT is met. So the solution would be to inform OutDev about it, by changing the code. Unfortunately I see some problems associated with this solution: BT contains much more than a single paragraph sometimes, and another is position glyphs within draw text objects. Moreover it requires changes in makefile (the extension code).

Description of implemented solution

  1. Changes in PDFIProcessor

All responsibility for glyph processing has been moved to CharGlyphsProcessor class initialized by passing PDFProcessor object to the only constructor. It is not the whole object indeed, but only several required functionalities implemented with facade design pattern - it's not suggested to modify PDFIProcessor content within CharGlyphsProcessor. PDFIProcessor posses CharGlyphsProcessor object and instead of running processGlyphLine in drawGlyphs function, CharGlyphsProcessor::process is executed, what starts processing of current glyph.

  1. CharGlyphProcessor

The only constructor of the class receives facade to PDFIProcessor class - this solution allows to call required methods from PDFIProcessor, but hide rest of methods, what prevents against modifying it's content externally. Objects of the class posses a pointer to currently computed paragraph. The main function of the class is "process" method, that receives arguments rFontMatrix, aRect and char to draw and starts paragraph structure creating. Every new glyphs is tried to add to current paragraph, if it fails a new line withing current paragraph is tried to be created. If it fails as well, the paragraph is drop and a new one is created to replace the current one. Every time new glyph or line is add the paragraph properties need to be updated to correctly count if next glyphs/lines might be contained in it. The second public method is "drop" to drop it's content overtly, when pdf "end of text object" command is met while parsing.

  1. CharGlyphParagraph

Object contains a list of lines that are

Personal tools