WordprocessingML is the XML format used by Microsoft Word 2007/2010 and that is part of the Office Open XML specification. It defines all the structure for the word related data. Shapes/Textbox used in MS Word 2007 is described in VML.
Finding sample files
Use google and look for "download docx".
We can divide the import filter into three main parts: XML parser(XML token handler), XSL templates and DomainMapper(content handler).
Main code could be found in below path:
\writerfilter\source\dmapper is the main part for parsing WordprocessingML content and handle the word data.
\writerfilter\source\ooxml is the XML parser for parsing the XML token from file.
WordProcessingML XML Token parser
OOXMLFastContextHandler is used as the main class for XML parsing, and it inherited from xml::sax::XFastContextHandler which is used to parse all the XML format files.
All the Handler used to parse WordprocessingML tokens is inherited from OOXMLFastContextHandler. The location is \writerfilter\source\ooxml. The class diagram is as below:
e.g OOXMLFastContextHandler_wordprocessingml_CT_Picture This is used to parse the tokens defined for drawing pictures in word.
XSL template for XML context handler
In \writerfilter\source\ooxml, we will see a lot .xsl files in it, the xsl(EXtensible Stylesheet Language) defines the template on how to generate the XML context handler, and after you build the writerfilter, you will see the .hxx and .cxx file corresponding to these xsl definition will be generated under the build path \misc. If we want to add a new context handler for WordProcessingML, we can only add the tokens definitions in data model, and the corresponding handler will be generate automatically according the templates for all.
Three kinds of xsl files defined:
- XML Model – model.xml
Defines all the OOXML token and the relationships, we can add new if we want to support more XML tokens.
- Class file templates
Defines the .hxx and .cxx file templates which used to generate handler for specific area in WordProcessingML.
- ContextHandler templates
Defines the OOXML token contextHandler templates.
This part is file content related, after parse the XML token from file through context handler, we will use DomainMapper as main stream handler to read the content from the XML tokens, and also decide how to arrange the content and how to insert into core model. The class diagram is as below:
- The source for this part is in below:
\writerfilter\inc\domainmapper.hxx \writerfilter\source\DomainMapper.cxx \writerfilter\source\DomainMapper_Impl.cxx \writerfilter\source\DomainMapper_Impl.hxx
There are several weak areas for OOXML file import
- Unsupported object
- Support with limitation