XML Filter Detection

From Apache OpenOffice Wiki
Jump to: navigation, search

The number of XML files that conform to differing DTD specifications means that a single filter and file type definition is insufficient to handle all of the possible formats available. In order to allow Apache OpenOffice to handle multiple filter definitions and implementations, it is necessary to implement an additional filter detection module that is capable of determining the type of XML file being read, based on its DocType declaration.

To accomplish this, a filter detection service com.sun.star.document.ExtendedTypeDetection can be implemented, which is capable of handling and distinguishing between many different XML based file formats. This type of service supersedes the basic flat detection, which uses the file's suffix to determine the Type, and instead, carries out a deep detection which uses the file's internal structure and content to detect its true type.

Requirements for Deep Detection

There are three requirements for implementing a deep detection module that is capable of identifying one or more unique XML types. These include:

  • An extended type definition for describing the format in more detail (TypeDetection.xcu).
  • A DetectService implementation.
  • A DetectService definition (TypeDetection.xcu).

Extending the File Type Definition

Since many different XML files can conform to different DTDs, the type definition of a particular XML file needs to be extended. To do this, some or all of the DocType information can be contained as part of the file type definition. This information is held as part of the ClipboardFormat property of the type node. A unique namespace or preface identifies the string at this point in the sequence as being a DocType declaration.

Sample Type definition:

  <node oor:name="writer_DocBook_File" oor:op="replace">
      <prop oor:name="UIName">
          <value XML:lang="en-US">DocBook</value>
      <prop oor:name="Data">
          <value>0,,doctype:-//OASIS//DTD DocBook XML V4.1.2//EN,,XML,20002,</value>

The ExtendedTypeDetection Service Implementation

In order for the type detection code to function as an ExtendedTypeDetection service, you must implement the detect() method as defined by the com.sun.star.document.XExtendedFilterDetection interface definition:

  string detect( [inout]sequence<com::sun::star::beans::PropertyValue > Descriptor );

This method supplies you with a sequence of PropertyValues from which you can use to extract the current TypeName and the URL of the file being loaded:

  ::rtl::OUString SAL_CALL FilterDetect::detect(com::sun::star::uno::Sequence< com::sun::star::beans::PropertyValue >& aArguments ) throw (com::sun::star::uno::RuntimeException) 
  const PropertyValue * pValue = aArguments.getConstArray();
  sal_Int32 nLength;
  ::rtl::OString resultString;
  nLength = aArguments.getLength();
  for (sal_Int32 i = 0; i < nLength; i++) {
          if (pValue[i].Name.equalsAsciiL(RTL_CONSTASCII_STRINGPARAM("TypeName"))) {
          else if (pValue[i].Name.equalsAsciiL(RTL_CONSTASCII_STRINGPARAM("URL"))) {
                  pValue[i].Value >>= sUrl;

Once you have the URL of the file, you can then use it to create a ::ucb::Content from which you can open an XInputStream to the file:

  Reference< com::sun::star::ucb::XCommandEnvironment > xEnv;
  ::ucb::Content aContent(sUrl,xEnv);
  xInStream = aContent.openStream();

You can now use this XInputStream to read the header of the file being loaded. Because the exact location of the DocType information within the file is not known, the first 1000 bytes of information will be read:

  ::rtl::OString resultString;
  com::sun::star::uno::Sequence< sal_Int8 > aData;
  long bytestRead = xInStream->readBytes (aData, 1000);
  resultString = ::rtl::OString( (const sal_Char *)aData.getConstArray(), bytestRead);

Once you have this information, you can start looking for a type that describes the file being loaded. In order to do this, you need to get a list of the types currently supported:

  Reference <XNameAccess> xTypeCont(mxMSF->createInstance(OUString::createFromAscii(
                                  "com.sun.star.document.TypeDetection" )),UNO_QUERY);
  Sequence <::rtl::OUString> myTypes= xTypeCont->getElementNames();
  nLength = myTypes.getLength();

For each of these types, you must first determine whether the ClipboardFormat property contains a DocType:

  Sequence<::rtl::OUString> ClipboardFormatSeq;
  Type_Props[Loc_of_ClipboardFormat].Value >>=ClipboardFormatSeq ;
  while() {
          if(ClipboardFormatSeq.match(OUString::createFromAscii("doctype:") {
                      //if it contains a DocType, start to compare to header

All the possible DocType declarations of the file types can be checked to determine a match. If a match is found, the type corresponding to the match is returned. If no match is found, an empty string is returned. This will force Apache OpenOffice into flat detection mode.

TypeDetection.xcu DetectServices Entry

Now that you have created the ExtendedTypeDetection service implementation, you need to tell Apache OpenOffice when to use this service.

First create a DetectServices node, unless one already exists, and then add the information specific to the detection service that has been implemented, that is, the name of the service and the file types that use it.

  <node oor:name="DetectServices">
  <node oor:name="com.sun.star.comp.filters.XMLDetect" oor:op="replace">
          <prop oor:name="ServiceName">
                  <value XML:lang="en-US">com.sun.star.comp.filters.XMLDetect</value>
          <prop oor:name="Types">
Content on this page is licensed under the Public Documentation License (PDL).
Personal tools
In other languages