Adding new Apache Tika Plugins to Alfresco (Part 3)

October 27th, 2010 by nickb

In Part 1 we saw what Apache Tika is and does, and in Part 2 we saw what it has brought to Alfresco. Now it’s time to look at adding new Tika Parsers, to support new file formats.

Firstly, why might you want to add a new parser? The most common reason is licensing – all the parsers that ship as standard with Apache Tika are Apache Licensed or similar, along with their dependencies, and so can be freely distributed and included in other projects. However, some file formats only have libraries that are available under GPL or proprietary licenses, and so these can’t be included in the standard Tika distribution.

There is a list of available 3rd party parsers on the Tika 3rd Party Plugins wiki page, currently made up of GPL licensed parsers + dependencies. If your format isn’t listed there, and you want to add it to Tika within Alfresco, then what to do?

Firstly, you need to write / acquire a Tika Parser. Writing a Tika Parser is quite easy, as the 5 minute parser guide explains. There are basically two methods to implement:

Set getSupportedTypes(ParseContext context);
void parse(InputStream stream, ContentHandler handler, Metadata metadata,
           ParseContext context) throws IOException, SAXException, TikaException;

The first allows you to indicate the file types your parser can handle. This is needed when registering the parser with the AutoDetectParser and similar, but isn’t needed if you select the parser explicitly. The second method is the one where you do the real work of outputting the contents and populating the metadata object.

To see this in action, let’s take a look at a simple “Hello World” Tika Parser:

package example;
public class HelloWorldParser implements Parser {
  public Set getSupportedTypes(ParseContext context) {
    Set types = new HashSet();
    types.add(MediaType.parse("hello/world"));
    return types;
  }
  public void parse(InputStream stream, ContentHandler handler,
         Metadata metadata, ParseContext context) throws SAXException {
    XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
    xhtml.startDocument();
    xhtml.startElement("h1");
    xhtml.characters("Hello, World!");
    xhtml.endElement("h1");
    xhtml.endDocument();

    metadata.set("hello","world");
    metadata.set("title","Hello World!");
    metadata.set("custom1","Hello, Custom Metadata 1!");
    metadata.set("custom2","Hello, Custom Metadata 2!");
  }
}

Before we can use this in Alfresco, we need to compile this against tika-core.jar (note – you may need to implement the parse method without a ParseContext object if you’re using an older version of Tika), and then wrap our classfile up in a jar. Once our jar is deployed into our application container (eg the shared lib of tomcat), we’re ready to configure it.

For 3rd party parsers which provide the Tika service metadata files, if we don’t want to control the registration in Alfresco then we can simply allow the default Tika-Auto metadata and transformer classes to handle it. In our case, we want to register it explicitly. To do that, we’ll create a new extension spring context file, and populate it:

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE beans PUBLIC '-//SPRING//DTD BEAN//EN' 'http://www.springframework.org/dtd/spring-beans.dtd'>
<beans>
    <bean id="extracter.MyCustomTika"
          class="org.alfresco.repo.content.metadata.TikaSpringConfiguredMetadataExtracter"
          parent="baseMetadataExtracter" >
        <!-- This is the name of our example parser compiled above -->
        <property name="tikaParserName">
           <value>example.HelloWorldParser</value>
        </property>

         <!-- Use the default mappings from TikaSpringConfiguredMetadataExtracter.properties -->
        <property name="inheritDefaultMapping">
            <value>true</value>
        </property>
        <!-- Map our extra keys to the content model -->
        <property name="mappingProperties">
            <props>
                <prop key="namespace.prefix.cm">http://www.alfresco.org/model/content/1.0</prop>
                <prop key="custom1">cm:description</prop>
                <prop key="custom2">cm:author</prop>
            </props>
        </property>
    </bean>

    <bean id="transformer.MyCustomTika"
          class="org.alfresco.repo.content.transform.TikaSpringConfiguredContentTransformer"
          parent="baseContentTransformer">
        <!-- Same as above -->
        <property name="tikaParserName">
           <value>example.HelloWorldParser</value>
        </property>
    </bean>
</beans>

To test this, we’ll need a node with the special fake mimetype of “hello/world”, which is what our Tika Parser is configured to handle. We can do that with a snippet of JavaScript like this:

var doc = userhome.createFile("hello.world");
doc.content = "This text will largely be ignored";
doc.mimetype = "hello/world";

If we run the above JavaScript, we’ll get a node called “hello.world”. If we run the “extract common metadata fields” action on it, we’ll then see the metadata properties showing through. Then, if we transform it to text/html, then we see a text heading of “Hello, World”. Thus we have verified that our custom Tika parser has been wired into Alfresco, is available for text transformation, and can do metadata extraction including custom keys.

More information on Tika and Alfresco is available on the Alfresco wiki.

Spring in a QName

September 28th, 2010 by nickb

Within Alfresco, we make a lot of use of Qualified Names (QNames) for addressing and naming things. Generally, when configuring Alfresco through Spring or properties files, we can use the short form, eg

<bean id="coreBean" class="org.alfresco.some.thing.core">
  <property name="typeQName">
    <value>cm:description</value>
  </property>
</bean>

Within the bean, the NamespaceResolver is used to turn the friendly, short form (eg cm:description) into the full form (eg {http://www.alfresco.org/model/content/1.0}description).

However, every so often you may find yourself trying to configure something with Spring that no-one ever expected you to be trying to do… In this situation, the string form isn’t accepted by the class, and only a real QName object may be sprung in.

As it turns out, creating a real QName object from within Spring isn’t actually too hard to do. So, in case you ever find yourself needing to do it, the definition will look something like this:

<bean id="coreBean" class="org.alfresco.some.thing.core">
  <property name="typeQName">
    <value>
      <bean class="org.alfresco.service.namespace.QName"
              factory-method="createQName">
        <constructor-arg value="http://www.alfresco.org/model/content/1.0" />
        <constructor-arg value="description"/>
      </bean>
    </value>
  </property>
</bean>

Apache Tika powered updates to Alfresco (Part 2)

September 24th, 2010 by nickb

In Part 1, we learnt a little about what Apache Tika is and does. In this part, we’ll see what new features using Tika gives us in Cheetah.

On the metadata side, Tika delivers three important things for us. These are support for a wider range of formats (see below), enhanced ease of adding custom parsers (you can just spring in a bean with the class name of your parser and you’re done), and consistent metadata.

This last one is less of an issue for Alfresco users than for many other users, but is a real issue for extractor developers. Within Alfresco, we always map the raw properties onto ones in the content model, but this is handled at the extractor level. As such, it shouldn’t matter to the user if one document format has a “last author”, another “last editor” and a third “last edited by”, they’ll all turn up in the same property in Alfresco. However, the extractor writer has to know about this, to provide the mapping, and this makes writing an extractor harder, and increases the chance of error.

Within Tika, there is a set list of common metadata keys, and each Tika parser internally maps its properties onto these. As such, when you receive your metadata back from Tika, it all looks the same no matter what file you got it from. If the metadata is a date, then Tika will also take care of converting it to a common format, so you don’t have to worry about parsing a dozen different date representations.

Finally, because the metadata is in a common format, we can more easily map it to the content model. Thus, in Alfresco 3.4, we see most of the common extractors have a wider range of metadata mappings to the content model as standard. One big example of this is in the case of images – EXIF tags are now automatically extracted and mapped onto the content model, and if the image was geotagged, then the location of the image is also mapped onto the content model.

Give it a try – upload a geotagged image to Alfresco share in 3.4, and see all the new metadata that shows up such as the location, camera, focal length and more!

In the past, most text extractors that were used in Alfresco only able to produce plain text. However, all the Tika parsers generate XHTML sax events, and sure are able to produce not only plain text, but also HTML and XHTML. Also, since Tika the XHTML is a true XML document, we can make use of XSLT to chain transformations.

The immediate benefit then is that all plain text content transformers that are powered by Tika can deliver an HTML version at no effort. Thus, HTML versions of PDFs, Word Documents etc can now be requested.

(At the moment, the HTML generated is very clean, but not always all that complex. The Tika community is gradually improving the markup generated to include more meaning, especially semantic information, and Alfresco is pleased to be involved in this effort)

Does being able to generate XHTML help that much? I’d say yes! With the forthcoming WCM Quick Start, we’ll shortly be adding some features around HTML versions of some kinds of uploaded documents. Using Tika, we were able to implement this feature very quickly, allowing us to concentrate the developer time on enhancing Tika. Next up, for some cases we wanted a whole XHTML document, and others we only wanted the body content. Using Tika and the SAX handlers, it’s a one line change to toggle between the whole document, or just the body contents, by picking a different transform handler. Finally, the output is XHTML, so for demo’s we’ve been able to use XSLT and E4X (from within a script action) to effortlessly manipulate the content.

Finally, as mentioned, using Tika delivers us support for a large number of new file formats. The current list of files supported via Tika is:

  • Audio – wav, riff, midi
  • DWG (CAD files)
  • Epub
  • RSS and ATOM feeds
  • True Type Fonts
  • HTML
  • Image – JPEG, PNG, Gif, TIFF and Bitmap (including EXIF information where found)
  • iWork (Keynote, Pages etc)
  • Mbox mail
  • Microsoft Office – Word, PowerPoint, Excel, Visio, Outlook, Publisher, Works
  • Microsoft Office OOXML – Word (docx), PowerPoint (pptx), Excel (xlsx)
  • MP3 (ID3 v1 and v2)
  • CDF (scientific data)
  • Open Document Format
  • PDF
  • Zip and Tar archives
  • RDF
  • Plain Text
  • FLV Video
  • XML
  • Java class files

What’s more, generally it’s just a case of dropping new Tika jars into Alfresco with little/no configuration changes, so we can look forward to easy addition of new formats with each new Alfresco release as the Tika support grows!

In Part 3, we will look at mapping between Tika’s common metadata, and the Alfresco content model.

More information on Tika and Alfresco is available on the Alfresco Wiki. Tika will also be discussed at the Alfresco Developer Conferences in Paris and New York later this year.

Apache Tika and Alfresco – Part 1

September 24th, 2010 by nickb

For the forthcoming Project Cheetah release, there have been a number of improvements to
Metadata Extraction and Content Transformations. These improvements have been delivered by using Apache Tika to power many of the standard extractors and transformers.

In this series of blog posts, we’ll be looking at what Apache Tika is and what it does, how it fits into Alfresco, what new features it has delivered, how you can customise how Tika works, and how you can add new Tika parsers to easily support new formats.

The idea for Apache Tika was hatched in 2006, largely from people involved in Apache Lucene, who were struggling to sensibly index all of their documents. The project went through the Apache Incubator, and after a period of time as a Lucene sub-project, in 2010 became it’s own top level Apache project. Tika is used by people indexing content, spidering the web, doing NLP and text processing, as well as with content repositories.

For all these use cases, the problems are largely the same. You start with a number of documents in a variety of formats. You wish to know what they are, and hence which libraries may be useful in processing them. You then want to get some consistent metadata out of them, and possibly a rich textual representation of the content. You also probably wanted all of this yesterday!

(As a side note, Alfresco users have historically been in a more fortunate position than most when faced with these challenges, as the Metadata Extractor and Content Transformation services have handled most of these for you.)

What services does Tika provide then?

Firstly, Tika offers content and language detection. Through this, you can pass Tika a piece of unknown content, and get back information on what kind of file it is (eg pdf, docx), along with what language the text is written in (eg utf-8 english). Within Alfresco we tend to already know this information, so as yet don’t make much use of detection.

Secondly, through the parser system, Tika provides access to the metadata of the document. You can use Tika to find out the last author of a word file, the title of an HTML page, or even the location where a geo-tagged image was taken. In addition, Tika provides a consistent view across the different format’s metadata, mapping internally from document specific to general metadata entries. As such, you don’t need to know if a format uses “last author”, “last editor” or “last edited by”, Tika instead always provides the same information. We’ll see more on using Tika for metadata in part 2.

Thirdly, through the parsers, Tika provides access to the textual content of files. The text is available as plain text, html and xhtml, with the latter offering options for onward transformations through SAX and XSLT to additional representations. This can be used for full text indexing, for web previews, and much more. Again in part 2 we’ll see how this is being used in Alfresco.

Finally, Tika provides access to the embedded resources within files. This could be 2 images embedded in a word document, or an excel spreadsheet held within an powerpoint file, or even half a dozen PDFs contained within a zip file. This is quite a new Tika feature, and we’ll hopefully be making more use of it in the future. For now, it offers the adventurous a consistent way to get at resources inside other files.

In Part 2, we’ll look at the new features and support that Tika delivers to Cheetah.

More information on Tika and Alfresco is available on the Alfresco Wiki. Tika will also be discussed at the Alfresco Developer Conferences in Paris and New York later this year.


Alfresco Home | Legal | Privacy | Accessibility | Site Map | RSS  RSS

© 2012 Alfresco Software, Inc. All Rights Reserved.