XML Metadata Extraction for WCM
Monday, December 1st, 2008While the XSDs in the WCM (AVM) are the equivalent of content models in DM, there is no effective way to search them. More specifically, it’s useful to be able to search based on a specific metadata elements in the generated XMLs, something that you need to do frequently in highly dynamic sites. In this post, we’ll discuss the sample I created for this that’s in the content community here (registration required).
In this example, I have a WCM content type defined through XSD press_release.xsd. I want to extract some metadata from it, for example, expiration date of the press release. The extraction process works similarly to the way things function on the DM side, with some gotchas. When extracting metadata, you need somewhere (properties) to store it.
For WCM (where all the XML nodes are stored as wcm:avmplaincontent type), is by creating an aspect with the properties that you want to extract. This is important – it has to be an aspect. This aspect will automatically get applied appropriately, as we’ll see later. In the included example, I have a XSD that creates a press release. I want to extract and index three properties – abstract (string), expiration date (date), and numtimes (int). Here is the code (I removed all the indexing properties for simplicity)
<aspects>
<aspect name="my:press_release_metadata">
<title>Sample Aspect for WCM – Press Release</title>
<properties>
<property name="my:abstract">
<type>d:text</type>
</property>
<property name="my:expiration_date">
<type>d:datetime</type>
</property>
<property name="my:numtimes">
<type>d:int</type>
</property>
</properties>
</aspect>
</aspects>
You’ll also need to expose your aspect properties in the UI through web-client-config-custom.xml
<config evaluator="aspect-name" condition=" my:press_release_metadata">
<property-sheet>
<show-property name="my:abstract" />
<show-property name="my:expiration_date" />
<show-property name="my:numtimes" />
</property-sheet>
</config>
Once I have the aspect in my content model (customModel.xml, which I introduce to the Data Dictionary through custom-model-context.xml), I can start configuring extraction process as outlined in wcm-xml-metadata-extracter-context.xml. There are two key sections:
-
Selector section (extracter.xml.sample.selector.XPathSelector bean), which looks inside the XML and maps it to the correct Extractor Bean. Since all the XForms of any type get saved as XML, we need to select the appropriate one (in this case pr:press_release). This configuration associates the specific XForm with a specific extraction definition.
<bean id="extracter.xml.sample.selector.XPathSelector" class="org.alfresco.repo.content.selector.XPathContentWorkerSelector" init-method="init"> <property name="workers"> <map> <entry key="/pr:press_release"> <ref bean="extracter.xml.sample.AlfrescoCustomModelMetadataExtracter" /> </entry> </map> </property> </bean>
-
Extractor bean for each of the Web Content Types you defined. These have two parts:
A. xpathMappingProperties – take an xpath expression that can extract value out of XML file and store it into internal Map. So, for example, the abstract property can be found through xpath expression "/press_release/abstract". It then gets stored into "abstract" internal map property.
<prop key="abstract">/press_release/abstract</prop>
Note that we have to also specify namespace so Alfresco can resolve them appropriately:
<prop key="namespace.prefix.pr">http://www.alfresco.org/alfresco/pr</prop>
B. mappingProperties – takes the properties out of internal map, and puts it into the specified data dictionary property. Here is the key – the extractor finds the corresponding aspect and automatically ("automagically") adds it to the Alfresco node. Before setting it, it checks for the target data type, and attempts to convert it to that type. In the example we are using, it takes out the internal map property "abstract" and sets it to property my:abstract. In this case the property is a string, so no conversion is really required.
<prop key="abstract">my:abstract</prop>
Note on Converting Dates: When I did this initially, I got an exception for converting dates. This is because the XForms store dates in the format of 2008-04-28, and the automatic cast did not work. To remedy that, I added a configuration setting where I specified the correct date format for the extractor to use: <property name="supportedDateFormats"> <list> <value>yyyy-MM-dd</value> </list> </property> Note, that metadata extraction runs when you create content (in the user sandbox). They key here is that the aspect gets applied automatically by the extraction process – you don’t need to make sure it’s added. You don’t even see mention of the aspect anywhere in the configuration files. This is what is should look like in the example:
The Lucene indexing will happen when you promote the content to the staging sandbox.
The indexing is governed by the settings on the properties you define on properties of the aspects (namely <index> element):
<property name="my:expiration_date">
<type>d:datetime</type>
<index enabled="true">
<atomic>true</atomic>
<stored>false</stored>
<tokenised>false</tokenised>
</index>
</property>
You cannot search based on these properties from the Search UI, since this relates to the WCM content, but you can query this using the Node Browser for testing, or, of course, the ultimate goal is likely to expose this through some web scripts.
