Posts Tagged ‘DM’

Version Baselining

Tuesday, November 3rd, 2009

One of the great things about working with Alfresco is the vast number of extension points the system offers to developers.  Some of these stem from the pervasive use of the Spring framework, some of them to a well thought out application architecture, and many of them from a number of guiding principles that are consistently applied even when their potential uses aren’t necessarily known with certainty ahead of time.

I recently had the pleasure of being reminded of this latter case when a customer asked for an extension that allowed their content contributors to control the “baseline” version number of documents in their Alfresco installation.  The idea was to allow their contributors to (optionally) enter a version number along with each document, and have the Alfresco versioning system start with that version number instead of the default of 1.0.

Although I didn’t know how this might be achieved, in less than 10 minutes I had my answer and it relied on a slight variation of a mechanism that I’d used in the past.  The customer was also gracious enough to release the IP, so I’ve made the initial version of the extension available on google code.

Here is a brief overview of its usage:

This extension works by extending Alfresco with a custom content type called “Version Baselined Content” that includes a single property called “Base Version”.  This property is where the content contributor can set the base version to be used if/when versioning is enabled on the document.

In order to create content of this type, “Version Baselined Content” needs to be selected in the “Type” dropdown of the “Add Content Dialog”:

Provided the “Modify all properties when this page closes” checkbox is left checked (the default), the contributor will then be presented with the option to specify the base version number for this document (if/when versioning is enabled):

The default value for this field is “0.1″ – if the contributor elects to skip modification of the new content’s properties, this is the base version number it will be assigned automatically.

The base version number must be a valid non-negative decimal number (ie. it must be a number greater than or equal to 0.0).  If an invalid value is entered, an error will be displayed when the user clicks the “OK” button.

Once the version number is populated, it may be edited via the document’s properties as many times as are necessary, up until the time versioning is enabled for the document:

Once versioning is enabled for the document, the initial version number will be set to the value of the “Base Version Number” property at that time:

From this point on, any modifications to the “Base Version Number” property will be ignored as it is not possible to renumber an existing Alfresco version history.

Other than allowing explicit control over the initial version number for a document, this extension does not change any other versioning behavior in the system.  For example creating a new minor revision of a document (via checkout and checkin) will increment the version number by 0.1.  Similarly, creating a new major revision of a document (via checkout and checkin) will increment the major component of the version number by 1, and set the minor component to 0:

While the extension is quite neat and (due to the generosity of the customer) available for anyone to use, refine and extend, what really grabbed me as I developed it was how, despite having no prior experience with this particular extension point, it was familiar enough that I was able to understand it almost immediately and leverage it to achieve the desired goal.

Bulk Import from a Filesystem

Thursday, October 22nd, 2009

The Use Case

In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system. That content may reside in a legacy CMS, on a shared network drive, on individual user’s hard drives or in email, but the requirement is almost always there – to inventory the content that’s out there and bring some or all of it into the CMS with a minimum of effort.

Alfresco provides several mechanisms that can be used to import content, including:

Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).

That said, most of these approaches suffer from one or more of the following limitations:

  • They require the content to be massaged into some other format prior to ingestion
  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.
  • They require development or configuration work
  • They’re more general in nature, and so aren’t as performant as a specialised solution

An Opinionated (but High Performance!) Alternative

For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.

The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server – typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming – far more efficient than any kind of mechanism that requires network I/O.

How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it’s also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.

Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).

Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:

  1. Break up large volumes of writes into multiple batches – long running transactions are problematic for most transactional systems (including Alfresco).
  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder’s modification timestamp.[EDIT] In recent versions of Alfresco, the automatic update of a folder’s modification timestamp (cm:modified property) has been disabled by default. It can be turned back on (by setting the property “system.enableTimestampPropagation” to true), but the default is false so this is likely to be less of an impact to bulk ingestion than I’d originally thought.

The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?

The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.

Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server’s hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server’s hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn’t it Do (Yet)?

Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that’s currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The “user experience” (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).

That said, the core logic is sound, and has been in production use for some time.  You may find that it’s worth investigating even in its currently rough state.

[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community. I’m chuffed that that’s the case and would request that any questions or comments you have be raised on the mailing list.  If you believe you’ve found a bug, or wish to request an enhancement to the tool, the issue tracker is the best place. Thanks!


Alfresco Home | Legal | Privacy | Accessibility | Site Map | RSS  RSS

© 2012 Alfresco Software, Inc. All Rights Reserved.