Archive for the ‘alfresco’ Category

Version Baselining

Tuesday, November 3rd, 2009

One of the great things about working with Alfresco is the vast number of extension points the system offers to developers.  Some of these stem from the pervasive use of the Spring framework, some of them to a well thought out application architecture, and many of them from a number of guiding principles that are consistently applied even when their potential uses aren’t necessarily known with certainty ahead of time.

I recently had the pleasure of being reminded of this latter case when a customer asked for an extension that allowed their content contributors to control the “baseline” version number of documents in their Alfresco installation.  The idea was to allow their contributors to (optionally) enter a version number along with each document, and have the Alfresco versioning system start with that version number instead of the default of 1.0.

Although I didn’t know how this might be achieved, in less than 10 minutes I had my answer and it relied on a slight variation of a mechanism that I’d used in the past.  The customer was also gracious enough to release the IP, so I’ve made the initial version of the extension available on google code.

Here is a brief overview of its usage:

This extension works by extending Alfresco with a custom content type called “Version Baselined Content” that includes a single property called “Base Version”.  This property is where the content contributor can set the base version to be used if/when versioning is enabled on the document.

In order to create content of this type, “Version Baselined Content” needs to be selected in the “Type” dropdown of the “Add Content Dialog”:

Provided the “Modify all properties when this page closes” checkbox is left checked (the default), the contributor will then be presented with the option to specify the base version number for this document (if/when versioning is enabled):

The default value for this field is “0.1″ – if the contributor elects to skip modification of the new content’s properties, this is the base version number it will be assigned automatically.

The base version number must be a valid non-negative decimal number (ie. it must be a number greater than or equal to 0.0).  If an invalid value is entered, an error will be displayed when the user clicks the “OK” button.

Once the version number is populated, it may be edited via the document’s properties as many times as are necessary, up until the time versioning is enabled for the document:

Once versioning is enabled for the document, the initial version number will be set to the value of the “Base Version Number” property at that time:

From this point on, any modifications to the “Base Version Number” property will be ignored as it is not possible to renumber an existing Alfresco version history.

Other than allowing explicit control over the initial version number for a document, this extension does not change any other versioning behavior in the system.  For example creating a new minor revision of a document (via checkout and checkin) will increment the version number by 0.1.  Similarly, creating a new major revision of a document (via checkout and checkin) will increment the major component of the version number by 1, and set the minor component to 0:

While the extension is quite neat and (due to the generosity of the customer) available for anyone to use, refine and extend, what really grabbed me as I developed it was how, despite having no prior experience with this particular extension point, it was familiar enough that I was able to understand it almost immediately and leverage it to achieve the desired goal.

Bulk Import from a Filesystem

Thursday, October 22nd, 2009

The Use Case

In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system. That content may reside in a legacy CMS, on a shared network drive, on individual user’s hard drives or in email, but the requirement is almost always there - to inventory the content that’s out there and bring some or all of it into the CMS with a minimum of effort.

Alfresco provides several mechanisms that can be used to import content, including:

Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).

That said, most of these approaches suffer from one or more of the following limitations:

  • They require the content to be massaged into some other format prior to ingestion
  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.
  • They require development or configuration work
  • They’re more general in nature, and so aren’t as performant as a specialised solution

An Opinionated (but High Performance!) Alternative

For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.

The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server - typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming - far more efficient than any kind of mechanism that requires network I/O.

How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it’s also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.

Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).

Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:

  1. Break up large volumes of writes into multiple batches - long running transactions are problematic for most transactional systems (including Alfresco).
  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder’s modification timestamp.

The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?

The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.

Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server’s hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server’s hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn’t it Do (Yet)?

Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that’s currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The “user experience” (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).

That said, the core logic is sound, and has been in production use for some time.  You may find that it’s worth investigating even in its currently rough state.

[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community. I’m chuffed that that’s the case and would request that any requests you have be logged via the issue tracker in Google Code, so that I can keep track of all of the great ideas that I’ve received. Thanks!

Including a Static XSD in a Web Form

Tuesday, July 7th, 2009

Since their inception, Alfresco WCM Web Forms have supported an inclusion mechanism based on the standard XML Schema include and import constructs.  Originally this mechanism read the included assets from the Web Project where the user was creating the content, but since v2.2SP3 the preferred mechanism has been to reference a Web Script instead (in fact the legacy mechanism may be deprecated in a future release).

One question that this new approach raises is how to support inclusion of static XSDs, as Web Scripts are inherently dynamic and introduce some unnecessary overhead for the simple static case. The good news is that Alfresco ships with a Web Script that simply reads a file from the repository and returns its contents:

/api/path/content{property}/{store_type}/{store_id}/{path}?a={attach?}

An example usage is:

/api/path/content/workspace/SpacesStore/Company Home/Data Dictionary/Presentation Templates/readme.ftl

Using the Web Script inclusion mechanism for Web Forms, we can use this Web Script to include or import any XSD file stored in the DM repository.  For example, if we have a file called “my-include.xsd” in the “Company Home” space that contains the following content:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:alf="http://www.alfresco.org/"
           targetNamespace="http://www.alfresco.org/"
           elementFormDefault="qualified">
  <xs:complexType abstract="true" name="IncludedComplexType">
    <xs:sequence>
      <xs:element name="Title"
                  type="xs:normalizedString"
                  minOccurs="1"
                  maxOccurs="1" />
      <xs:element name="Summary"
                  type="xs:string"
                  minOccurs="0"
                  maxOccurs="1" />
      <xs:element name="Keyword"
                  type="xs:normalizedString"
                  minOccurs="0"
                  maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>
</xs:schema>

We could include it into a Web Form XSD using an include statement such as the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:alf="http://www.alfresco.org/"
           targetNamespace="http://www.alfresco.org/"
           elementFormDefault="qualified">
  <xs:include schemaLocation=”webscript://api/path/content/workspace/SpacesStore/Company Home/my-include.xsd?ticket={ticket}” />
  <xs:complexType name=”MyWebFormType”>
    <xs:complexContent>
      <xs:extension base=”alf:IncludedComplexType”>
        <xs:sequence>
          <xs:element name=”Body”
                      type=”xs:string”
                      minOccurs=”1″
                      maxOccurs=”1″ />
        </xs:sequence>
      </xs:extension>
    </xs:complexContent>
  </xs:complexType>
  <xs:element name=”MyWebForm” type=”alf:MyWebFormType” />
</xs:schema>

This is clearly faster and easier than developing a custom Web Script to either emit the XML Schema shown above, or to return the contents of a specific XSD file from the repository!

This approach also provides a solution to another question: how does one neatly package up a Web Form, along with all of its dependencies, ready for deployment to another Alfresco environment?

By storing included XSD files in Company Home > Data Dictionary > Web Forms, we give ourselves the option to package up the entire Web Forms space as an ACP file and deploy that ACP file to any other Alfresco environment, knowing that we’ve captured not only all of the Web Forms in the source environment, but all dependent XSD files as well.

Code Movement vs Content Movement

Wednesday, July 1st, 2009

Seth Gottlieb has written a great post entitled “Code moves forward. Content moves backward.” that, by strange coincidence, echoes an Alfresco KB item authored by Alfresco’s very own Ben Hagan last year.

What’s interesting to me is that there is an alternative world view that asserts that code and content are two sides of the same coin and hence should be managed the same way in the same management system.  This meme seems particularly strong amongst those who are adherent’s of the Boiko school of thought and also those who’ve had significant exposure to certain Web CMS products (that shall remain nameless) that are clearly designed for the blended model, and so indoctrinate users /developers to use a blended model in all cases (whether appropriate or not).

My experience has been that blending code and content management together doesn’t work well in the majority of cases, for two primary reasons:

  1. Typically very different groups are producing the code and the content - often they’re in completely different divisions within the organisation (ie. IT vs business unit) and sometimes are even separate companies (ie. web agency vs client).
  2. The releases cycles for code and content are vastly different - code is typically released infrequently (weekly, at best), while the content on any large site is typically changing virtually non-stop.

The net result is that shoehorning both activities together creates unnecessary procedural couplings, between groups who are typically poorly structured (from a communication and coordination perspective) to efficiently manage those redundant couplings.

Anyway, it’s a great post on a very interesting topic, and I’d definitely encourage anyone involved in implementing a Web CMS (whether Alfresco WCM or not) to give it a solid read.

Web CMS’s Dissected

Wednesday, November 5th, 2008

It’s no secret that Content Management Systems (CMS) are a pretty heterogeneous bunch of technologies, covering everything from paper document imaging systems through to portal servers through to desktop productivity apps - applying some Tufte to the CMS space demonstrates this heterogeneity pretty clearly. What’s less apparent is that even within the relatively narrow confines of Web CMS (WCMS) technologies, industry definitions are almost as fuzzy.

Interestingly, this confusion is not apparent when looking at WCMS technologies directly.  Over the last decade or so the WCMS market has matured into basically two quite well defined and quite distinct types of application, yet I rarely find this distinction reflected in discussions around WCMS technology.

I refer to these two categories of WCMS as:

  1. Content Production Systems (CPS)
  2. Presentation Management Systems (PMS)

Here are some typical distinguishing characteristics for these two types of system:

CPS PMS
Architecture Separate management and delivery systems Single monolithic system for both management and delivery
Content Production Capabilities
  • individual content production workspaces (”sandboxes”)
  • versioning / audit history
  • workflow / QA
  • in-context preview prior to launch
  • site rollback
  • deployment / publication
Strong Weak to none
Content Delivery Capabilities Weak to none - often API based Strong
Canned Content Models Few to none Extensive, though mostly presentation-centric:
  • sites
  • navigation / site map
  • pages
  • layouts
  • regions
  • components / modules / portlets
  • templates
  • skins / themes
Support for Custom Content Models Rich, often based on XML or relational technologies Typically weak, often limited to simple map / dictionary data structures
Examples
  • Interwoven TeamSite
  • Vignette VCM
  • Documentum Web Publisher
  • Drupal
  • Joomla
  • The Nukes (PHPNuke / DotNetNuke)
  • Portal Servers
  • Wikis

 
As can be gleaned from the table, for the most part these types of systems address orthogonal use cases (the content creation / production process vs the content delivery process) which explains why it’s so confusing to directly compare a CPS with a PMS (something I’ve seen numerous times).

Now there’s no reason that a single system couldn’t do both, and in fact some WCMS vendors have product offerings that attempt to do this. The problem is that to date most of these attempts have involved cobbling together what were previously independent applications, resulting in seemingly arbitrary distinctions between content that’s fully managed via the CPS vs content that isn’t.

As an example, one of the products sold by one of the vendors listed above marries a CPS with a portal server, but none of the portal server’s configuration data (pages, layouts, regions, portlets, etc.) is stored in the CPS, so there’s no ability to manage (workflow, review, version, etc.) changes to that data.  To the typical editorial team, this distinction is arbitrary and baffling, and can contribute to adoption problems.

So where does Alfresco WCM fit in all of this?

By now it should be clear that current versions of Alfresco WCM are solidly in the CPS camp - the core functionality is specifically focused on the content production use case, with presentation management left up to the delivery tier (which implementers of Alfresco WCM can implement using whatever technologies they’re comfortable with).

That said, Alfresco has recognised the value of a combined CPS + PMS solution for some time, but up until recently the focus was on implementing the CPS first, since:

  1. it’s arguably easier to add PMS constructs on top of a CPS than it is to retrofit a PMS with CPS style functionality
  2. there are situations where a PMS is already in place (eg. a custom web application), and the requirement is to introduce a WCMS that can integrate with that PMS rather than replacing it - in this case a pure play CPS is an appropriate solution

Earlier this year work began on the PMS side of things, and that’s started to bear fruit in the recent 3.0 release; specifically with the introduction of the Surf platform.  The next step (currently targeting a later release in the 3.x product line) is to introduce a visual site building tool (tentatively called Web Studio) that allows less technical users to visually manipulate the Surf content model (ie. build “sites”, “pages” etc. using a visual editing tool).

The beauty of this approach is that the Surf data model is stored in the (existing) repository, so all of the content production capabilities of the repository (sandboxed content creation / modification, workflow / QA, in-context preview, full revision history of the content set, rollback / roll forward, deployment / publishing, etc.) apply to all changes to a Surf site, regardless of whether it’s a user writing a new press release, a subject matter expert optimising the navigational hierarchy for their section of the site, a web admin re-skinning the entire site or a web developer creating new page templates to add to the library.

Compare this to the product described above, where some changes are made in the WCMS (and can be content managed) and some are made in the portal (and are not managed at all) and I’m sure you’ll see why we’re so excited about both the Surf platform and the upcoming Web Studio tool.

Implementing “DocFlip” for FSRs

Thursday, October 30th, 2008

In my previous post I discussed how File System Receivers (FSRs) implement deployment transactions on top of non-transactional filesystems.  As discussed in that post, there is a window of time in which an inconsistent state could be seen by an application reading the content; that is, while the FSR is in the middle of the commit phase.  Now the duration of this window varies based on a number of factors, but in some cases it’s critical to minimise the inconsistent window as much as possible, and in these cases a technique called “docflip” can help.

I first heard about “docflip” almost 10 years ago, and have seen it in use several times since then.  The basic approach is relatively simple:

  1. Two full copies of the target directory are maintained.
  2. A symlink is used that points to one of these directories.  All applications that are reading content use this symlink exclusively (they are unaware of the two underlying directories).
  3. At any point in time:
    1. One of the directories (the one pointed to by the symlink) is the “live” copy.
    2. The other directory (that is not pointed to by anything) is the “shadow” copy.
  4. A transaction involves:
    1. Writing all of the changes to the shadow copy.
    2. Either committing the transaction, which involves:
      1. Flipping the symlink from the current live directory to the (newly updated) shadow directory, effectively swapping which directory is live and which is the shadow.
      2. Re-running step 4.1 against the (new) shadow directory (the directory that was live up until step 4.2.1) – this can also be achieved by simply rsyncing from the (new) live to the (new) shadow directory, if rerunning the original set of content modifications is too difficult or expensive.
    3. Or rolling back the transaction, which involves replacing the (partially updated) shadow directory with the contents of the current live directory, without touching the symlink at all.

Note that there are some downsides to this approach, including:

  • It requires two full copies of the target directory, which can be problematic with large content sets.
  • It assumes that applications don’t keep files open for extended periods of time - updates to a file are only visible when that file is (re)opened.
  • It doesn’t work very well on Windows platforms due to Windows’ unfortunate choice of using fully qualified paths for file handles instead of inodes, making it impossible to flip the symlink / junction if any files are currently held open by an application.

Regardless, “docflip” greatly reduces the window of time in which the filesystem is in an inconsistent state - basically to the time it takes to rewrite a symlink.  That said it doesn’t completely eliminate phantom reads, since it’s still possible for an application to read a file prior to a transaction, a transaction commits (flipping the symlink) and then the application re-reads the file a second time post transaction and the file has changed.  However without introducing read transactions (which would require changes to the applications reading the filesystem, along with some kind of transaction coordinator), it’s probably impossible to obtain serialisable isolation on non-transactional filesystems.

So now that we have a technique for minimising the time for changes to commit, how would this be implemented with an Alfresco FSR?

Without enhancing the FSR in any way, the approach I’ve considered involves:

  1. Having 3 copies of the target filesystem - one managed by the FSR, the other two (the live and shadow copies) managed by the custom “docflip” process.  As with vanilla “docflip” a symlink would point to the currently live copy of the content, and all applications reading the content would read via that symlink.
      • It’s not possible to use the FSR’s own target directory as one of the live / shadow directories, since that would require that the FSR itself can be dynamically reconfigured to ensure it always writes to the shadow (which changes with every flip of the symlink).
      1. Configuring a ProgramRunnable that calls a “docflip” shell script.  This shell script:
        1. Replicates the deployed delta from the FSR target directory to the shadow copy.
        2. Commits the transaction by flipping the symlink (ie. swaps the shadow and live copies).
        3. Re-replicates the deployed delta from the FSR target directory to the (new) shadow copy.
      2. Rollback doesn’t need to be considered, since by the time the ProgramRunnable is invoked, the FSR has already committed the deployed content to the target directory.  The only concern would be if step 2.3 fails – that would need to raise a critical administration alert of some kind since it would require manual intervention to avoid throwing all subsequent deployments into disarray.  Forcibly shutting down the FSR in this case might be justified, just to ensure that no further deployment can occur until the issue is resolved.

      Replicating the changes made to the FSR’s target directory to the “docflip” directories (steps 2.1 and 2.3) could be done in a number of ways, including:

      1. Brute force rsync of the entire target directory.
      2. Directed rsync, using the manifest of changes that are sent to the shell script by the ProgramRunnable.
      3. By interpreting the manifest of changes that are sent to the shell script by the ProgramRunnable and executing equivalent cp / rm / mkdir / rmdir commands.
      4. Implementing the entire “docflip” process in Java instead of a shell script, and directly interpreting the manifest of changes.

      These are listed in what I believe would be least dev effort / worst performance to most dev effort / highest performance.  The “sweet spot” is likely to be a combination of options 2 and 3, where rsync is used for creates / updates and rm / mkdir / rmdir are used for file deletes and directory operations.  If performance trumps all else option 4 is worth considering, possibly leveraging Java NIO and/or multi-threading techniques (being careful to preserve the order of operations listed in the manifest that are order-dependent eg. create directory A, …, …, create file A/B.txt).

      So there you have it - a (hopefully enlightening!) exploration of the intricacies of FSR deployment, as well as ways to mitigate some of the potential concerns with the default implementation.


      Alfresco Home | Legal | Privacy | Accessibility | Site Map | RSS  RSS

      © 2009 Alfresco Software, Ltd, All Rights Reserved