Category Archives: WCM

Alfresco and Groovy, Baby!

For quite a few years now I’ve been a fan of scripted languages that run on the JVM, initially experimenting with the venerable BeanShell, then tinkering with Javascript (via Rhino), JRuby and finally discovering Groovy in late 2007. A significant advantage that Groovy has over most of those other languages (with the possible exception of BeanShell), is that it is basically a superset of Java, so most valid Java code is also valid Groovy code and can therefore be executed by the Groovy “interpreter”1 without requiring compilation, packaging or deployment – three things that significantly drag down one’s productivity with “real” Java.

To that end I decided to see if there was a way to implement Alfresco Web Scripts using Groovy, ideally in the hope of gaining access to the powerful Alfresco Java APIs with all of the productivity benefits of working in a scripting-like interpreted environment.

It turns out that the Spring Framework (a central part of Alfresco) moved in this direction some time ago, with support for what they refer to as dynamic-language-backed beans. Given that a Java backed Web Script is little more than a Spring bean plus a descriptor and some view templates, initially it seemed like Groovy backed Web Scripts might be possible in Alfresco already, merely by adding the Groovy runtime JAR to the Alfresco classpath and then configuring a Java-backed Web Script with a dynamic-language-backed Spring bean.

Oh behave!

Unfortunately this approach ran into one small snag: Alfresco requires that Java Web Script beans have a “parent” of “webscript”, as follows:

<bean id="webscript.my.web.script.get"
class="com.acme.MyWebScript"
parent="webscript">
<constructor-arg index="0" ref="ServiceRegistry" />
</bean>

but Spring doesn’t allow dynamic-language-backed beans to have a “parent” clause.

It’s freedom baby, yeah!

There are several ways to work around this issue, but the simplest was to implement a “proxy” Web Script bean in Java that simply delegates to another Spring bean, which itself could be a dynamic-language-backed Spring bean implemented in any of the dynamic languages Spring supports.

This class ends up looking something like (imports and comments removed in the interest of brevity):

public class DelegatingWebScript
extends DeclarativeWebScript
{
private final DynamicDeclarativeWebScript dynamicWebScript;

public DelegatingWebScript(final DynamicDeclarativeWebScript dynamicWebScript)
{
this.dynamicWebScript = dynamicWebScript;
}

@Override
protected Map executeImpl(WebScriptRequest request, Status status, Cache cache)
{
return(dynamicWebScript.execute(request, status, cache));
}
}

While DynamicDeclarativeWebScript looks something like:

public interface DynamicDeclarativeWebScript
{
Map execute(WebScriptRequest request, Status status, Cache cache);
}

This Java interface defines the API the Groovy code needs to implement in order for the DelegatingWebScript to be able to delegate to it correctly when the Web Script is invoked.

The net effect of all this is that a Web Script can now be implemented in Groovy (or any of the dynamic languages Spring supports for beans), by implementing the DynamicDeclarativeWebScript interface in a Groovy class, declaring a Spring bean with the script file containing that Groovy class and then configuring a new DelegatingWebScript instance with that dynamic bean. This may sound complicated, but as you can see in this example, is pretty straightforward:

<lang:groovy id="groovy.myWebScript"
refresh-check-delay="5000"
script-source="classpath:alfresco/extension/groovy/MyWebScript.groovy">
<lang:property name="serviceRegistry" ref="ServiceRegistry" />
</lang:groovy>

<bean id="webscript.groovy.myWebScript"
class="org.alfresco.extension.webscripts.groovy.DynamicDelegatingWebScript"
parent="webscript">
<constructor-arg index="0" ref="groovy.myWebScript" />
</bean>

While a little more work than I’d expected, this approach meets all of my goals of being able to write Groovy backed Web Scripts, and in the interests of sharing I’ve put the code up on the Alfresco forge Google Code.

I demand the sum… …OF 1 MILLION DOLLARS!

But wait – there’s more! Not content with simply providing a framework for developing custom Web Scripts in Groovy, I decided to test out this framework by implementing a “Groovy Shell” Web Script. The idea here is that rather than having to develop and register a new Groovy Web Script each and every time I want to tinker with some Groovy code, instead the Web Script would receive the Groovy code as a parameter and execute whatever is passed to it.

Before we go any further, I should mention one very important thing: this opens up a massive script-injection-attack hole in Alfresco, and as a result this Web Script should NOT be used in any environment where data loss (or worse!) is unacceptable!! It is trivial to upload a script that does extremely nasty things to the machine hosting Alfresco (including, but by no means limited to, formatting all drives attached to the system) so please be extremely cautious about where this Web Script gets deployed!

Getting back on track, I accomplished this using Groovy’s GroovyShell class to evaluate a form POSTed parameter to the Web Script as Groovy code (this is conceptually identical to Javascript’s “eval” function, hence the warning about injection attacks). Effectively we have a Groovy-backed Web Script that interprets an input parameter as Groovy code, and then goes ahead and dynamically executes it! It’s turtles all the way down!

The code also transforms the output of the script into JSON format, since there are existing Java libraries for transforming arbitrary object graphs (as would be returned by an arbitrary Groovy script) into JSON format.

Here’s a screenshot showing the end result:

Alfresco Groovy Shell

Alfresco Groovy Shell - Vanilla Groovy Script

The more observant reader will have noticed the notes in the top right corner, particularly the note referring to a “serviceRegistry” object. Before evaluating the script, the Web Script injects the all important Alfresco ServiceRegistry object into the execution context of the script, in a Groovy variable called “serviceRegistry”. The reason for doing so is obvious – this allows the script to interrogate and manipulate the Alfresco repository:

Alfresco Groovy Shell

Alfresco Groovy Shell - Groovy Script that Interrogates the Alfresco Repository

Sharks with lasers strapped to their heads!

Now if you look carefully at this script, you’ll notice that it (mostly) looks like Java, and this is where the value of this Groovy Shell Web Script starts to become apparent: because most valid Java code is also valid Groovy code, you can use this Web Script to prototype Java code that interacts with the Alfresco repository, without going through the usual Java rigmarole of compiling, packaging, deploying and restarting!

I recently conducted an in-depth custom code review for an Alfresco customer who had used Java extensively, and this Web Script was a godsend – not only did I eliminate the drudgery of compiling, packaging and deploying the customer’s custom code (not to mention restarting Alfresco each time), I also completely avoided the time consuming (and, let’s be honest, painful) task of trying to reverse engineer their build toolchain so that I could build the code in my environment. This alone was worth the price of admission, but coupled with the rapid turnaround on changes (the mythical “edit / test / edit / test” cycle), I was able to diagnose their issues in a much shorter time than would otherwise have been possible.

Conclusion

As always I’m keen to hear of your experiences with this project should you choose to use it, and am keen to have others join me in maintaining and enhancing the code (which is surprisingly little, once all’s said and done).


Technically Groovy does not have an interpreter; rather it compiles source scripts into JVM bytecode on demand. The net effect for the developer however is the same – the developer doesn’t have to build, package or deploy their code prior to execution – a serious productivity boost.

Disabling “Configure Workflow”

By default, the Alfresco WCM UI allows an author to select a different workflow and even reconfigure it at submission time, as shown in the following screenshot:

Configure Workflow at Submit Time

Screenshot: Configure Workflow at Submit Time

The obvious issue is that typically authors should not have the ability to influence the approval process, which, after all, is intended to ensure that any content they submit is appropriate for display on the live site.  As the feature currently exists in Alfresco, it is possible, for example, for the author to set themselves as the approver of their change set, completely circumventing the approval process that has been put in place.

While there is an open enhancement request (ENH-466) requesting that these controls be removed, many implementers need to be able to remove them immediately, on versions of Alfresco where this enhancement request has not yet been implemented.

Luckily there’s a straight forward way of doing this, albeit one that requires modification of a core Alfresco JSP.  The UI for the Submit Items dialog is rendered by a single JSP in the Alfresco Explorer UI:

/jsp/wcm/submit-dialog.jsp

At around line 104 (on Enterprise 3.2r – it may be slightly earlier or later in the file on other versions) the following two <h:panelGrid> blocks appear:

<h:panelGrid columns="1" cellpadding="2" style="padding-top:12px;padding-bottom:4px;"
width="100%" rowClasses="wizardSectionHeading">
<h:outputText value="&nbsp;#{msg.workflow}" escape="false" />
</h:panelGrid>

<h:panelGrid columns="1" cellpadding="2" cellpadding="2" width="100%" style="margin-left:8px">
<h:panelGroup rendered="#{DialogManager.bean.workflowListSize != 0}">
<h:outputText value="#{msg.submit_workflow_selection}" />
<h:panelGrid columns="2" cellpadding="2" cellpadding="2">
<a:selectList id="workflow-list" multiSelect="false" styleClass="noBrColumn" itemStyle="padding-top: 3px;"
value="#{DialogManager.bean.workflowSelectedValue}">
<a:listItems value="#{DialogManager.bean.workflowList}" />
</a:selectList>
<h:commandButton value="#{msg.submit_configure_workflow}" style="margin:4px" styleClass="dialogControls"
action="dialog:submitConfigureWorkflow" actionListener="#{DialogManager.bean.setupConfigureWorkflow}" />
</h:panelGrid>
</h:panelGroup>
<h:panelGroup rendered="#{DialogManager.bean.workflowListSize == 0}">
<f:verbatim><% PanelGenerator.generatePanelStart(out, request.getContextPath(), "yellowInner", "#ffffcc"); %></f:verbatim>
<h:panelGrid columns="2" cellpadding="0" cellpadding="0">
<h:graphicImage url="/images/icons/warning.gif" style="padding-top:2px;padding-right:4px" width="16" height="16"/>
<h:outputText styleClass="mainSubText" value="#{msg.submit_no_workflow_warning}" />
</h:panelGrid>
<f:verbatim><% PanelGenerator.generatePanelEnd(out, request.getContextPath(), "yellowInner"); %></f:verbatim>
</h:panelGroup>
</h:panelGrid>

Removing the ability for authors to select a different workflow and/or reconfigure the selected workflow is as simple as commenting out both of these blocks, using JSP style comment tags (<%– and –%>).  The result appears as follows:

No Ability to Select or Configure Workflow

Screenshot: No Ability to Select or Configure Workflow

As you can see, the entire Workflow section of the Submit Items dialog has now been removed, and the user no longer has the ability to select a different workflow or reconfigure it.

A Note about Packaging

While it may be tempting to simply modify the JSP directly in the exploded Alfresco webapp, it is critically important to understand that doing so is unsafe.  Specifically, Tomcat may choose to re-explode the alfresco.war file at any time, overwriting your changes without warning and thereby reverting the Submit Items dialog to the default behaviour.

A better approach is to package up the modified JSP file into an AMP file, and deploy it to other environments (test, production, etc.) using the apply_amps script or the Module Management Tool.  Packaging the JSP as an AMP file also allows you to “pin” the change to a specific version of Alfresco (via the module.repo.version.min and module.repo.version.max properties, described here), which is also important to prevent someone accidentally installing an older version of the JSP into a newer version of Alfresco (which can create other, difficult-to-track-down issues in Alfresco).

 
Please note that modifying core Alfresco code (even JSPs) will technically invalidate support for the installation if you are a subscriber to Alfresco’s Enterprise Network – this should not be done lightly! In this case, however, the risk of unexpected side effects is minimal and although the change will need to be manually re-applied every time the installation is upgraded, there are ways of pro-actively managing that risk.

Timed Deployment

While Alfresco WCM contains a sophisticated deployment engine, the options for initiating deployment are rather more limited, comprising the manual “Deploy Snapshot” function in the Explorer UI, and the automatic “Auto Deploy” function that can be configured in the Web Project Settings and then requested by an author at submission time.

While these options are useful, they each have their downsides.  Manual deployment is, well, highly manual, and in practice it’s usually unacceptable to dedicate a Content Manager to monitoring promotions and deploying them as they roll in.  Auto-deployment removes the manual step (once authors are trained to check the “auto-deploy” checkbox during submission to workflow), but has the problem that a high rate of concurrent promotion can overwhelm the deployment system (since each and every promotion is auto-deployed individually, despite the deployment engine offering a far more efficient “batch” deployment mode).

A Better Alternative

A better approach is to have deployment initiated automatically on a scheduled basis, picking up the latest snapshot at that point in time (which will automatically include all prior snapshots since the last successful deployment).

So what would be involved in developing this as a customisation?

Not very much, as it turns out.  Alfresco already includes all of the necessary components to provide this functionality:

  1. It includes the Quartz job scheduling engine.
  2. It includes an Action for initiating deployment.

All that’s needed is some glue code to tie these two components together, and thanks to the generosity of a recent Alfresco Enterprise customer, that glue code has now been developed and is available as part of the alfresco-wcm-deployment project on Google Code.

Under the Hood

As it turns out there was one critical detail that made this slightly less straight forward than I’d expected. Specifically, the action responsible for deploying a Web Site (the AVMDeployWebsite action) isn’t responsible for writing deployment reports into the Web Project – that step is performed in the Explorer UI’s JSF code instead (in a private class called org.alfresco.web.bean.wcm.DeployWebsiteDialog).

Given that deployment reports are a critical piece of operational reporting information, it was clear that generating the deployment reports in exactly the same fashion as the OOTB “Deploy Snapshot” and “Auto Deploy” functions was a high priority.  As a result the code doesn’t call the AVMDeployWebsite action directly – instead I copied the relevant block of code out of DeployWebsiteDialog and added it to my own custom class.

Other than that, the code is pretty straight forward.  Following Alfresco’s ubiquitous “in-process SOA” pattern I introduced a new service interface (called WebProjectDeploymentService), wired it into my Quartz job class using Spring, then configured it (with the cron expression that controls how frequently it runs) in a separate “trigger” bean.

As always, if you have any questions or comments, please feel free to reply here.  I would request that any bug reports or enhancement requests get raised in the issue tracker in the Google Code project however – they are far easier to monitor there than in the comments on this post.


Note that this creates upgrade risk, since that code could change in future versions of Alfresco. Given I work with the Alfresco code day-to-day I’m in a better position to detect when such changes have occurred, but if you’re doing something like this yourself I would encourage extra diligence in monitoring changes to the original code to ensure your extension doesn’t break unexpectedly following an upgrade.

The Case for Killing “WCM”

As if the gaudy Christmas lights, crass inflatable Santas and disturbing illuminated mechanical deer weren’t enough, CMS Watch have loudly proclaimed the start of the silly season with their annual prognostication on the state of CMS for the coming year.

This has generated a range of responses from the usual suspects, but the response that really caught my eye was Jon Marks’ “Visions of Jon: WCM is for losers”.

Considering myself a “WCM guy”, I took some umbrage at being called a loser (even by someone of Jon’s pedigree!), but after digesting his proposal (along with a “venti” serving of pre-season, 100-proof egg nog to help calm the nerves) the idea is beginning to grow on me. That’s the idea that WCM is a nonsense term – the jury is still out on whether I’m a loser or not!

From one of Jon’s comments:

I think the VCM and Drupal are fundamentally different, and neither are an ECM system.

This is a specific example of a general pattern I’ve observed for a while now. Jon continues:

The problem we have at the moment is that both of them are called WCM systems. … The fact that we have to put them both into the same WCM bucket kills me.

This really struck a chord with me, and had me rethinking my previous stance that WCM is a single product category with 2 major subdivisions. Perhaps the problem is deeper than that, and CPS’ and PMS’ are so different that there’s little justification for grouping them together into a single “WCM” bucket? If so we’ve arrived at the same conclusion as Jon: WCM is a meaningless term and deserves to be deprecated.

To start undoing the 15 years of mind share that the term “WCM” has enjoyed, it’s time to start thinking about new terminology that better describes these two functional categories. For several years I’ve been throwing around the terms “Content Production System” (CPS) and “Presentation Management System” (PMS), and in their COPE strategy NPR uses the terms “Content Management System” (CMS) and “Web Publishing Tool” (WPT).

What terms do you use (or think could / should be used) to describe these two product categories?

“Little Bits of History Repeating”

(with apologies to the Propeller Heads and Shirley Bassey)

Laurence Hart recently posted some reminisces regarding his formative years in content management, and it got me feeling a little nostalgic about my own introduction to and history with content management.  Allow me to bore you with a rather self indulgent look back at the last decade or so…

Sun, Surf and Sandstone

For me it all started in late 1996, when I decided to update the 1991 rockclimbing guide to Sydney.  Lacking in publishing experience and having heard from more experienced souls that publishing was more than half the work in preparing such a guide, I decided to update the information, put it online and then consider a hard copy edition at a later date (the classic divide-and-conquer get-bored-and-do-something-else approach).

At the time I was working for one of the (then) Big-5 management consulting firms, and had specialised in BEA (now Oracle) Tuxedo, so the web and its technologies was pretty much foreign territory for me.  I figured this little guidebook project would be a good use case for learning about this newfangled interwebitube thingamajig.

Not having heard of content management (in part because it was a niche indication in those days!) I rolled my own “CMS” in MS Access, and used that to publish out the new guidebook as a static HTML site.  This wasn’t just a one-trick pony CMS either – the editor of a rockclimbing guide to the Glasshouse Mountains also picked it up a year or two later, and has been using it to manage his guidebook since then.  It’s with mixed feelings that I admit that this is one of the longer lived CMS implementations I’ve worked on!

The key takeaway for me from this period was that keeping presentation and content separate is indeed a highly valuable guiding principle, but that it’s also difficult to do without creating a visually repetitive site (which isn’t necessarily a bad thing, but tends to rub marketing and creative folks the wrong way).

The 300lbs Gorilla

Having caught the web bug (and if the truth be known, being completely fed up with developing business applications in C & C++), in 2000 I took a leap of faith and joined Vignette, arguably at about the time the company was at the pinnacle of its success.  To the casual observer it could appear that Vignette was on a steady decline from that point on, but for me personally it was a pretty wild ride – a lot of very smart people with a dizzying array of ideas – many of them brilliant, even more of them completely outlandish and/or impractical in the extreme.

And of course all of it focused on how best to manage and deliver web content, rather than being seen as a slightly perverse hobby that detracted from the “real work” of OLTP, N-tier client-server, data warehouses and the like!

In some ways the dotcom bust and subsequent “dark ages” actually helped Vignette, by bringing a previously missing intensity of focus to operational matters and (mostly) putting paid to the hubris accrued during the heady closing days of the 20th century.

If I can summarise that period in one statement, it would be that relational databases make *terrible* CMSes.  So many of Vignette’s technical flaws (specifically in the StoryServer and VCM product lines) stem directly or indirectly from the architectural decision to implement custom content models directly as relational data models.

Creative Interlude

After a stint in product management, I left Vignette in early 2006 and joined Avenue A | Razorfish – a Web Design Agency.  While only brief, this assignment gave me a new appreciation for the fine art of web design and the highly skilled, creative individuals who choose this profession.

It also reinforced the fact that many Web CMSes are still wrestling with basic plumbing issues (versioning, deployment, performance etc.) and have yet to really wrestle with some of the higher level issues of usability and productivity all while supporting creative freedom.

On a more mundane note, this experience also gave me a marked distaste for docroot management systems – that model was antiquated last millennium and makes no sense in this day and age!

Open Source Comes Calling

While I’d always had an interest in open source (in fact the Sydney climbing guidebook has been published under an open source documentation license – the GNU FDL – since its first edition in 1997), I’d never worked for an open source company before, and when the chance presented itself in late 2006, I jumped at the chance to join Alfresco, where I continue to work.

While it’s a little premature for me to be drawing any conclusions from my experiences at Alfresco, there are some patterns that I can clearly identify.  For starters there’s no doubt that open source is a disruptive business model – having a company that spends a majority of revenue on R&D (rather than on sales commissions) is a huge win for everyone (except career sales executives! ).  There’s also something to be said for openly visible source code – the “given enough eyeballs, all bugs are shallow” principle and all that.

In terms of content management, Alfresco comes closest (that I’ve seen) to realising the promise of a blended DM and WCM system (although as with any system there’s always room for improvement).

The Future of CMS Technologies

Julian Wraith recently started a discussion entitled “The future of content management” that has kicked off quite a few interesting responses.

Of those, the one that really grabbed my attention was Justin Cormack’s great response entitled “CMS technology choices“. By strange coincidence it closely echoes (but far more eloquently and in a lot more detail!) a conversation Kevin Cochrane and I had in twitter at about the same time, and while I almost entirely agree with everything Justin has written, the twitter conversation does highlight my one fundamental disagreement with the post.  Here’s the transcript of my side of that conversation:

Managing web content is about more than simply supporting the technical constructs the web uses (REST, stateless etc.).

eg. the graph of relationships between the content items making up a site can be an important source of information for authors.

But the web itself has no direct support for graph data structures (beyond humble “pointers”: <a href> tags and the like).

And perhaps as a consequence many (most?) Web CMSes don’t have support for that either. ;-)

IMNSHO the future is: schemaless (ala CouchDB, MongoDB, at al), graph based (ala Neo4J), distributed version control (ala Git).

(in hindsight I should also have mentioned “queryable (ala RDBMS, MongoDB, etc.)”)

To better describe my divergence from Justin’s vision of the future, I believe that management of, and visibility into the “content graph” (the set of links / relationships / associations / dependencies / call-them-what-you-will) is one of the more important features a CMS can provide, particularly for web content management where the link structure (including, but not limited to, the site’s navigation model) is so integral to the consumer’s final experience of the content.

So what “content graph” features, specifically, should a hypothetical CMS provide?

In my opinion a CMS needs to support at least the following operations on the content graph:

  • Track all links between assets that are under management, in such a way that the content graph can be:
    • bi-directionally traversed ie. the CMS can quickly and efficiently answer questions such as “which assets does asset X refer to?”, “which assets refer to asset X?”
    • used within queries ie. the CMS can quickly and efficiently answer questions such as “show me all content items that are within 3 degrees of separation from asset X, are of type ‘press release’, and were published in the last month by ‘Peter Monks'”
  • Flag any content modifications that “break” the content graph eg. deletion of an asset that is the target of one or more references
    • From a usability perspective our hypothetical CMS would provide the ability for the user requesting the breaking change to automatically “fix” the breakages eg. by correcting the soon-to-be invalid (dangling) links in the source item(s)
  • Support arbitrary metadata on references, preferably using the same metadata modeling language that is used for “real” content assets
  • Support basic validity checking of external links – links that point to assets that are not under management (eg. URIs that point to other web sites)

Other than linking, I think Justin’s post pretty much nails it.  I’m a big fan of schemaless repositories, having worked extensively with several “schemaed” CMSes that made seemingly simple steps (such as adding or removing a single property from a content type that happened to have instances in existence) a lengthy exercise in frustration.

I’m also a big fan of “structural” versioning (ala SVN, Git, Mercurial etc.), as it’s the only way to properly support rollback in the presence of deletions.  Trying to explain to an irate user that they just deleted not only an asset but also its entire revision history is not something I particularly relish!

Rich query and search facilities are a given – it’s one thing to put content into a CMS, but if you can’t query and search that content, it’s little better than a filesystem.

Replication (as in CouchDB, Git, etc.) is also an inevitable requirement for CMSes – I regularly see requirements for a CMS that can provide efficient access to documents across locations that are widely geographically distributed (including cases where connectivity to some of those locations is low bandwidth and/or intermittent).  Replication (with automatic conflict detection and sophisticated features to assist with the inevitably manual process of conflict resolution) is the only mechanism I’m aware of that can handle these cases.

And in closing, a big thank you to Julian Wraith for initiating this discussion – it’s extremely refreshing to discover other folks who are as passionate and (if I may say) as opinionated about CMS technology as I am!

Including a Static XSD in a Web Form

Since their inception, Alfresco WCM Web Forms have supported an inclusion mechanism based on the standard XML Schema include and import constructs.  Originally this mechanism read the included assets from the Web Project where the user was creating the content, but since v2.2SP3 the preferred mechanism has been to reference a Web Script instead (in fact the legacy mechanism may be deprecated in a future release).

One question that this new approach raises is how to support inclusion of static XSDs, as Web Scripts are inherently dynamic and introduce some unnecessary overhead for the simple static case. The good news is that Alfresco ships with a Web Script that simply reads a file from the repository and returns its contents:

/api/path/content{property}/{store_type}/{store_id}/{path}?a={attach?}

An example usage is:

/api/path/content/workspace/SpacesStore/Company Home/Data Dictionary/Presentation Templates/readme.ftl

Using the Web Script inclusion mechanism for Web Forms, we can use this Web Script to include or import any XSD file stored in the DM repository.  For example, if we have a file called “my-include.xsd” in the “Company Home” space that contains the following content:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:alf="http://www.alfresco.org/"
           targetNamespace="http://www.alfresco.org/"
           elementFormDefault="qualified">
  <xs:complexType abstract="true" name="IncludedComplexType">
    <xs:sequence>
      <xs:element name="Title"
                  type="xs:normalizedString"
                  minOccurs="1"
                  maxOccurs="1" />
      <xs:element name="Summary"
                  type="xs:string"
                  minOccurs="0"
                  maxOccurs="1" />
      <xs:element name="Keyword"
                  type="xs:normalizedString"
                  minOccurs="0"
                  maxOccurs="unbounded" />
    </xs:sequence>
  </xs:complexType>
</xs:schema>

We could include it into a Web Form XSD using an include statement such as the following:

<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"
           xmlns:alf="http://www.alfresco.org/"
           targetNamespace="http://www.alfresco.org/"
           elementFormDefault="qualified">
  <xs:include schemaLocation="webscript://api/path/content/workspace/SpacesStore/Company Home/my-include.xsd?ticket={ticket}" />
  <xs:complexType name="MyWebFormType">
    <xs:complexContent>
      <xs:extension base="alf:IncludedComplexType">
        <xs:sequence>
          <xs:element name="Body"
                      type="xs:string"
                      minOccurs="1"
                      maxOccurs="1" />
        </xs:sequence>
      </xs:extension>
    </xs:complexContent>
  </xs:complexType>
  <xs:element name="MyWebForm" type="alf:MyWebFormType" />
</xs:schema>

This is clearly faster and easier than developing a custom Web Script to either emit the XML Schema shown above, or to return the contents of a specific XSD file from the repository!

This approach also provides a solution to another question: how does one neatly package up a Web Form, along with all of its dependencies, ready for deployment to another Alfresco environment?

By storing included XSD files in Company Home > Data Dictionary > Web Forms, we give ourselves the option to package up the entire Web Forms space as an ACP file and deploy that ACP file to any other Alfresco environment, knowing that we’ve captured not only all of the Web Forms in the source environment, but all dependent XSD files as well.

Code Movement vs Content Movement

Seth Gottlieb has written a great post entitled “Code moves forward. Content moves backward.” that, by strange coincidence, echoes an Alfresco KB item authored by Alfresco’s very own Ben Hagan last year.

What’s interesting to me is that there is an alternative world view that asserts that code and content are two sides of the same coin and hence should be managed the same way in the same management system.  This meme seems particularly strong amongst those who are adherent’s of the Boiko school of thought and also those who’ve had significant exposure to certain Web CMS products (that shall remain nameless) that are clearly designed for the blended model, and so indoctrinate users /developers to use a blended model in all cases (whether appropriate or not).

My experience has been that blending code and content management together doesn’t work well in the majority of cases, for two primary reasons:

  1. Typically very different groups are producing the code and the content – often they’re in completely different divisions within the organisation (ie. IT vs business unit) and sometimes are even separate companies (ie. web agency vs client).
  2. The releases cycles for code and content are vastly different – code is typically released infrequently (weekly, at best), while the content on any large site is typically changing virtually non-stop.

The net result is that shoehorning both activities together creates unnecessary procedural couplings, between groups who are typically poorly structured (from a communication and coordination perspective) to efficiently manage those redundant couplings.

Anyway, it’s a great post on a very interesting topic, and I’d definitely encourage anyone involved in implementing a Web CMS (whether Alfresco WCM or not) to give it a solid read.

Web CMS’s Dissected

It’s no secret that Content Management Systems (CMS) are a pretty heterogeneous bunch of technologies, covering everything from paper document imaging systems through to portal servers through to desktop productivity apps – applying some Tufte to the CMS space demonstrates this heterogeneity pretty clearly. What’s less apparent is that even within the relatively narrow confines of Web CMS (WCMS) technologies, industry definitions are almost as fuzzy.

Interestingly, this confusion is not apparent when looking at WCMS technologies directly.  Over the last decade or so the WCMS market has matured into basically two quite well defined and quite distinct types of application, yet I rarely find this distinction reflected in discussions around WCMS technology.

I refer to these two categories of WCMS as:

  1. Content Production Systems (CPS)
  2. Presentation Management Systems (PMS)

Here are some typical distinguishing characteristics for these two types of system:

CPS PMS
Architecture Separate management and delivery systems Single monolithic system for both management and delivery
Content Production Capabilities
  • individual content production workspaces (“sandboxes”)
  • versioning / audit history
  • workflow / QA
  • in-context preview prior to launch
  • site rollback
  • deployment / publication
Strong Weak to none
Content Delivery Capabilities Weak to none – often API based Strong
Canned Content Models Few to none Extensive, though mostly presentation-centric:
  • sites
  • navigation / site map
  • pages
  • layouts
  • regions
  • components / modules / portlets
  • templates
  • skins / themes
Support for Custom Content Models Rich, often based on XML or relational technologies Typically weak, often limited to simple map / dictionary data structures
Examples
  • Interwoven TeamSite
  • Vignette VCM
  • Documentum Web Publisher
  • Drupal
  • Joomla
  • The Nukes (PHPNuke / DotNetNuke)
  • Portal Servers
  • Wikis

 
As can be gleaned from the table, for the most part these types of systems address orthogonal use cases (the content creation / production process vs the content delivery process) which explains why it’s so confusing to directly compare a CPS with a PMS (something I’ve seen numerous times).

Now there’s no reason that a single system couldn’t do both, and in fact some WCMS vendors have product offerings that attempt to do this. The problem is that to date most of these attempts have involved cobbling together what were previously independent applications, resulting in seemingly arbitrary distinctions between content that’s fully managed via the CPS vs content that isn’t.

As an example, one of the products sold by one of the vendors listed above marries a CPS with a portal server, but none of the portal server’s configuration data (pages, layouts, regions, portlets, etc.) is stored in the CPS, so there’s no ability to manage (workflow, review, version, etc.) changes to that data.  To the typical editorial team, this distinction is arbitrary and baffling, and can contribute to adoption problems.

So where does Alfresco WCM fit in all of this?

By now it should be clear that current versions of Alfresco WCM are solidly in the CPS camp – the core functionality is specifically focused on the content production use case, with presentation management left up to the delivery tier (which implementers of Alfresco WCM can implement using whatever technologies they’re comfortable with).

That said, Alfresco has recognised the value of a combined CPS + PMS solution for some time, but up until recently the focus was on implementing the CPS first, since:

  1. it’s arguably easier to add PMS constructs on top of a CPS than it is to retrofit a PMS with CPS style functionality
  2. there are situations where a PMS is already in place (eg. a custom web application), and the requirement is to introduce a WCMS that can integrate with that PMS rather than replacing it – in this case a pure play CPS is an appropriate solution

Earlier this year work began on the PMS side of things, and that’s started to bear fruit in the recent 3.0 release; specifically with the introduction of the Surf platform.  The next step (currently targeting a later release in the 3.x product line) is to introduce a visual site building tool (tentatively called Web Studio) that allows less technical users to visually manipulate the Surf content model (ie. build “sites”, “pages” etc. using a visual editing tool).

The beauty of this approach is that the Surf data model is stored in the (existing) repository, so all of the content production capabilities of the repository (sandboxed content creation / modification, workflow / QA, in-context preview, full revision history of the content set, rollback / roll forward, deployment / publishing, etc.) apply to all changes to a Surf site, regardless of whether it’s a user writing a new press release, a subject matter expert optimising the navigational hierarchy for their section of the site, a web admin re-skinning the entire site or a web developer creating new page templates to add to the library.

Compare this to the product described above, where some changes are made in the WCMS (and can be content managed) and some are made in the portal (and are not managed at all) and I’m sure you’ll see why we’re so excited about both the Surf platform and the upcoming Web Studio tool.

Implementing “DocFlip” for FSRs

In my previous post I discussed how File System Receivers (FSRs) implement deployment transactions on top of non-transactional filesystems.  As discussed in that post, there is a window of time in which an inconsistent state could be seen by an application reading the content; that is, while the FSR is in the middle of the commit phase.  Now the duration of this window varies based on a number of factors, but in some cases it’s critical to minimise the inconsistent window as much as possible, and in these cases a technique called “docflip” can help.

I first heard about “docflip” almost 10 years ago, and have seen it in use several times since then.  The basic approach is relatively simple:

  1. Two full copies of the target directory are maintained.
  2. A symlink is used that points to one of these directories.  All applications that are reading content use this symlink exclusively (they are unaware of the two underlying directories).
  3. At any point in time:
    1. One of the directories (the one pointed to by the symlink) is the “live” copy.
    2. The other directory (that is not pointed to by anything) is the “shadow” copy.
  4. A transaction involves:
    1. Writing all of the changes to the shadow copy.
    2. Either committing the transaction, which involves:
      1. Flipping the symlink from the current live directory to the (newly updated) shadow directory, effectively swapping which directory is live and which is the shadow.
      2. Re-running step 4.1 against the (new) shadow directory (the directory that was live up until step 4.2.1) – this can also be achieved by simply rsyncing from the (new) live to the (new) shadow directory, if rerunning the original set of content modifications is too difficult or expensive.
    3. Or rolling back the transaction, which involves replacing the (partially updated) shadow directory with the contents of the current live directory, without touching the symlink at all.

Note that there are some downsides to this approach, including:

  • It requires two full copies of the target directory, which can be problematic with large content sets.
  • It assumes that applications don’t keep files open for extended periods of time – updates to a file are only visible when that file is (re)opened.
  • It doesn’t work very well on Windows platforms due to Windows’ unfortunate choice of using fully qualified paths for file handles instead of inodes, making it impossible to flip the symlink / junction if any files are currently held open by an application.

Regardless, “docflip” greatly reduces the window of time in which the filesystem is in an inconsistent state – basically to the time it takes to rewrite a symlink.  That said it doesn’t completely eliminate phantom reads, since it’s still possible for an application to read a file prior to a transaction, a transaction commits (flipping the symlink) and then the application re-reads the file a second time post transaction and the file has changed.  However without introducing read transactions (which would require changes to the applications reading the filesystem, along with some kind of transaction coordinator), it’s probably impossible to obtain serialisable isolation on non-transactional filesystems.

So now that we have a technique for minimising the time for changes to commit, how would this be implemented with an Alfresco FSR?

Without enhancing the FSR in any way, the approach I’ve considered involves:

  1. Having 3 copies of the target filesystem – one managed by the FSR, the other two (the live and shadow copies) managed by the custom “docflip” process.  As with vanilla “docflip” a symlink would point to the currently live copy of the content, and all applications reading the content would read via that symlink.
      • It’s not possible to use the FSR’s own target directory as one of the live / shadow directories, since that would require that the FSR itself can be dynamically reconfigured to ensure it always writes to the shadow (which changes with every flip of the symlink).
      1. Configuring a ProgramRunnable that calls a “docflip” shell script.  This shell script:
        1. Replicates the deployed delta from the FSR target directory to the shadow copy.
        2. Commits the transaction by flipping the symlink (ie. swaps the shadow and live copies).
        3. Re-replicates the deployed delta from the FSR target directory to the (new) shadow copy.
      2. Rollback doesn’t need to be considered, since by the time the ProgramRunnable is invoked, the FSR has already committed the deployed content to the target directory.  The only concern would be if step 2.3 fails – that would need to raise a critical administration alert of some kind since it would require manual intervention to avoid throwing all subsequent deployments into disarray.  Forcibly shutting down the FSR in this case might be justified, just to ensure that no further deployment can occur until the issue is resolved.

      Replicating the changes made to the FSR’s target directory to the “docflip” directories (steps 2.1 and 2.3) could be done in a number of ways, including:

      1. Brute force rsync of the entire target directory.
      2. Directed rsync, using the manifest of changes that are sent to the shell script by the ProgramRunnable.
      3. By interpreting the manifest of changes that are sent to the shell script by the ProgramRunnable and executing equivalent cp / rm / mkdir / rmdir commands.
      4. Implementing the entire “docflip” process in Java instead of a shell script, and directly interpreting the manifest of changes.

      These are listed in what I believe would be least dev effort / worst performance to most dev effort / highest performance.  The “sweet spot” is likely to be a combination of options 2 and 3, where rsync is used for creates / updates and rm / mkdir / rmdir are used for file deletes and directory operations.  If performance trumps all else option 4 is worth considering, possibly leveraging Java NIO and/or multi-threading techniques (being careful to preserve the order of operations listed in the manifest that are order-dependent eg. create directory A, …, …, create file A/B.txt).

      So there you have it – a (hopefully enlightening!) exploration of the intricacies of FSR deployment, as well as ways to mitigate some of the potential concerns with the default implementation.