Bulk Import from a Filesystem

The Use Case

In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system. That content may reside in a legacy CMS, on a shared network drive, on individual user’s hard drives or in email, but the requirement is almost always there - to inventory the content that’s out there and bring some or all of it into the CMS with a minimum of effort.

Alfresco provides several mechanisms that can be used to import content, including:

Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).

That said, most of these approaches suffer from one or more of the following limitations:

  • They require the content to be massaged into some other format prior to ingestion
  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.
  • They require development or configuration work
  • They’re more general in nature, and so aren’t as performant as a specialised solution

An Opinionated (but High Performance!) Alternative

For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.

The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server - typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming - far more efficient than any kind of mechanism that requires network I/O.

How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it’s also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.

Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).

Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:

  1. Break up large volumes of writes into multiple batches - long running transactions are problematic for most transactional systems (including Alfresco).
  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder’s modification timestamp.

The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?

The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.

Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server’s hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server’s hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn’t it Do (Yet)?

Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that’s currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The “user experience” (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).

That said, the core logic is sound, and has been in production use for some time.  You may find that it’s worth investigating even in its currently rough state.

[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community. I’m chuffed that that’s the case and would request that any requests you have be logged via the issue tracker in Google Code, so that I can keep track of all of the great ideas that I’ve received. Thanks!

Tags: , , , , , , ,

25 Responses to “Bulk Import from a Filesystem”

  1. Jan Pfitzner Says:

    hi,
    just want to add that fme AG (an Alfresco Partner in germany and my prior employer) offers a migration tool called migration-center.
    This was developed for high speed & high load migration from filesystem to documentum. It is used by some well known companies.
    migration-center will also be able to talk with alfresco, you’ll simply have to implement a specific importer-Interface. If you want to import from another import source (e.g. another ecm repo) you can do the same by implementing a specific scanner-Interface.
    cheers, jan

  2. John Meewes Says:

    Peter -

    Thank you for your efforts bringing this key piece of functionality to the community. As a vendor tasked with CMS implementation and regularly loading hundreds of thousands of files onto client installations, one of our challenges with Alfresco has been a reliable supported interface for importing pre-indexed scanned documents. We look forward to working with your new tool.

    Best,

    John

  3. Peter Monks Says:

    John,

    Good to hear! I’m very keen to hear of your experiences with the tool. If you have ideas for improvement or (heaven forbid! ;-) ) run into bugs, please don’t hesitate to use the issue tracker in the Google Code project to track those.

    Cheers,
    Peter

  4. Rami Says:

    I developed a tool to upload documents to alfresco with their meta data and tried to make the interface as simple as possible, it will generate ACP that can be imported into Alfresco automatically or manually

    http://forge.alfresco.com/projects/acpgenerator/

  5. Peter Monks Says:

    Rami, the issue with ACPs is that they’re imported in a single transaction, so if the content set is large that approach will run afoul of the various issues with long running transactions.

    The ACP approach also requires that the content is copied three times:

    1. from disk into the ACP file
    2. the ACP file itself is transferred (copied) from disk into the repository (which may occur over the network, introducing network I/O latencies into the process as well)
    3. from the ACP file into the content store

    The bulk filesystem import tool only incurs one of these copy costs - copying the files from disk into the content store (which is the bare minimum that Alfresco requires).

    Still, for smaller content sets ACP files work just fine, and as you point out they have support for importing metadata today (which, at the time of writing, the bulk filesystem importer still lacks).

  6. Keith Veleba Says:

    If you bulk load into a space that has content rules applied, what happens? Will the rules still fire? If they do, is there a way to NOT use the metadata importers at all, and let the rules handle everything?

    All in all, this is the solution to my prayers. I’ve been trying to load 300,000 files to a repository via CIFS and Alfresco really hates that!

  7. Peter Monks Says:

    Keith, yes rules will still fire. In fact that’s the reason the original customer who sponsored this development didn’t require metadata loading - they already had rules configured that synthesised their metadata. It sounds like your use case (replacing CIFS + rules with metadata importer + rules) is identical to their case, so you should be in good shape.

    The metadata loading functionality is optional - you can control via Spring configuration which (if any) of the metadata loader implementations are used. As of v0.5 the “basic” and “properties file” metadata loaders are configured by default, but unless you create metadata files on disk alongside your original content, the property file metadata loading logic won’t take effect.

    The “basic” metadata loader is required however, as it’s responsible for correctly setting the type (cm:content vs cm:folder) of each node as it’s created in the repository, as well as setting the cm:name and cm:title properties to the name of the file on disk. CIFS is doing both of these things too, it’s just that you don’t really see it explicitly (CIFS makes the repository look like a filesystem, but under the covers it’s actually populating some standard metadata properties, such as type, cm:name, cm:title, etc.).

    Anyway, I’m very keen to hear about your experiences with the tool - please keep us apprised of your progress!

  8. Keith Veleba Says:

    Hi Peter,

    Thanks for the info. I installed the tool and tested it out. My initial test, leaving it configured as default, imported my file structure, along with filenames but all of my files were 0K length and were unable to be retrieved from the repository.

    Any thoughts? I turned on logging, but all I’m seeing are the “Ingesting..” statements and the properties file metadata failures as I have no shadow files.

    I changed the permissions on my source directory and files to 777, and no luck so far. So close I can taste it!

  9. Peter Monks Says:

    Keith, I’d suggest raising this in the issue tracker in the Google Code project. The more details you can provide (environment - OS, DB, Java; log file output; filesystem ownership information on the source content vs the user Alfresco is running as; the Alfresco user you’re running the Web Scripts as, etc. etc.) the more likely it is that a possible explanation will suggest itself.

  10. Keith Veleba Says:

    Will do. I’ve been unable to get this thing to work at all.

  11. Peter Monks Says:

    Yeah I suspect something is wrong with your installation or environment, given that it’s in successful production use in at least one location and has been evaluated by a dozen or so other installations (that I’m aware of).

    Once you create the issue, I’ll take a look and see if anything obvious jumps out at me.

  12. Peter Monks Says:

    For anyone who’s seeing similar issues, Keith reported his issue here.

  13. Keith Veleba Says:

    Peter,

    Wanted to follow up and thank you for the quick turnaround on addressing the issue I discovered. I’m using the tool to upload approximately 300K files to an instance of Alfresco Community. It’s been working great, and is helping out tremendously. Up until I discovered your tool, I was dreading having to write one myself. Thanks for making this available to the community, and I hope you have a prosperous and Happy New Year!

  14. Mihaela Apostol Says:

    Hi Peter,

    We are in a project where the next urgent step is to bulk import documents in Alfresco, and we thought using your solution.
    Unfortunately I am not familiar with Maven and this is why I am not sure what to do at the first step of the installation process:
    “1. Build the AMP file using Maven2 (”mvn clean package”)”.

    In the mean time I’ve installed apache-maven-2.2.1 on my computer, hope this is the utility I need it.
    As for your explanations, I understood that I have to “manually edit the pom.xml file in order to point Maven to either the Community Artifact repository (sponsored by SourceSense, one of Alfresco’s European SI partners), or to a Maven repository I have to create that contains the Alfresco Enterprise artifacts” . Please emphasis this step, even if this is basic routine for you. Thank you in advance!

  15. Peter Monks Says:

    Mihaela, the Google Code page has a pre-built AMP file available for download that obviates the need to built the package yourself.

  16. Luke Delengowski Says:

    I have installed the .amp file, but I am at a loss on how to use the actual functionality. The readme file associated with the project states that the web script is available at /bulk/import/filesystem, yet I am unable to locate it.

    Thanks!

  17. Peter Monks Says:

    Luke, Web Script URLs start with “/alfresco/service”, so the fully qualified URL for the Web Script would be along the lines of:

    http://myalfrescoserver:8080/alfresco/service/index/uri/bulk/import/filesystem

  18. Max Says:

    Hi Peter,

    Thank you for the great tool. I am using version 0.6 on Alfresco Community 3.2r2.

    I seem to have an issue with bulk importing to “/Company Home” - it gives a file not found error. However importing to “/Company Home/foldername” works as expected. Is this by design or is there a solution?

    Also, are you planning (or is there already) the ability to maintain file dates?

    Max

  19. Peter Monks Says:

    Max, I’d suggest raising an issue for the “/Company Home” problem in the issue tracker on the Google code project - it sounds like a bug.

    The issue regarding file dates is already tracked in the issue tracker as issue #4. This is a rather more complex problem, as Java (at least until JSR-203 is implemented - currently slated for JDK 1.7) is unable to read filesystem metadata (including most file dates).

  20. gary hickey Says:

    I would love to use your tool as I need to load thousands of PDFs nightly and Alfresco share hangs on both CIFS and FTP copies after about 600-1000 documents. I used MMT to include the supplied AMP file into Alfresco.war.

    I get the following error in alfresco.log:
    14:32:53,670 ERROR [org.alfresco.web.scripts.AbstractRuntime] Exception from executeScript - redirecting to status template error: 02030011 Not implemented
    org.alfresco.error.AlfrescoRuntimeException: 02030011 Not implemented
    at org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)

    If I go through the webscripts, I get the following error.

    Web Script Status 500 - Internal Error

    The Web Script /alfresco/service/bulk/import/filesystem/initiate has responded with a status of 500 - Internal Error.

    500 Description: An error inside the HTTP server which prevented it from fulfilling the request.

    Message: 02030011 Not implemented

    Exception: org.alfresco.error.AlfrescoRuntimeException - 02030011 Not implemented

    org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    org.alfresco.repo.management.subsystems.ChainingSubsystemProxyFactory$1.invoke(ChainingSubsystemProxyFactory.java:95)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
    $Proxy22.loadUserByUsername(Unknown Source)
    net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.getUserFromBackend(DaoAuthenticationProvider.java:390)
    net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.authenticate(DaoAuthenticationProvider.java:225)
    net.sf.acegisecurity.providers.ProviderManager.doAuthentication(ProviderManager.java:159)
    net.sf.acegisecurity.AbstractAuthenticationManager.authenticate(AbstractAuthenticationManager.java:49)
    org.alfresco.repo.security.authentication.AuthenticationComponentImpl.authenticateImpl(AuthenticationComponentImpl.java:81)
    org.alfresco.repo.security.authentication.AbstractAuthenticationComponent.authenticate(AbstractAuthenticationComponent.java:144)
    org.alfresco.repo.security.authentication.AuthenticationServiceImpl.authenticate(AuthenticationServiceImpl.java:129)
    org.alfresco.repo.security.authentication.AbstractChainingAuthenticationService.authenticate(AbstractChainingAuthenticationService.java:166)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)
    org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
    net.sf.acegisecurity.intercept.method.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:80)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.alfresco.repo.security.permissions.impl.ExceptionTranslatorMethodInterceptor.invoke(ExceptionTranslatorMethodInterceptor.java:49)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.alfresco.repo.audit.AuditComponentImpl.audit(AuditComponentImpl.java:275)
    org.alfresco.repo.audit.AuditMethodInterceptor.invoke(AuditMethodInterceptor.java:69)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
    $Proxy26.authenticate(Unknown Source)
    org.alfresco.repo.web.scripts.servlet.BasicHttpAuthenticatorFactory$BasicHttpAuthenticator.authenticate(BasicHttpAuthenticatorFactory.java:187)
    org.alfresco.repo.web.scripts.RepositoryContainer.executeScript(RepositoryContainer.java:280)
    org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:262)
    org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:139)
    org.alfresco.web.scripts.servlet.WebScriptServlet.service(WebScriptServlet.java:122)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
    org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
    org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
    org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
    org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:857)
    org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:565)
    org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1509)
    java.lang.Thread.run(Thread.java:619)

    I’m not sure if I have not turned something on or if it is a security error. Any help would be appreciated.

    Gary

  21. Peter Monks Says:

    Gary, can I suggest you raise this in the issue tracker in the Google Code project? Thanks!

  22. Frank Gruska Says:

    I was able to run the script and import 100s of documents without any problem. However, as a next steps I have to associate some custom meta data. I tried to follow the instruction in the readme file but so far had no success. Can you please provide example of a metadata/properties file which also includes a custom content type and properties.

    Thanks, Frank

  23. Peter Monks Says:

    Frank, the readme file goes into some detail on how and where to put the metadata properties files, and there’s a simple metadata properties file example about halfway down.

  24. Frank Gruska Says:

    Hello Peter,
    Many thanks. I got it to work. However, could it be that dates ( e.g. 2012-05-23) are not supported.

  25. Peter Monks Says:

    Frank, currently the code relies on Alfresco to convert string values in the properties files into their correct data type in the repository. I can’t recall exactly which implicit data type conversions Alfresco supports natively, but there is a chance they are quite limited and don’t extend to dates or date/times.

    Regardless, this has been raised as a task in the issue tracker in the Google code project - please feel free to look into this further if you have the time and interest, as I’m not sure when I will next have an opportunity to investigate it.

Leave a Reply


Alfresco Home | Legal | Privacy | Accessibility | Site Map | RSS  RSS

© 2009 Alfresco Software, Ltd, All Rights Reserved