Bulk Import from a Filesystem

The Use Case

In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system. That content may reside in a legacy CMS, on a shared network drive, on individual user’s hard drives or in email, but the requirement is almost always there – to inventory the content that’s out there and bring some or all of it into the CMS with a minimum of effort.

Alfresco provides several mechanisms that can be used to import content, including:

Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).

That said, most of these approaches suffer from one or more of the following limitations:

  • They require the content to be massaged into some other format prior to ingestion
  • Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.
  • They require development or configuration work
  • They’re more general in nature, and so aren’t as performant as a specialised solution

An Opinionated (but High Performance!) Alternative

For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.

The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server – typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on.  This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming – far more efficient than any kind of mechanism that requires network I/O.

How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example).  Alternatively it’s also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.

Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco.  This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).

Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:

  1. Break up large volumes of writes into multiple batches – long running transactions are problematic for most transactional systems (including Alfresco).
  2. Avoid updating the same objects from different concurrent transactions.  In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder’s modification timestamp.[EDIT] In recent versions of Alfresco, the automatic update of a folder’s modification timestamp (cm:modified property) has been disabled by default. It can be turned back on (by setting the property “system.enableTimestampPropagation” to true), but the default is false so this is likely to be less of an impact to bulk ingestion than I’d originally thought.

The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process).  It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions.  It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.

But What Does this Mean in Real Life?

The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS.  In one test, it took approximately an hour to load 1,500 files into the repository via CIFS.  In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.

Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server’s hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server’s hard drive, providing a substantial (order of magnitude) overall saving.

What Doesn’t it Do (Yet)?

Despite already being in use in production, this tool is not what I would consider complete.  The issue tracker in the Google Code project has details on the functionality that’s currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality.  The “user experience” (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).

That said, the core logic is sound, and has been in production use for some time.  You may find that it’s worth investigating even in its currently rough state.

[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community. I’m chuffed that that’s the case and would request that any questions or comments you have be raised on the mailing list.  If you believe you’ve found a bug, or wish to request an enhancement to the tool, the issue tracker is the best place. Thanks!

Tags: , , , , , , ,

  • http://alfrescian.org Jan Pfitzner

    hi,
    just want to add that fme AG (an Alfresco Partner in germany and my prior employer) offers a migration tool called migration-center.
    This was developed for high speed & high load migration from filesystem to documentum. It is used by some well known companies.
    migration-center will also be able to talk with alfresco, you’ll simply have to implement a specific importer-Interface. If you want to import from another import source (e.g. another ecm repo) you can do the same by implementing a specific scanner-Interface.
    cheers, jan

  • http://www.nationalscanning.com John Meewes

    Peter -

    Thank you for your efforts bringing this key piece of functionality to the community. As a vendor tasked with CMS implementation and regularly loading hundreds of thousands of files onto client installations, one of our challenges with Alfresco has been a reliable supported interface for importing pre-indexed scanned documents. We look forward to working with your new tool.

    Best,

    John

  • http://www.sydneyclimbing.com/ Peter Monks

    John,

    Good to hear! I’m very keen to hear of your experiences with the tool. If you have ideas for improvement or (heaven forbid! ;-) ) run into bugs, please don’t hesitate to use the issue tracker in the Google Code project to track those.

    Cheers,
    Peter

  • http://www.vrami.net Rami

    I developed a tool to upload documents to alfresco with their meta data and tried to make the interface as simple as possible, it will generate ACP that can be imported into Alfresco automatically or manually

    http://forge.alfresco.com/projects/acpgenerator/

  • http://www.sydneyclimbing.com/ Peter Monks

    Rami, the issue with ACPs is that they’re imported in a single transaction, so if the content set is large that approach will run afoul of the various issues with long running transactions.

    The ACP approach also requires that the content is copied three times:

    1. from disk into the ACP file
    2. the ACP file itself is transferred (copied) from disk into the repository (which may occur over the network, introducing network I/O latencies into the process as well)
    3. from the ACP file into the content store

    The bulk filesystem import tool only incurs one of these copy costs – copying the files from disk into the content store (which is the bare minimum that Alfresco requires).

    Still, for smaller content sets ACP files work just fine, and as you point out they have support for importing metadata today (which, at the time of writing, the bulk filesystem importer still lacks).

  • http://www.sanduskyregister.com Keith Veleba

    If you bulk load into a space that has content rules applied, what happens? Will the rules still fire? If they do, is there a way to NOT use the metadata importers at all, and let the rules handle everything?

    All in all, this is the solution to my prayers. I’ve been trying to load 300,000 files to a repository via CIFS and Alfresco really hates that!

  • http://www.sydneyclimbing.com/ Peter Monks

    Keith, yes rules will still fire. In fact that’s the reason the original customer who sponsored this development didn’t require metadata loading – they already had rules configured that synthesised their metadata. It sounds like your use case (replacing CIFS + rules with metadata importer + rules) is identical to their case, so you should be in good shape.

    The metadata loading functionality is optional – you can control via Spring configuration which (if any) of the metadata loader implementations are used. As of v0.5 the “basic” and “properties file” metadata loaders are configured by default, but unless you create metadata files on disk alongside your original content, the property file metadata loading logic won’t take effect.

    The “basic” metadata loader is required however, as it’s responsible for correctly setting the type (cm:content vs cm:folder) of each node as it’s created in the repository, as well as setting the cm:name and cm:title properties to the name of the file on disk. CIFS is doing both of these things too, it’s just that you don’t really see it explicitly (CIFS makes the repository look like a filesystem, but under the covers it’s actually populating some standard metadata properties, such as type, cm:name, cm:title, etc.).

    Anyway, I’m very keen to hear about your experiences with the tool – please keep us apprised of your progress!

  • http://www.sanduskyregister.com Keith Veleba

    Hi Peter,

    Thanks for the info. I installed the tool and tested it out. My initial test, leaving it configured as default, imported my file structure, along with filenames but all of my files were 0K length and were unable to be retrieved from the repository.

    Any thoughts? I turned on logging, but all I’m seeing are the “Ingesting..” statements and the properties file metadata failures as I have no shadow files.

    I changed the permissions on my source directory and files to 777, and no luck so far. So close I can taste it!

  • http://www.sydneyclimbing.com/ Peter Monks

    Keith, I’d suggest raising this in the issue tracker in the Google Code project. The more details you can provide (environment – OS, DB, Java; log file output; filesystem ownership information on the source content vs the user Alfresco is running as; the Alfresco user you’re running the Web Scripts as, etc. etc.) the more likely it is that a possible explanation will suggest itself.

  • http://www.sanduskyregister.com Keith Veleba

    Will do. I’ve been unable to get this thing to work at all.

  • http://www.sydneyclimbing.com/ Peter Monks

    Yeah I suspect something is wrong with your installation or environment, given that it’s in successful production use in at least one location and has been evaluated by a dozen or so other installations (that I’m aware of).

    Once you create the issue, I’ll take a look and see if anything obvious jumps out at me.

  • http://www.sydneyclimbing.com/ Peter Monks

    For anyone who’s seeing similar issues, Keith reported his issue here.

  • http://sanduskyregister.com Keith Veleba

    Peter,

    Wanted to follow up and thank you for the quick turnaround on addressing the issue I discovered. I’m using the tool to upload approximately 300K files to an instance of Alfresco Community. It’s been working great, and is helping out tremendously. Up until I discovered your tool, I was dreading having to write one myself. Thanks for making this available to the community, and I hope you have a prosperous and Happy New Year!

  • Mihaela Apostol

    Hi Peter,

    We are in a project where the next urgent step is to bulk import documents in Alfresco, and we thought using your solution.
    Unfortunately I am not familiar with Maven and this is why I am not sure what to do at the first step of the installation process:
    “1. Build the AMP file using Maven2 (“mvn clean package”)”.

    In the mean time I’ve installed apache-maven-2.2.1 on my computer, hope this is the utility I need it.
    As for your explanations, I understood that I have to “manually edit the pom.xml file in order to point Maven to either the Community Artifact repository (sponsored by SourceSense, one of Alfresco’s European SI partners), or to a Maven repository I have to create that contains the Alfresco Enterprise artifacts” . Please emphasis this step, even if this is basic routine for you. Thank you in advance!

  • http://www.sydneyclimbing.com/ Peter Monks

    Mihaela, the Google Code page has a pre-built AMP file available for download that obviates the need to built the package yourself.

  • Luke Delengowski

    I have installed the .amp file, but I am at a loss on how to use the actual functionality. The readme file associated with the project states that the web script is available at /bulk/import/filesystem, yet I am unable to locate it.

    Thanks!

  • http://www.sydneyclimbing.com/ Peter Monks

    Luke, Web Script URLs start with “/alfresco/service”, so the fully qualified URL for the Web Script would be along the lines of:

    http://myalfrescoserver:8080/alfresco/service/bulk/import/filesystem

  • Max

    Hi Peter,

    Thank you for the great tool. I am using version 0.6 on Alfresco Community 3.2r2.

    I seem to have an issue with bulk importing to “/Company Home” – it gives a file not found error. However importing to “/Company Home/foldername” works as expected. Is this by design or is there a solution?

    Also, are you planning (or is there already) the ability to maintain file dates?

    Max

  • http://www.sydneyclimbing.com/ Peter Monks

    Max, I’d suggest raising an issue for the “/Company Home” problem in the issue tracker on the Google code project – it sounds like a bug.

    The issue regarding file dates is already tracked in the issue tracker as issue #4. This is a rather more complex problem, as Java (at least until JSR-203 is implemented – currently slated for JDK 1.7) is unable to read filesystem metadata (including most file dates).

  • gary hickey

    I would love to use your tool as I need to load thousands of PDFs nightly and Alfresco share hangs on both CIFS and FTP copies after about 600-1000 documents. I used MMT to include the supplied AMP file into Alfresco.war.

    I get the following error in alfresco.log:
    14:32:53,670 ERROR [org.alfresco.web.scripts.AbstractRuntime] Exception from executeScript – redirecting to status template error: 02030011 Not implemented
    org.alfresco.error.AlfrescoRuntimeException: 02030011 Not implemented
    at org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)

    If I go through the webscripts, I get the following error.

    Web Script Status 500 – Internal Error

    The Web Script /alfresco/service/bulk/import/filesystem/initiate has responded with a status of 500 – Internal Error.

    500 Description: An error inside the HTTP server which prevented it from fulfilling the request.

    Message: 02030011 Not implemented

    Exception: org.alfresco.error.AlfrescoRuntimeException – 02030011 Not implemented

    org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    org.alfresco.repo.management.subsystems.ChainingSubsystemProxyFactory$1.invoke(ChainingSubsystemProxyFactory.java:95)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
    $Proxy22.loadUserByUsername(Unknown Source)
    net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.getUserFromBackend(DaoAuthenticationProvider.java:390)
    net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.authenticate(DaoAuthenticationProvider.java:225)
    net.sf.acegisecurity.providers.ProviderManager.doAuthentication(ProviderManager.java:159)
    net.sf.acegisecurity.AbstractAuthenticationManager.authenticate(AbstractAuthenticationManager.java:49)
    org.alfresco.repo.security.authentication.AuthenticationComponentImpl.authenticateImpl(AuthenticationComponentImpl.java:81)
    org.alfresco.repo.security.authentication.AbstractAuthenticationComponent.authenticate(AbstractAuthenticationComponent.java:144)
    org.alfresco.repo.security.authentication.AuthenticationServiceImpl.authenticate(AuthenticationServiceImpl.java:129)
    org.alfresco.repo.security.authentication.AbstractChainingAuthenticationService.authenticate(AbstractChainingAuthenticationService.java:166)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    java.lang.reflect.Method.invoke(Method.java:597)
    org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)
    org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
    net.sf.acegisecurity.intercept.method.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:80)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.alfresco.repo.security.permissions.impl.ExceptionTranslatorMethodInterceptor.invoke(ExceptionTranslatorMethodInterceptor.java:49)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.alfresco.repo.audit.AuditComponentImpl.audit(AuditComponentImpl.java:275)
    org.alfresco.repo.audit.AuditMethodInterceptor.invoke(AuditMethodInterceptor.java:69)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
    org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
    org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
    $Proxy26.authenticate(Unknown Source)
    org.alfresco.repo.web.scripts.servlet.BasicHttpAuthenticatorFactory$BasicHttpAuthenticator.authenticate(BasicHttpAuthenticatorFactory.java:187)
    org.alfresco.repo.web.scripts.RepositoryContainer.executeScript(RepositoryContainer.java:280)
    org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:262)
    org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:139)
    org.alfresco.web.scripts.servlet.WebScriptServlet.service(WebScriptServlet.java:122)
    javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
    org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
    org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
    org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
    org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:857)
    org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:565)
    org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1509)
    java.lang.Thread.run(Thread.java:619)

    I’m not sure if I have not turned something on or if it is a security error. Any help would be appreciated.

    Gary

  • http://www.sydneyclimbing.com/ Peter Monks

    Gary, can I suggest you raise this in the issue tracker in the Google Code project? Thanks!

  • Frank Gruska

    I was able to run the script and import 100s of documents without any problem. However, as a next steps I have to associate some custom meta data. I tried to follow the instruction in the readme file but so far had no success. Can you please provide example of a metadata/properties file which also includes a custom content type and properties.

    Thanks, Frank

  • http://www.sydneyclimbing.com/ Peter Monks

    Frank, the readme file goes into some detail on how and where to put the metadata properties files, and there’s a simple metadata properties file example about halfway down.

  • Frank Gruska

    Hello Peter,
    Many thanks. I got it to work. However, could it be that dates ( e.g. 2012-05-23) are not supported.

  • http://www.sydneyclimbing.com/ Peter Monks

    Frank, currently the code relies on Alfresco to convert string values in the properties files into their correct data type in the repository. I can’t recall exactly which implicit data type conversions Alfresco supports natively, but there is a chance they are quite limited and don’t extend to dates or date/times.

    Regardless, this has been raised as a task in the issue tracker in the Google code project – please feel free to look into this further if you have the time and interest, as I’m not sure when I will next have an opportunity to investigate it.

  • Pingback: Alfresco Content Migrations – An OpenMigrate Update « TSG Blog

  • Bastiaan

    It is ready to bulk import into Alfresco 3.3? Can you mention somewhere for what versions of Alfresco your tool is usable? I did not find anything about this.

  • http://www.sydneyclimbing.com/ Peter Monks

    Bastiaan, the tested versions are mentioned in the readme file, although the tool is basically very simple and should work on all 3.x versions of Alfresco. In fact it may even work on 2.x versions of Alfresco, but currently the AMP is configured to only allow installation on versions 3.0 and above as I’ve not tested on any 2.x release.

  • Susan

    Peter,
    I’ve been testing the bulk importer, and it works well except that the “update existing files” option doesn’t seem to do anything. Has anyone else had trouble with this?

  • http://www.sydneyclimbing.com/ Peter Monks

    Susan, can I suggest you raise this in the issue tracker in the Google Code project? It’s far easier to manage / track in there. Thanks!

  • Arthur

    Thanks for this tool Peter!
    I migrated 50go of data from a shared drive to alfresco and it worked perfectly (the process took something around 15 hours)
    The only problem I got was with folders that contained a whitespace at the end so it might be a good idea to trim spaces names after creation.
    Great work anyway.

  • http://www.sydneyclimbing.com/ Peter Monks

    Arthur, good to hear! Just out of interest, approximately how many files and folders were in the source content set?

    I’ve raised an issue regarding the whitespace in the issue tracker – it’s issue #33.

  • justin

    Hi i’ve tried this amp and it works great! How can i add custom aspects in the metadata.properties ? Is this possible?

  • http://www.sydneyclimbing.com/ Peter Monks

    justin, the readme describes how to attach aspects (regardless of whether they’re built-in or custom) to the ingested content – see line 81.

  • savic.prvoslav

    Hi, I just used and it works great, I personally like when spaces and folders overlap and spaces have rules on them, it works perfectly. Great job !

  • http://www.sydneyclimbing.com/ Peter Monks

    Thanks Savic! Glad to hear you’ve had success with the tool!

  • Walter Raboch

    Hi Peter,
    I have a fix for issue 4 ‘creation and modification dates’. I would like to share with you.
    If you are interested, just contact me.
    Regards,
    Walter

  • http://www.sydneyclimbing.com/ Peter Monks

    Walter, if you could attach a patchfile to issue #4 in the issue tracker, I’ll give it a review.

    I should also point out that this isn’t actually an issue with the bulk importer, but a bug in Alfresco that was recently fixed (see ALF-2565).

  • Tiur

    Hi Peter,

    Could you compile the latest head version of your source code, we really need the fix for issue #4 but we still having difficulties when compiling it in maven.

    Regards,

    Tiur

  • http://www.sydneyclimbing.com/ Peter Monks

    Tiur, please review the existing issues in the issue tracker (there are several that are similar), and raise a new issue if appropriate.

  • Tiur

    Hi Peter,

    We’ve raised a new issue : http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=55

    Could you kindly provide the AMP file for Alfresco 3.4? We are willing to give donation for the project if you do this. We need this functionality ASAP, it would very helpful if you help us.

    Regards,

    Tiur

  • http://www.sydneyclimbing.com/ Peter Monks

    Tiur, I’ve updated issue #55 with the current state of play. Unfortunately the issue you’re running into relates to the Community Maven repository (which I am in no way involved in supporting or maintaining) rather than the Bulk Filesystem Import Tool, so your best bet is to chase it up with them separately.

    FWIW I’ve also reached out to them internally to try to find out what’s going on, but you should chase it up directly as well as I rarely use that repository.

  • Susan

    Is it possible to use the bulk importer to load comments along with other metadata? Thanks.

  • chiru

    H Peter,

    This is a fantastic tool i am able upload the around 130GB data in 2 – 3 hours…
    i was wondered even from intranet the transfer rate is 7Mbps… I have no problem sofar when I am using with alfresco and share.
    But when it comes to Open office 3.2.0 with oracle connector I was not able to access the data which resides in folder.
    Sory for raising this issue here but i want to know what is the prob.

  • http://www.sydneyclimbing.com/ Peter Monks

    Susan, it depends how comments are defined in the content model.

    If they’re a property of the node, then yes they can be loaded (although note that multi-valued properties are not yet supported – see issue #20).

    If comments are stored as sub-nodes of the file then currently there’s no way of loading that structure, since filesystems don’t typically support files that are also folders (unlike Alfresco, which does support that structure).

  • http://www.sydneyclimbing.com/ Peter Monks

    chiru, can you describe the problem in more detail? It doesn’t sound like the issue you’re seeing is related to the import tool, although it’s a bit hard to tell from your description.

  • Antonio

    Hi Peter,

    I have been checking your import tools and it looks good, but I hava a question: is it possible at this moment to set document’s categories?.

    Thank

  • http://www.sydneyclimbing.com/ Peter Monks

    Antonio, it’s possible to set a single category (issue #19 in the issue tracker describes how this is done), but it’s not yet possible to set multi-valued properties – that’s issue #20. Setting a single category isn’t very useful, obviously, but I have not had time to look at issue #20.

  • Pingback: Zia Consulting, Inc » Alfresco Content Migration – Zia Consulting, Inc

  • znikolovski

    Hi Peter,

    We’re in the process of upgrading Alfresco to 3.3.4 but they want to do a gradual upgrade (and decided to use Alfresco 3.2.* as an interim step). We had the bulk import working before with a custom aspect defined and when we tried to run it on 3.2 it failed with the following exception:

    namespace prefix [prefix] is not mapped to a namespace uri

    I should mention that applying the aspect manually to a file didn’t produce any errors.

    Any idea why the bulk import is complaining?

    My metadata file is of the following structure:

    type=cm:content
    aspects=sensis:prod
    cm\:title=09000001800699fe.pdf
    cm\:description=Contract
    sensis\:advertiserId=478283400
    sensis\:campaignCode=N00Y
    sensis\:generationDate=2005-05-31T12:00:00.000+10:00
    sensis\:issue=26
    cm\:storeName=storeA

    Thanks in advance for any suggestions you might have.

    • http://www.sydneyclimbing.com/ Peter Monks

      Zoran, that error usually indicates that the content model containing the given namespace (almost certainly “sensis” in this case) isn’t registered with the repository. That said, if you’re able to attach the “sensis:prod” aspect manually via the UI (Explorer or Share) then that pretty much rules that possibility out.

      Would you mind raising this in the issue tracker in the Google Code project, so that I can track it properly? The above detail is good, but what would be even better would be the files you’re using to register the content model with the repository (both the model file itself and the Spring application context that loads it), or a cut-down equivalent that also demonstrates the issue. Thanks!

  • polgarine

    Hi

    I just wanted to say that, thanks to this tool, we were able to upload 4.5 million documents in an Alfresco repository in only 4 days.
    This would have taken weeks with webdav or ftp.

    Thank you very much for this awesome tool

  • http://www.sydneyclimbing.com/ Peter Monks

    polgarine, that’s great to hear – thanks for commenting! Just out of interest, roughly how large (in MB / GB) were the documents in total?

  • Leo

    Thanks a lot for this tool. Supressing the blank in your readme, line
    aspects=cm:versionable, custom:myAspect
    or adapting your code near
    ((String)metadataProperties.get(key)).split(“,”)
    might avoid some trouble.

  • http://www.sydneyclimbing.com/ Peter Monks

    Leo, would you mind raising this in the issue tracker in the Google Code project, so that I can track it properly?

    I’d be particularly interested in knowing precisely what the behaviour is when the list of aspect names includes spaces (i.e. is an exception thrown, does the aspect fail to get applied, do incorrect aspects get applied, etc.).

  • Steve

    Can you explain how the webscript you wrote gets access to the content that lives on the file system? When I read the wiki regarding web scripts, a html input form is the only way shown to access the file content, that is, it uploads the content via the form and then the web script has access to the form fields and the file content. In a a bulk file scenario where there isn’t a UI, it’s not obvious how to gain access to file content. Can you enlighten us? Thanks in advance.

  • http://www.sydneyclimbing.com/ Peter Monks

    Steve, the key is that the Web Script is reading the source content off the server’s filesystem, not the client that initiated the import. This is part of the reason that this is an administrator-only tool for now – it requires that the content be copied to a disk that’s mounted to the server hosting Alfresco, prior to the tool being run (typically end-users wouldn’t have direct filesystem access to the server(s) Alfresco is running on, so wouldn’t be able to use this tool).

    This isn’t a problem for the tool’s primary use case of course, which is around large scale content migration / ingestion. It’s unlikely that an end-user would be able to accomplish this unassisted anyway, even if the tool supported it.

  • Diane

    Does this work on 64 bit Linux? I ran the apply_amps.sh and it appears the WARs (alfresco & share) were both corrupted, when replacing them with the backup Tomcat now boots up cleanly again. Before that it crashed at startup :-(

  • http://www.sydneyclimbing.com/ Peter Monks

    Diane, the tool is developed on 64bit Mac OSX, which (from an Alfresco / Tomcat perspective) is basically the same as 64bit Linux. Did you try applying the AMP again? My first suspicion would be that this issue was caused by a one-time glitch in the apply_amps process.

  • Fred Grafe

    Is the tool compatible with the latest version of Alfresco (3.4d)? Getting the following exception in the tomcat logs:
    Module ‘org.alfresco.extension.alfresco-bulk-filesystem-import’ version 0.11 is incompatible with the current repository version 3.4.0.
    The repository version required must be in range [3.3.0 : 3.3.99].
    at org.alfresco.error.AlfrescoRuntimeException.create(AlfrescoRuntimeException.java:46)
    at org.alfresco.repo.module.ModuleComponentHelper.startModule(ModuleComponentHelper.java:509)
    at org.alfresco.repo.module.ModuleComponentHelper.access$400(ModuleComponentHelper.java:57)
    at org.alfresco.repo.module.ModuleComponentHelper$1$1.execute(ModuleComponentHelper.java:239)
    at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:381)
    at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:272)
    at org.alfresco.repo.module.ModuleComponentHelper$1.doWork(ModuleComponentHelper.java:260)
    … 54 more

  • http://www.sydneyclimbing.com/ Peter Monks

    Fred, v0.11 of the bulk filesystem import tool was developed and tested against Alfresco v3.3 (the then latest release of Alfresco). The module (AMP file) was therefore “pinned” to version 3.3, resulting in the above error when installed on Alfresco v3.4 (or indeed any version other than 3.3.x).

    You can manually override the supported version of the AMP by editing the module.properties within the AMP file, which will allow the tool to be installed on 3.4, but there’s no guarantee it’ll work. I’m not aware of anything that would prevent it from working, but haven’t verified it myself.

    The next version of the tool will be built and tested against v3.4 but I don’t have an ETA on it, unfortunately.

  • Fred Grafe

    Hi Peter,
    Thanks for getting back to me about the version issue. I updated configuration and it deployed just fine in alfresco. I was able to test out the upload a bunch a files successfully as a test. Right now I’m trying to figure out how to get the metadata import to work with our defined model.

    I do have a question….is the purpose of this tool mainly for an individual to use to upload the files? Or could this be used by a job scheduler or called from within another application?

  • http://www.sydneyclimbing.com/ Peter Monks

    Fred, the import tool itself is currently exposed as two REST APIs (Web Scripts):
    1. An “initiate” API
    2. A “status” API

    Both of these are invoked via HTTP GET† requests, which can be scripted from an external job scheduler (e.g. cron or at) or called from any external application that is capable of executing an HTTP GET request. In addition, the status API can emit either HTML or XML, allowing external applications to poll the tool and obtain detailed status information.

    The UI Web Script that’s used to manually initiate an import is little more than a convenience layer on top of these two REST APIs, and is not central to the operation of the tool itself.

    † Technically I should have used HTTP POST or HTTP PUT for the “initiate” API, in keeping with REST principles, but my pragmatic experience has been that HTTP GETs are far easier to call (particularly from within the browser, shell scripts etc.) provided minimal data is being passed in the request (as is the case here). There’s an enhancement request on this in the issue tracker.

  • http://dhartford.blogspot.com dhartford

    Hey all,
    Thanks for the tool, this adds another option for migration that, for larger jobs, will likely be much easier (opposed to planning *days* of active migration with other approaches)!

    Question however as I’m reviewing options as discussed here http://forums.alfresco.com/en/viewtopic.php?f=9&t=38889, does this bulk import from filesystem work with Alfresco’s content store such that the backlog, once put into Alfresco, is pre-seperated in the contentstore or currently is your entire backlog dumped into the current year contentstore on the filesystem (alf_data/contentstore/2011/*** for example).

  • http://dhartford.blogspot.com dhartford

    adding – comment above for Alfresco 3.4.d CE edition, so the next version above 0.11 would be helpful for myself as well as Fred Grafe, added an issue in tracker.

  • http://www.sydneyclimbing.com/ Peter Monks

    dhartford, currently the tool simply imports the content into Alfresco using whatever contentstore implementation (and therefore storage policy) that Alfresco instance is configured with. So for example if the XAM connector is configured, binaries will be stored on the CAS device using a hashed id rather than the default “timestamp hashbucket directory structure” approach.

    For the use case described in the forum post, I’d suggest that using Content Storage Policies (see also this webinar) is a better approach, as it will provide the archival mechanism you require, independent of the underlying content store implementation. Relying on the internal implementation details of a particular contentstore implementation (such as the timestamp hashbucket behaviour of the filesystem contentstore) is somewhat risky, as Alfresco reserves the right to change those implementation details at a later time.

  • http://dhartford.blogspot.com dhartford

    Wow, http://wiki.alfresco.com/wiki/Content_Store_Selector is exactly what I was looking for — thanks so much Peter!

    That will help address my specific requirement regardless of the backlog import tool used. Just so you can have some numbers, my current review of using CMIS (admittedly, in a single-thread/serial fashion) was only netting 2 transactions (images)/sec, so I do hope the bulk filesystem import will work for 3.4.d, I haven’t modified the module.properties to try 0.11 on it yet but will let you know of any findings.

  • http://www.sydneyclimbing.com/ Peter Monks

    dhartford, good to hear! Yeah 2 txns / sec is not great. FWIW I see sustained throughput of 15 – 20 docs / sec using the tool on my 2009 MacBook Pro, using a vanilla Alfresco Enterprise 3.3 install, and in the past Alfresco has been demonstrated to handle sustained throughput of up to around 100 docs / sec (though that was on fairly beefy hardware).

    I expect Alfresco 3.4 will be faster still (removing Hibernate gave the repository a noticeable performance bump across the board), and following the next (versioning) release, I intend to work on a couple of performance-focused enhancements that should also speed the tool up. Issue #56 in particular has the potential to significantly improve the performance of imports.

  • http://dhartford.blogspot.com dhartford

    doh, the Content Store Selector approach complains about no “storeSelectorContentStoreBase” defined…it appears through the forums that this is an enterprise-only feature not available in the Community Edition. Too bad, as this seems like a pretty common scenario to use if people knew about it more.

  • http://www.mq.edu.au Jordi Anguela

    Hi Peter, we are analyzing your tool to import 260.000 records into a Record Management site.

    Peter, have you ever used your tool for that purpose?

    We started with a small sample of 1000 Record Folders with some metadata and it took 24min to ingest them. Not acceptable, we need to improve this speed. When we tried to ingest those Records Folders as normal folders with their metadata into a normal Site it took only 15sec. We would like to get similar performance with the RM site.

    We think that something is slowing down the process in the RM Site (audit may be?). We have already tried some tuning techniques described in Alfresco documentations without any luck.

    Best regards,
    Jordi

  • http://www.sydneyclimbing.com/ Peter Monks

    Jordi, I have not tested the tool against an RM site as the tool doesn’t know (or care) what type of space the target is – it simply writes the source content to wherever you tell it to. In other words, the performance discrepancy you’re seeing is likely due to the repository, its configuration or the environment, rather than the tool itself.

    Have you tried profiling / DB tracing while the import is in progress to try to find out what specifically is taking the extra time?

  • http://www.mq.edu.au Jordi Anguela

    Hi Peter,

    thank you for your reply. We have tried to use VisualVM to detect a bottleneck without luck. However, apparently we solved the issue regarding the slowness with the record’s ingestion in the RM Site. You know that the File Plan in the RM site has 4 levels (Series, Categories, Record Folders & Records Files). In our first test we had created the shadow files ONLY for the Record Folders (and skipping the Series and Categories). When the shadow files for the Series & Categories were created then the tool worked much better: 5000 records in 15min (on my local deployment)

    Another important information that we have discovered along the process and that you could add to the documentation is that you need to create the shadow files specifically in “UTF-8″. java.io.FileWriter class doesn’t use UTF-8 by default (it uses ISO-8859-1) and this was generating a NullPointerException during the execution.

    Hope this information is useful,
    Regards, Jordi

  • http://www.sydneyclimbing.com/ Peter Monks

    Jordi, I’d be very interested in seeing an example file that causes the NPE, along with the full stack trace for the UTF-8 vs ISO-8859 issue. Would you mind raising an issue in the issue tracker?

  • Zhihai Liu

    Hi Peter,

    I ran into an exception when trying an import with the 1.0 release. The error seemed to happen on multi-valued cm:taggable property in the metadata file. I created a new issue as http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=88. Later on I found a related issue here
    http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=57. Do you mind taking a look?

    Thanks.

  • http://www.sydneyclimbing.com/ Peter Monks

    Zhihai, I’ve updated issue #88 [1] with more information, and confirmed that it is indeed a duplicate of issue #57 [2]. Once you correct your metadata files, everything should work as expected.

    [1] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=88
    [2] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=57

  • http://www.sydneyclimbing.com/ Peter Monks

    I’ve just created a mailing list to help facilitate assistance with and discussion of the tool. Please try to use that resource rather than commenting here, since blog comments aren’t great for that kind of thing. Thanks!

  • http://www.previdenciasocial.gov.br Luiz Candido Borges

    I’m trying to use the Alfesco Bulk Filesystem Import on Community Edition 3.4.d (RHEL platform). After some troubles to import content – the Source Directory must be under /tomcat/bin (I’m sure this is not a standard way) – I didn’t arrive to import metadata. I mixed the content files with respective metadata files in the same hierachical organization as in the repository. As result, the content is loaded but not the metadata. In fact, the metadata file is loaded into repository as any other file. Is this the correct way to load metadata? The documentation is very succint: it shows how configurate but not how to use.

    Thanks for your help,

    Additional information:

    Custom content model file (it’s working well by using Alfresco Explorer)

    Customizacao para o Laboratorio de Conversao de Midia
    Candido
    2011-09-29
    1.0

    Relacao de Lancamento por Lote
    cm:content

    Microfilmagem
    d:text

    true
    false
    true

    A metadata file:

    cm:digitalizacao01
    1234/1056-1078
    Lote de Lancamento
    Roseli

  • http://www.previdenciasocial.gov.br Luiz Candido Borges

    Peter, I’m sorry for the previous post: the xml sample I’ve sent didn’t work well in your blog. If you want, I can send then attachen on e-mail.

    Best regards,

  • http://www.sydneyclimbing.com/ Peter Monks

    Luiz, would you mind raising this in the mailing list? That’s a much better forum for discussing topics like this one.

  • Pallavi

    Hi,
    I was using your bulk import tool.It works great. I would like to know how can we declare the imported files as records. In one of the books i read we can give the cm:declareRecords in aspects tag and then give all the mandatory properties. The file will be declared as record. But it doesnt seem to work for me. Can someone help please

    Pallavi

  • http://www.sydneyclimbing.com/ Peter Monks

    Pallavi, can I suggest you raise this on the project’s mailing list? That’s a much better forum for discussing topics like this one.

  • oubaid

    hi peter
    I was using your bulk import tool.It works great.
    But i want to do this code to import automatcly metadata to folders contents
    curl -u admin:admin -d “sourceDirectory=/Users/user/Documents/Nouveaudossier/metadata&targetPath=/Company%20Home/sites/test” “http://localhost:8080/alfresco/service/bulkfsimport/initiate”

    but evry time The Web Script /alfresco/service/bulkfsimport/initiate has responded with a status of 302
    please i need your help

    Best regards,

  • http://www.sydneyclimbing.com/ Peter Monks

    oubaid, please raise this on the project’s mailing list.

  • http://www.jeanmicheldavid.com Jean-Michel David

    Hi,

    is there any limitation on the path length? I have some paths that are longer than 256 characters.

    Thanks.

  • http://www.sydneyclimbing.com/ Peter Monks

    Jean-Michel, please raise this on the project’s mailing list.


Alfresco Home | Legal | Privacy | Accessibility | Site Map | RSS  RSS

© 2012 Alfresco Software, Inc. All Rights Reserved.