Bulk Import from a Filesystem
The Use Case
In any CMS implementation an almost ubiquitous requirement is to load existing content into the new system. That content may reside in a legacy CMS, on a shared network drive, on individual user’s hard drives or in email, but the requirement is almost always there – to inventory the content that’s out there and bring some or all of it into the CMS with a minimum of effort.
Alfresco provides several mechanisms that can be used to import content, including:
- Alfresco JLAN Server – allows the repository to be manipulated as if it were a Windows network volume, FTP server, WebDAV repository or NFS volume
- CMIS API
- REST API
- SOAP API
- ACP files
Alfresco is also fortunate to have SI partners such as Technology Services Group who provide specialised content migration services and tools (their open source OpenMigrate tool has proven to be popular amongst Alfresco implementers).
That said, most of these approaches suffer from one or more of the following limitations:
- They require the content to be massaged into some other format prior to ingestion
- Orchestration of the ingestion process is performed external (ie. out-of-process) to Alfresco, resulting in excessive chattiness between the orchestrator and Alfresco.
- They require development or configuration work
- They’re more general in nature, and so aren’t as performant as a specialised solution
An Opinionated (but High Performance!) Alternative
For that reason I recently set about implementing a bulk filesystem import tool, that focuses on satisfying a single, highly specific use case in the most performant manner possible: to take a set of folders and files on local disk and load them into the repository as quickly and efficiently as possible.
The key assumption that allows this process to be efficient is that the source folders and files must be on disk that is locally accessible to the Alfresco server – typically this will mean a filesystem that is located on a hard drive physically housed in the server Alfresco is running on. This allows the code to directly stream from disk into the repository, which basically devolves into disk-to-disk streaming – far more efficient than any kind of mechanism that requires network I/O.
How those folders and files got onto the local disk is left as an exercise for the reader, but most OSes provide efficient mechanisms for transferring files across a network (rsync and robocopy, for example). Alternatively it’s also possible to mount a remote filesystem using an OS-native mechanism (CIFS, NFS, GFS and the like), although doing so reintroduces network I/O overhead.
Another key differentiator of this solution is that all of the logic for ingestion executes in-process within Alfresco. This completely eliminates expensive network RPCs while ingestion is occurring, and also provides fine grained control of various expensive operations (such as transaction commits / rollbacks).
Which leads into another advantage of this solution: like most transactional systems, there are some general strategies that should be followed when writing large amount of data into the Alfresco repository:
- Break up large volumes of writes into multiple batches – long running transactions are problematic for most transactional systems (including Alfresco).
- Avoid updating the same objects from different concurrent transactions. In the case of Alfresco, this is particularly noticeable when writing content into the same folder, as those writes cause updates to the parent folder’s modification timestamp.[EDIT] In recent versions of Alfresco, the automatic update of a folder’s modification timestamp (cm:modified property) has been disabled by default. It can be turned back on (by setting the property “system.enableTimestampPropagation” to true), but the default is false so this is likely to be less of an impact to bulk ingestion than I’d originally thought.
The bulk filesystem import tool implements both of these strategies (something that is not easily accomplished when ingestion is coordinated by a separate process). It batches the source content by folder, using a separate transaction per folder, and it also breaks up any folder containing more than a specific number of files (1,000 by default) into multiple transactions. It also creates all of the children of a given folder (both files and sub-folders) as part of the same transaction, so that indirect updates to the parent folder occur from that single transaction.
But What Does this Mean in Real Life?
The benefit of this approach was demonstrated recently when an Alfresco implementation had a bulk ingestion process that regularly loaded large numbers (1,000s) of large image files (several MBs per file) into the repository via CIFS. In one test, it took approximately an hour to load 1,500 files into the repository via CIFS. In contrast the bulk filesystem import tool took less than 5 minutes to ingest the same content set.
Now clearly this ignores the time it took to copy the 1,500 files onto the Alfresco server’s hard drive prior to running the bulk filesystem import tool, but in this case it was possible to modify the sourcing process so that it dropped the content directly onto the Alfresco server’s hard drive, providing a substantial (order of magnitude) overall saving.
What Doesn’t it Do (Yet)?
Despite already being in use in production, this tool is not what I would consider complete. The issue tracker in the Google Code project has details on the functionality that’s currently missing; the most notable gap being the lack of support for population of metadata (folders are created as cm:folder and files are created as cm:content). [EDIT] v0.5 adds a first cut at metadata import functionality. The “user experience” (I hesitate to call it that) is also very rough and could easily be substantially improved. [EDIT] v0.4 added several UI Web Scripts that significantly improve the usability of the tool (at least for the target audience: Alfresco developers and administrators).
That said, the core logic is sound, and has been in production use for some time. You may find that it’s worth investigating even in its currently rough state.
[POST EDIT] This tool seems to have attracted quite a bit of interest amongst the Alfresco implementer community. I’m chuffed that that’s the case and would request that any questions or comments you have be raised on the mailing list. If you believe you’ve found a bug, or wish to request an enhancement to the tool, the issue tracker is the best place. Thanks!
Tags: bulk, content migration, DM, filesystem, high performance, import, ingestion, web script

October 23rd, 2009 at 7:05 am
hi,
just want to add that fme AG (an Alfresco Partner in germany and my prior employer) offers a migration tool called migration-center.
This was developed for high speed & high load migration from filesystem to documentum. It is used by some well known companies.
migration-center will also be able to talk with alfresco, you’ll simply have to implement a specific importer-Interface. If you want to import from another import source (e.g. another ecm repo) you can do the same by implementing a specific scanner-Interface.
cheers, jan
November 6th, 2009 at 12:45 am
Peter -
Thank you for your efforts bringing this key piece of functionality to the community. As a vendor tasked with CMS implementation and regularly loading hundreds of thousands of files onto client installations, one of our challenges with Alfresco has been a reliable supported interface for importing pre-indexed scanned documents. We look forward to working with your new tool.
Best,
John
November 6th, 2009 at 12:50 am
John,
Good to hear! I’m very keen to hear of your experiences with the tool. If you have ideas for improvement or (heaven forbid!
) run into bugs, please don’t hesitate to use the issue tracker in the Google Code project to track those.
Cheers,
Peter
November 26th, 2009 at 8:31 am
I developed a tool to upload documents to alfresco with their meta data and tried to make the interface as simple as possible, it will generate ACP that can be imported into Alfresco automatically or manually
http://forge.alfresco.com/projects/acpgenerator/
November 26th, 2009 at 5:01 pm
Rami, the issue with ACPs is that they’re imported in a single transaction, so if the content set is large that approach will run afoul of the various issues with long running transactions.
The ACP approach also requires that the content is copied three times:
1. from disk into the ACP file
2. the ACP file itself is transferred (copied) from disk into the repository (which may occur over the network, introducing network I/O latencies into the process as well)
3. from the ACP file into the content store
The bulk filesystem import tool only incurs one of these copy costs – copying the files from disk into the content store (which is the bare minimum that Alfresco requires).
Still, for smaller content sets ACP files work just fine, and as you point out they have support for importing metadata today (which, at the time of writing, the bulk filesystem importer still lacks).
December 11th, 2009 at 8:19 pm
If you bulk load into a space that has content rules applied, what happens? Will the rules still fire? If they do, is there a way to NOT use the metadata importers at all, and let the rules handle everything?
All in all, this is the solution to my prayers. I’ve been trying to load 300,000 files to a repository via CIFS and Alfresco really hates that!
December 11th, 2009 at 8:32 pm
Keith, yes rules will still fire. In fact that’s the reason the original customer who sponsored this development didn’t require metadata loading – they already had rules configured that synthesised their metadata. It sounds like your use case (replacing CIFS + rules with metadata importer + rules) is identical to their case, so you should be in good shape.
The metadata loading functionality is optional – you can control via Spring configuration which (if any) of the metadata loader implementations are used. As of v0.5 the “basic” and “properties file” metadata loaders are configured by default, but unless you create metadata files on disk alongside your original content, the property file metadata loading logic won’t take effect.
The “basic” metadata loader is required however, as it’s responsible for correctly setting the type (cm:content vs cm:folder) of each node as it’s created in the repository, as well as setting the cm:name and cm:title properties to the name of the file on disk. CIFS is doing both of these things too, it’s just that you don’t really see it explicitly (CIFS makes the repository look like a filesystem, but under the covers it’s actually populating some standard metadata properties, such as type, cm:name, cm:title, etc.).
Anyway, I’m very keen to hear about your experiences with the tool – please keep us apprised of your progress!
December 14th, 2009 at 8:16 pm
Hi Peter,
Thanks for the info. I installed the tool and tested it out. My initial test, leaving it configured as default, imported my file structure, along with filenames but all of my files were 0K length and were unable to be retrieved from the repository.
Any thoughts? I turned on logging, but all I’m seeing are the “Ingesting..” statements and the properties file metadata failures as I have no shadow files.
I changed the permissions on my source directory and files to 777, and no luck so far. So close I can taste it!
December 14th, 2009 at 8:40 pm
Keith, I’d suggest raising this in the issue tracker in the Google Code project. The more details you can provide (environment – OS, DB, Java; log file output; filesystem ownership information on the source content vs the user Alfresco is running as; the Alfresco user you’re running the Web Scripts as, etc. etc.) the more likely it is that a possible explanation will suggest itself.
December 15th, 2009 at 9:21 pm
Will do. I’ve been unable to get this thing to work at all.
December 15th, 2009 at 9:26 pm
Yeah I suspect something is wrong with your installation or environment, given that it’s in successful production use in at least one location and has been evaluated by a dozen or so other installations (that I’m aware of).
Once you create the issue, I’ll take a look and see if anything obvious jumps out at me.
December 15th, 2009 at 10:17 pm
For anyone who’s seeing similar issues, Keith reported his issue here.
December 28th, 2009 at 5:55 pm
Peter,
Wanted to follow up and thank you for the quick turnaround on addressing the issue I discovered. I’m using the tool to upload approximately 300K files to an instance of Alfresco Community. It’s been working great, and is helping out tremendously. Up until I discovered your tool, I was dreading having to write one myself. Thanks for making this available to the community, and I hope you have a prosperous and Happy New Year!
January 19th, 2010 at 9:15 am
Hi Peter,
We are in a project where the next urgent step is to bulk import documents in Alfresco, and we thought using your solution.
Unfortunately I am not familiar with Maven and this is why I am not sure what to do at the first step of the installation process:
“1. Build the AMP file using Maven2 (“mvn clean package”)”.
In the mean time I’ve installed apache-maven-2.2.1 on my computer, hope this is the utility I need it.
As for your explanations, I understood that I have to “manually edit the pom.xml file in order to point Maven to either the Community Artifact repository (sponsored by SourceSense, one of Alfresco’s European SI partners), or to a Maven repository I have to create that contains the Alfresco Enterprise artifacts” . Please emphasis this step, even if this is basic routine for you. Thank you in advance!
January 19th, 2010 at 3:43 pm
Mihaela, the Google Code page has a pre-built AMP file available for download that obviates the need to built the package yourself.
January 19th, 2010 at 9:03 pm
I have installed the .amp file, but I am at a loss on how to use the actual functionality. The readme file associated with the project states that the web script is available at /bulk/import/filesystem, yet I am unable to locate it.
Thanks!
January 19th, 2010 at 10:59 pm
Luke, Web Script URLs start with “/alfresco/service”, so the fully qualified URL for the Web Script would be along the lines of:
http://myalfrescoserver:8080/alfresco/service/bulk/import/filesystem
February 1st, 2010 at 1:44 am
Hi Peter,
Thank you for the great tool. I am using version 0.6 on Alfresco Community 3.2r2.
I seem to have an issue with bulk importing to “/Company Home” – it gives a file not found error. However importing to “/Company Home/foldername” works as expected. Is this by design or is there a solution?
Also, are you planning (or is there already) the ability to maintain file dates?
Max
February 1st, 2010 at 4:47 am
Max, I’d suggest raising an issue for the “/Company Home” problem in the issue tracker on the Google code project – it sounds like a bug.
The issue regarding file dates is already tracked in the issue tracker as issue #4. This is a rather more complex problem, as Java (at least until JSR-203 is implemented – currently slated for JDK 1.7) is unable to read filesystem metadata (including most file dates).
March 3rd, 2010 at 10:30 pm
I would love to use your tool as I need to load thousands of PDFs nightly and Alfresco share hangs on both CIFS and FTP copies after about 600-1000 documents. I used MMT to include the supplied AMP file into Alfresco.war.
I get the following error in alfresco.log:
14:32:53,670 ERROR [org.alfresco.web.scripts.AbstractRuntime] Exception from executeScript – redirecting to status template error: 02030011 Not implemented
org.alfresco.error.AlfrescoRuntimeException: 02030011 Not implemented
at org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)
If I go through the webscripts, I get the following error.
Web Script Status 500 – Internal Error
The Web Script /alfresco/service/bulk/import/filesystem/initiate has responded with a status of 500 – Internal Error.
500 Description: An error inside the HTTP server which prevented it from fulfilling the request.
Message: 02030011 Not implemented
Exception: org.alfresco.error.AlfrescoRuntimeException – 02030011 Not implemented
org.alfresco.repo.security.authentication.DefaultMutableAuthenticationDao.loadUserByUsername(DefaultMutableAuthenticationDao.java:410)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
org.alfresco.repo.management.subsystems.ChainingSubsystemProxyFactory$1.invoke(ChainingSubsystemProxyFactory.java:95)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
$Proxy22.loadUserByUsername(Unknown Source)
net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.getUserFromBackend(DaoAuthenticationProvider.java:390)
net.sf.acegisecurity.providers.dao.DaoAuthenticationProvider.authenticate(DaoAuthenticationProvider.java:225)
net.sf.acegisecurity.providers.ProviderManager.doAuthentication(ProviderManager.java:159)
net.sf.acegisecurity.AbstractAuthenticationManager.authenticate(AbstractAuthenticationManager.java:49)
org.alfresco.repo.security.authentication.AuthenticationComponentImpl.authenticateImpl(AuthenticationComponentImpl.java:81)
org.alfresco.repo.security.authentication.AbstractAuthenticationComponent.authenticate(AbstractAuthenticationComponent.java:144)
org.alfresco.repo.security.authentication.AuthenticationServiceImpl.authenticate(AuthenticationServiceImpl.java:129)
org.alfresco.repo.security.authentication.AbstractChainingAuthenticationService.authenticate(AbstractChainingAuthenticationService.java:166)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
java.lang.reflect.Method.invoke(Method.java:597)
org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:304)
org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:182)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:149)
net.sf.acegisecurity.intercept.method.aopalliance.MethodSecurityInterceptor.invoke(MethodSecurityInterceptor.java:80)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
org.alfresco.repo.security.permissions.impl.ExceptionTranslatorMethodInterceptor.invoke(ExceptionTranslatorMethodInterceptor.java:49)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
org.alfresco.repo.audit.AuditComponentImpl.audit(AuditComponentImpl.java:275)
org.alfresco.repo.audit.AuditMethodInterceptor.invoke(AuditMethodInterceptor.java:69)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:106)
org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:171)
org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
$Proxy26.authenticate(Unknown Source)
org.alfresco.repo.web.scripts.servlet.BasicHttpAuthenticatorFactory$BasicHttpAuthenticator.authenticate(BasicHttpAuthenticatorFactory.java:187)
org.alfresco.repo.web.scripts.RepositoryContainer.executeScript(RepositoryContainer.java:280)
org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:262)
org.alfresco.web.scripts.AbstractRuntime.executeScript(AbstractRuntime.java:139)
org.alfresco.web.scripts.servlet.WebScriptServlet.service(WebScriptServlet.java:122)
javax.servlet.http.HttpServlet.service(HttpServlet.java:717)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:857)
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:565)
org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1509)
java.lang.Thread.run(Thread.java:619)
I’m not sure if I have not turned something on or if it is a security error. Any help would be appreciated.
Gary
March 4th, 2010 at 12:09 am
Gary, can I suggest you raise this in the issue tracker in the Google Code project? Thanks!
March 9th, 2010 at 4:21 pm
I was able to run the script and import 100s of documents without any problem. However, as a next steps I have to associate some custom meta data. I tried to follow the instruction in the readme file but so far had no success. Can you please provide example of a metadata/properties file which also includes a custom content type and properties.
Thanks, Frank
March 9th, 2010 at 8:53 pm
Frank, the readme file goes into some detail on how and where to put the metadata properties files, and there’s a simple metadata properties file example about halfway down.
March 10th, 2010 at 6:57 pm
Hello Peter,
Many thanks. I got it to work. However, could it be that dates ( e.g. 2012-05-23) are not supported.
March 10th, 2010 at 7:57 pm
Frank, currently the code relies on Alfresco to convert string values in the properties files into their correct data type in the repository. I can’t recall exactly which implicit data type conversions Alfresco supports natively, but there is a chance they are quite limited and don’t extend to dates or date/times.
Regardless, this has been raised as a task in the issue tracker in the Google code project – please feel free to look into this further if you have the time and interest, as I’m not sure when I will next have an opportunity to investigate it.
April 1st, 2010 at 8:47 pm
[...] to further increase migration performance. In the coming months, we will be looking to integrate concepts from Peter Monk’s work with the Alfresco Bulk Filesystem [...]
April 20th, 2010 at 12:09 pm
It is ready to bulk import into Alfresco 3.3? Can you mention somewhere for what versions of Alfresco your tool is usable? I did not find anything about this.
April 20th, 2010 at 10:12 pm
Bastiaan, the tested versions are mentioned in the readme file, although the tool is basically very simple and should work on all 3.x versions of Alfresco. In fact it may even work on 2.x versions of Alfresco, but currently the AMP is configured to only allow installation on versions 3.0 and above as I’ve not tested on any 2.x release.
April 27th, 2010 at 11:03 pm
Peter,
I’ve been testing the bulk importer, and it works well except that the “update existing files” option doesn’t seem to do anything. Has anyone else had trouble with this?
April 27th, 2010 at 11:16 pm
Susan, can I suggest you raise this in the issue tracker in the Google Code project? It’s far easier to manage / track in there. Thanks!
June 14th, 2010 at 9:03 am
Thanks for this tool Peter!
I migrated 50go of data from a shared drive to alfresco and it worked perfectly (the process took something around 15 hours)
The only problem I got was with folders that contained a whitespace at the end so it might be a good idea to trim spaces names after creation.
Great work anyway.
June 14th, 2010 at 4:39 pm
Arthur, good to hear! Just out of interest, approximately how many files and folders were in the source content set?
I’ve raised an issue regarding the whitespace in the issue tracker – it’s issue #33.
August 24th, 2010 at 6:23 am
Hi i’ve tried this amp and it works great! How can i add custom aspects in the metadata.properties ? Is this possible?
August 24th, 2010 at 6:33 am
justin, the readme describes how to attach aspects (regardless of whether they’re built-in or custom) to the ingested content – see line 81.
September 10th, 2010 at 10:47 pm
Hi, I just used and it works great, I personally like when spaces and folders overlap and spaces have rules on them, it works perfectly. Great job !
September 10th, 2010 at 11:29 pm
Thanks Savic! Glad to hear you’ve had success with the tool!
September 16th, 2010 at 3:24 pm
Hi Peter,
I have a fix for issue 4 ‘creation and modification dates’. I would like to share with you.
If you are interested, just contact me.
Regards,
Walter
September 16th, 2010 at 5:51 pm
Walter, if you could attach a patchfile to issue #4 in the issue tracker, I’ll give it a review.
I should also point out that this isn’t actually an issue with the bulk importer, but a bug in Alfresco that was recently fixed (see ALF-2565).
December 4th, 2010 at 2:45 am
Hi Peter,
Could you compile the latest head version of your source code, we really need the fix for issue #4 but we still having difficulties when compiling it in maven.
Regards,
Tiur
December 4th, 2010 at 3:47 am
Tiur, please review the existing issues in the issue tracker (there are several that are similar), and raise a new issue if appropriate.
December 6th, 2010 at 5:41 am
Hi Peter,
We’ve raised a new issue : http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=55
Could you kindly provide the AMP file for Alfresco 3.4? We are willing to give donation for the project if you do this. We need this functionality ASAP, it would very helpful if you help us.
Regards,
Tiur
December 6th, 2010 at 6:37 am
Tiur, I’ve updated issue #55 with the current state of play. Unfortunately the issue you’re running into relates to the Community Maven repository (which I am in no way involved in supporting or maintaining) rather than the Bulk Filesystem Import Tool, so your best bet is to chase it up with them separately.
FWIW I’ve also reached out to them internally to try to find out what’s going on, but you should chase it up directly as well as I rarely use that repository.
December 8th, 2010 at 10:25 pm
Is it possible to use the bulk importer to load comments along with other metadata? Thanks.
December 9th, 2010 at 1:29 pm
H Peter,
This is a fantastic tool i am able upload the around 130GB data in 2 – 3 hours…
i was wondered even from intranet the transfer rate is 7Mbps… I have no problem sofar when I am using with alfresco and share.
But when it comes to Open office 3.2.0 with oracle connector I was not able to access the data which resides in folder.
Sory for raising this issue here but i want to know what is the prob.
December 16th, 2010 at 6:42 pm
Susan, it depends how comments are defined in the content model.
If they’re a property of the node, then yes they can be loaded (although note that multi-valued properties are not yet supported – see issue #20).
If comments are stored as sub-nodes of the file then currently there’s no way of loading that structure, since filesystems don’t typically support files that are also folders (unlike Alfresco, which does support that structure).
December 16th, 2010 at 6:43 pm
chiru, can you describe the problem in more detail? It doesn’t sound like the issue you’re seeing is related to the import tool, although it’s a bit hard to tell from your description.
December 28th, 2010 at 1:22 pm
Hi Peter,
I have been checking your import tools and it looks good, but I hava a question: is it possible at this moment to set document’s categories?.
Thank
December 28th, 2010 at 5:08 pm
Antonio, it’s possible to set a single category (issue #19 in the issue tracker describes how this is done), but it’s not yet possible to set multi-valued properties – that’s issue #20. Setting a single category isn’t very useful, obviously, but I have not had time to look at issue #20.
January 7th, 2011 at 4:35 am
[...] to the source and target system. It quickly became obvious (partly due to Peter Monks’ blog post on one approach) that there were a handful of options [...]
January 18th, 2011 at 5:08 am
Hi Peter,
We’re in the process of upgrading Alfresco to 3.3.4 but they want to do a gradual upgrade (and decided to use Alfresco 3.2.* as an interim step). We had the bulk import working before with a custom aspect defined and when we tried to run it on 3.2 it failed with the following exception:
namespace prefix [prefix] is not mapped to a namespace uri
I should mention that applying the aspect manually to a file didn’t produce any errors.
Any idea why the bulk import is complaining?
My metadata file is of the following structure:
type=cm:content
aspects=sensis:prod
cm\:title=09000001800699fe.pdf
cm\:description=Contract
sensis\:advertiserId=478283400
sensis\:campaignCode=N00Y
sensis\:generationDate=2005-05-31T12:00:00.000+10:00
sensis\:issue=26
cm\:storeName=storeA
Thanks in advance for any suggestions you might have.
January 18th, 2011 at 5:41 am
Zoran, that error usually indicates that the content model containing the given namespace (almost certainly “sensis” in this case) isn’t registered with the repository. That said, if you’re able to attach the “sensis:prod” aspect manually via the UI (Explorer or Share) then that pretty much rules that possibility out.
Would you mind raising this in the issue tracker in the Google Code project, so that I can track it properly? The above detail is good, but what would be even better would be the files you’re using to register the content model with the repository (both the model file itself and the Spring application context that loads it), or a cut-down equivalent that also demonstrates the issue. Thanks!
January 26th, 2011 at 9:43 am
Hi
I just wanted to say that, thanks to this tool, we were able to upload 4.5 million documents in an Alfresco repository in only 4 days.
This would have taken weeks with webdav or ftp.
Thank you very much for this awesome tool
January 26th, 2011 at 5:14 pm
polgarine, that’s great to hear – thanks for commenting! Just out of interest, roughly how large (in MB / GB) were the documents in total?
February 4th, 2011 at 12:50 pm
Thanks a lot for this tool. Supressing the blank in your readme, line
aspects=cm:versionable, custom:myAspect
or adapting your code near
((String)metadataProperties.get(key)).split(“,”)
might avoid some trouble.
February 4th, 2011 at 5:22 pm
Leo, would you mind raising this in the issue tracker in the Google Code project, so that I can track it properly?
I’d be particularly interested in knowing precisely what the behaviour is when the list of aspect names includes spaces (i.e. is an exception thrown, does the aspect fail to get applied, do incorrect aspects get applied, etc.).
February 9th, 2011 at 4:14 pm
Can you explain how the webscript you wrote gets access to the content that lives on the file system? When I read the wiki regarding web scripts, a html input form is the only way shown to access the file content, that is, it uploads the content via the form and then the web script has access to the form fields and the file content. In a a bulk file scenario where there isn’t a UI, it’s not obvious how to gain access to file content. Can you enlighten us? Thanks in advance.
February 9th, 2011 at 11:15 pm
Steve, the key is that the Web Script is reading the source content off the server’s filesystem, not the client that initiated the import. This is part of the reason that this is an administrator-only tool for now – it requires that the content be copied to a disk that’s mounted to the server hosting Alfresco, prior to the tool being run (typically end-users wouldn’t have direct filesystem access to the server(s) Alfresco is running on, so wouldn’t be able to use this tool).
This isn’t a problem for the tool’s primary use case of course, which is around large scale content migration / ingestion. It’s unlikely that an end-user would be able to accomplish this unassisted anyway, even if the tool supported it.
March 18th, 2011 at 7:13 pm
Does this work on 64 bit Linux? I ran the apply_amps.sh and it appears the WARs (alfresco & share) were both corrupted, when replacing them with the backup Tomcat now boots up cleanly again. Before that it crashed at startup
March 18th, 2011 at 8:37 pm
Diane, the tool is developed on 64bit Mac OSX, which (from an Alfresco / Tomcat perspective) is basically the same as 64bit Linux. Did you try applying the AMP again? My first suspicion would be that this issue was caused by a one-time glitch in the apply_amps process.
May 16th, 2011 at 6:29 pm
Is the tool compatible with the latest version of Alfresco (3.4d)? Getting the following exception in the tomcat logs:
Module ‘org.alfresco.extension.alfresco-bulk-filesystem-import’ version 0.11 is incompatible with the current repository version 3.4.0.
The repository version required must be in range [3.3.0 : 3.3.99].
at org.alfresco.error.AlfrescoRuntimeException.create(AlfrescoRuntimeException.java:46)
at org.alfresco.repo.module.ModuleComponentHelper.startModule(ModuleComponentHelper.java:509)
at org.alfresco.repo.module.ModuleComponentHelper.access$400(ModuleComponentHelper.java:57)
at org.alfresco.repo.module.ModuleComponentHelper$1$1.execute(ModuleComponentHelper.java:239)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:381)
at org.alfresco.repo.transaction.RetryingTransactionHelper.doInTransaction(RetryingTransactionHelper.java:272)
at org.alfresco.repo.module.ModuleComponentHelper$1.doWork(ModuleComponentHelper.java:260)
… 54 more
May 17th, 2011 at 3:11 am
Fred, v0.11 of the bulk filesystem import tool was developed and tested against Alfresco v3.3 (the then latest release of Alfresco). The module (AMP file) was therefore “pinned” to version 3.3, resulting in the above error when installed on Alfresco v3.4 (or indeed any version other than 3.3.x).
You can manually override the supported version of the AMP by editing the module.properties within the AMP file, which will allow the tool to be installed on 3.4, but there’s no guarantee it’ll work. I’m not aware of anything that would prevent it from working, but haven’t verified it myself.
The next version of the tool will be built and tested against v3.4 but I don’t have an ETA on it, unfortunately.
May 20th, 2011 at 8:40 pm
Hi Peter,
Thanks for getting back to me about the version issue. I updated configuration and it deployed just fine in alfresco. I was able to test out the upload a bunch a files successfully as a test. Right now I’m trying to figure out how to get the metadata import to work with our defined model.
I do have a question….is the purpose of this tool mainly for an individual to use to upload the files? Or could this be used by a job scheduler or called from within another application?
May 20th, 2011 at 9:00 pm
Fred, the import tool itself is currently exposed as two REST APIs (Web Scripts):
1. An “initiate” API
2. A “status” API
Both of these are invoked via HTTP GET† requests, which can be scripted from an external job scheduler (e.g. cron or at) or called from any external application that is capable of executing an HTTP GET request. In addition, the status API can emit either HTML or XML, allowing external applications to poll the tool and obtain detailed status information.
The UI Web Script that’s used to manually initiate an import is little more than a convenience layer on top of these two REST APIs, and is not central to the operation of the tool itself.
† Technically I should have used HTTP POST or HTTP PUT for the “initiate” API, in keeping with REST principles, but my pragmatic experience has been that HTTP GETs are far easier to call (particularly from within the browser, shell scripts etc.) provided minimal data is being passed in the request (as is the case here). There’s an enhancement request on this in the issue tracker.
May 23rd, 2011 at 8:21 pm
Hey all,
Thanks for the tool, this adds another option for migration that, for larger jobs, will likely be much easier (opposed to planning *days* of active migration with other approaches)!
Question however as I’m reviewing options as discussed here http://forums.alfresco.com/en/viewtopic.php?f=9&t=38889, does this bulk import from filesystem work with Alfresco’s content store such that the backlog, once put into Alfresco, is pre-seperated in the contentstore or currently is your entire backlog dumped into the current year contentstore on the filesystem (alf_data/contentstore/2011/*** for example).
May 23rd, 2011 at 8:28 pm
adding – comment above for Alfresco 3.4.d CE edition, so the next version above 0.11 would be helpful for myself as well as Fred Grafe, added an issue in tracker.
May 23rd, 2011 at 9:00 pm
dhartford, currently the tool simply imports the content into Alfresco using whatever contentstore implementation (and therefore storage policy) that Alfresco instance is configured with. So for example if the XAM connector is configured, binaries will be stored on the CAS device using a hashed id rather than the default “timestamp hashbucket directory structure” approach.
For the use case described in the forum post, I’d suggest that using Content Storage Policies (see also this webinar) is a better approach, as it will provide the archival mechanism you require, independent of the underlying content store implementation. Relying on the internal implementation details of a particular contentstore implementation (such as the timestamp hashbucket behaviour of the filesystem contentstore) is somewhat risky, as Alfresco reserves the right to change those implementation details at a later time.
May 24th, 2011 at 1:53 pm
Wow, http://wiki.alfresco.com/wiki/Content_Store_Selector is exactly what I was looking for — thanks so much Peter!
That will help address my specific requirement regardless of the backlog import tool used. Just so you can have some numbers, my current review of using CMIS (admittedly, in a single-thread/serial fashion) was only netting 2 transactions (images)/sec, so I do hope the bulk filesystem import will work for 3.4.d, I haven’t modified the module.properties to try 0.11 on it yet but will let you know of any findings.
May 24th, 2011 at 3:07 pm
dhartford, good to hear! Yeah 2 txns / sec is not great. FWIW I see sustained throughput of 15 – 20 docs / sec using the tool on my 2009 MacBook Pro, using a vanilla Alfresco Enterprise 3.3 install, and in the past Alfresco has been demonstrated to handle sustained throughput of up to around 100 docs / sec (though that was on fairly beefy hardware).
I expect Alfresco 3.4 will be faster still (removing Hibernate gave the repository a noticeable performance bump across the board), and following the next (versioning) release, I intend to work on a couple of performance-focused enhancements that should also speed the tool up. Issue #56 in particular has the potential to significantly improve the performance of imports.
May 24th, 2011 at 6:31 pm
doh, the Content Store Selector approach complains about no “storeSelectorContentStoreBase” defined…it appears through the forums that this is an enterprise-only feature not available in the Community Edition. Too bad, as this seems like a pretty common scenario to use if people knew about it more.
September 20th, 2011 at 11:00 pm
Hi Peter, we are analyzing your tool to import 260.000 records into a Record Management site.
Peter, have you ever used your tool for that purpose?
We started with a small sample of 1000 Record Folders with some metadata and it took 24min to ingest them. Not acceptable, we need to improve this speed. When we tried to ingest those Records Folders as normal folders with their metadata into a normal Site it took only 15sec. We would like to get similar performance with the RM site.
We think that something is slowing down the process in the RM Site (audit may be?). We have already tried some tuning techniques described in Alfresco documentations without any luck.
Best regards,
Jordi
September 21st, 2011 at 9:53 am
Jordi, I have not tested the tool against an RM site as the tool doesn’t know (or care) what type of space the target is – it simply writes the source content to wherever you tell it to. In other words, the performance discrepancy you’re seeing is likely due to the repository, its configuration or the environment, rather than the tool itself.
Have you tried profiling / DB tracing while the import is in progress to try to find out what specifically is taking the extra time?
September 26th, 2011 at 1:06 am
Hi Peter,
thank you for your reply. We have tried to use VisualVM to detect a bottleneck without luck. However, apparently we solved the issue regarding the slowness with the record’s ingestion in the RM Site. You know that the File Plan in the RM site has 4 levels (Series, Categories, Record Folders & Records Files). In our first test we had created the shadow files ONLY for the Record Folders (and skipping the Series and Categories). When the shadow files for the Series & Categories were created then the tool worked much better: 5000 records in 15min (on my local deployment)
Another important information that we have discovered along the process and that you could add to the documentation is that you need to create the shadow files specifically in “UTF-8″. java.io.FileWriter class doesn’t use UTF-8 by default (it uses ISO-8859-1) and this was generating a NullPointerException during the execution.
Hope this information is useful,
Regards, Jordi
September 26th, 2011 at 6:56 am
Jordi, I’d be very interested in seeing an example file that causes the NPE, along with the full stack trace for the UTF-8 vs ISO-8859 issue. Would you mind raising an issue in the issue tracker?
September 29th, 2011 at 3:19 pm
Hi Peter,
I ran into an exception when trying an import with the 1.0 release. The error seemed to happen on multi-valued cm:taggable property in the metadata file. I created a new issue as http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=88. Later on I found a related issue here
http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=57. Do you mind taking a look?
Thanks.
September 29th, 2011 at 6:52 pm
Zhihai, I’ve updated issue #88 [1] with more information, and confirmed that it is indeed a duplicate of issue #57 [2]. Once you correct your metadata files, everything should work as expected.
[1] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=88
[2] http://code.google.com/p/alfresco-bulk-filesystem-import/issues/detail?id=57
September 29th, 2011 at 9:28 pm
I’ve just created a mailing list to help facilitate assistance with and discussion of the tool. Please try to use that resource rather than commenting here, since blog comments aren’t great for that kind of thing. Thanks!
October 10th, 2011 at 6:48 pm
I’m trying to use the Alfesco Bulk Filesystem Import on Community Edition 3.4.d (RHEL platform). After some troubles to import content – the Source Directory must be under /tomcat/bin (I’m sure this is not a standard way) – I didn’t arrive to import metadata. I mixed the content files with respective metadata files in the same hierachical organization as in the repository. As result, the content is loaded but not the metadata. In fact, the metadata file is loaded into repository as any other file. Is this the correct way to load metadata? The documentation is very succint: it shows how configurate but not how to use.
Thanks for your help,
Additional information:
Custom content model file (it’s working well by using Alfresco Explorer)
Customizacao para o Laboratorio de Conversao de Midia
Candido
2011-09-29
1.0
Relacao de Lancamento por Lote
cm:content
Microfilmagem
d:text
true
false
true
A metadata file:
cm:digitalizacao01
1234/1056-1078
Lote de Lancamento
Roseli
October 10th, 2011 at 6:54 pm
Peter, I’m sorry for the previous post: the xml sample I’ve sent didn’t work well in your blog. If you want, I can send then attachen on e-mail.
Best regards,
October 10th, 2011 at 9:08 pm
Luiz, would you mind raising this in the mailing list? That’s a much better forum for discussing topics like this one.
January 12th, 2012 at 1:30 pm
Hi,
I was using your bulk import tool.It works great. I would like to know how can we declare the imported files as records. In one of the books i read we can give the cm:declareRecords in aspects tag and then give all the mandatory properties. The file will be declared as record. But it doesnt seem to work for me. Can someone help please
Pallavi
January 13th, 2012 at 3:32 am
Pallavi, can I suggest you raise this on the project’s mailing list? That’s a much better forum for discussing topics like this one.