Monday, August 17, 2009

Content Migration - How we got a project over a finish line 16.9 times faster

Last year we ended up migrating 38 web sites and major site sections (10,710 pages in total) in just over a week each. A Nation-Wide reputable vendor estimated each of them to take 3 to 6 months. How did we do it?

The short answer is this: we used Batch Loader. Easy enough. Am I simply comparing a manual import to the use of a tool? Nope. I'm not about to waste your time. After all, that vendor was also using batch loader.

Batch Loader Turbo Charged

When it comes to a mass-check in - batch loader is a nice and useful tool but when you're loading tenths of thousands of files from dozens of locations and when each file has unique derived values in its metadata - Batch Loader won't be of much help.

I guess that almost any enterprise-scale content migration will have you fall flat on your face if you're simply relying on Batch Loader to "magically" load your content into ECM.

So what is the quickest way to automate such a migration?

Its simple. The answer becomes obvious when you look at HOW the batch loader works. It processes a typical HDA file - one record at a time. It picks up a file from the location you specify in primaryFile field, sets metadata values to the ones you tell it to use and calls a Check In service. Again, it reads batch loader script one record at a time and checks in files - one by one.

What if we could create some cool batch loader script that will import all the files we want imported? All at once! Sure, that would be nice, but how do we go about creating one?

The Batch Builder utility that comes with Batch Loader is very limited. It builds a very simple files based on a content of a single directory lets you use file system data as meta. It won't let you pick up files from multiple locations or create complex meta values.

So, here's the biggie - to turbo-charge your content migration effort, you need may consider GENERATING your own batch loader scripts.

How to use Code Generation effectively

How do you go about generating it? For simple migration you can get away with using your editor's search and replace function on a comma-separated list of files

Let's say, your excel file has the following columns:

Content Id, Title, Author, Security Group, Account, Doc Type, Date, File Path

After you save it in a comma-separated (CSV) format, you'll end up with something like this:

A2561405, Migration Project Plan, Bill, Public, , abstract, 8/12/09 4:20 PM, C:/Migration/Project Plan v.3.4.doc

Now, you could use a RegEx like this to produce a batch loader script out of your CSV file:

Replace this:

^([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)$

( If you're new to regular expressions, this says:

- Begin at the start of a line
- Select every character until you see a comma - repeated 8 times
- You must be at the end of the line )

With this replacement string:

dDocName=$1\n dDocTitle=$2\n dDocAuthor=$3\n dSecurityGroup=$4\n dDocAccount=$5\n dDocType=$6\n dInDate=$7\n primaryFile=$8\n <<EOD>> \n

You may need to test the RegEx in your own editor as every one has a slightly different syntax.

After you run it - your comma-separated line will transform into an HDA entry that will look like this:

dDocName=A2561405
dDocTitle=Migration Project Plan
dDocAuthor=Bill
dSecurityGroup=Public
dDocAccount=
dDocType=abstract
dInDate=8/12/09 4:20 PM
primaryFile=C:/Migration/Project Plan v.3.4.doc
<<EOD>>

I hope, you get the idea.

How to scale it up

You can easily adapt this technique to any complexity. Just use Perl, Ruby or another scripting language of your choice to generate metadata values and the file names and locations.

Be sure to use subroutines, structure your code well and store it in source-control system. Code generation script can get quite complex quite quickly.

Important last minute tips


Today, there will be three:

- You'll very likely need to debug your code generation script and run your batch load file more then once so be sure to:
  • Add a custom meta field or a special value like "batch_loader" for the dDocAuthor so you can find your files quickly and delete them when its time to start fresh
  • Test on a small sub-set (under 200 items) so you don't have to wait for 8 hours for these 500 Gb to import
- Be sure to CLEAR the "Clean up files after successful check in" box. If you leave it checked - your source files WILL be deleted and you won't find them in a Recycle Bin!

- Be sure to MARK "Enable error file" box. This will create a detailed log file and a smaller batch loader script file for the files that didn't load. Absolutely essential!





That's all for now.

Happy Migrating!

9 comments:

  1. Is "Enable Error File" available as a command line parameter?

    ReplyDelete
  2. Hi Mike

    From what I know, batch loader is a GUI app, not a command line... Why would you bother calling a batch loader in a script any way?

    Actually, there's an even more powerful tool you can use to automate your content migrations. It called IDC Command. If I see enough people looking for that info - I'll do an article on that too...

    Best
    D

    ReplyDelete
  3. Batchloader is ABSOLUTELY a command line application. Been using it that way for years. (mainly due to issues running X-Windows, but that's another story) I've been using it without issue. Most of the loads happen quickly and without issue. However occasionally, there is an issue with a filename or another parameter that prevents a record from loading. "Enable Error File" would have saved me time pulling out the few records that failed. Just wondering if that parameter is available. I have read through the docs (for 7.5.x) and there is no mention of the parameter.

    ReplyDelete
  4. Just remembering this blog again. Did you ever determine if there is an Error File switch on the command line for batchloader?

    ReplyDelete
  5. In quite similar ways, I prefer using Excel Macros to create Batchloader-script entries for each file. It find that easier and faster.

    And the batchlaoder parameters are avialble in commandline as well. These configuration setting, are recommended to be set in /bin/intradoc.cfg:

    EnableErrorFile=true (to write error log for all files whose upload fails, during a batch)

    MaxErrorsAllowed=100 (Sets the number of errors after which the Batch Loader stops processing records from the batch load file)

    CleanUp=false (if true, indicates that the files should be deleted from the file system as they are being batch loaded. )

    And this fires the tool in quiet commandline mode, making it even faster:
    BatchLoader -q -console -n

    ReplyDelete
  6. Hi Dmitri,

    I’m looking for some information in setting up archiver for contribution to consumption replication. Could you please give me some information so I know where to start?

    Thanks in advance

    ReplyDelete
  7. Hello Miguel

    Essentially, setting up replication involves 4 steps:

    1. Setting up outgoing provider on your source system
    2. Updating host name filter on your target system - to let your source system connect
    3. Setting up an export-automated archive on the source system
    4. Setting up an archive on the target system that will automatically import when transfer is completed.

    For low level detail and step-by-step guidance go to page 151 of Content Server - System Migration Guide 10gR3 (a different page number for 11g guide)


    The Guide is also available here: http://download.oracle.com/docs/cd/E17967_01/cs/cs_doc_10/documentation/admin/sysmigration_cserver_10en.pdf

    ReplyDelete
  8. Hi, I am trying to estimate the effort it takes to move 18 million PDF and TIFF files (50% , 50%) distribution into UCM in the fastest time possible. Do you suggest a tool that lets me do that? How do leagcy systems that have been around for decades move to UCM?

    ReplyDelete
  9. Batch Loader is still your tool of choice. You may want to shut off auto index updates while you import.... Do a test load with a thousand items and see how long it takes. All depends on your configuration - if you have an IBR, full text indexing and so on. The speed of your database and file system also plays a part.

    If the system is not performing - there's a lot of options to consider....

    ReplyDelete