Skip to content
rjrudin edited this page Aug 21, 2017 · 32 revisions

Version 3.0.0 features several new tasks for leveraging the new Data Movement SDK in version 4 of the MarkLogic Java Client API.

Until version 3.0.0 is available, you can try this out on 3.0-beta1, which has all of the new DMSDK tasks in it. And you can easily create your own tasks that use DMSDK - see below for more information.

The problem

The goal of these new tasks is to solve a common problem - you need to perform some kind of update operation on tens of thousands (perhaps even just thousands) of documents or more, and the operation times out in Query Console or a simple main module.

So you then either break the operation up and run it multiple times in Query Console, or you create a new CoRB job for an ad hoc operation. CoRB is an important tool on MarkLogic projects for running transforms, particularly as part of a deployment (like migrating your data as part of a new release). But ideally you can get the job done with just the Java Client API, which you already have access to in your build.gradle file via ml-gradle, and by leveraging the new features and benefits of DMSDK.

The solution

ml-gradle now provides a solution to this common problem by using DMSDK to perform all the updates, thus scaling to any number of documents, and with a simple command line interface. The tasks in 3.0.0 are focused on common update operations on document collections and permissions, along with using collections and URI patterns to select the documents to update. But there's also support for easily creating your own tasks that use DMSDK to perform any kind of update based on any set of documents.

So while you'll almost certainly keep using CoRB and Gradle together for transforms that either need to be repeated often and/or benefit from being able to write custom code that goes beyond simple queries and update operations, you can use these new DMSDK-based Gradle tasks for simple operations that don't need custom code and can be knocked out quickly via the command line and a few parameters.

Trying it out

To see all the new tasks, just run the following:

gradle tasks

And look for the new "Data Movement Tasks" group.

Here are a few examples to give you an idea of how the tasks work.

Let's say we have 1 million documents in a collection named "red". We can easily add those to another collection - note how "whereCollections" defines the comma-separated set of collections of documents we want to modify (a document only needs to belong to one of the collections), and "collections" defines the comma-separated collections we want to add to each selected document:

gradle mlAddCollections -Pcollections=blue -PwhereCollections=red

Generally, properties that let you select documents to modify will be prefixed with "where".

We can also explicitly set all the collections too:

gradle mlSetCollections -Pcollections=red,blue,green -PwhereCollections=red

And then remove collections to get back to our original state:

gradle mlRemoveCollections -Pcollections=blue,green -PwhereCollections=red

We can also select documents via a URI pattern (which is processed under the hood by cts:uri-match - and 3.1.0 will have support for specifying a full query on the command line as well, though you can achieve that in 3.0.0 by writing your own task as described below):

gradle mlAddCollections -Pcollections=xmlDocuments -PwhereUriPattern=**.xml

And just like collections, we can set permissions too, using the common "role,capability,role,capability" syntax for specifying permissions:

gradle mlAddPermissions -Ppermissions=rest-reader,rest,rest-writer,update -PwhereUriPattern=**.json

And as you probably expect now, you can use mlRemovePermissions and mlSetPermissions to remove and set document permissions too.

And of course, sometimes you just need to delete entire collections that contain tens of millions of documents - no problem now:

gradle mlDeleteCollections -Pcollections=red,blue

Configurable properties for any DMSDK task

Each of these tasks has several properties that affect how DMSDK operates.

You can configure the thread count (defaults to 8):

gradle -PthreadCount=32 ...

Or the batch size (default to 100):

gradle -PbatchSize=50 ...

Or the job name:

gradle -PjobName=my-job ...

Or whether a consistent snapshot is used (defaults to true):

gradle -PapplyConsistentSnapshot=false ...

In addition, if you'd like some basic logging generated for each batch of URIs that's processed, just do the following:

gradle -PlogBatches=true ...

Writing your own tasks that use DMSDK

Version 3.0.0 includes tasks for manipulating collections and permissions on documents selected via a URI pattern or collections, but these of course won't address every use case. You can easily reuse the plumbing that these tasks are using by creating a new task like this:

task myTask(type: com.marklogic.gradle.task.datamovement.DataMovementTask) {
  doLast {
    applyOnCollections(new com.example.MyListener(), "some-collection")
  }
}

With a little more code, you can access any method on the QueryBatcherTemplate class that's used under the hood - such as using a method that lets you run any query that returns URIs:

task myTask(type: com.marklogic.gradle.task.datamovement.DataMovementTask) {
  doLast {
    def client = newClient()
    try {
      newQueryBatcherTemplate(client).applyOnUrisQuery(new com.example.MyListener(), "cts:uris((), (), some query...)")
    } finally {
      client.release()
    }
  }
}

For more information, please see DataMovementTask in the ml-gradle project and QueryBatcherTemplate in the ml-javaclient-util project, which is a Spring-style template class that simplifies using a QueryBatcher.

Note that by overriding the newQueryBatcherTemplate method, you have complete control over how the QueryBatcherTemplate is configured - for instance, you can add as many URIs and failure listeners as you want.

Clone this wiki locally