Skip to content
This repository has been archived by the owner on Sep 26, 2019. It is now read-only.

Introduction

Sascha T. Ishikawa edited this page Jul 9, 2015 · 5 revisions

What is Scribe 2.0?

Scribe 2.0 is a partial-text, or metadata, transcription framework that reduces a complex transcription task into simpler, self-contained sub-tasks. The division of labor into atomic tasks is further reinforced by providing volunteers with canonical and independent workflows to mark, transcribe, and verify documents. Furthermore, it enables volunteers to step through any workflow in no particular order. Though Scribe 2.0 was not originally intended to be a full-transcription framework, we believe it can be adapted as such.

The Mark-Transcribe-Verify Framework

The front-end first downloads a document for transcription. It is retrieved in the form of a subject, which is a JSON object containing a URI to a uniquely identifiable media document (in this case an image of a document that needs to be transcribed) along with data that may indicate how many other users have transcribed the particular document, additional metadata, etc.

The process of transcribing an entire document begins with the “mark” workflow. Given a document, a volunteer is asked to choose from a set of tools to mark specific pre-defined regions in need of transcription. Once a volunteer indicates that a mark is complete, by clicking the “done” button, it is submitted to the server as a classification. A classification is a record of a volunteer’s response(s) to a given subject’s workflow and contains important information used for the final transcription, A/B splits, etc. This information includes the subject ID, timestamp(s), A/B split designations, and a hash of annotations with subject-specific responses such as mark locations and dimensions, transcription text, or verification responses.

Once a “mark” classification is received, it is processed 1 by the server, saved to a database, and a secondary “transcribe” subject is generated. The “transcribe” subject references its parent subject and contains, among other information, a URI to the same image as the subject it was generated from; however, it is part of a separate “transcribe” workflow that asks users to focus on a subregion of the image corresponding to the mark that was generated previously (either by the same user or someone else). Each workflow fetches its own subjects and therefore can be accessed independently, with a separate UI. In other words, the unit of work is atomic and allows different stages of transcription to be completed, not only by volunteers, but by automated approaches. For example, in some instances, the “mark” workflow may be replaced with a computer vision algorithm that detects regions of text to be transcribed. In such cases, volunteers would only interact with a “transcribe” (and perhaps “verify”) workflow.

From the front-end perspective, the workflows operate independently of each other; they are routed to their own URL hashes; they request their own subjects, use their own tools, and create their own classifications. However, they are linked through the back-end. A “mark” subject produces a “mark” classification, which generates a secondary “transcribe” subject and then a “transcribe” classification. Depending on the needs of a project, each “mark” or “transcribe” classification may produce additional “verify” subjects that can be presented to volunteers as a final verification step before a final transcription is produced. It becomes clear that each subsequent workflow depends on the “mark” workflow. Without any marks, there is nothing to transcribe or verify.

The separation of concerns between the “mark” and “transcribe” (and “verify”) workflows is an important aspect of Scribe 2.0 for several reasons. First and foremost, it removes the need to store the state in between workflows. This simplifies the front-end logic; rather than having an all-encompassing UI component that keeps track of a user’s progress throughout multiple workflows, where each step potentially reflects another intermediate state, having a single component responsible for each workflow reduces the number of states a component is responsible for. This simplifies the design of each component and promotes a more modular and readable codebase.

This serves to highlight one more advantage of independent workflows in the front-end. Assuming enough marks are produced, volunteers are afforded the flexibility to choose any workflow over the others, in no particular order, or step through all of them in sequence.

1 “marks” are sent to the server as a classification and denormalized into a transcription subject. During denormalization, some fields from the “mark” classification are transferred to a “transcribe” subject when it is generated. The more crucial fields include the “mark” subject’s ID, the annotation data that specifies a mark’s location and/or dimensions, and the type of data represented by the mark.

Subjects, Workflows, and Classifications (Oh, my!)

The API provides subjects, defines workflows, and receives classifications. In addition, the site content, from the custom background image URL to the HTML markup representing the web page content, is returned by the API and inserted into existing templates.

[TO DO: write about how subjects are fetched]

Subject-Workflow Relationship

Each subject belongs to a specific workflow. Therefore, in order to retrieve subjects, a workflow must be designated. Scribe 2.0 provides the following endpoint for this,

http://project.scribe.org/workflows/:workflow_id/subjects.json?limit=:limit

where :workflow_id is the ID or name of a specific workflow and :limit specifies the maximum number of subjects to return. The subjects are returned in a JSON object.

A workflow contains all the necessary information to produce a classification for a particular subject. This includes a list of tasks to be completed, their progression (defined by specifying the next task in the sequence), the required tools, available choices, and instructions to prompt the user to perform a particular action.

The Lifecycle of a Subject

Creation

Retirement

Retirement Rules

The status field in the subject’s model can take on the following values

  • inactive
  • active
  • retired
  • complete

What is a classification?

A classification is a record of a user’s work on a subject. There is a one-to-one relationship between subjects and classifications. Think of a classification is the end-product of a workflow, where user-generated data gets appended after the completion of each task within the workflow.

Like subjects, classifications are represented as JSON objects containing key/value hashes and arrays. Data collected from a user is stored within the “annotations” array, where each element represents an object returned from a task. The JSON object below represents a typical classification.

{
  "_id": ObjectId("5527fe2545626f2c7a010000"),
  "workflow_id": ObjectId("5526dbdc65626f1b6b020000"),
  "subject_id": ObjectId("5526dbdc65626f1b6b050000"),
  "annotations": {
    "0": {
      "task": "determine_has_records",
      "value": "yes",
      "_key": "0.04673911537975073"
    },
    "1": {
      "task": "identify_records",
      "_key": "0.2517089892644435",
      "_toolIndex": "2",
      "value": {
        "0": {
          "key": "0",
          "tool": "0",
          "x": "459.50298683142955",
          "y": "441.3859649122807",
          "_key": "0.6388408085331321"
        },
        "1": {
          "key": "1",
          "tool": "1",
          "x": "995.9941801192058",
          "y": "219.4736842105263",
          "width": "734.017496271003",
          "height": "73.15789473684214",
          "_key": "0.4052213248796761",
          "status": "mark"
        },
        "2": {
          "key": "2",
          "tool": "2",
          "x": "1588.573089068886",
          "y": "1024.2105263157894",
          "yLower": "1074.2105263157894",
          "yUpper": "974.2105263157894",
          "_key": "0.15537359076552093"
        }
      }
    }
  },
  "location": "offline/example_subjects/logbookofalfredg1851unse_0083.jpg",
  "started_at": "2015-04-10T16:45:18.559Z",
  "finished_at": "2015-04-10T16:45:25.616Z",
  "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.118 Safari/537.36"
}

The workflow can be partially reconstructed simply by inspecting the annotations. The first task, named determine_has_records, prompts the user to indicate if the subject has any transcribable records. The second task, identify_records, asks users to draw marks around each record to be transcribed at a later time. It’s clear that the user marked three records, each with a different marking tool, identified by its tool index. Note that the tools produce slightly different data.

Workflows

Workflows are represented by a JSON object.

As the user advances through each task in a workflows

An empty classification is created at the beginning of each workflow

The state of the workflow, that is, which task and tool is currently active, whether or not the task navigation allows you to go onto the next task or submit your work, etc., depends entirely upon the contents of the last annotation. The following code uses currentAnnotation to determine the task component that will be stored in TaskComponent for rendering.

annotations = @props.classification.annotations
currentAnnotation = annotations[annotations.length-1] # the last annotation

currentTask = @props.workflow.tasks[currentAnnotation?.task]
TaskComponent = tasks[currentTask.tool]

When a workflow is first loaded, the current task automatically defaults to the one specified by the first_task key. An empty annotation is created with a call to the addAnnotationForTask() method, which is passed a task key. In this case, where no annotations have been made yet, taskKey == first_task.

A classification prop is set before the Mark component is mounted. This is where all the user’s work is stored, within the annotations object. Any changes to classification are updated via json-api-client1, which listens for events and uses Promises to keep asynchronous events in order.

1 https://www.npmjs.com/package/json-api-client

Tasks

Tasks are managed in the “mark” (and still in development in the “transcribe”) route component (mark/index.cjsx). There they are kept synchronized with the annotations in the classification.

Each task is a pre-defined component located in the tasks directory. All available tasks are indexed in index.cjsx, located in the same directory. They can be used in other components by requiring the index file. To select a particular task component for rendering, use corresponding hash key as in the example that follows.

tasks = require ‘../tasks’
TaskComponent = tasks[‘textRow’] # return text-row task

…
render: ->
  …
  <TaskComponent … />
Tools

The Workflow Component

The workflow component (for example the marking component in mark/index.cjsx) is a ReactJS component that

Child Subjects

Secondary or tertiary subjects are generated from the “mark” classifications.

In the case of the “transcribe” workflow, a secondary subject corresponds to a mark over a document region with text that must be transcribed. All other regions of the document are made irrelevant as volunteers may only transcribe one region at a time (the rest of the document may be cropped out or obscured).

Groups

Subjects can be collected into groups which belong to a specific workflow. When the front end requests a subject it can do so from a specific workflow or a specific group on a workflow. This enables the creation of small campaigns, grouping of subjects by region/location/time frame etc or other groupings. Groups can contain metadata about the grouping they represent.

Clone this wiki locally