Skip to content

DeepDive Lite- for simple application prototyping, testing & development, as well as for learning about how to build DeepDive applications

Notifications You must be signed in to change notification settings

xiao-cheng/ddlite

 
 

Repository files navigation

DeepDive Lite

Motivation

DeepDive Lite is an attempt to provide a lighter-weight interface to the process of creating a structured information extraction application in DeepDive. DeepDive Lite is built for rapid prototyping and development solely focused around defining an input/output schema, and creating a set of distant supervision rules. The goal is to then be able to directly plug these objects into DeepDive proper, and instantly get a more scalable, performant and customizable version of the application (which can then be iterated on within the DeepDive development framework).

One shorter-term motivation is also to provide a lighter-weight entry point to the DeepDive application development cycle for new non-expert users. However DeepDive Lite may also be useful for "expert" DeepDive users as a simple toolset for certain development and prototyping tasks.

DeepDive Lite is also part of a broader attempt to answer the following research questions: how much progress can be made with the schema and distant supervision rules being the sole user entry point to the application development process? To what degree can DeepDive be seen/used as an (iterative) compiler, which takes in a rule-based program, and transforms it to a statistical learning & inference-based one?

Installation / dependencies

First of all, make sure all git submodule has been downloaded.

git submodule update --init

DeepDive Lite requires a few python packages including:

We provide a simple way to install everything using virtualenv:

# set up a Python virtualenv
virtualenv .virtualenv
source .virtualenv/bin/activate

pip install --requirement python-package-requirement.txt

Alternatively, they could be installed system-wide if sudo pip is used instead of pip in the last command without the virtualenv setup and activation.

In addition the Stanford CoreNLP parser jars need to be downloaded; this can be done using:

./install-parser.sh

Finally, DeepDive Lite is built specifically with usage in Jupyter/IPython notebooks in mind. The jupyter command is installed as part of the above installation steps, so the following command within the virtualenv opens our demo notebook.

jupyter notebook DeepDiveLite.ipynb

Basics

Please see the Jupyter notebook demo in DeepDiveLite.ipynb for more detail!

Preprocessing Input

The SentenceParser can be used to split a document (input as a string) into sentences, and to extract a range of basic linguistic features from these sentences, such as part-of-speech tags, a dependency tree parse, lemmatized words, etc:

parser = SentenceParser()
for sent in parser.parse(doc_string):
  yield sent

The output is a generator of Sentence objects, which have various useful sentence attributes (as mentioned partially above).

Note: this is often the slowest part of the process, so for large document sets, pre-processing with high parallelism and/or external to DeepDive Lite is recommended. Further improvements on speed to come as well [TODO].

Candidate Extraction

DeepDive Lite is (currently) focused around extracting relation mentions from text, involving either one or two entities. In either case, we define a Relations object, which extracts a set of candidate relation mentions. Our task is then to train the system to distinguish true relation mentions from false ones.

For the binary case, we define a relation based on two sets of entity mentions, described via declarative operators. For example, we can define a relation as occuring between phrases that match a list of gene names, and phrases that match a list of phenotype names, and then extract them from a set of sentences:

r = Relations(
  DictionaryMatch('G', genes, ignore_case=False),
  DictionaryMatch('P', phenos),
  sentences)

The Relations object both extracts the candidate relations, and then serves as the interface to and container of them. To access them- as Relation objects- we use r.relations, and can render a visualization of one via e.g. r.relations[0].render: rendered-relation

Distant Supervision

The goal is now to create a set of rules that specify which relations are true versus false, which we will use to train the system to perform this inference correctly.*

In the context of DeepDive Lite, a rule is simply a function which accepts a Relation object and returns a value in {-1,0,1} (where 0 means 'abstain'). Once a list of rules is created, this list is applied to the Relations set via r.apply_rules(rules). This generates a matrix of rule labels r.rules, with rows corresponding to rules, and columns to relation candidates.

Note also that a natural question is: 'how well would my rules alone do on the classification task?'. This provides a natural baseline for assessing further performance downstream. To answer this question, relative to a set of ground truth, we can use r.get_rule_priority_vote_accuracy(idxs, ground_truth).

*Note that if a set of labeled data is available, these labels could technically be used to create a trivial set of rules; however we assume we are operating in domains where a sufficiently large labeled training set is not available.

Feature Extraction

Feature extraction is done automatically via r.extract_features(). The method of featurization can however be selected and customized [TODO]. After this has been performed, a (sparse) matrix of features r.feats is generated, with rows corresponding to features and columns to relation candidates.

Learning

Learning of rule & feature weights can be done using logistic regression, via r.learn_feats_and_weights(). This generates a learned parameter array r.w. Predicted relation values (with -1 meaning false, and 1 meaning true) can then be generated via r.get_predicted, and accuracy relative to a set of ground truth labels via r.get_classification_accuracy(idxs, ground_truth).

About

DeepDive Lite- for simple application prototyping, testing & development, as well as for learning about how to build DeepDive applications

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 49.5%
  • Python 48.3%
  • JavaScript 2.1%
  • Shell 0.1%