DeepDive Lite

Motivation

DeepDive Lite is an attempt to provide a lighter-weight interface to the process of creating a structured information extraction application in DeepDive. DeepDive Lite is built for rapid prototyping and development solely focused around defining an input/output schema, and creating a set of distant supervision rules. The goal is to then be able to directly plug these objects into DeepDive proper, and instantly get a more scalable, performant and customizable version of the application (which can then be iterated on within the DeepDive development framework).

One shorter-term motivation is also to provide a lighter-weight entry point to the DeepDive application development cycle for new non-expert users. However DeepDive Lite may also be useful for "expert" DeepDive users as a simple toolset for certain development and prototyping tasks.

DeepDive Lite is also part of a broader attempt to answer the following research questions: how much progress can be made with the schema and distant supervision rules being the sole user entry point to the application development process? To what degree can DeepDive be seen/used as an (iterative) compiler, which takes in a rule-based program, and transforms it to a statistical learning & inference-based one?

Installation / dependencies

First of all, make sure all git submodule has been downloaded.

git submodule update --init

DeepDive Lite requires a few python packages including:

We provide a simple way to install everything using virtualenv:

# set up a Python virtualenv
virtualenv .virtualenv
source .virtualenv/bin/activate

pip install --requirement python-package-requirement.txt

Alternatively, they could be installed system-wide if sudo pip is used instead of pip in the last command without the virtualenv setup and activation.

In addition the Stanford CoreNLP parser jars need to be downloaded; this can be done using:

./install-parser.sh

Finally, DeepDive Lite is built specifically with usage in Jupyter/IPython notebooks in mind. The jupyter command is installed as part of the above installation steps, so the following command within the virtualenv opens our demo notebook.

jupyter notebook DeepDiveLite.ipynb

Basics

Please see the Jupyter notebook demo in DeepDiveLite.ipynb for more detail!

Preprocessing Input

The SentenceParser can be used to split a document (input as a string) into sentences, and to extract a range of basic linguistic features from these sentences, such as part-of-speech tags, a dependency tree parse, lemmatized words, etc:

parser = SentenceParser()
for sent in parser.parse(doc_string):
  yield sent

The output is a generator of Sentence objects, which have various useful sentence attributes (as mentioned partially above).

Note: this is often the slowest part of the process, so for large document sets, pre-processing with high parallelism and/or external to DeepDive Lite is recommended. Further improvements on speed to come as well [TODO].

Candidate Extraction

DeepDive Lite is (currently) focused around extracting relation mentions from text, involving either one or two entities. In either case, we define a Relations object, which extracts a set of candidate relation mentions. Our task is then to train the system to distinguish true relation mentions from false ones.

For the binary case, we define a relation based on two sets of entity mentions, described via declarative operators. For example, we can define a relation as occuring between phrases that match a list of gene names, and phrases that match a list of phenotype names, and then extract them from a set of sentences:

r = Relations(
  DictionaryMatch('G', genes, ignore_case=False),
  DictionaryMatch('P', phenos),
  sentences)

The Relations object both extracts the candidate relations, and then serves as the interface to and container of them. To access them- as Relation objects- we use r.relations, and can render a visualization of one via e.g. r.relations[0].render:

Distant Supervision

The goal is now to create a set of rules that specify which relations are true versus false, which we will use to train the system to perform this inference correctly.*

In the context of DeepDive Lite, a rule is simply a function which accepts a Relation object and returns a value in {-1,0,1} (where 0 means 'abstain'). Once a list of rules is created, this list is applied to the Relations set via r.apply_rules(rules). This generates a matrix of rule labels r.rules, with rows corresponding to rules, and columns to relation candidates.

Note also that a natural question is: 'how well would my rules alone do on the classification task?'. This provides a natural baseline for assessing further performance downstream. To answer this question, relative to a set of ground truth, we can use r.get_rule_priority_vote_accuracy(idxs, ground_truth).

*Note that if a set of labeled data is available, these labels could technically be used to create a trivial set of rules; however we assume we are operating in domains where a sufficiently large labeled training set is not available.

Feature Extraction

Feature extraction is done automatically via r.extract_features(). The method of featurization can however be selected and customized [TODO]. After this has been performed, a (sparse) matrix of features r.feats is generated, with rows corresponding to features and columns to relation candidates.

Learning

Learning of rule & feature weights can be done using logistic regression, via r.learn_feats_and_weights(). This generates a learned parameter array r.w. Predicted relation values (with -1 meaning false, and 1 meaning true) can then be generated via r.get_predicted, and accuracy relative to a set of ground truth labels via r.get_classification_accuracy(idxs, ground_truth).

Name		Name	Last commit message	Last commit date
Latest commit History 100 Commits
gene_tag_example		gene_tag_example
test		test
vis		vis
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
GeneTaggerExample_Extraction.ipynb		GeneTaggerExample_Extraction.ipynb
GeneTaggerExample_Learning.ipynb		GeneTaggerExample_Learning.ipynb
README.md		README.md
ddlite.py		ddlite.py
ddlite_entity_features.py		ddlite_entity_features.py
ddlite_matcher.py		ddlite_matcher.py
ddlite_mindtagger.py		ddlite_mindtagger.py
ddlite_parser.py		ddlite_parser.py
install-parser.sh		install-parser.sh
python-package-requirement.txt		python-package-requirement.txt
rel_tree.png		rel_tree.png
tree_structs.py		tree_structs.py
treedlib		treedlib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepDive Lite

Motivation

Installation / dependencies

Basics

Preprocessing Input

Candidate Extraction

Distant Supervision

Feature Extraction

Learning

About

Releases

Packages

Languages

xiao-cheng/ddlite

Folders and files

Latest commit

History

Repository files navigation

DeepDive Lite

Motivation

Installation / dependencies

Basics

Preprocessing Input

Candidate Extraction

Distant Supervision

Feature Extraction

Learning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages