Skip to content

FOBIE dataset and code for Semi-Open Relation Extraction, applied to Biology for Computer-Aided Biomimetics.

License

Notifications You must be signed in to change notification settings

rubenkruiper/FOBIE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

73 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semi-Open Relation Extraction

The Focused Open Biology Information Extraction (FOBIE) dataset aims to support IE from Computer-Aided Biomimetics. The dataset contains ~1,500 sentences from scientific biological texts. These sentences are annotated with TRADE-OFFS and syntactically similar relations between unbounded arguments, as well as argument-modifiers.

The FOBIE dataset has been used to explore Semi-Open Relation Extraction (SORE). The code for this and instructions can be found inside the SORE folder Readme.md, or in the ReadTheDocs documentations.

Format

The train/test/dev data files are provided in two formats. A verbose json format inspired on the Semeval2018 task 7 dataset:

{"[document_ID]":
  {"[relation_ID_within_document]":
    {"annotations":
      {"modifiers":
        {"[within_sentence_modifier_ID]":
          {"Arg0": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",
                    "text": "[string]"},
           "Arg1": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",
                    "text": "[string]"}
          }
       },
     "tradeoffs":
        {"[within_sentence_tradeoff_ID]":
          {"Arg0": {"span_start": "[token_index]",
                    "span_end": "[token_index]",
                    "span_id": "[brat_ID]",  
                    "text": "[string]"},
          "Arg1": {"span_start": "[token_index]",
                   "span_end": "[token_index]",
                   "span_id": "[brat_ID]",  
                   "text": "[string]"},           
          "TO_indicator": {"span_start": "[token_index]",
                           "span_end": "[token_index]",
                           "span_id": "[brat_ID]",  
                           "text": "[string]"},
          "labels": {"Confidence": "High"}
        }
      }
    },
    "sentence": "[string]"
  }
},

And the Sci-ERC dataset format, which is used to train the SciIE system:

{   "clusters": [],
    "sentences": [["List", "of", "some", "tokens", "."]],
    "ner": [[[4, 4, "Generic"]]],
    "relations": [[[4, 4, 6, 17, "Tradeoff"]]],
    "doc_key": "XXX"}

We also provide a script to convert data from the verbose format to SciIE format, as well as a script to convert BRAT annotations to the verbose format.

Statistics

Also see dataset_statistics.py under the scripts folder.

Train Dev Test Total
# Unique documents 1010 138 144 1292
# Sentences 1248 150 150 1548
Avg. sent. length 37.42 38.91 40.02 37.81
% of sents ≥ 25 tokens 82.21 % 85.33 % 83.33 % 82.62%
Relations:
- Trade-Off 639 54 72 765
- Not-a-Trade-Off 2004 258 240 2502
- Arg-Modifier 1247 142 132 1521
Triggers 1292 155 153 1600
Keyphrases 3436 401 398 4235
Keyphrases w/ multiple relations 1600 188 163 1951
Spans 4728 556 551 5835
Max relations/sent 9 8 8
Max spans/sent 9 8 8
Max triggers/sent 2 2 2
Max args/trigger 5 4 4
Unique spans 3643
Unique triggers 41
# single-word keyphrases 864 (20.4%)
Avg. tokens per keyphrase 3.46

If you use the FOBIE dataset or SORE code in your research, please consider citing the following papers:

@inproceedings{Kruiper2020_SORE,
  author =      "Kruiper, Ruben
                and Vincent, Julian F V
                and Chen-Burger, Jessica
                and Desmulliez, Marc P Y
                and Konstas, Ioannis",
  title =       "In Layman's Terms: Semi-Open Relation Extraction from Scientific Texts"
  year =        "2020",
  url =         "https://arxiv.org/pdf/2005.07751.pdf",
  arxivId =     "2005.07751"
}
@inproceedings{Kruiper2020_FOBIE,
  author =      "Kruiper, Ruben
                and Vincent, Julian F V
                and Chen-Burger, Jessica
                and Desmulliez, Marc P Y
                and Konstas, Ioannis",
  title =       "A Scientific Information Extraction Dataset for Nature Inspired Engineering"
  booktitle =   "Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)",
  year =        "2020",
  keywords =    "Biomimetics,Relation Extraction,Scientific Information Extraction,Trade-Offs",
  pages =       "2078--2085",
  url =         "http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.255.pdf",
  arxivId =     "2005.07753"
}

The FOBIE dataset along with SORE code in this repository are licensed under a Creative Commons Attribution 4.0 License.

About

FOBIE dataset and code for Semi-Open Relation Extraction, applied to Biology for Computer-Aided Biomimetics.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages