Skip to content

NatUKE Wiki

Paulo do Carmo edited this page Nov 3, 2022 · 4 revisions

Welcome to NatUKE Wiki! Here we present usability explanations and an extension from the paper with more results data and discussions. We also provide a preview of the data used in the experiments and the access to said data.

Usability

NatUKE source code explanation for understanding, running and evaluating experiments.

Source code breakdown

Here we explain all the source code in the repository and the order in which to execute them:

  1. clean_pdfs.ipynb: load pdfs considering database and prepare two dataframes to be used further;
  2. phrases_flow.py: load texts dataframe and separate the texts into 512 tokens phrases;
  3. topic_generation.ipynb: load phrases dataframe and create a topic cluster using BERTopic [4];
  4. topic_distribution.ipynb: load BERTopic model the phrases dataframe and distributes the topics filtering according to an upper limit of the proportion and outputs the dataframe;
  5. hin_generation.ipynb: load the filtered topics dataset and paper information to generate the usable knowledge graph;
  6. knn_dynamic_benchmark.py: runs the experiments using the generated knowledge graph, considering the parametrers set on the main portion of the code;
  7. dynamic_benchmark_evaluation.py: generates hits@k and mrr metrics for the experiments, allowing different parameters to be set for the algorithms used as well as the metrics;
  8. execution_time_processer.py: processes the dynamically .txt generated by knn_dynamic_benchmark.py experiments into a dataframe of execution times;
  9. metric_graphs.py: with the metric results and execution times allows the generation of personalized graphs;
  • natuke_utils.py: contains the source for the: methods; split algorithms; similar entity prediction; and metrics.
  • exploration.ipynb: used to explore data, as for the quantities of each property.

Submodules

GraphEmbeddings

GraphEmbeddings submodule based on https://github.com/shenweichen/GraphEmbedding but the used algorithms works with tf 2.x

To install this version of GraphEmbeddings run:

cd GraphEmbeddings
python setup.py install

Metapath2Vec

metapath2vec submodule based on: https://stellargraph.readthedocs.io/en/stable/demos/link-prediction/metapath2vec-link-prediction.html

Enviroments compatibility

For a better user experience we recommend setting up two virtual environments for running biologist:

  • requirements.txt for all the codes, except topic_distribution.ipynb; topic_generation.ipynb; and hin_generation.ipynb;
  • requirements_topic.txt for topic_distribution.ipynb; topic_generation.ipynb; and hin_generation.ipynb (BERTopic requires a different numpy version for numba).

Extended paper

Introduction

Knowledge graphs (KGs) play a key role as source of structured data for a variety of applications [8]. They can be specialized or multi-purpose, containing information from a multitude of domains. Nevertheless, the process of building KGs is usually very cumbersome and time-consuming, often relying on manual efforts that can lead either to errors or incompleteness [11]. It is a main-fold task that involves several complex reading-understanding natural language processing techniques. An effort that can be significantly challenging when keeping the data set up-to-date with the latest information [7]. Devising automatic knowledge extraction methods is therefore a far-reaching goal to facilitate KG curation and maintenance.

Recently, machine learning (ML) methods have shown promising results in various natural language tasks, coping with data nuances that can go unnoticed by rule-based approaches designed by humans over limited data observations. In this work, we introduce a crowd-sourced evaluation benchmark containing a corpus of over two thousand exemplars for evaluating natural products knowledge extraction from academic literature. We refer to natural products as chemical compounds generated by living organisms, they contribute as much as 67% to all drugs approved worldwide [9]. Natural product research relies mainly on text for academic communication, building approaches to facilitate data querying exploration and organization is therefore pivotal to speedup research. We also evaluate different state-of-the-art unsupervised embedding generation methods and show that it is possible to extract some properties with relatively good accuracy. In our evaluation, EPHEN outperforms other approaches on natural product knowledge extraction. Although we focus on natural products, the methods evaluated in this work can be easily extended to other domains. Overall our contributions are as follows:

  • A large crowd-sourced benchmark for natural product knowledge extraction from academic Literature containing over two thousand manual curated entries, and;

  • An evaluation of different state-of-the-art unsupervised embedding generation methods on the task of end-to-end natural product knowledge extraction from academic literature.

Benchmark

The problem of knowledge extraction from unstructured data sources is that authors may use different words or methods to describe the same thing. Using rule-based information extraction algorithms is therefore very challenging. In this work, we propose a benchmark and evaluate different ML embeddings to the task of unsupervised knowledge extraction. We design the evaluation as such that we measure the performance of each approach when inserting randomly selected portions of a crowd-sourced training data set.

In order to simulate a scenario whereas new training data is constantly added to the model, we removed all nodes that originated from the crowd-sourced data set out of the KG, leaving the papers connected only to their topics. The first train/test split consists of a 20/80% division, and for other stages, the train split is increased by 20% until it reaches an 80/20% division. We also enriched our KG with topics related to the papers using BERTopic [4]. In Figure 1 we present a visualization of the evaluation stages. In our benchmark, we evaluate the accuracy of each approach in predicting the resource from different chemical compound properties using hits@k. The metric hits@k calculates and average of how many predictions achieve top k rankings [2].

Figure 1 - evaluation stages

evaluation_stages

Data

The data set used for evaluation as well as training was generated from hundreds of peer reviewed scientific articles where the information on more than 2,000 natural products were extracted. The data set was built manually by chemistry specialists that read the articles annotating four relevant properties associated with each natural product discussed in the paper: (I) metabolic class and (II) bioactivity, (III) species from where natural products were extracted, and (IV) collection site of this species.

The dataset can be found in different formats:

For the benchmark data was extracted from the linked data endpoint and then joined with the spreadsheet for missing values, with the following SPARQL query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX nubbe: <http://nubbe.db/>
PREFIX nubbeprop: <http://nubbe.db/property/>
PREFIX nubbeclass: <http://nubbe.db/class/>

SELECT DISTINCT 
  ?doi ?bioActivity ?collectionSpecie ?collectionSite ?collectionType ?molType 
  ?molecularMass ?monoisotropicMass ?cLogP ?tpsa?numberOfLipinskiViolations 
  ?numberOfH_bondAcceptors ?numberOfH_bondDonors ?numberOfRotableBonds 
  ?molecularVolume ?smile
WHERE {
  ?data     nubbeprop:doi                           ?doi                          .
  OPTIONAL {
    ?data   nubbeprop:biologicalActivity            ?bioActivity                  ;
            nubbeprop:collectionSpecie              ?collectionSpecie             ;
            nubbeprop:collectionSite                ?collectionSite               ;
            nubbeprop:collectionType                ?collectionType               ;
            rdf:type                                ?molType                      ;
            nubbeprop:molecularMass                 ?molecularMass                ;
            nubbeprop:monoisotropicMass             ?monoisotropicMass            ;
            nubbeprop:cLogP                         ?cLogP                        ;
            nubbeprop:tpsa                          ?tpsa                         ;
            nubbeprop:numberOfLipinskiViolations    ?numberOfLipinskiViolations   ;
            nubbeprop:numberOfH-bondAcceptors       ?numberOfH_bondAcceptors      ;
            nubbeprop:numberOfH-bondDonors          ?numberOfH_bondDonors         ;
            nubbeprop:numberOfRotableBonds          ?numberOfRotableBonds         ;
            nubbeprop:molecularVolume               ?molecularVolume              ;
            nubbeprop:smile                         ?smile                        .
  }
}

Models

We compare four different unsupervised graph embedding methods for our knowledge extraction task: (1) DeepWalk [10] is an unsupervised graph embedding method that uses random walks to sample a training data set for a skipgram architecture; (2) Node2Vec [5] extends DeepWalk method to allow more control on the random walks; (3) Metapath2Vec [3] is another extension from DeepWalk that transforms the random walks into meta-path based walks; and (4) Embedding Propagation on Heterogeneous Networks (EPHEN) [1] is an embedding propagation method that uses a regularization function to distribute an initial BERT embedding on a KG, meaning that it considers both text and structured data in an unsupervised scenario. We also tried to evaluate GraphSAGE [6] but our dataset does not achieve a complete KG and GraphSAGE’s architecture does not handle it.

Results

We use NubbeDB ontology (https://github.com/AKSW/dinobbio/tree/main/ontology) for property prediction. We extract five different properties: (1) name, (2) bioactivity, (3) specie, (4) collection site, and (5) isolation type. We use different values of k proportionally to the property-value prediction difficulty. For instance, it is significantly more challenging to predict the right natural product name than it is to predict the isolation type because there are considerable fewer exemplars in the traning data set for natural product than there is for isolation type. For that we evaluated with different k from 1 to 50, considering values multiples of 5.

Graphs

The graphs show the results from experiments extracting five different natural product properties from biochemical academic papers. Each graph presents different property extraction and values of k to the hits@k metric: (1) name, k = 50; (2) bioactivity, k = 5; (3) specie, k = 50; (4) collection site, k = 20; and (5) isolation type, k = 1. The final k value for each extraction is defined either when a score higher than 0.50 is achieved at any evaluation stage or the upper limit of k = 50. The final k value for each extraction is defined either when a score higher than 0.50 is achieved at any evaluation stage or the upper limit of k = 50.

Graph 1 - compound name extraction

compound_name

In graph 1, we can see that only Metapath2Vec achieved better performance on the compound name extraction at the 4th evaluation stage than the previous stages. We can see that the execution time of every algorithm reduces as the evaluation stages progresses. This happens because the time it takes to calculate the similarity is also measured. We can also see that on the 4th evaluation stage EPHEN's execution time becomes lower than Metapah2Vec's, even though Metapah2Vec is implemented as a parallel algorithm while EPHEN is sequential. This happens because EPHEN can generate embeddings for new nodes and links by iterating further while all the other methods must reconstruct the embeddings from zero.

Graph 2 - bioactivity extraction

bioactivity

In graph 2 EPHEN's performance increases throughout the evaluation stages on the bioactivity extraction scenario and is also above our treshold of 0.50 with the hits@5 metric. We can also see a more clear separation of the execution time of DeepWalk and Node2Vec as well as the same behavior between Metapath2Vec and EPHEN's execution times.

Graph 3 - collection specie extraction

specie

In graph 3, the collection specie extraction scenario the treshold of 0.50 was achieved in a hits@50 metric. It is also another scenario where Metapath2Vec achieves both: the best performance in all evaluation stages and is also able to have the best results in the 4th evaluation stage. This shows us that Metapath2Vec can't position the correct link at the first position as best as EPHEN, but is better at keeping a middle ground when it comes to larger ks in the hits@k metric.

Graph 4 - collection site extraction

collection_site

In graph 4, the collection site extraction, is another scenario where the performance both increases throughout the evaluations stages for EPHEN. In this more challenging scenario Node2Vec and DeepWalk manage to achieve best and second best performance respectively in the first evaluation scenario, but their performance drops steadly throughout the evaluation stages while Metapath2Vec's performance is maintained throughout the evaluation stages and EPHEN's increases.

Graph 5 - collection type extraction

extraction_type

In graph 5 we see once again that EPHEN is better at positioning the correct link on the top of the list, while the apparent performance of Metapath2Vec decreases with lower ks since the collection type extraction is the only scenario where EPHEN achieved 0.75 hits@1.

Tables

Table 1 shows the results from experiments extracting five different natural product properties from biochemical academic papers. They are presented on different values of k to the hits@k metric: (1) name, k = 50; (2) bioactivity, k = 5; (3) specie, k = 50; (4) collection site, k = 20; and (5) isolation type, k = 1. The final k value for each extraction is defined either when a score higher than 0.50 is achieved at any evaluation stage or the upper limit of k = 50.

Table 2 shows the results from experiments extracting five different natural product properties from biochemical academic papers. They are presented on different values of k to the hits@k metric: (1) name, k = 50; (2) bioactivity, k = 1; (3) specie, k = 20; (4) collection site, k = 5; and (5) isolation type, k = 1. The final k value for each extraction is defined either when a score higher than 0.20 is achieved at any evaluation stage or the upper limit of k = 50.

Table 1

Results table for extracting: chemical compound (C), bioactivity (B), specie (S), collection site (L), and isolation type (T). Performance metric with the average and standard deviation of the metric hits@k and k is respectively: 50, 5, 50, 20, and 1.

Property Evaluation Stage DeepWalk Node2vec Metapath2Vec EPHEN
C 1st 0.08 ± 0.01 0.08 ± 0.01 0.10 ± 0.01 0.09 ± 0.01
2nd 0.01 ± 0.01 0.00 ± 0.01 0.08 ± 0.02 0.02 ± 0.01
3rd 0.01 ± 0.01 0.01 ± 0.01 0.09 ± 0.03 0.03 ± 0.02
4th 0.00 ± 0.00 0.00 ± 0.00 0.20 ± 0.05 0.04 ± 0.05
B 1st 0.41 ± 0.08 0.41 ± 0.07 0.27 ± 0.03 0.55 ± 0.06
2nd 0.12 ± 0.02 0.07 ± 0.03 0.17 ± 0.06 0.57 ± 0.07
3rd 0.10 ± 0.03 0.03 ± 0.03 0.13 ± 0.04 0.60 ± 0.08
4th 0.07 ± 0.04 0.03 ± 0.03 0.12 ± 0.06 0.64 ± 0.07
S 1st 0.37 ± 0.04 0.36 ± 0.04 0.40 ± 0.03 0.36 ± 0.04
2nd 0.24 ± 0.03 0.22 ± 0.03 0.41 ± 0.06 0.24 ± 0.03
3rd 0.27 ± 0.07 0.25 ± 0.06 0.42 ± 0.04 0.29 ± 0.07
4th 0.25 ± 0.10 0.24 ± 0.07 0.44 ± 0.12 0.30 ± 0.06
L 1st 0.56 ± 0.06 0.57 ± 0.05 0.40 ± 0.05 0.53 ± 0.03
2nd 0.41 ± 0.05 0.36 ± 0.08 0.42 ± 0.04 0.52 ± 0.06
3rd 0.38 ± 0.06 0.28 ± 0.04 0.42 ± 0.08 0.55 ± 0.04
4th 0.29 ± 0.05 0.23 ± 0.10 0.40 ± 0.12 0.55 ± 0.06
T 1st 0.25 ± 0.09 0.10 ± 0.05 0.28 ± 0.04 0.71 ± 0.04
2nd 0.14 ± 0.08 0.07 ± 0.06 0.22 ± 0.08 0.66 ± 0.10
3rd 0.14 ± 0.09 0.05 ± 0.04 0.19 ± 0.04 0.75 ± 0.10
4th 0.09 ± 0.05 0.01 ± 0.02 0.19 ± 0.06 0.75 ± 0.11

Table 2

Results table for extracting: chemical compound (C), bioactivity (B), specie (S), collection site (L), and isolation type (T). Performance metric with the average and standard deviation of the metric hits@k and k is respectively: 50, 1, 20, 5, and 1.

Property Evaluation Stage DeepWalk Node2vec Metapath2Vec EPHEN
C 1st 0.08 ± 0.01 0.08 ± 0.01 0.10 ± 0.01 0.09 ± 0.01
2nd 0.01 ± 0.01 0.00 ± 0.01 0.08 ± 0.02 0.02 ± 0.01
3rd 0.01 ± 0.01 0.01 ± 0.01 0.09 ± 0.03 0.03 ± 0.02
4th 0.00 ± 0.00 0.00 ± 0.00 0.20 ± 0.05 0.04 ± 0.05
B 1st 0.10 ± 0.03 0.09 ± 0.04 0.06 ± 0.04 0.17 ± 0.05
2nd 0.01 ± 0.01 0.02 ± 0.01 0.04 ± 0.03 0.19 ± 0.05
3rd 0.01 ± 0.01 0.01 ± 0.01 0.03 ± 0.02 0.24 ± 0.06
4th 0.01 ± 0.02 0.01 ± 0.01 0.10 ± 0.04 0.25 ± 0.06
S 1st 0.10 ± 0.03 0.10 ± 0.02 0.15 ± 0.02 0.10 ± 0.02
2nd 0.12 ± 0.04 0.13 ± 0.03 0.11 ± 0.03 0.15 ± 0.03
3rd 0.12 ± 0.04 0.11 ± 0.05 0.15 ± 0.04 0.19 ± 0.05
4th 0.11 ± 0.06 0.11 ± 0.06 0.19 ± 0.07 0.22 ± 0.07
L 1st 0.15 ± 0.04 0.13 ± 0.04 0.12 ± 0.02 0.26 ± 0.04
2nd 0.09 ± 0.03 0.08 ± 0.04 0.13 ± 0.04 0.29 ± 0.05
3rd 0.06 ± 0.03 0.06 ± 0.03 0.11 ± 0.04 0.30 ± 0.07
4th 0.06 ± 0.04 0.05 ± 0.03 0.13 ± 0.08 0.27 ± 0.07
T 1st 0.25 ± 0.09 0.10 ± 0.05 0.28 ± 0.04 0.71 ± 0.04
2nd 0.14 ± 0.08 0.07 ± 0.06 0.22 ± 0.08 0.66 ± 0.10
3rd 0.14 ± 0.09 0.05 ± 0.04 0.19 ± 0.04 0.75 ± 0.10
4th 0.09 ± 0.05 0.01 ± 0.02 0.19 ± 0.06 0.75 ± 0.11

Discussion

We can observe that overall, EPHEN achieves the best performance and, most importantly, can increase performance through the evaluation stages. For example, in the bioactivity extraction, EPHEN achieves 0.55 hits@5 in the first evaluation stage and progressively better results until 0.64 in the fourth evaluation stage. In contrast, DeepWalk has the second-best results on the first evaluation with 0.41 and drops performance until reaching the second-worst results with 0.07 on the fourth evaluation stage.

We can also observe that the attributes with fewer options for completion, the better relative performance will be achieved on the extraction. For example, compound name and collection type have the biggest and lowest counts of nodes to link to, at 446 and 6, respectively. EPHEN achieves the second-best performance on the compound name extraction, but its performance does not surpass the 0.10 hits@50 performance mark. Meanwhile, it achieves performance results ranging from 0.71 to 0.75 hits@1 from the first to the fourth evaluation stage on the isolation type extraction.

Conclusion

Our evaluation shows that it is possible to use unsupervised embeddings approaches to extract chemical compound properties from academic literature, in particular, the ones with fewer candidates (i.e bioactivity and isolation type). In most cases, the use of context-aware data does lead to improvements on extraction quality. EPHEN has achieved the best results while Metapath2Vec showed good performance in more challenging scenarios (i.e. chemical compound and collection site prediction). Finally, DeepWalk and Node2Vec’s random walks perform better with fewer training corpora. In the future work, we plan to fine-tune EPHEN using resource similarity data and develop an automatic natural product knowledge extraction framework with the human-in-the-loop.

References

  1. do Carmo, P., Marcacini, R.: Embedding propagation over heterogeneous event networks for link prediction. In: 2021 IEEE International Conference on Big Data (Big Data). pp. 4812–4821 (2021). https://doi.org/10.1109/BigData52589.2021.9671645
  2. Docs, A.: Hits at n score (Mar 2019), https://docs.ampligraph.org/en/1.4.0/generated/ampligraph.evaluation.hits at n score.html
  3. Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: Scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. pp. 135–144 (2017)
  4. Grootendorst, M.: Bertopic: Leveraging bert and c-tf-idf to create easily interpretable topics. (2020). https://doi.org/10.5281/zenodo.4381785, https://doi.org/10.5281/zenodo.4381785
  5. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 855–864 (2016)
  6. Hamilton, W., Ying, Z., Leskovec, J.: Inductive representation learning on large graphs. Advances in neural information processing systems 30 (2017)
  7. Hellmann, S., Stadler, C., Lehmann, J., Auer, S.: Dbpedia live extraction. In: OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”. pp. 1209–1223. Springer (2009)
  8. Hogan, A., Blomqvist, E., Cochez, M., d’Amato, C., Melo, G.d., Gutierrez, C., Kirrane, S., Gayo, J.E.L., Navigli, R., Neumaier, S., et al.: Knowledge graphs. Synthesis Lectures on Data, Semantics, and Knowledge 12(2), 1–257 (2021)
  9. Newman, D.J., Cragg, G.M.: Natural products as sources of new drugs from 1981 to 2014. Journal of natural products 79(3), 629–661 (2016)
  10. Perozzi, B., Al-Rfou, R., Skiena, S.: Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 701–710 (2014)
  11. Zaveri, A., Kontokostas, D., Sherif, M.A., B ̈uhmann, L., Morsey, M., Auer, S., Lehmann, J.: User-driven quality evaluation of dbpedia. In: Proceedings of the 9th International Conference on Semantic Systems. p. 97–104. I-SEMANTICS ’13, Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2506182.2506195, https://doi.org/10.1145/2506182.2506195