-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.rmd
142 lines (109 loc) · 6.05 KB
/
README.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
title: "R package: queryup"
author: "Guillaume Voisinne"
output:
github_document :
html_preview: true
date: "`r format(Sys.time(), '%Y - %m - %d')`"
---
```{r include=FALSE}
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
devtools::load_all()
```
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/queryup)](https://cran.r-project.org/package=queryup)
[![R-CMD-check](https://github.com/VoisinneG/queryup/workflows/R-CMD-check/badge.svg)](https://github.com/VoisinneG/queryup/actions)
[![Codecov test coverage](https://codecov.io/gh/VoisinneG/queryup/branch/master/graph/badge.svg)](https://app.codecov.io/gh/VoisinneG/queryup?branch=master)
[![CRAN mirror downloads](https://cranlogs.r-pkg.org/badges/queryup)](https://cran.r-project.org/package=queryup/)
The `queryup` R package aims to facilitate retrieving information from the UniProt database using R.
Programmatic access to the UniProt database is performed by submitting queries to the [UniProt website REST API](https://www.uniprot.org/help/api_queries).
## Install
You can install the package from CRAN using:
```{r eval=FALSE, message=FALSE, warnings=FALSE}
install.packages("queryup")
```
Alternatively, you may also install the package from github using devtools:
```{r eval=FALSE, message=FALSE, warnings=FALSE}
devtools::install_github("VoisinneG/queryup")
```
## Queries
Queries combine different fields to identify matching database entries. Here, queries are submitted using the function `query_uniprot()`. In the `queryup` R package, a query must be formatted as a list containing character vectors named after existing UniProt fields (available query fields can be found in the [API documentation](https://www.uniprot.org/help/query-fields) or in the package data `query_fields$field`). Different query fields must be matched simultaneously. For instance, the following query uses the fields *gene_exact* to return the UniProt entries of all proteins encoded by gene *Pik3r1* :
```{r eval=FALSE, message=FALSE, warnings=FALSE}
library(queryup)
```
```{r message=FALSE, warnings=FALSE}
query <- list("gene_exact" = "Pik3r1")
df <- query_uniprot(query, show_progress = FALSE)
head(df)
```
Available query fields can be listed using the package data `query_fields`:
```{r}
query_fields$field
```
## Columns
By default, `query_uniprot()` returns a data.frame with UniProt accession IDs, gene names, organism and Swiss-Prot review status. You can choose which data columns to retrieve using the `columns` parameter.
```{r message=FALSE, warnings=FALSE}
df <- query_uniprot(query,
columns = c("id", "sequence", "keyword", "gene_primary"),
show_progress = FALSE)
```
See the [API documentation](https://www.uniprot.org/help/return_fields) or the package data `return_fields` for all available columns.
Available returned fields can be listed using the package data `return_fields`:
```{r}
head(return_fields)
```
Note that the parameter `columns` and the name of the corresponding column in the output data frame do not necessarily match (they correspond to columns "field" and "label" respectively in the package data `return_fields`).
```{r}
names(df)
```
Let's check the sequence and the UniProt keywords corresponding to the first entry :
```{r}
as.character(df$Sequence[1])
as.character(df$Keywords[1])
```
## Combining query fields
Our first query returned many matches. We can build more specific queries by using more than one query field. By default, matching entries must satisfy all query fields simultaneously. Let's retrieve the only Swiss-Prot reviewed protein entry encoded by gene *Pik3r1* in *Homo sapiens* (taxon: 9606):
```{r message=FALSE, warnings=FALSE}
query <- list("gene_exact" = "Pik3r1",
"reviewed" = "true",
"organism_id" = "9606")
df <- query_uniprot(query, show_progress = FALSE)
print(df)
```
## Multiple items per query field
It is also possible to look for entries that match different items within a single query field. Items from a given query field are looked for independently. Hence, the following query will return all Swiss-Prot reviewed proteins encoded by either *Pik3r1* or *Pik3r2* in either *Mus musculus* (taxon: 10090) or *Homo sapiens* (taxon: 9606):
```{r message=FALSE, warnings=FALSE}
query <- list("gene_exact" = c("Pik3r1", "Pik3r2"),
"reviewed" = "true",
"organism_id" = c("9606", "10090"))
df <- query_uniprot(query, show_progress = FALSE)
print(df)
```
## Queries with invalid entries
If a query containing invalid entries is sent to the UniProt REST API, an error message is returned and no information about the other potentially valid entries can be retrieved. To overcome this limitation, `queryup` parses the error messages and remove invalid entries from the query. Hence, `query_uniprot()` will return information for valid entries only :
```{r}
invalid_ids <- c("P226", "CON_P22682", "REV_P47941")
valid_ids <- c("A0A0U1ZFN5", "P22682")
ids <- c(invalid_ids, valid_ids)
query <- list("accession_id" = ids)
query_uniprot(query)
```
## Long queries
Because UniProt REST API limits the size of queries, long queries containing more than a few hundreds entries cannot be passed in a single request. To overcome this limitation, the `queryup` package splits long queries into smaller ones. For instance, the dataset `uniprot_entries` that is bundled with the `queryup` package contains information for 1000 UniProt entries. We could retrieve the ENSEMBL ids corresponding to these entries using :
```{r message=FALSE, warning=FALSE}
ids <- uniprot_entries$Entry
query <- list("accession_id" = ids)
columns <- c("gene_names", "xref_ensembl")
df <- query_uniprot(query, columns = columns, show_progress = FALSE)
head(df)
```
## Protein-protein interactions
Another usage could be to retrieve protein-protein interactions among a set of UniProt entries:
```{r message=FALSE, warning=FALSE}
ids <- sample(uniprot_entries$Entry, 400)
query <- list("accession_id" = ids,
"interactor" = ids)
columns <- "cc_interaction"
df <- query_uniprot(query = query, columns = columns, show_progress = FALSE)
head(df)
```