This repository is intended for building, updating, and querying DBNascent. This is a MySQL database cataloguing all nascent sequencing experiments in the SRA through 2020. The database has been built and maintained by the DnA Lab at University of Colorado Boulder.
Data in the database pulls from manually curated metadata tables, quality control data, and bidirectional call data from samples. All data is present on the Fiji cluster at CU Boulder.
Version notes (12/20/2023):
- The database has been somewhat restructured.
- All table names are different but describe the same fields. The table equivalents are as follows (
linkIDs
andsearchEquiv
are the same):
Old table | New table |
---|---|
sampleAccum |
samples |
exptMetadata |
papers |
sampleID |
sampleEquiv |
geneticInfo |
genetics |
organismInfo |
organisms |
tissueDetails |
tissues |
bidirSummary |
bidirs |
conditionInfo |
conditions |
sampleCondition |
conditionLink |
nascentflowMetadata |
nascentflowRuns |
sampleNascentflow |
nascentflowLink |
bidirflowMetadata |
bidirflowRuns |
sampleBidirflow |
bidirflowLink |
- A few fields have changed names. The primary key identifiers for all tables are now simply
id
instead of naming which id it is, whereas tables that link to that id have the field as<linkedTable>_id
(see fields and linkages in schema). This helps with django's navigation of the database. Other new field names are as follows:
Old field | New field |
---|---|
paper_id |
paper_name |
samp_qc_score |
sample_qc_score |
samp_data_score |
sample_nro_score |
paper_data_score |
paper_nro_score |
- All non-integer identifier table linkages have been removed, so
paper_name
andsample_name
are no longer inLinkIDs
andorganisms
is linked to thepapers
andgenetics
tables by a numeric id instead of the organism name. Similarly with thesampleEquiv
linkage to thesamples
table.
The database was built with python 3.6.3. The following packages are required for building OR querying:
configparser v5.2.0 or higher
numpy v1.19.2 or higher
yaml v5.4.1 or higher
pymysql v1.0.2 or higher (may substitute a different MySQL translator)
sqlalchemy v1.4.31 or higher
(Generated with https://github.com/sqlalchemy/sqlalchemy/wiki/SchemaDisplay)
All database objects and functions are defined in dborm.py and dbutils.py.
In order to seamlessly integrate with the django website querying this database, the tables should be initially created through a django migration within the website repository on Gitlab. However, the schemas specified for django are the same as those specified here, with a few additional tables generated by django. Thus the database can be created with this repository alone if necessary.
config_build.py
defines file paths and fields outside of and within the database. Adding a field to a metadata table requires adding it to the config_build.py
file as well.
organisms.txt
, sample_cell_types.txt
, and searcheq.txt
are manually curated tables defining organisms, tissues, and unique values within the database. Adding data may require adding additional lines to these files.
The main scripts for building the database are db_global_add_update.py
and db_paper_add_update.py
, combined in the db_build_full.sbatch
script.
The database can be queried with defined fields and filtering specifications with query_printout.py
for input into DESeq2 or other applications. This script relies on the config_query.txt
config file, as well as the dborm.py
and dbutils.py
. If the query is complex enough, it may require a manual MySQL query, which can be easily passed to the database and printed out with the manual_query_printout.py
script.
Both config files refer to a credentials file that contains your credentials for accessing the database. This file should be a one-line two-column tab delimited file: <username><tab><password>