Skip to content

Reference sequences and annotation files for commonly analyzed organisms

Notifications You must be signed in to change notification settings

igordot/reference-genomes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reference Sequence and Annotation Files

Supplementary genome reference files in addition to the basic FASTA and GTF as well as notes and commentary.

Disclaimer: Reference genomes are complicated and there are many caveats, even for commonly used ones (see: which human reference genome to use). Use at your own risk.


Initializing Reference Genome Directory

Genomics bioinformatic tools generally require a reference file/index, but the exact requirements can vary. This repository includes notes for setting up a reference genome directory that can be used for common analysis types (RNA-seq, ChIP-seq, whole genome/exome sequencing) as well as some pre-generated supplementary files. The organization scheme is inspired by Illumina iGenomes, a collection of sequence and annotation files for commonly analyzed genomes. The files are placed in separate directories based on the genome reference version, such as hg38 or mm10. Within each genome directory, the files are named based on the type.

The two primary files that are required:

  • genome.fa - genome sequence in FASTA format
  • genes.gtf - gene annotations in GTF format

These can usually be downloaded from genome browsers like Ensembl, UCSC, or NCBI. The best source will vary depending on the species of interest. There are also species-specific resources like FlyBase or WormBase. The differences can be minor (different contig names), but they could be more substantial (thousands of additional genes).

Sanity Testing

There are a few issues to be aware of to avoid potential downstream problems. Both genome.fa and genes.gtf have to be valid FASTA and GTF files, but they should also be compatible with each other.

If a FASTA index and a sequence dictionary can be successfully generated (see below), the FASTA is likely compatible with all tools. One exception is that contig names (such as "chr1") with spaces will cause issues for some applications. Check contig names with:

grep "^>" genome.fa

Contig names should be identical in the FASTA and the GTF. Create a list of contigs in both files and check that the GTF does not have more contigs than the FASTA:

cat genome.fa.fai | cut -f 1 | sort > contigs.fa.txt
cat genes.gtf | cut -f 1 | sort | uniq > contigs.gtf.txt
wc -l contigs.*.txt

This counts the total number of contigs from each file.

Check that all the contig names in the GTF are present in the FASTA:

diff --side-by-side --suppress-common-lines contigs.fa.txt contigs.gtf.txt

Only the non-matching contigs will be output. If the FASTA has additional contigs, that should be fine.

The GTF should include gene_name and gene_id attributes for all records. Check the number of genes with and without those attributes:

cat genes.gtf | grep -F -v "transcript_id" | grep -F -w "gene" | wc -l
cat genes.gtf | grep -F -v "transcript_id" | grep -F -w "gene" | grep -F "gene_name" | wc -l
cat genes.gtf | grep -F -v "transcript_id" | grep -F -w "gene" | grep -F "gene_id" | wc -l

The three numbers should be identical.

The GTF should include gene, transcript, and exon features (column 3) for all genes.

Generating Common Supplementary Files

Generate FASTA index genome.fa.fai using samtools:

samtools faidx genome.fa

Generate sequence dictionary genome.dict using Picard (used by Picard and GATK):

java -Xms16G -Xmx16G -jar /path/to/picard.jar CreateSequenceDictionary \
  REFERENCE=genome.fa OUTPUT=genome.dict

Generate chrom.sizes (used by bedtools and UCSC utilities):

cut -f 1,2 genome.fa.fai > chrom.sizes

Generate genome BED file using bedtools:

bedtools makewindows -g chrom.sizes -w 999999999 \
  | LC_ALL=C sort -k1,1 \
  > genome.bed

Generate Bowtie 2 index (also compatible with Bowtie 1.2.3+):

mkdir Bowtie2
ln -s ../genome.fa Bowtie2/genome.fa
bowtie2-build --threads 4 Bowtie2/genome.fa Bowtie2/genome

Generate BWA index:

mkdir BWA
ln -s ../genome.fa BWA/genome.fa
bwa index -a bwtsw BWA/genome.fa

Generate gene predictions in refFlat format (must be gzipped for Picard CollectRnaSeqMetrics):

# gtfToGenePred is part of UCSC Genome Browser Utilities
gtfToGenePred -genePredExt -geneNameAsName2 genes.gtf refFlat.tmp
paste <(cut -f 12 refFlat.tmp) <(cut -f 1-10 refFlat.tmp) > refFlat.txt
rm -f refFlat.tmp
gzip refFlat.txt

Generate STAR index:

mkdir STAR
STAR --runMode genomeGenerate --sjdbOverhang 100 --runThreadN 4 \
  --genomeDir STAR --genomeFastaFiles genome.fa --sjdbGTFfile genes.gtf

Generate kallisto index:

kallisto index --make-unique --index kallisto.idx transcripts.fa  

Generate Salmon index:

salmon index --type quasi --threads 4 \
  --transcripts transcripts.fa \
  --index Salmon

Generate Salmon (version 1.2+) index for decoy-aware selective alignment:

mkdir Salmon
cat transcripts.fa genome.fa > Salmon/gentrome.fa
gzip -v Salmon/gentrome.fa
# a default k of 31 is optimized for reads >75bp, consider a smaller k with shorter reads
salmon index --threads 4 --kmerLen 23 \
  --transcripts Salmon/gentrome.fa.gz \
  --decoys Salmon/decoys.txt \
  --gencode \
  --index Salmon

About

Reference sequences and annotation files for commonly analyzed organisms

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published