Skip to content

Latest commit

 

History

History
89 lines (68 loc) · 6.68 KB

preprocessing.md

File metadata and controls

89 lines (68 loc) · 6.68 KB

How to prepare input bam and HLA panel for Kourami

(1) Input bam to Kourami [read alignments to HLA panel]

The input to Kourami is a bam file containing read alignments of a subset of reads from HLA loci to known HLA reference panel generated by incorporating gene-length MSA and exon-only MSA provided by IMGT/HLA database. Currently, Kourami uses 3.24.0 release of the database. Different versions of the database can be used easily and how to use other IMGT releases or your own db is explained in the second part of this document.

What you need

  1. [Kourami]
  2. bamUtil 1.0.14 or later
  3. samtools 1.3.1 or later
  4. bwa 0.7.15-r1140 or later
  5. bwa-kit 0.7.12 if you don't already have GRCh38 alignment

Download a suitable flavor of GRCh38

The current version (GRCh38) of the human genome comes in multiple flavors because they are published as multiple components. The components are:

Ⓐ. Primary assembly : chromosome, unplaced, and unlocaized contigs + EBV (195 contigs)

Ⓑ. Decoy (2386 contigs)

Ⓒ. ALT : ALT haplotype (261 contigs)

Ⓓ. HLA alleles packaged in hs38DH in bwa.kit (525 contigs)

We find that either using hs38NoAltDH ( a + b + d ) or hs38DH ( a + b + c + d ) is most effective for extracting reads from HLA loci. The hs38DH flavor is used by 1000 Genome project (see here)

You need to download hs38NoAltDH by running download_grch38.sh script located under script directory

kourami@kourami:~/kourami$./scripts/download_grch38.sh hs38NoAltDH

This will generate the reference fasta file hs38NoAltDH.fa under resources directory. Then you need to bwa index the downloaded reference by running:

kourami@kourami:~/kourami$bwa index resources/hs38NoAltDH.fa	

Read Extraction and input bam generation from GRCh38 bam

If you have WGS data aligned to GRCh38 reference, we first need to extract reads that likely coming from HLA loci. If not, see the section "When aligned bam files to GRCh38 are not available" below. Depending on the GRCh38 flavor your bam is aligned to, you need to use the correct script to extract reads (See Table below). The WGS data aligned to GRCh38 should be either in bam or cram format (have to be sorted and indexed) prior to running an extraction script.

GRCh38 flavor Use
hs38DH alignAndExtract_hs38DH.sh
hs38NoAltDH alignAndExtract_hs38DH_NoAlt.sh
Ⓐ + (Ⓑ optional) + Ⓒ alignAndExtract_hs38Alt.sh
Ⓐ + (Ⓑ optional) [NOT recommended] alignAndExtract_hs38.sh

Running the extraction (pre-processing) script

An example is shown for bam aligned to hs38DH below:

kourami@kourami:~/kourami$mkdir test
kourami@kourami:~/kourami$cd test
kourami@kourami:~/kourami/test$../scripts/alignAndExtract_hs38DH.sh NA12878 /mnt/data/NA12878.hs38DH.bam

This will generate NA12878_on_KouramiPanel.bam and this can be fed into Kourami.

Specifying another version of IMGT/HLA DB other then the default

Kourami formatted IMGT/HLA DB as well as the reference panel sequences are normally located under db directotry under kourami installation. In case you want to use another version of IMGT/HLA DB, you can sepcify the path to the desired database (by using -d [path-to-db] option)when running one of the extraction scripts. Before your another IMGT/HLA release, you must change it to Kourami-compatible format. This is explained in the second part of this document (see [Using another version of IMGT/HLA DB or a custom verion] below).

kourami@kourami:~/kourami/test$../scripts/alignAndExtract_hs38DH.sh -d ~/kourami/customDB NA12878 /mnt/data/NA12878.hs38DH.bam

When aligned bam files to GRCh38 are not available:

When an aligned bam file (to the human genome) is not available, you must first align high coverage WGS data ( >30X coverage ) to the reference human genome, we recommend using the hs38NoAltDH or hs38DH flavor (see [Downloading the correct version of GRCh38] section above). We recommend you to follow 1000 genomes GRCh38 pipeline explained here. Generate bam or cram file should be sorted and indexed.

(2) Creating Kourami HLA panel and merged MSAs from another version (release) of IMGT/HLA DB or a custom version

The default version (Kourami formatted - IMGT/HLA release 3.24.0) can be automatically downloaded and bwa-indexed by running scripts/download_panel.sh (See Installation section in README.

Given a set of MSAs prepared in each release of IMGT/HLA DB, you will need to reformat them to be used with Kourami. You can use the script formatIMGT.sh under scripts directiory.

Dependencies

Input to the script

  • All multiple sequence alignment (MSA) flat files of alignments directory in an IMGT/HLA DB release.
  • In each release, alignments directory can be separately downloaded. alignments directory is distributed as a zip file (Alignments_Rel_XXXX.zip where XXXX is Release number). The text format of MSA is explained under [File Formats]-[Sequence Alignments] in here. Zipped alignments directory can be downloaded either from https://github.com/ANHIG/IMGTHLA or ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/ . The ftp site only provides the latest release.
  • hla_nom_g.txt file is included in each IMGT/HLA release and this is a required file for Kourami. When downloading alignments directory, hla_nom_g.txt from the same release of IMGT/HLA must be downloaded as well. Place this file in the downloaded alignments directory.

Usage

<PATH-TO>/formatIMGT.sh -i [IMGT/HLA alignments dir] <optional parameters>
Option Tag Description
-i <input_dir> path to IMGT/HLA alignments directory [required]
-v <ver_number> IMGT/HLA release number is automatically set from IMGT/HLA MSA files. If specified, it overrides. [optional]
-o <output_dir> name of the directory the output will be written to. Output files will be written to <output_dir>/<ver_number>. If not provided, custom_db/<ver_number> under Kourami installation directory will be used. [optional]
-h print this message [optional]