Skip to content

garcia-nacho/WDL_CodeChallenge_LifeBit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LifeBit code challenge

Info

The pipeline uses two docker containers:
garcianacho/lb_base
garcianacho/lb_vep_full
These containers contain all dependencies and scripts necesary to run the pipeline.

The container garcianacho/lb_base uses a different scatter script from the one to run on NextFlow to deal with the way WDL parses the paths to the R script.

Run

To run the pipeline you must clone this repo.

git clone garcia-nacho code-challenge-nextflow-wdl-annotation

and run the following command inside the code-challenge-nextflow-wdl-annotation folder

miniwdl run ./lb_challenge.wdl InputVcf=./VCFsubset.vcf

Note that you need miniwdl installed on your system, you can get it via pip install miniwdl

Interestingly, the same wdl code doesn't work when using Cromwell (v.85). This is because under Cromwell the input files for the last process are located inside subfolders and the R script can't find them. This behaviour is different from Miniwdl where all the files are located together inside the same folder. This is which is what the R script expects. In other words, Miniwdl's behaviour is the same as in NextFlow and different from Cromwell.

Input

As input, I have subsampled a vcf file from the 1000 genomes project: 1000Genomes/trio/HG00702_SH089_CHS. To speed up the process of testing I have just gathered a few variants from each chromosome as required.

Under the hood

The command runs a script that splits the vcf file used as input in several parts. Given the size of the vcf used as input, the size of the parts is just 10 variants per file. This size can be easily adjusted. Next all the chunks are sent to the vep command. Vep runs the following plugins:

BLOSUM62
CSN
DownstreamProtein
ProteinLengthChange
HGVS_IntronEndOffset
HGVS_IntronStartOffset
LOVD
NearestExonJB
ReferenceQuality
SpliceRegion
TSSDistance
FlagLRG

On the last step, the pipeline gathers all results and generate an unique vcf file inside the Results folder that is generated by the pipeline

Output

The output file will be called Results.vcf and the md5sum is 2a5f79e048b74f8ab98f10b47725c7dc
This output file will be located inside the ./XXXX_UUUU_RunVep/out/gather_vep.finaloutput folder.

Releases

No releases published

Packages

 
 
 

Languages