Skip to content

Scripts PanCancer Specific

Keiran Raine edited this page Mar 19, 2015 · 6 revisions

These scripts are only relevant for users who are members of the ICGC/TCGA PanCancer working groups.

gnos_pull.pl

This tools allows you to selectively download either results or alignment data from the GNOS repositories (provided you have the relevant access keys).

It uses the elastic search json file donor_p*.jsonl.gz hosted on pancancer.info as a source of information. You can override the default if you want to use a specific version during large scale pulls.

The main utility of this tool is the ability to subset and filter what you are retrieving by setting filters in an ini file. The example included in the codebase details all the available options and filters. For example:

[COMPOSITE_FILTERS]
multi_tumour=1

Results in retrieval of variant calling data for all donors that have multiple tumour files:

$ ./gnos_pull.pl -o $SCRATCH -c .../gnos_pull.ini -a CALLS -i
Retained donors: 21
Rejected donors: 2471
Project Distribution
    CMDI-UK:	16	(avg ? GB)
    EOPC-DE:	3	(avg ? GB)
    PACA-CA:	2	(avg ? GB)

NOTE: (avg ? GB as data not populated in jsonl at time of writing)

Alternatively setting the command line arg -a to ALIGNMENTS will give details regarding the BAM files:

$ ./gnos_pull.pl -o $SCRATCH -c .../gnos_pull.ini -a ALIGNMENTS -i
Retained donors: 54
Rejected donors: 2438
Project Distribution
  CMDI-UK:	31	(avg 376 GB)
  EOPC-DE:	8	(avg 679 GB)
  LIRI-JP:	10	(avg 333 GB)
  PACA-CA:	2	(avg 444 GB)
  PRAD-UK:	3	(avg 756 GB)

For data hosted in multiple GNOS repositories you can give the script a list of the repositories in descending transfer rate (at present this has to be constructed by hand using the appropriate transfer table on pancancer.info):

[TRANSFER]
order=<<EOT
https://gtrepo-dkfz.annailabs.com/
https://gtrepo-ebi.annailabs.com/
https://gtrepo-etri.annailabs.com/
EOT

Please see the command line help for the most current details of the command line options:

./gnos_pull.pl -h

NOTE: If GNOS hangs you will have to kill the process and restart the script, it will resume from the last completed dataset.

bam_to_sra_sub.pl

Generates submission XML files required to submit data to the PanCancer GNOS instances. Details of how to prepare your BAM/FASTQ data can be found on the OICR-PanCancer Wiki (requires login).

xml_to_bas.pl

Regenerates *.bas file from analysisFull.xml URI. PCAP::Bam::Stats is able to parse this correctly regardless column ordering (which may be inconsistent).

e.g.

$: xml_to_bas.pl -d https://gtrepo-ebi.annailabs.com/cghub/metadata/analysisFull/4e183691-ba1f-4103-a517-948f363928b8
read_group_id	#_divergent_bases	#_divergent_bases_r1	#_divergent_bases_r2	#_duplicate_reads	#_gc_bases_r1	#_gc_bases_r2	#_mapped_bases	#_mapped_bases_r1	#_mapped_bases_r2	#_mapped_reads	#_mapped_reads_properly_paired	#_mapped_reads_r1	#_mapped_reads_r2	#_total_reads	#_total_reads_r1	#_total_reads_r2	bam_filename	insert_size_sd	library	mean_insert_size	median_insert_size	platform	platform_unit	read_length_r1	read_length_r2	readgroup	sample
WTSI25941	109656822	53400157	56256665	0	4900621926	4913073665	23957071382	12092611993	11864459389	242992645	91741465	121667967	121324678	243357940	121678970	121678970	out_4.bam	83.731	WGS:WTSI:12490	408.841	400.000	ILLUMINA	WTSI:7119_5	100	100	WTSI25941	f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25938	105448163	50962194	54485969	0	5059285020	5071727818	24756788121	12492499609	12264288512	251054082	94753203	125696465	125357617	251410540	125705270	125705270	out_1.bam	83.591	WGS:WTSI:12490	409.087	400.000	ILLUMINA	WTSI:7119_2	100	100	WTSI25938	f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25937	124757189	68266593	56490596	0	5003807799	5007016723	24418124464	12326534108	12091590356	247931492	93471492	124212116	123719376	248447342	124223671	124223671	out_0.bam	83.328	WGS:WTSI:12490	408.817	400.000	ILLUMINA	WTSI:7119_1	100	100	WTSI25937	f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25940	120711275	58333598	62377677	0	4926375001	4938756287	24057793106	12144140521	11913652585	244147184	92150919	122241408	121905776	244501826	122250913	122250913	out_3.bam	83.724	WGS:WTSI:12490	408.665	400.000	ILLUMINA	WTSI:7119_4	100	100	WTSI25940	f8467ec8-2d5e-ba21-e040-11ac0c483584
WTSI25939	110785952	52341213	58444739	0	5110036344	5124651899	24997394847	12618022324	12379372523	253555020	95683446	126945749	126609271	253911232	126955616	126955616	out_2.bam	83.557	WGS:WTSI:12490	408.982	400.000	ILLUMINA	WTSI:7119_3	100	100	WTSI25939	f8467ec8-2d5e-ba21-e040-11ac0c483584