Skip to content

Latest commit

 

History

History
executable file
·
82 lines (58 loc) · 4.94 KB

steps_profile_VIS_scMultiome.md

File metadata and controls

executable file
·
82 lines (58 loc) · 4.94 KB

Profiling vector integration sites at single-cell level using scATAC-seq or scMultiome data

Here are the steps for profiling VIS from 10X scATAC-seq/scMultiome data. The main tool we used is EpiVIA. The starting point is the fastq files generated by 10X scATAC-seq data. There are 4 fq files: I1, R1, R2 and R3. I1 is the 8 bp sample barcode, R1 is the forward read, R2 is the 16 bp 10x feature barcode and R3 is the reverse read. 

As prerequisites, files from multiple lanes should be merged. Qualtiy control was performed for R1 and R3 (fastqc), reads should be trimmed in case adapters are found. More important, build a bwa index for a modified human genome. The provirus sequence has to be appended as an additional chromosome.

Step 1 (alignment) Map the reads to the modified genome by bwa.

idx=/research/projects/yu3grp/IO_JY/yu3grp/LVXSCID/patients_scATACseq/bwa_index/hg19withpv/hg19wpvidx ##location of the bwa index
input1=P1_scMulti_ATAC_S1_R1.fastq.gz
input2=P1_scMulti_ATAC_S1_R3.fastq.gz
bam_file=P1_scMulti_ATAC_S1_pe.bam
bwa mem -t 4 $idx $input1 $input2 | samtools view -bS - > $bam_file

Step 2 (extract the chimeric reads) Extract the chimeric reads in which one end is mapped to a human chromosome, whereas another end is mapped to the provirus sequence.

input=P1_scMulti_ATAC_S1_pe.bam
out=P1_scMulti_ATAC_S1_pe_mated.bam
out_filter1=P1_scMulti_ATAC_S1_pe.mated.filter1.bam
out_filter2=P1_scMulti_ATAC_S1_pe.mated.filter2.bam
out_filter=P1_scMulti_ATAC_S1_pe.mated.filter.bam

samtools view -F 12 $input1 -o $out

###It will get all host-virus pairs
samtools view -h $out | awk '($7 == "chrZ_provirus" && $3!="=") || ($3 == "chrZ_provirus" && $7!="=") || $1 ~ /^@/' | samtools view -bS - > $out1_filter1

###get the host-host, virus-virus pairs with soft clip
samtools view -h $out | awk '($7 == "="  && $6 ~ /S/) || ($7 == "="  && $14 ~ /S/) || $1 ~ /^@/' | samtools view -bS - > $out1_filter2

samtools merge $out_filter $out_filter1 $out_filter2

Step 3 (Match cellular barcodes) Extract the cellular barcode for each chimeric read using the R2 fastq file. The barcode is added as an additional tag in the bam file.

##input files
input_bam=P1_scMulti_ATAC_S1_pe.mated.filter.bam. #bam file from step 2 storing chimeric reads
cellID_file=P1_scMulti_ATAC_S1_R2.fastq.gz.       #R2 files storing all cellular barcodes from 10X   

##name output files 
input_bam_R2=P1_scMulti_ATAC_S1_pe.mated.filter_R2.fastq
aux_file=aux

#keep barcodes in R2 that present in the chimeric reads
samtools view $input_bam | awk -F "\t" '{print $1}' > $aux_file
seqtk subseq $cellID_file $aux_file > $input_bam_R2
rm $aux_file

Now, we have a fastq file storing the R2 (cellular barcodes) for every potential chimeric read in the filtered bam file. Next, run this python script to add the corresponding barcode for each read in the bam file under the tar "CB", and convert the file back to bam.

samtools view -bS P1_scMulti_ATAC_S1_pe.mated.filter_wCB.sam > P1_scMulti_ATAC_S2_pe.mated.filter_wCB.bam

Step 4 (Run EpiVIA) Install the tool EpiVIA. The basic input is the bam file storing potential chimeric reads with CB as tag. In addition, the locations of the provirus sequence and bwa index have to be provided. Inside the provirus sequence, the coordinates of the viral long-term repeat (LTR) should be provided.

input_bam=P1_scMulti_ATAC_S1_pe.mated.filter_wCB.bam
provirus_file=/research/projects/yu3grp/IO_JY/yu3grp/LVXSCID/patients_scATACseq/provirus_sequence
out_dir=/research/projects/yu3grp/IO_JY/yu3grp/LVXSCID/patients_scATACseq/multiome_P1/05_epiVIA_res_v3

out_bam=$out_dir/P1_scMulti_ATAC_S1_out.wCB.bam #output bam file
out_bam_aux=$out_dir/P1_scMulti_ATAC_S1_out #folder storing the output bam file
mkdir $out_bam_aux

epiVIA --Provirus $provirus_file --ltr5_start 1 --ltr5_end 650 --ltr3_start 3991 --ltr3_end 4640 --HostIndex /research/projects/yu3grp/IO_JY/yu3grp/LVXSCID/patients_scATACseq/bwa_index/hg19/hg19idx --Host2bit /research/projects/yu3grp/IO_JY/yu3grp/LVXSCID/patients_scATACseq/hg19.2bit --tempdir $out_bam_aux --candidate_bam $out_bam --gbdb http://hgdownload.soe.ucsc.edu/gbdb/ --genome hg19 $input_bam $out_bam_aux

The output file "integration_sites.txt" compiles a list of integration sites with the corresponding barcodes. For scATAC-seq data, the barcodes correspond to the column names of the count matrice in 10X (imported by tools like Seurat). For scMultiome data, it is not the case. 10X generated a file called "per_barcode_metrics.csv" which maps the ATAC-seq barcodes outputed by sequencer to barcodes that are consistent with the expression data.

Alternatively, one could run 10X cell ranger ATAC or ARC with the modified reference genome. It could potentially get rid of step 3.