Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read alignment off by one after primer trimming #36

Closed
peterk87 opened this issue Apr 15, 2020 · 10 comments
Closed

Read alignment off by one after primer trimming #36

peterk87 opened this issue Apr 15, 2020 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@peterk87
Copy link

Describe the bug

Some read alignments are shifted after ivar trim (v1.2) compared to untrimmed read alignments from Minimap2 (v2.17-r941) and NGMLR (v0.2.7).

To Reproduce

I aligned the Nanopore reads to the reference genome (MN908947.3) with Minimap2 (and NGMLR), trimmed primer sequences and low quality bases

$ minimap2 -ax map-ont -t16 ref.fa reads.fq | samtools sort -@16 | samtools view -F4 -b -o reads.sort.bam
$ ivar trim -b artic-ncov2019/primer_schemes/nCoV-2019/V2/nCoV-2019.bed -p reads.trim -i reads.sort.bam -q 1 -m 30 -s 4

Note: for some reason many of the reads had very low average quality scores for this run and we were mostly interested in trimming the primer sequences.

I viewed and compared alignments in IGV (v2.8.2 installed from BioConda) and noticed some big differences between the untrimmed and trimmed alignments of some reads in certain regions.

Top alignment is untrimmed Minimap2 and bottom is iVar trimmed alignment:

Screenshot_20200415_121302-IGV-ivar-minimap2-comparison-multiple-reads

For example, here's the same read at position MN908947.3:29,383 where in the original alignment the base is G (QV=27) and in the trimmed alignment the base is A (QV=11):

Before ivar trim read alignment information in IGV

Read name = 04853829-6653-42bb-b303-351fbd160119
Read length = 515bp
----------------------
Mapping = Primary @ MAPQ 60
Reference span = MN908947.3:29,363-29,646 (+) = 284bp
Cigar = 89S15M1D11M1I3M1D20M1I14M5I3M2I2M2I18M1I33M1D7M1I7M2I12M1I28M2I8M5I3M1D12M2I3M4I12M1I6M2D5M1D23M1D31M120S
Clipping = Left 89 soft; Right 120 soft
----------------------
s1 = 79
s2 = 0
NM = 51
AS = 314
de = 0.1145
rl = 0
cm = 9
nn = 0
tp = P
ms = 314Location = MN908947.3:29,383
Base = G @ QV 27

After ivar trim

Read name = 04853829-6653-42bb-b303-351fbd160119
Read length = 515bp
----------------------
Mapping = Primary @ MAPQ 60
Reference span = MN908947.3:29,380-29,647 (+) = 268bp
Cigar = 104S11M1I3M1D20M1I14M5I3M2I2M2I18M1I33M1D7M1I7M2I12M1I28M2I8M5I3M1D12M2I3M4I12M1I6M2D5M1D23M1D31M120S
Clipping = Left 104 soft; Right 120 soft
----------------------
s1 = 79
s2 = 0
NM = 51
AS = 314
de = 0.1145
rl = 0
cm = 9
nn = 0
tp = P
ms = 314
Hidden tags: XALocation = MN908947.3:29,383
Base = A @ QV 11

It looks like the iVar trimmed alignment has been shifted when I zoom in on the read in both untrimmed and trimmed alignments:

Screenshot_20200415_112320-IGV-ivar-minimap2-comparison

Note that the position is incremented by 1 in the trimmed read alignment relative to the untrimmed alignment.

Expected behavior

I expected that the alignments should remain the same aside from primer and quality trimming.

Workstation:

  • OS: Arch Linux x86_64 (Kernel Release: 5.5.9-arch1-2)
  • CPU: Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
  • iVar Version 1.2 from BioConda

Additional context

Conda env YAML ($ conda env export) for running iVar, Minimap2, NGMLR and samtools:

channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - _libgcc_mutex=0.1=conda_forge
  - _openmp_mutex=4.5=0_gnu
  - bzip2=1.0.8=h516909a_2
  - ca-certificates=2020.4.5.1=hecc5488_0
  - curl=7.69.1=h33f0ec9_0
  - htslib=1.9=h4da6232_3
  - krb5=1.17.1=h2fd8d38_0
  - libcurl=7.69.1=hf7181ac_0
  - libdeflate=1.5=h516909a_0
  - libedit=3.1.20170329=hf8c457e_1001
  - libgcc-ng=9.2.0=h24d8f2e_2
  - libgomp=9.2.0=h24d8f2e_2
  - libssh2=1.8.2=h22169c7_2
  - libstdcxx-ng=9.2.0=hdf63c60_2
  - minimap2=2.17=h8b12597_1
  - ncurses=6.1=hf484d3e_1002
  - ngmlr=0.2.7=he860b03_1
  - openssl=1.1.1f=h516909a_0
  - samtools=1.9=h10a08f8_12
  - sniffles=1.0.11=hdbcaa40_1
  - tclap=1.2.1=h470a237_1
  - tk=8.6.10=hed695b0_0
  - xz=5.2.5=h516909a_0
  - zlib=1.2.11=h516909a_1006
@gkarthik
Copy link
Member

gkarthik commented Apr 16, 2020

Hey Peter, thank you for bringing this to my attention. This is a bug that occurs if there is a deletion at the final position of the primer sequence. I patched this and I'm writing a few tests to account for it. I will release a version by EOD.

@gkarthik gkarthik reopened this Apr 16, 2020
@gkarthik gkarthik added good first issue Good for newcomers bug Something isn't working and removed good first issue Good for newcomers labels Apr 16, 2020
@peterk87
Copy link
Author

Awesome! Thanks for fixing it so quickly!

@gkarthik
Copy link
Member

I just released v1.2.1 that should fix this. Could you please run it and let me know if it works as expected? I can then close this issue. Thanks!

@peterk87
Copy link
Author

Hi @gkarthik

I get the following error when trying to compile 1.2.1

$ make
make  all-recursive
make[1]: Entering directory '/home/pkruczkiewicz/sandbox/ivar-1.2.1'
Making all in src
make[2]: Entering directory '/home/pkruczkiewicz/sandbox/ivar-1.2.1/src'
depbase=`echo call_consensus_pileup.o | sed 's|[^/]*$|.deps/&|;s|\.o$||'`;\
g++ -DHAVE_CONFIG_H -I. -I..   -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/pkruczkiewicz/miniconda3/include  -g -std=c++11 -Wall -Wextra -Werror -MT call_consensus_pileup.o -MD -MP -MF $depbase.Tpo -c -o call_consensus_pileup.o call_consensus_pileup.cpp &&\
mv -f $depbase.Tpo $depbase.Po
call_consensus_pileup.cpp: In function ‘int call_consensus_from_plup(std::istream&, std::string, uint8_t, double, uint8_t, char, bool)’:
call_consensus_pileup.cpp:185:31: error: ‘ref’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
  185 |       ad = update_allele_depth(ref, bases, qualities, min_qual);
      |            ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [Makefile:381: call_consensus_pileup.o] Error 1
make[2]: Leaving directory '/home/pkruczkiewicz/sandbox/ivar-1.2.1/src'
make[1]: *** [Makefile:359: all-recursive] Error 1
make[1]: Leaving directory '/home/pkruczkiewicz/sandbox/ivar-1.2.1'
make: *** [Makefile:300: all] Error 2

I am able to compile iVar 1.2 from source with no issues.

@gkarthik
Copy link
Member

gkarthik commented Apr 16, 2020

For some reason, I'm unable to reproduce this error with gcc v6.4.0, v7.4.0 or v7.3.0. I fixed the release to account for this error. Could you please try this again?

If you could also send me your compiler version that would help. Looks like its compiler dependent but I'm unable to isolate the issue. Thanks!

@peterk87
Copy link
Author

Hi @gkarthik I fixed the issue I had compiling with gcc v9.3.0 in #37, however, I am still encountering some cases where the read seems to be shifted.

For example, here's the information from Tablet for a read that seems to have been shifted over:

Before:

e8c04a0a-a1e2-4773-8e5e-664a3db7a8b3
From: 17,432 U17,432 to 17,760 U17,760
Length: 329 U312 (37 mismatches)
Cigar: 56S7M4I14M3D3M1I2M2D12M1I12M1I2M3D31M2I13M1D14M7I42M1D17M1D8M2I21M2I1M1D7M2...
Read direction is FORWARD

>e8c04a0a-a1e2-4773-8e5e-664a3db7a8b3
TTGGCGACCCTACTCAATTAC***CACCA**CACATTGCTAACTCAGGGC
ACACTAG***CAGGCTGTTTCAATTCAGTGTGTAGACTTATGAAAACTAT
AGGT*CAGACATGTTCCTCGGAACTTGTCGGCGTTATCCTGCTGAAATTG
TTGACACTGTG*GTGCTTTGGTTTATGAT*AGCCGCTTAAAGCACATAAA
GACAAATCAG*TCAATGCTTTAAAATGTTT**TAAGGGTTTTATCACGCA
TGATGTTTCATCTACATTCAACAGCCCACAAATAGGCGTGGCGAGA*AAT
TTCGCACAC**ATCCCTGCTTGGAGAAAA

After:

e8c04a0a-a1e2-4773-8e5e-664a3db7a8b3
From: 17,453 U17,453 to 17,757 U17,757
Length: 305 U291 (235 mismatches)
Cigar: 81S3M1I2M2D12M1I12M1I2M3D31M2I13M1D14M7I42M1D17M1D8M2I21M2I1M1D7M2I12M2D7M1...
Read direction is FORWARD

>e8c04a0a-a1e2-4773-8e5e-664a3db7a8b3
CACCA**CACATTGCTAACTCAGGGCACACTAG***CAGGCTGTTTCAAT
TCAGTGTGTAGACTTATGAAAACTATAGGT*CAGACATGTTCCTCGGAAC
TTGTCGGCGTTATCCTGCTGAAATTGTTGACACTGTG*GTGCTTTGGTTT
ATGAT*AGCCGCTTAAAGCACATAAAGACAAATCAG*TCAATGCTTTAAA
ATGTTT**TAAGGGTTTTATCACGCATGATGTTTCATCTACATTCAACAG
CCCACAAATAGGCGTGGCGAGA*AATTTCGCACAC**ATCCCTGCTTGGA
GAAAA

Another example:

Before:

66990fce-f0ec-419b-a609-41e77e1ee0b9
From: 713 U713 to 1,017 U1,017
Length: 305 U270 (53 mismatches)
Cigar: 66S10M1I1M4D4M1D5M1D12M2I8M1I21M2D4M2D18M1I3M1I3M1D17M7D1M1I7M1D3M2D20M2D9M...
Read direction is FORWARD

>66990fce-f0ec-419b-a609-41e77e1ee0b9
GGCACTGATCC****GAAG*TTTTC*AGAAAACTGGAACACTAAACCTAG
CGGTGGTGTTACCCGTG**GGCA**CGTGAGCTTAACGGAGGGGGATAC*
CTCGCTATGTCGATAAC*******GTGGCGGT*AAG**TACCCTCTTGAG
TGCATTGA**ACCTTCTGG**CGTGCTGGTAA**CT**AAGCAC**TGTC
CGCACAACTGGACTTTATTGACACTAAGA**GGTG*CT*CTGCTGCCGCG
AACGCAGGCATGAAATTGCTTGGTACACGGAACGTTCTGAAAAGAGCTAT
GAATT

After:

66990fce-f0ec-419b-a609-41e77e1ee0b9
From: 727 U727 to 1,016 U1,016
Length: 290 U259 (226 mismatches)
Cigar: 78S4M1D5M1D12M2I8M1I21M2D4M2D18M1I3M1I3M1D17M7D1M1I7M1D3M2D20M2D9M2D7M1I4M2...
Read direction is FORWARD

>66990fce-f0ec-419b-a609-41e77e1ee0b9
GAAG*TTTTC*AGAAAACTGGAACACTAAACCTAGCGGTGGTGTTACCCG
TG**GGCA**CGTGAGCTTAACGGAGGGGGATAC*CTCGCTATGTCGATA
AC*******GTGGCGGT*AAG**TACCCTCTTGAGTGCATTGA**ACCTT
CTGG**CGTGCTGGTAA**CT**AAGCAC**TGTCCGCACAACTGGACTT
TATTGACACTAAGA**GGTG*CT*CTGCTGCCGCGAACGCAGGCATGAAA
TTGCTTGGTACACGGAACGTTCTGAAAAGAGCTATGAATT

Unfortunately, it looks like there are still a number of reads that have been shifted resulting in misalignments.

@gkarthik gkarthik self-assigned this Apr 18, 2020
@gkarthik
Copy link
Member

@peterk87 This issue turned out to be more complicated. Because the reads are unpaired, the reads were being trimmed only on one end. I just pushed code to trim unpaired reads on both ends (fwd and rev primer). I also fixed the issues with deletions right after primer position. Could you please pull the master branch and see if it works on your dataset?

I want to make sure it works on your data before creating a new release. Thank you!

@peterk87
Copy link
Author

Thanks @gkarthik It looks like the issue is fixed as of 72b35e6. Primer sequences are trimmed in the read alignments without any visible issues!

I have been comparing the results from iVar with those from the ARTIC pipeline trim_align.py script and the results are comparable. I appreciate that iVar doesn't discard a lot of the potentially informative reads in low coverage regions as the trim_align.py script does. It's also very nice that iVar is much faster than the Python script.

@gkarthik
Copy link
Member

Fantastic! Thank you for raising this issue. Helped me work through a few other things too. I'm adding another feature to create bcf reports for consensus today and will create a release. I think with the ability to trim unpaired reads, ivar trim can be used for nanopore reads. However, please keep in mind that it will require tweaking the quality trimming threshold using -q. Maybe between 5 and 10 since the default is 20 for illumina data.. 🤔 I started playing around with the trimming for nanopore data but haven't generated any benchmarks yet.

@peterk87
Copy link
Author

Thanks @gkarthik! I'm glad you were able to fix it so quickly!

I've been setting the quality trimming threshold very low for my data. The Flongle seems to produce lower quality scores than the MinION.

Looking forward to the new bcf report feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

2 participants