-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CZID-8088: Fix host gene count breaking; add rRNA count #282
Conversation
@mlin While we're looking at the kallisto task, could you tell me if this TODO is still meaningful?
It looks to me like those params have been added though? But maybe they're just placeholders? I saw you had the last change on that TODO line, hoping you knew. Thanks! |
@vincent-czi @katrinakalantar re the fragment length parameters. Yes, we're setting them to arbitrary hard-coded values whereas it would be more ideal to set them more accurately for the data at hand.
It would be nice to file this formally as a todo, but I think it's relatively low priority for the following reasons.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vincent-czi the code changes LGTM, nice work! Two confirmatory questions:
- In your testing, was the mentioned rRNA the only example that couldn't be joined between the tsv and gtf by the ENST id?
- The non-human host species generally continue to use Ensembl gene annotations since GENCODE annotations only exist for human (and mouse I think). I think (when I last worked on this) CZ ID only outputs the host quantifications for human, so we only really need to handle that path. But if that's changed then we might need to handle both cases, and in any case this might be worth documenting somewhere in case it changes in the future.
@mlin For the confirmatory questions, yes, good on both counts.
Regarding the other comment, thanks for all the info on the fragment length parameters. I'll update the TODO to point at that comment since that really clarifies what it's saying. |
Ticket: CZID-8088
Core issue
We recently updated the kallisto index. The GTF file we had been using for matching up transcripts to genes used a different identification pattern than the new index, so when we tried to roll up the various reads by gene, nothing matched and the returned TSV was empty (other than header row info).
Fix
We're now using the latest GTF file from GENCODE (where we sourced the inputs to create the new kallisto index). This means the identification pattern matches up better between kallisto output and the GTF used for rolling up into reads by gene. There were some more tweaks required for how we do that match up, but that's the big picture.
Notes
"ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|"
), so we extract out the ENST part of the id and match that against the ENST ids we get from GTF.reads_per_transcript.kallisto.tsv
we weren't handling before:U13369.1
. I talked to Katrina about this, and it's a ribosomal rna gene we added special. It's used to estimate how much of the host material is rRNA. I couldn't find good documentation in a ticket, but Katrina gave me the expected behavior: for that specific read, those numbers should get written to the output TSV, and just jumped over the gene rollup entirely. The end users will understand what they're looking at. That said, I think maybe we should document it somewhere beyond the code comments I added? Not sure where though.reads_per_transcript.kallisto.tsv
andgencode.v43.annotation.gtf.gz
as inputs to it. So this has been tested and the resulting outputreads_per_gene.kallisto.tsv
was looked over by both me and Katrina, seems to work as expected with these changes.kallisto
WDL task with all of these changes incorporated on my dev EC2 instance before merging+deploying, although I'll probably talk to Ryan about the best way to do that.Still need to do
Separately, I still need to upload the new GTF file to S3, and update the appropriate DB records (via data migration) so that the new GTF file will start getting used.