Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CZID-8088: Fix host gene count breaking; add rRNA count #282

Merged
merged 3 commits into from
Jul 24, 2023

Conversation

vincent-czi
Copy link
Contributor

Ticket: CZID-8088

Core issue

We recently updated the kallisto index. The GTF file we had been using for matching up transcripts to genes used a different identification pattern than the new index, so when we tried to roll up the various reads by gene, nothing matched and the returned TSV was empty (other than header row info).

Fix

We're now using the latest GTF file from GENCODE (where we sourced the inputs to create the new kallisto index). This means the identification pattern matches up better between kallisto output and the GTF used for rolling up into reads by gene. There were some more tweaks required for how we do that match up, but that's the big picture.

Notes

  • The matching between kallisto output and GTF is being done via the ENST ids. There are a bunch of different id schemes, and kallisto output has all of them in one smushed together string (ex: "ENST00000456328.2|ENSG00000290825.1|-|OTTHUMT00000362751.1|DDX11L2-202|DDX11L2|1657|"), so we extract out the ENST part of the id and match that against the ENST ids we get from GTF.
  • There was one special id in reads_per_transcript.kallisto.tsv we weren't handling before: U13369.1. I talked to Katrina about this, and it's a ribosomal rna gene we added special. It's used to estimate how much of the host material is rRNA. I couldn't find good documentation in a ticket, but Katrina gave me the expected behavior: for that specific read, those numbers should get written to the output TSV, and just jumped over the gene rollup entirely. The end users will understand what they're looking at. That said, I think maybe we should document it somewhere beyond the code comments I added? Not sure where though.
  • I devved and subsequently tested this work by extracting out just the Python portion of the WDL that performs the rollup/output and running that by itself with the reads_per_transcript.kallisto.tsv and gencode.v43.annotation.gtf.gz as inputs to it. So this has been tested and the resulting output reads_per_gene.kallisto.tsv was looked over by both me and Katrina, seems to work as expected with these changes.
    • I still plan to run the kallisto WDL task with all of these changes incorporated on my dev EC2 instance before merging+deploying, although I'll probably talk to Ryan about the best way to do that.

Still need to do

Separately, I still need to upload the new GTF file to S3, and update the appropriate DB records (via data migration) so that the new GTF file will start getting used.

@vincent-czi vincent-czi requested review from mlin and rzlim08 July 20, 2023 19:23
@vincent-czi
Copy link
Contributor Author

@mlin While we're looking at the kallisto task, could you tell me if this TODO is still meaningful?

# TODO: input fragment length parameters for non-paired-end (l = average, s = std dev)

# TODO: input fragment length parameters for non-paired-end (l = average, s = std dev)
  String kallisto_invocation = "/kallisto/kallisto quant"
      + " -i '${kallisto_idx}' -o $(pwd) --plaintext ${if (paired) then '' else '--single -l 200 -s 20'} ${kallisto_options}"
      + " '~{fastp1_fastq}'" + if (defined(fastp2_fastq)) then " '~{fastp2_fastq}'" else ""

It looks to me like those params have been added though? But maybe they're just placeholders? I saw you had the last change on that TODO line, hoping you knew. Thanks!

@mlin
Copy link
Collaborator

mlin commented Jul 24, 2023

@vincent-czi @katrinakalantar re the fragment length parameters. Yes, we're setting them to arbitrary hard-coded values whereas it would be more ideal to set them more accurately for the data at hand.

-l, --fragment-length=DOUBLE  Estimated average fragment length
-s, --sd=DOUBLE               Estimated standard deviation of fragment length
                              (default: -l, -s values are estimated from paired
                               end data, but are required when using --single)

It would be nice to file this formally as a todo, but I think it's relatively low priority for the following reasons.

  1. As indicated, they're not required for paired-end samples, which is most common by far in modern short-read sequencing.
  2. Although I don't know exactly how kallisto uses the information, I personally doubt it's make-or-break as it's statistical in nature, and thus usually used to break ties or as some kind of weak prior.
  3. Implementation would be nontrivial because the information might need to be provided by the user upon upload and fed through to this pipeline step (I'm not sure if it can be inferred from the data itself)

Copy link
Collaborator

@mlin mlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vincent-czi the code changes LGTM, nice work! Two confirmatory questions:

  1. In your testing, was the mentioned rRNA the only example that couldn't be joined between the tsv and gtf by the ENST id?
  2. The non-human host species generally continue to use Ensembl gene annotations since GENCODE annotations only exist for human (and mouse I think). I think (when I last worked on this) CZ ID only outputs the host quantifications for human, so we only really need to handle that path. But if that's changed then we might need to handle both cases, and in any case this might be worth documenting somewhere in case it changes in the future.

@vincent-czi
Copy link
Contributor Author

@mlin For the confirmatory questions, yes, good on both counts.

  1. When I was testing, everything fell precisely in three categories: ENST (vast majority), ERCC (a handful), and the single U13369.1 rRNA.
  2. That's my understanding as well: we only ever do the host gene counts for human. That said, I'm not sure how clearly that's explained anywhere in code or documentation. The workflow here runs based on existence of the gtf_gz file, but doesn't mention the human-only aspect. And the kickoff in czid-web code that percolates down to this workflow conditionally sets gtf_gz based on if it's human, but doesn't explain anything about why. I might be missing something that formally details this aspect, but I think it falls into institutional knowledge.

Regarding the other comment, thanks for all the info on the fragment length parameters. I'll update the TODO to point at that comment since that really clarifies what it's saying.

@vincent-czi vincent-czi merged commit 46e0ec8 into main Jul 24, 2023
12 checks passed
@vincent-czi vincent-czi deleted the vince/czid-8088-kallisto-fix branch July 24, 2023 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants