Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analyze organism/source #108

Open
uniqueg opened this issue Mar 3, 2023 · 12 comments
Open

Analyze organism/source #108

uniqueg opened this issue Mar 3, 2023 · 12 comments
Assignees
Labels
library source Infer source organism or cell line/tissue
Milestone

Comments

@uniqueg
Copy link
Member

uniqueg commented Mar 3, 2023

  1. Compare predictions with reported source
  2. Quantify accuracy: correct, false, undecided

One goal is to minimize false predictions. Here we can try playing with --library-source-min-match-percentag and --library-source-min-frequency-ratio.

However, another goal is to also make sure that we get reliable predictions at least for the most common model organism (not useful that we never get a mouse because based on our rRNA gene transcripts we cannot sufficiently distinguish between mouse and shrew mouse). About this, there are basically two things we can do:

  • Don't support so many organisms, and particularly those that are closely related
  • Add transcripts of more genes, including ones with a more dynamic evolution
@balajtimate
Copy link
Contributor

I'm currently running the 770 samples tests with different --library-source-min-match-percentage to try to optimize the param for the correctly identified organisms, just wanted to come back to this issue continuing from this comment: #107 (comment)

To summarize, from our data of 77 samples from the lab, which consisted mostly of model organisms, only 5 had the organisms falsely inferred, 1 mouse (where the issue was that the ratio between 1st and 2nd inferred org was less than 2, with 30% as mmusculus and 17% as omeridionalis, which is weird, cause it's a butterfly, and for the other mouse samples, the second most frequent org is usually another mouse, like mspicilegus) and 4 E.coli samples, that I checked and it is not in our list of transcripts. I suppose some RP sequences could be added?

@balajtimate
Copy link
Contributor

balajtimate commented Jun 7, 2023

So unfortunately, as it turns out, there's no point lowering --library-source-min-match-percentage, as there's no difference between the results of the different parameter setting.
1-Barplot-predicted-organisms-2 (1)

However, out of the 226 Undecided samples (which is the 29% of the 770 samples), 77 had the organism correctly matched, except the frequency between 1st and 2nd most frequent source was <2 (the default), and 94 had the an incorrect organism inferred. We could try setting this to 1, as none of the samples had a frequency ratio lower than that (except for when none of the organisms were identified - 55). We would get an additional ~10% increase in correctly identified organisms, but at the same time, ~12% increase in false predictions. What do you think?

Here are the results from the rerun, with --library-source-min-frequency-ratio set to 1.
1-Barplot-predicted-organisms-2 (4)
So there's 55 samples (7%) for which no organism could be determined. It's also worth noting that out of the 229 (29%) falsely identified samples, 52 had the correct organism in the 2nd most frequent place, the first one simply had more counts. I'm not sure what could be done in this case, given that most of the time, the 1nd and 2nd most frequent organisms are not even closely related (or the problem is with the samples themselves, e.g. incorrect metadata or contamination from other species).
mined_test_data_result_libfreq_1_0806.csv

@uniqueg
Copy link
Member Author

uniqueg commented Jun 9, 2023

Interesting results. I think it's really not too bad for a start. A couple of points:

  • I'm not quite sure what you mean when you say that none of the samples had a frequency ratio above 1. It's a ratio of the counts of the most frequent and the second most frequent orgs, so by definition it must be >= 1, or NaN (if there are no counts for any org or all but one org). This, setting thay parameter to 1 basically deactivates this check.

  • We absolutely want to minimize false positives. It's much better to have results undecided and then have users check the complete results and make up their own minds. So when deciding parameters, if anything, we should be more strict, rather than less. For a start, I think we should not have more than 5% false positives, preferably less than 1%.

  • What can we do to improve results, apart from playing with our params? (1) Add additional highly expressed genes (as you mention), including some that are possibly a bit less conserved (so ideally, there should be orthologs for all organisms, but those should differ more than for the ribosomal protein genes); (2) remove closely related organisms and focus only on the most common ones (we could count how many samples of each org in our list are in SRA and pick just the top 50 or 100 or so) - this should solve the mouse problem you observed.

  • It is of course also possible that the SRA annotations are wrong in few cases (or highly contaminated, as you say; maybe that mouse had just eaten a butterfly 😅). So for those samples where we have false positives, we should map these samples against the complete annotations of the top 2 or 3 results. That should give us the real result. And it would indeed be nice to find such cases, it makes the case for the tool stronger. Especially if we manage (with the above measures) to bring down the false positives to a level that is similar or even lower than the one of SRA. That would tell people: You're better off running HTSinfer than relying on what's written in SRA.

@uniqueg
Copy link
Member Author

uniqueg commented Jun 9, 2023

One other thing you could do: Create separate stats for (1) the 10 most common organisms in SRA or whatever number of organisms we need to account for more than 95% of RNA-Seq samples in SRA, out of the organisms we support (for these we wanna be particularly good), (2) the rest of the organisms, and (3) all organisms (what you have now).

@uniqueg
Copy link
Member Author

uniqueg commented Jun 9, 2023

And one more question: Are all of the annotated organisms for the 770 samples supported, in theory, by HTSinfer?

@balajtimate
Copy link
Contributor

  • I'm not quite sure what you mean when you say that none of the samples had a frequency ratio above 1. It's a ratio of the counts of the most frequent and the second most frequent orgs, so by definition it must be >= 1, or NaN (if there are no counts for any org or all but one org). This, setting that parameter to 1 basically deactivates this check.

Yeah, sorry, I got it mixed up with the min-match-percentage 😅 It makes sense that we wouldn't want this lowered.

And one more question: Are all of the annotated organisms for the 770 samples supported, in theory, by HTSinfer?

Yes, I checked these, I think they were specifically chosen because they're in the transcript list.

So for those samples where we have false positives, we should map these samples against the complete annotations of the top 2 or 3 results. That should give us the real result.

One other thing you could do: Create separate stats for (1) the 10 most common organisms in SRA or whatever number of organisms we need to account for more than 95% of RNA-Seq samples in SRA, out of the organisms we support (for these we wanna be particularly good), (2) the rest of the organisms, and (3) all organisms (what you have now).

It's a good idea, I'll do the mapping and create the stats. Tbh I think it would be best to focus on the most common organisms (like you mentioned); currently, there's 440 orgs supported by HTSinfer, which is far too much, and there's a large number of closely related ones. I'll look up the most common ones in SRA and remove the others from the list, and check if that lowers the number of false positives.

@uniqueg
Copy link
Member Author

uniqueg commented Jun 12, 2023

Yes, but please don't remove orgs too drastically. Once you have a list of orgs by sample count in SRA, we can take the top 100. If that's still not enough, we can further reduce. But being able to support many organisms is still a cool feature, so I wouldn't want to go down to 20 or less, if we can avoid it.

@balajtimate
Copy link
Contributor

Well, it seems reducing the number of organisms is probably not the way to go 😓
Out of the 3.4 million illumina RNA-seq samples in SRA, sorted according to organism frequency, I selected the ones we already have in our list and took the 100 most common ones; all in all, this amounted to ~92% of the SRA samples.
This actually lowered the undecided rate, but increased the false positivity rate:
1-Barplot-predicted-organisms-2 (7)
Even just looking at the 10 most common organisms (hsapiens, mmusculus, athaliana, dmelanogaster, drerio, rnorvegicus, zmays, mmulatta, scerevisiae, osativa) in both cases (just selected the samples where the metadata org was from these - 208 samples), it seems the 100-filtered transcript prediction is only slightly better than the one with all organism transcripts, but the false positivity also increased:
org_test_results

Also, do you know that in the case of PE samples, in what way does the predicted organism from the second read get taken into account? I found quite a lot of samples where the organism could've been correctly predicted from R2, with a large ratio between first and second most frequent organism for that read, but since R1 had a ratio <2, the end result is undecided. It's defined in get_library_source.py and what I gather is that both reads get quantified with kallisto, but the result source (org) gets determined only for the first read?
Examples (1_tpm_1 is first hit, and 1_tpm_2 is the second hit for R1, and similarly for R2, 2_tpm_1, 2_tpm_2 etc):

sample org 1_tpm_1 1_org_1 1_tpm_2 1_org_2 1_percent 2_tpm_1 2_org_1 2_tpm_2 2_org_2 2_percent match_org
SRR17499120 bmori 54.893 lchalumnae 35.848 omeridionalis 1.531 64.649 bmori 7.038 omeridionalis 9.186 Undecided
SRR17499123 bmori 55.264 lchalumnae 36.041 omeridionalis 1.533 64.964 bmori 6.969 omeridionalis 9.322 Undecided
SRR22952772 btaurus 38.895 btaurus 20.333 bbbison 1.913 55.846 btaurus 15.559 bgrunniens 3.589 Undecided
SRR22952770 btaurus 36.444 btaurus 22.088 bbbison 1.65 51.714 btaurus 16.228 bgrunniens 3.187 Undecided

@uniqueg
Copy link
Member Author

uniqueg commented Jun 15, 2023

Oh, that's a bit unexpected!

I mean, checking against all the 770 orgs, okay, we are then "forcing" a lot of false positives, because the true ones aren't in the list. But for the second part, where you only consider the most common orgs, I would have expected a better result: more reduction of undecided, and less false positives. Hard to imagine why we have more, although ... I guess we are still pushing reads that don't map well to the target organism to any of fewer other choices, and those could then maybe mess with the numbers, which are probably rather low, in general.

Two things I can think of:

  • We could increase the number of reads (500k is rather low)
  • We could decrease the numer of allowed errors, possibly putting it down to zero. It kinda makes sense in this exercise (where we have many similar templates) to only allow perfect matches for better separation of organisms. This should increase the undecided (given that the total count will drop, but we can hopefully counteract this by increasing --records), but hopefully decrease the false positives.

As for the PE samples: That's weird indeed. What I would actually expect how the mapping should be done for PE samples is that both reads are mapped together, resulting in a single file of alignments - rather than mapping both libraries separately and then deciding separately and somehow combining or concatenating. This should actually be much more stringent, because the best RP gene compatible with both mates would be picked, and in cases where, say, there is two reasonably good options for one mate, it is a lot less likely that for the other mate the wrong organism of the two would also be in the top choices.

Anyway, thanks for the good work!

balajtimate added a commit that referenced this issue Nov 9, 2023
balajtimate added a commit that referenced this issue Nov 9, 2023
balajtimate added a commit that referenced this issue Nov 9, 2023
balajtimate added a commit that referenced this issue Nov 9, 2023
balajtimate added a commit that referenced this issue Nov 9, 2023
balajtimate added a commit that referenced this issue Nov 9, 2023
balajtimate added a commit that referenced this issue Nov 10, 2023
balajtimate added a commit that referenced this issue Nov 14, 2023
balajtimate added a commit that referenced this issue Nov 15, 2023
* feat: add org param

* refactor: avoid duplicate mappings (#131)

Co-authored-by: Boris Jurič <499542@mail.muni.cz>
Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>

* fix typo, update pylint config

* feat: add org_id param #108

* refactor: get_library_source.py #108

* test: add org param tests #108

* fix: update Pydantic version (#146)

* fix pydantic issues

* fix: update pydantic version in envs

* fix: pin sphinx-rtd-theme into env

* fix: update readthedocs config

* update readme, gitignore

* feat: infer org source if id not in dict #108

* replace json with model_dump

* feat: add org_id param #108

* feat: add org_id param #108

* refactor: replace org with tax-id

* refactor get_library_source

* refactor get_library_source tests

* refactor: update models.py

* refactor: fix typos

---------

Co-authored-by: Boris Jurič <74237898+BorisYourich@users.noreply.github.com>
Co-authored-by: Boris Jurič <499542@mail.muni.cz>
Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>
@uniqueg uniqueg added the library source Infer source organism or cell line/tissue label Dec 20, 2023
@uniqueg
Copy link
Member Author

uniqueg commented Dec 20, 2023

It is important that we also get an idea of (1) why some of our organism annotations fail and (2) whether perhaps there are also mistakes in the SRA metadata.

@balajtimate: To do that, please map all libraries for which the organism was falsely annotated and at least a few dozen of the libraries for which no organism was annotated against the top 3 organisms and check the mapping rates.

@uniqueg uniqueg added this to the Paper MVP milestone Dec 20, 2023
@balajtimate
Copy link
Contributor

So here are some final notes on this issue:
For one thing, the metadata quality of the 780 samples SRA mined dataset is... not great. I know I really should have done this in the beginning, but I just crossed check the lib sources for each sample in our dataset with the metadata from SRA, and there are some differences. Out of 780, 135 were considered mismatches for lib source. Some examples from our original testing data:

  1. 15 samples, like SRR23906030 were marked as bvulgaris, as in Beta vulgaris, because they had "beta" in the study description, when in fact they are C. elegans samples. HTSinfer here correctly identified the lib source for all 15.
  2. All 30 ppaniscus and ptroglodytes samples were in fact Homo sapiens, I don't know where that information came from.
  3. Others like SRR17509963, were described as mmmarmota when in fact it was a meta-transcriptome analysis.
  4. And there were samples like SRR23906946, which we had in our list because the Strategy was RNA-seq, but in fact they were scRNA-seq samples, so I took those out.

So I cleaned up the mined dataset (corrected the lib source so that it matched the SRA metadata and removed 45 problematic samples, like single cell ones and where the actual lib source organism is not found in our transcript db) and checked the lib sources again, if the result of HTSinfer matched with the metadata:

There are still 37 mismatches however, but I think the 5% false positives are not bad. There are also weird cases:

  1. SRX6058708, where the organism is indicated as Plasmodium chabaudi chabaudi, and HTSinfer determined Mus musculus, from the abstract however it's clear that they used mouse cells in the experiemental design.
  2. Then there is also SRR21614910, which has Mesocricetus auratus as organism in the metadata, but the title is simply "RNA-seq of mus", description as "transcriptomic data in mouse" and the inferred lib source is Mus musculus. I guess one could call a hamster a mouse, but there could be a clearer description in the SRA entry.

I would say in these cases HTSinfer actually provided a more precise information about the lib source than the metadata from SRA.

Lastly, I mapped the rest of the libraries with mismatches against the metadata organism and the inferred organism, and to no surprise, the samples had a significantly higher mapping rate to the metadata organism than the inferred ones. This was also true for the couple of Undecided lib source that I mapped. In most cases, the decision came down to the ratio of mapped reads from the first and second most common lib source, which has to be at least 2. Lowering it to 1.5 also wouldn't make sense, as we would get an almost equal number of matches and mismatches. It's also worth noting that the most common incorrectly inferred organism is Ficedula albicollis and Oryza meridionalis. I have no idea what could be the meaning of that.

So while there is still room for improvement, I think with the testing data that we had this is probably the best result. I will summarize the findings from here and write it in #56.

@uniqueg
Copy link
Member Author

uniqueg commented Jan 29, 2024

This is awesome work, thanks a lot @balajtimate! And fantastic results 😍

The only thing I would still like to see though is the full mappings of at least the 37 mismatches (ideally of all samples) against the top 3 inferred organisms/sources and comparison with what SRA reports. It is really a powerful statement for the paper (and the usefulness of HTSinfer) if we are able to quantify (to a limited extent) how often samples are misannotated or heavily contaminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
library source Infer source organism or cell line/tissue
Projects
None yet
Development

No branches or pull requests

2 participants