Analyze organism/source #108

uniqueg · 2023-03-03T09:54:10Z

Compare predictions with reported source
Quantify accuracy: correct, false, undecided

One goal is to minimize false predictions. Here we can try playing with --library-source-min-match-percentag and --library-source-min-frequency-ratio.

However, another goal is to also make sure that we get reliable predictions at least for the most common model organism (not useful that we never get a mouse because based on our rRNA gene transcripts we cannot sufficiently distinguish between mouse and shrew mouse). About this, there are basically two things we can do:

Don't support so many organisms, and particularly those that are closely related
Add transcripts of more genes, including ones with a more dynamic evolution

The text was updated successfully, but these errors were encountered:

balajtimate · 2023-06-06T10:05:18Z

I'm currently running the 770 samples tests with different --library-source-min-match-percentage to try to optimize the param for the correctly identified organisms, just wanted to come back to this issue continuing from this comment: #107 (comment)

To summarize, from our data of 77 samples from the lab, which consisted mostly of model organisms, only 5 had the organisms falsely inferred, 1 mouse (where the issue was that the ratio between 1st and 2nd inferred org was less than 2, with 30% as mmusculus and 17% as omeridionalis, which is weird, cause it's a butterfly, and for the other mouse samples, the second most frequent org is usually another mouse, like mspicilegus) and 4 E.coli samples, that I checked and it is not in our list of transcripts. I suppose some RP sequences could be added?

balajtimate · 2023-06-07T11:33:55Z

So unfortunately, as it turns out, there's no point lowering --library-source-min-match-percentage, as there's no difference between the results of the different parameter setting.

However, out of the 226 Undecided samples (which is the 29% of the 770 samples), 77 had the organism correctly matched, except the frequency between 1st and 2nd most frequent source was <2 (the default), and 94 had the an incorrect organism inferred. We could try setting this to 1, as none of the samples had a frequency ratio lower than that (except for when none of the organisms were identified - 55). We would get an additional ~10% increase in correctly identified organisms, but at the same time, ~12% increase in false predictions. What do you think?

Here are the results from the rerun, with --library-source-min-frequency-ratio set to 1.

So there's 55 samples (7%) for which no organism could be determined. It's also worth noting that out of the 229 (29%) falsely identified samples, 52 had the correct organism in the 2nd most frequent place, the first one simply had more counts. I'm not sure what could be done in this case, given that most of the time, the 1nd and 2nd most frequent organisms are not even closely related (or the problem is with the samples themselves, e.g. incorrect metadata or contamination from other species).
mined_test_data_result_libfreq_1_0806.csv

uniqueg · 2023-06-09T19:27:57Z

Interesting results. I think it's really not too bad for a start. A couple of points:

I'm not quite sure what you mean when you say that none of the samples had a frequency ratio above 1. It's a ratio of the counts of the most frequent and the second most frequent orgs, so by definition it must be >= 1, or NaN (if there are no counts for any org or all but one org). This, setting thay parameter to 1 basically deactivates this check.
We absolutely want to minimize false positives. It's much better to have results undecided and then have users check the complete results and make up their own minds. So when deciding parameters, if anything, we should be more strict, rather than less. For a start, I think we should not have more than 5% false positives, preferably less than 1%.
What can we do to improve results, apart from playing with our params? (1) Add additional highly expressed genes (as you mention), including some that are possibly a bit less conserved (so ideally, there should be orthologs for all organisms, but those should differ more than for the ribosomal protein genes); (2) remove closely related organisms and focus only on the most common ones (we could count how many samples of each org in our list are in SRA and pick just the top 50 or 100 or so) - this should solve the mouse problem you observed.
It is of course also possible that the SRA annotations are wrong in few cases (or highly contaminated, as you say; maybe that mouse had just eaten a butterfly 😅). So for those samples where we have false positives, we should map these samples against the complete annotations of the top 2 or 3 results. That should give us the real result. And it would indeed be nice to find such cases, it makes the case for the tool stronger. Especially if we manage (with the above measures) to bring down the false positives to a level that is similar or even lower than the one of SRA. That would tell people: You're better off running HTSinfer than relying on what's written in SRA.

uniqueg · 2023-06-09T19:32:45Z

One other thing you could do: Create separate stats for (1) the 10 most common organisms in SRA or whatever number of organisms we need to account for more than 95% of RNA-Seq samples in SRA, out of the organisms we support (for these we wanna be particularly good), (2) the rest of the organisms, and (3) all organisms (what you have now).

uniqueg · 2023-06-09T19:33:44Z

And one more question: Are all of the annotated organisms for the 770 samples supported, in theory, by HTSinfer?

balajtimate · 2023-06-11T21:04:07Z

I'm not quite sure what you mean when you say that none of the samples had a frequency ratio above 1. It's a ratio of the counts of the most frequent and the second most frequent orgs, so by definition it must be >= 1, or NaN (if there are no counts for any org or all but one org). This, setting that parameter to 1 basically deactivates this check.

Yeah, sorry, I got it mixed up with the min-match-percentage 😅 It makes sense that we wouldn't want this lowered.

And one more question: Are all of the annotated organisms for the 770 samples supported, in theory, by HTSinfer?

Yes, I checked these, I think they were specifically chosen because they're in the transcript list.

So for those samples where we have false positives, we should map these samples against the complete annotations of the top 2 or 3 results. That should give us the real result.

One other thing you could do: Create separate stats for (1) the 10 most common organisms in SRA or whatever number of organisms we need to account for more than 95% of RNA-Seq samples in SRA, out of the organisms we support (for these we wanna be particularly good), (2) the rest of the organisms, and (3) all organisms (what you have now).

It's a good idea, I'll do the mapping and create the stats. Tbh I think it would be best to focus on the most common organisms (like you mentioned); currently, there's 440 orgs supported by HTSinfer, which is far too much, and there's a large number of closely related ones. I'll look up the most common ones in SRA and remove the others from the list, and check if that lowers the number of false positives.

uniqueg · 2023-06-12T18:05:41Z

Yes, but please don't remove orgs too drastically. Once you have a list of orgs by sample count in SRA, we can take the top 100. If that's still not enough, we can further reduce. But being able to support many organisms is still a cool feature, so I wouldn't want to go down to 20 or less, if we can avoid it.

balajtimate · 2023-06-15T14:16:49Z

Well, it seems reducing the number of organisms is probably not the way to go 😓
Out of the 3.4 million illumina RNA-seq samples in SRA, sorted according to organism frequency, I selected the ones we already have in our list and took the 100 most common ones; all in all, this amounted to ~92% of the SRA samples.
This actually lowered the undecided rate, but increased the false positivity rate:

Even just looking at the 10 most common organisms (hsapiens, mmusculus, athaliana, dmelanogaster, drerio, rnorvegicus, zmays, mmulatta, scerevisiae, osativa) in both cases (just selected the samples where the metadata org was from these - 208 samples), it seems the 100-filtered transcript prediction is only slightly better than the one with all organism transcripts, but the false positivity also increased:

Also, do you know that in the case of PE samples, in what way does the predicted organism from the second read get taken into account? I found quite a lot of samples where the organism could've been correctly predicted from R2, with a large ratio between first and second most frequent organism for that read, but since R1 had a ratio <2, the end result is undecided. It's defined in get_library_source.py and what I gather is that both reads get quantified with kallisto, but the result source (org) gets determined only for the first read?
Examples (1_tpm_1 is first hit, and 1_tpm_2 is the second hit for R1, and similarly for R2, 2_tpm_1, 2_tpm_2 etc):

sample	org	1_tpm_1	1_org_1	1_tpm_2	1_org_2	1_percent	2_tpm_1	2_org_1	2_tpm_2	2_org_2	2_percent	match_org
SRR17499120	bmori	54.893	lchalumnae	35.848	omeridionalis	1.531	64.649	bmori	7.038	omeridionalis	9.186	Undecided
SRR17499123	bmori	55.264	lchalumnae	36.041	omeridionalis	1.533	64.964	bmori	6.969	omeridionalis	9.322	Undecided
SRR22952772	btaurus	38.895	btaurus	20.333	bbbison	1.913	55.846	btaurus	15.559	bgrunniens	3.589	Undecided
SRR22952770	btaurus	36.444	btaurus	22.088	bbbison	1.65	51.714	btaurus	16.228	bgrunniens	3.187	Undecided

uniqueg · 2023-06-15T15:09:39Z

Oh, that's a bit unexpected!

I mean, checking against all the 770 orgs, okay, we are then "forcing" a lot of false positives, because the true ones aren't in the list. But for the second part, where you only consider the most common orgs, I would have expected a better result: more reduction of undecided, and less false positives. Hard to imagine why we have more, although ... I guess we are still pushing reads that don't map well to the target organism to any of fewer other choices, and those could then maybe mess with the numbers, which are probably rather low, in general.

Two things I can think of:

We could increase the number of reads (500k is rather low)
We could decrease the numer of allowed errors, possibly putting it down to zero. It kinda makes sense in this exercise (where we have many similar templates) to only allow perfect matches for better separation of organisms. This should increase the undecided (given that the total count will drop, but we can hopefully counteract this by increasing --records), but hopefully decrease the false positives.

As for the PE samples: That's weird indeed. What I would actually expect how the mapping should be done for PE samples is that both reads are mapped together, resulting in a single file of alignments - rather than mapping both libraries separately and then deciding separately and somehow combining or concatenating. This should actually be much more stringent, because the best RP gene compatible with both mates would be picked, and in cases where, say, there is two reasonably good options for one mate, it is a lot less likely that for the other mate the wrong organism of the two would also be in the top choices.

Anyway, thanks for the good work!

* feat: add org param * refactor: avoid duplicate mappings (#131) Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch> * fix typo, update pylint config * feat: add org_id param #108 * refactor: get_library_source.py #108 * test: add org param tests #108 * fix: update Pydantic version (#146) * fix pydantic issues * fix: update pydantic version in envs * fix: pin sphinx-rtd-theme into env * fix: update readthedocs config * update readme, gitignore * feat: infer org source if id not in dict #108 * replace json with model_dump * feat: add org_id param #108 * feat: add org_id param #108 * refactor: replace org with tax-id * refactor get_library_source * refactor get_library_source tests * refactor: update models.py * refactor: fix typos --------- Co-authored-by: Boris Jurič <74237898+BorisYourich@users.noreply.github.com> Co-authored-by: Boris Jurič <499542@mail.muni.cz> Co-authored-by: Alex Kanitz <alexander.kanitz@alumni.ethz.ch>

uniqueg · 2023-12-20T14:50:53Z

It is important that we also get an idea of (1) why some of our organism annotations fail and (2) whether perhaps there are also mistakes in the SRA metadata.

@balajtimate: To do that, please map all libraries for which the organism was falsely annotated and at least a few dozen of the libraries for which no organism was annotated against the top 3 organisms and check the mapping rates.

balajtimate · 2024-01-24T16:44:42Z

So here are some final notes on this issue:
For one thing, the metadata quality of the 780 samples SRA mined dataset is... not great. I know I really should have done this in the beginning, but I just crossed check the lib sources for each sample in our dataset with the metadata from SRA, and there are some differences. Out of 780, 135 were considered mismatches for lib source. Some examples from our original testing data:

15 samples, like SRR23906030 were marked as bvulgaris, as in Beta vulgaris, because they had "beta" in the study description, when in fact they are C. elegans samples. HTSinfer here correctly identified the lib source for all 15.
All 30 ppaniscus and ptroglodytes samples were in fact Homo sapiens, I don't know where that information came from.
Others like SRR17509963, were described as mmmarmota when in fact it was a meta-transcriptome analysis.
And there were samples like SRR23906946, which we had in our list because the Strategy was RNA-seq, but in fact they were scRNA-seq samples, so I took those out.

So I cleaned up the mined dataset (corrected the lib source so that it matched the SRA metadata and removed 45 problematic samples, like single cell ones and where the actual lib source organism is not found in our transcript db) and checked the lib sources again, if the result of HTSinfer matched with the metadata:

There are still 37 mismatches however, but I think the 5% false positives are not bad. There are also weird cases:

SRX6058708, where the organism is indicated as Plasmodium chabaudi chabaudi, and HTSinfer determined Mus musculus, from the abstract however it's clear that they used mouse cells in the experiemental design.
Then there is also SRR21614910, which has Mesocricetus auratus as organism in the metadata, but the title is simply "RNA-seq of mus", description as "transcriptomic data in mouse" and the inferred lib source is Mus musculus. I guess one could call a hamster a mouse, but there could be a clearer description in the SRA entry.

I would say in these cases HTSinfer actually provided a more precise information about the lib source than the metadata from SRA.

Lastly, I mapped the rest of the libraries with mismatches against the metadata organism and the inferred organism, and to no surprise, the samples had a significantly higher mapping rate to the metadata organism than the inferred ones. This was also true for the couple of Undecided lib source that I mapped. In most cases, the decision came down to the ratio of mapped reads from the first and second most common lib source, which has to be at least 2. Lowering it to 1.5 also wouldn't make sense, as we would get an almost equal number of matches and mismatches. It's also worth noting that the most common incorrectly inferred organism is Ficedula albicollis and Oryza meridionalis. I have no idea what could be the meaning of that.

So while there is still room for improvement, I think with the testing data that we had this is probably the best result. I will summarize the findings from here and write it in #56.

uniqueg · 2024-01-29T08:33:47Z

This is awesome work, thanks a lot @balajtimate! And fantastic results 😍

The only thing I would still like to see though is the full mappings of at least the 37 mismatches (ideally of all samples) against the top 3 inferred organisms/sources and comparison with what SRA reports. It is really a powerful statement for the paper (and the usefulness of HTSinfer) if we are able to quantify (to a limited extent) how often samples are misannotated or heavily contaminated.

uniqueg assigned balajtimate Mar 3, 2023

balajtimate mentioned this issue Apr 25, 2023

Analyze 3' end adapters #107

Closed

balajtimate added a commit that referenced this issue Nov 9, 2023

feat: add org_id param #108

00c0d6b

balajtimate added a commit that referenced this issue Nov 9, 2023

refactor: get_library_source.py #108

5c02280

balajtimate added a commit that referenced this issue Nov 9, 2023

test: add org param tests #108

aecad4c

balajtimate added a commit that referenced this issue Nov 9, 2023

feat: add org_id param #108

2bb2451

balajtimate added a commit that referenced this issue Nov 9, 2023

refactor: get_library_source.py #108

6656ff3

balajtimate added a commit that referenced this issue Nov 9, 2023

test: add org param tests #108

6b6bc58

balajtimate added a commit that referenced this issue Nov 9, 2023

feat: infer org source if id not in dict #108

7359932

balajtimate added a commit that referenced this issue Nov 10, 2023

feat: add org_id param #108

fbd91b0

balajtimate added a commit that referenced this issue Nov 14, 2023

feat: add org_id param #108

a6d4a56

uniqueg added the library source Infer source organism or cell line/tissue label Dec 20, 2023

uniqueg added this to the Paper MVP milestone Dec 20, 2023

balajtimate mentioned this issue Jan 26, 2024

Improve library source inference #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analyze organism/source #108

Analyze organism/source #108

uniqueg commented Mar 3, 2023

balajtimate commented Jun 6, 2023

balajtimate commented Jun 7, 2023 •

edited

Loading

uniqueg commented Jun 9, 2023

uniqueg commented Jun 9, 2023

uniqueg commented Jun 9, 2023

balajtimate commented Jun 11, 2023

uniqueg commented Jun 12, 2023

balajtimate commented Jun 15, 2023

uniqueg commented Jun 15, 2023

uniqueg commented Dec 20, 2023 •

edited

Loading

balajtimate commented Jan 24, 2024

uniqueg commented Jan 29, 2024

Analyze organism/source #108

Analyze organism/source #108

Comments

uniqueg commented Mar 3, 2023

balajtimate commented Jun 6, 2023

balajtimate commented Jun 7, 2023 • edited Loading

uniqueg commented Jun 9, 2023

uniqueg commented Jun 9, 2023

uniqueg commented Jun 9, 2023

balajtimate commented Jun 11, 2023

uniqueg commented Jun 12, 2023

balajtimate commented Jun 15, 2023

uniqueg commented Jun 15, 2023

uniqueg commented Dec 20, 2023 • edited Loading

balajtimate commented Jan 24, 2024

uniqueg commented Jan 29, 2024

balajtimate commented Jun 7, 2023 •

edited

Loading

uniqueg commented Dec 20, 2023 •

edited

Loading