Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve library source inference #56

Open
rohank63 opened this issue Feb 10, 2021 · 6 comments
Open

Improve library source inference #56

rohank63 opened this issue Feb 10, 2021 · 6 comments
Labels
enhancement New feature or request library source Infer source organism or cell line/tissue

Comments

@rohank63
Copy link
Collaborator

rohank63 commented Feb 10, 2021

  • Consistently update set of reference transcripts for library source inference and include more genes for better accuracy
  • Include organisms/sources from all clades in Ensembl: Bacteria, Plants, Fungi, Metazoa etc.
@rohank63 rohank63 added the enhancement New feature or request label Feb 10, 2021
@uniqueg uniqueg added the library source Infer source organism or cell line/tissue label Mar 14, 2022
@uniqueg uniqueg changed the title Need to update Transcripts.fasta for Infer_organsim Update reference transcripts Mar 14, 2022
@uniqueg uniqueg changed the title Update reference transcripts Update/extend reference transcripts Mar 14, 2022
@uniqueg
Copy link
Member

uniqueg commented Mar 14, 2022

Handle with or after #72.

@uniqueg uniqueg added the blocked Issue is blocked by another issue label Mar 14, 2022
@uniqueg uniqueg added low_priority Not urgent and removed blocked Issue is blocked by another issue labels May 19, 2022
@uniqueg
Copy link
Member

uniqueg commented Dec 20, 2023

@balajtimate: In this issue, please create a short list of all the strategies we dsicussed to improve the library source inference

@uniqueg uniqueg changed the title Update/extend reference transcripts Improve library source inference Dec 20, 2023
@uniqueg uniqueg removed the low_priority Not urgent label Dec 20, 2023
@balajtimate
Copy link
Contributor

As both the library type and the orientation inference relies on the inferred library source, it's extremely important to improve the inference. The key points from #108 and other discussions:

  1. Add more genes from the current organisms other than ribosomal protein genes. This should include genes that are highly conserved intra-species, but show enough variability inter-species to be used for identification. One approach would be the use of DNA barcoding genes, like cytochrome c oxidase I (COI), cytochrome b (CYTB), histone 3 (H3) for mammals, matK and rbcL for plants. One source for this could be the BOLD database.
  2. This should focus on the most common organisms in SRA: hsapiens, mmusculus, athaliana, drerio, rnorvegicus, zmays, mmulatta, scerevisiae, osativa, btaurus, sscrofa, celegans, ggallus
  3. Currently, HTSinfer doesn't support bacteria, but the next most common organism is ecoli, so add the RP genes from Ensembl Bacteria

@uniqueg
Copy link
Member

uniqueg commented Jan 29, 2024

Thanks! To clarify: What exactly do you mean by "This should focus" in 2. What is "This" and how to make "this" focus on just the listed organisms?

@balajtimate
Copy link
Contributor

I meant adding more genes (other than the RP genes) should focus on the 15 most common organisms, to have greater precision in the lib source inference of those organisms (at least).

@uniqueg
Copy link
Member

uniqueg commented Jan 29, 2024

Thanks. Any concrete ideas how such a strategy could look like? I mean, how to find genes that are broadly conserved while at the same time maximizing the difference between the most common orgs? I don't really see how to start with such an exercise. Or were you suggesting to not care about the conservation beyond the most common organisms at all? And then maybe have a 2-stage process - look first at the broadly conserved (current) genes and then, based on the results for that, pick another subset of genes for better resolution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request library source Infer source organism or cell line/tissue
Projects
None yet
Development

No branches or pull requests

3 participants