Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace pangenome.vcf with a presence-absence.vcf as main output, but keep it to build the graph genomes #30

Open
clemgoub opened this issue May 24, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@clemgoub
Copy link
Collaborator

Replace pangenome.vcf with a presence-absence.vcf in the 3_TSD_Search/ output folder. This new file will show 1 genotype column per sample but the calls are only 1 or 0 (i.e. identical to the SUPP_VEC field). We still need to output pangenome.vcf for compatibility with the option --graffite-vcf (skips SV search and annotation, and use the VCF provided to build graph and map reads). Alternatively, don't output pangenome.vcf, but keep it internally to build the graph if needed. This would require to modify the routines for --graffite-vcf in order to strip the genotype column and replace them with a single column with all variants 1|0.

I anticipate a possible source of confusion as "presence-absence" could be interpreted as the presence or absence of a TE rather than presence/absence of the variant. Perhaps a solution to this is to output two files, one in VCF format, respecting the VCF convention and called GraffiTE_variants_presence-absence.vcf and the other being tsv table, identical to the non-header lines of the VCF but where the DEL calls are reverted to match the presence/absence pattern of the TEs for each sample. We could call this file GraffiTE_TE_presence-absence.tsv.

Of course, will need to update the documentation accordingly.

This change has several advantages:

  1. it is more explicit and easier to interpret, either seing 1 (alt allele) or 0 (ref allele) in the VCF for each variants/sample combination in the VCF or 1 (TE presence) or 0 (TE absence) in the TSV for each TE/sample.
  2. it should be easier to parse than the SUPP_VEC
  3. it avoids having to pull the vcf.txt file from in order to know which position of the SUPP_VEC correspond to which sample.
@clemgoub clemgoub added the enhancement New feature or request label May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant