Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agat_convert_sp_gxf2gxf.pl does not sort as expected by tabix #230

Closed
Shellfishgene opened this issue Mar 11, 2022 · 2 comments
Closed

agat_convert_sp_gxf2gxf.pl does not sort as expected by tabix #230

Shellfishgene opened this issue Mar 11, 2022 · 2 comments

Comments

@Shellfishgene
Copy link

Hi!

When sorting the below gff3 with agat_convert_sp_gxf2gxf.pl, the file only gets sorted within types, but (in this example) exon, CDS and intron are not sorted by start base. gt gff -sortlines does it as tabix expects. Is there a way to change the agat sorting behaviour?

agat output (does not report any problems):

##gff-version 3
Chr01   AUGUSTUS        gene    48532   57251   1.09    -       .       ID=gene-10011
Chr01   AUGUSTUS        transcript      48532   57251   0.59    -       .       ID=transcript-11073;Parent=gene-10011
Chr01   AUGUSTUS        gene    48532   57251   0.64    -       .       ID=g10011;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        mRNA    48532   57251   0.64    -       .       ID=g10011.t1;Parent=g10011;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    48532   48701   .       -       .       ID=exon-69840;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    48803   49221   .       -       .       ID=exon-69841;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    54260   54305   .       -       .       ID=exon-69842;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    57161   57251   .       -       .       ID=exon-69843;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     48532   48701   0.64    -       2       ID=cds-69840;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     48803   49221   0.76    -       1       ID=cds-69841;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     54260   54305   0.96    -       2       ID=cds-69842;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     57161   57251   0.98    -       0       ID=cds-69843;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        intron  48702   48802   0.84    -       .       ID=intron-58782;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        intron  49222   54259   0.96    -       .       ID=intron-58783;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        intron  54306   57160   0.96    -       .       ID=intron-58784;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1

$ bgzip test.sort.gff3
$ tabix test.sort.gff3.gz

[E::hts_idx_push] Unsorted positions on sequence #1: 57161 followed by 48532
tbx_index_build failed: augustus.hints.gff.gz
@Juke34
Copy link
Collaborator

Juke34 commented Mar 11, 2022

The output sorting is made here in the code:

# omniscient is a hash containing a whole gXf file in memory sorted in a specific way (3 levels)
sub print_omniscient{
my ($hash_omniscient, $gffout) = @_ ;
my @l3_out_priority = ("tss", "exon", "cds", "tts");
#uri_decode_omniscient($hash_omniscient);
# --------- deal with header --------------
write_headers($hash_omniscient, $gffout);
### OLD FASHION GOING TRHOUGH LEVEL1
#foreach my $primary_tag_l1 ( sort {$a <=> $b or $a cmp $b} keys %{$hash_omniscient->{'level1'}}){ # primary_tag_l1 = gene or repeat etc...
# foreach my $id_tag_key_level1 ( sort { $hash_omniscient->{'level1'}{$primary_tag_l1}{$a}->start <=> $hash_omniscient->{'level1'}{$primary_tag_l1}{$b}->start } keys %{$hash_omniscient->{'level1'}{$primary_tag_l1}} ) { #sort by position
### NEW FASHION GOING TRHOUGH LEVEL1 - Have to first create a hash of seq_id -> level1_feature , then we can go through in alphanumerical order.
# sort by seq id
my ( $hash_sortBySeq, $hash_sortBySeq_std, $hash_sortBySeq_topf ) = collect_l1_info_sorted_by_seqid_and_location($hash_omniscient);
# Read by seqId to sort properly the output by seq ID
# sort { (($a =~ /(\d+)$/)[0] || 0) <=> (($b =~ /(\d+)$/)[0] || 0) will provide sorting like that: contig contig1 contig2 contig3 contig10 contig11 contig22 contig100 contig101
foreach my $seqid (sort { (($a =~ /(\d+)$/)[0] || 0) <=> (($b =~ /(\d+)$/)[0] || 0) } keys %{$hash_sortBySeq}){ # loop over all the feature level1
#################
# == LEVEL 1 == # IF not in omniscient do that, otherwise we us within. Make a method for it.
#################
write_top_features($gffout, $seqid, $hash_sortBySeq_topf, $hash_omniscient);
foreach my $locationid ( sort { ncmp ($a, $b) } keys %{$hash_sortBySeq->{$seqid} } ){
my $primary_tag_l1 = $hash_sortBySeq->{$seqid}{$locationid}{'tag'};
my $id_tag_key_level1 = $hash_sortBySeq->{$seqid}{$locationid}{'id'};
$gffout->write_feature($hash_omniscient->{'level1'}{$primary_tag_l1}{$id_tag_key_level1}); # print feature
#################
# == LEVEL 2 == #
#################
foreach my $primary_tag_l2 (sort {$a cmp $b} keys %{$hash_omniscient->{'level2'}}){ # primary_tag_l2 = mrna or mirna or ncrna or trna etc...
if ( exists_keys( $hash_omniscient, ('level2', $primary_tag_l2, $id_tag_key_level1) ) ){
foreach my $feature_level2 ( sort { ncmp ($a->start.$a->end.$a->_tag_value('ID'), $b->start.$b->end.$b->_tag_value('ID') ) } @{$hash_omniscient->{'level2'}{$primary_tag_l2}{$id_tag_key_level1}}) {
$gffout->write_feature($feature_level2);
#################
# == LEVEL 3 == #
#################
my $level2_ID = lc($feature_level2->_tag_value('ID'));
###########
# Before tss
if ( exists_keys($hash_omniscient,('level3','tss',$level2_ID)) ){
foreach my $feature_level3 ( @{$hash_omniscient->{'level3'}{'tss'}{$level2_ID}}) {
#_uri_encode_one_feature($feature_level3);
$gffout->write_feature($feature_level3);
}
}
######
# FIRST EXON
if ( exists_keys( $hash_omniscient, ('level3', 'exon', $level2_ID) ) ){
foreach my $feature_level3 ( sort {$a->start <=> $b->start} @{$hash_omniscient->{'level3'}{'exon'}{$level2_ID}}) {
$gffout->write_feature($feature_level3);
}
}
###########
# SECOND CDS
if ( exists_keys( $hash_omniscient, ('level3', 'cds', $level2_ID) ) ){
foreach my $feature_level3 ( sort {$a->start <=> $b->start} @{$hash_omniscient->{'level3'}{'cds'}{$level2_ID}}) {
$gffout->write_feature($feature_level3);
}
}
###########
# Last tts
if ( exists_keys($hash_omniscient,('level3','tts',$level2_ID)) ){
foreach my $feature_level3 ( @{$hash_omniscient->{'level3'}{'tts'}{$level2_ID}}) {
#_uri_encode_one_feature($feature_level3);
$gffout->write_feature($feature_level3);
}
}
############
# THEN ALL THE REST
foreach my $primary_tag_l3 (sort {$a cmp $b} keys %{$hash_omniscient->{'level3'}}){ # primary_tag_l3 = cds or exon or start_codon or utr etc...
if (! grep { $_ eq $primary_tag_l3 } @l3_out_priority){
if ( exists_keys( $hash_omniscient, ('level3', $primary_tag_l3, $level2_ID) ) ){
foreach my $feature_level3 ( sort {$a->start <=> $b->start} @{$hash_omniscient->{'level3'}{$primary_tag_l3}{$level2_ID}}) {
$gffout->write_feature($feature_level3);
}
}
}
}
}
}
}
}
}
# --------- deal with fasta seq --------------
write_fasta($gffout, $hash_omniscient);
}

I have think in many occasion to improve this part. I'm not entirely satisfied. I think I followed what was doing Ensembl ~ 2018.
Currently the rule for level3 features is first tss, then exon, then CDS, then tts, then all the rest. Eah type is sorted by start position.
We can add a option to keep topological sorting for level1(gene) and level2(mrna) and then just print by start position. I just do not know if we need a rule to prioritize one type of feature over another when they have same start position...

@Shellfishgene
Copy link
Author

I thinnk it pretty common for gff to be bgzipped and indexed with tabix, for genome browsers for example, so it mayb be a good idea to add coordinate based sorting.

Juke34 added a commit that referenced this issue Apr 2, 2022
@Juke34 Juke34 closed this as completed in 302d28f Apr 4, 2022
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants