agat_convert_sp_gxf2gxf.pl does not sort as expected by tabix #230

Shellfishgene · 2022-03-11T09:13:35Z

Hi!

When sorting the below gff3 with agat_convert_sp_gxf2gxf.pl, the file only gets sorted within types, but (in this example) exon, CDS and intron are not sorted by start base. gt gff -sortlines does it as tabix expects. Is there a way to change the agat sorting behaviour?

agat output (does not report any problems):

##gff-version 3
Chr01   AUGUSTUS        gene    48532   57251   1.09    -       .       ID=gene-10011
Chr01   AUGUSTUS        transcript      48532   57251   0.59    -       .       ID=transcript-11073;Parent=gene-10011
Chr01   AUGUSTUS        gene    48532   57251   0.64    -       .       ID=g10011;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        mRNA    48532   57251   0.64    -       .       ID=g10011.t1;Parent=g10011;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    48532   48701   .       -       .       ID=exon-69840;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    48803   49221   .       -       .       ID=exon-69841;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    54260   54305   .       -       .       ID=exon-69842;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        exon    57161   57251   .       -       .       ID=exon-69843;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     48532   48701   0.64    -       2       ID=cds-69840;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     48803   49221   0.76    -       1       ID=cds-69841;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     54260   54305   0.96    -       2       ID=cds-69842;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        CDS     57161   57251   0.98    -       0       ID=cds-69843;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        intron  48702   48802   0.84    -       .       ID=intron-58782;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        intron  49222   54259   0.96    -       .       ID=intron-58783;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1
Chr01   AUGUSTUS        intron  54306   57160   0.96    -       .       ID=intron-58784;Parent=g10011.t1;gene_id=g10011;transcript_id=g10011.t1

$ bgzip test.sort.gff3
$ tabix test.sort.gff3.gz

[E::hts_idx_push] Unsorted positions on sequence #1: 57161 followed by 48532
tbx_index_build failed: augustus.hints.gff.gz

The text was updated successfully, but these errors were encountered:

Juke34 · 2022-03-11T10:57:09Z

The output sorting is made here in the code:

AGAT/lib/AGAT/OmniscientO.pm

Lines 45 to 147 in 60de645

    
           # omniscient is a hash containing a whole gXf file in memory sorted in a specific way (3 levels) 
        
           sub print_omniscient{ 
        
           	my ($hash_omniscient, $gffout) = @_  ; 
        
           	my @l3_out_priority = ("tss", "exon", "cds", "tts"); 
        
           	#uri_decode_omniscient($hash_omniscient); 
        
             # --------- deal with header -------------- 
        
             write_headers($hash_omniscient, $gffout); 
        
           ### OLD FASHION GOING TRHOUGH LEVEL1 
        
           	#foreach my $primary_tag_l1 ( sort {$a <=> $b or $a cmp $b} keys %{$hash_omniscient->{'level1'}}){ # primary_tag_l1 = gene or repeat etc... 
        
           	#	foreach my $id_tag_key_level1 ( sort { $hash_omniscient->{'level1'}{$primary_tag_l1}{$a}->start <=> $hash_omniscient->{'level1'}{$primary_tag_l1}{$b}->start } keys %{$hash_omniscient->{'level1'}{$primary_tag_l1}} ) { #sort by position 
        
           ### NEW FASHION GOING TRHOUGH LEVEL1 - Have to first create a hash of seq_id -> level1_feature , then we can go through in alphanumerical order. 
        
           	# sort by seq id 
        
           	my ( $hash_sortBySeq, $hash_sortBySeq_std, $hash_sortBySeq_topf ) = collect_l1_info_sorted_by_seqid_and_location($hash_omniscient); 
        
           	# Read by seqId to sort properly the output by seq ID 
        
           	# sort { (($a =~ /(\d+)$/)[0] || 0) <=> (($b =~ /(\d+)$/)[0] || 0) will provide sorting like that: contig contig1 contig2 contig3 contig10 contig11 contig22 contig100 contig101 
        
           	foreach my $seqid (sort { (($a =~ /(\d+)$/)[0] || 0) <=> (($b =~ /(\d+)$/)[0] || 0) } keys %{$hash_sortBySeq}){ # loop over all the feature level1 
        
             ################# 
        
           	# == LEVEL 1 == # IF not in omniscient do that, otherwise we us within. Make a method for it. 
        
           	################# 
        
           		write_top_features($gffout, $seqid, $hash_sortBySeq_topf, $hash_omniscient); 
        
           		foreach my $locationid ( sort { ncmp ($a, $b) } keys %{$hash_sortBySeq->{$seqid} } ){ 
        
           				my $primary_tag_l1 = $hash_sortBySeq->{$seqid}{$locationid}{'tag'}; 
        
           				my $id_tag_key_level1 = $hash_sortBySeq->{$seqid}{$locationid}{'id'}; 
        
           			  $gffout->write_feature($hash_omniscient->{'level1'}{$primary_tag_l1}{$id_tag_key_level1}); # print feature 
        
           				################# 
        
           				# == LEVEL 2 == # 
        
           				################# 
        
           				foreach my $primary_tag_l2 (sort {$a cmp $b} keys %{$hash_omniscient->{'level2'}}){ # primary_tag_l2 = mrna or mirna or ncrna or trna etc... 
        
           					if ( exists_keys( $hash_omniscient, ('level2', $primary_tag_l2, $id_tag_key_level1) ) ){ 
        
           						foreach my $feature_level2 ( sort { ncmp ($a->start.$a->end.$a->_tag_value('ID'), $b->start.$b->end.$b->_tag_value('ID') ) } @{$hash_omniscient->{'level2'}{$primary_tag_l2}{$id_tag_key_level1}}) { 
        
           							$gffout->write_feature($feature_level2); 
        
           							################# 
        
           							# == LEVEL 3 == # 
        
           							################# 
        
           							my $level2_ID = lc($feature_level2->_tag_value('ID')); 
        
                         ########### 
        
                         # Before tss 
        
                         if ( exists_keys($hash_omniscient,('level3','tss',$level2_ID)) ){ 
        
                           foreach my $feature_level3 ( @{$hash_omniscient->{'level3'}{'tss'}{$level2_ID}}) { 
        
                             #_uri_encode_one_feature($feature_level3); 
        
                             $gffout->write_feature($feature_level3); 
        
                           } 
        
                         } 
        
           							###### 
        
           							# FIRST EXON 
        
           							if ( exists_keys( $hash_omniscient, ('level3', 'exon', $level2_ID) ) ){ 
        
           								foreach my $feature_level3 ( sort {$a->start <=> $b->start} @{$hash_omniscient->{'level3'}{'exon'}{$level2_ID}}) { 
        
           									$gffout->write_feature($feature_level3); 
        
           								} 
        
           							} 
        
           							########### 
        
           							# SECOND CDS 
        
           							if ( exists_keys( $hash_omniscient, ('level3', 'cds', $level2_ID) ) ){ 
        
           								foreach my $feature_level3 ( sort {$a->start <=> $b->start} @{$hash_omniscient->{'level3'}{'cds'}{$level2_ID}}) { 
        
           									$gffout->write_feature($feature_level3); 
        
           								} 
        
           							} 
        
                         ########### 
        
                         # Last tts 
        
                         if ( exists_keys($hash_omniscient,('level3','tts',$level2_ID)) ){ 
        
                           foreach my $feature_level3 ( @{$hash_omniscient->{'level3'}{'tts'}{$level2_ID}}) { 
        
                             #_uri_encode_one_feature($feature_level3); 
        
                             $gffout->write_feature($feature_level3); 
        
                           } 
        
                         } 
        
           							############ 
        
           							# THEN ALL THE REST 
        
           							foreach my $primary_tag_l3 (sort {$a cmp $b} keys %{$hash_omniscient->{'level3'}}){ # primary_tag_l3 = cds or exon or start_codon or utr etc... 
        
           								if (! grep { $_ eq $primary_tag_l3 } @l3_out_priority){ 
        
           									if ( exists_keys( $hash_omniscient, ('level3', $primary_tag_l3, $level2_ID) ) ){ 
        
           										foreach my $feature_level3 ( sort {$a->start <=> $b->start} @{$hash_omniscient->{'level3'}{$primary_tag_l3}{$level2_ID}}) { 
        
           											$gffout->write_feature($feature_level3); 
        
           										} 
        
           									} 
        
           								} 
        
           							} 
        
           						} 
        
           					} 
        
           				} 
        
           			} 
        
           	} 
        
           	# --------- deal with fasta seq -------------- 
        
           	write_fasta($gffout, $hash_omniscient); 
        
           }

I have think in many occasion to improve this part. I'm not entirely satisfied. I think I followed what was doing Ensembl ~ 2018.
Currently the rule for level3 features is first tss, then exon, then CDS, then tts, then all the rest. Eah type is sorted by start position.
We can add a option to keep topological sorting for level1(gene) and level2(mrna) and then just print by start position. I just do not know if we need a rule to prioritize one type of feature over another when they have same start position...

Shellfishgene · 2022-03-11T11:34:48Z

I thinnk it pretty common for gff to be bgzipped and indexed with tabix, for genome browsers for example, so it mayb be a good idea to add coordinate based sorting.

Juke34 added a commit that referenced this issue Apr 2, 2022

add test for tabix. fix #230

babdd88

Juke34 closed this as completed in 302d28f Apr 4, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agat_convert_sp_gxf2gxf.pl does not sort as expected by tabix #230

agat_convert_sp_gxf2gxf.pl does not sort as expected by tabix #230

Shellfishgene commented Mar 11, 2022

Juke34 commented Mar 11, 2022

Shellfishgene commented Mar 11, 2022

agat_convert_sp_gxf2gxf.pl does not sort as expected by tabix #230

agat_convert_sp_gxf2gxf.pl does not sort as expected by tabix #230

Comments

Shellfishgene commented Mar 11, 2022

Juke34 commented Mar 11, 2022

Shellfishgene commented Mar 11, 2022