Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/ #35

dadrasarmin · 2022-06-29T13:39:45Z

Hi reat developers,

First, I want to thank you for developing this great tool. I participated in your workshop at EI, and I am very motivated to use reat. I had some struggles during the last few days to run the prediction workflow and I have a four suggestions for the documentation file.

I thought I could pass one of the portcullis output to --introns, however, my run failed multiple times. After some investigation, I noticed I had to provide the gff3 file instead of the bed file. I think it would be nice to mention this in the document.

reat/annotation/prediction_module/main.wdl

Line 1617 in be8c679

junctools convert -o junctions.bed -if igff -of ebed ~{junctions}
On the documentation link, in section the "Evidence Modeler default weights file", we can find an example like:

ABINITIO_PREDICTION     GlimmerHMM      1
ABINITIO_PREDICTION     SNAP    1
ABINITIO_PREDICTION     CodingQuarry_v2.0       1
ABINITIO_PREDICTION AUGUSTUS_RUN_ABINITIO 1
PROTEIN hq_protein_alignment    4
PROTEIN lq_protein_alignment    1
TRANSCRIPT      hq_assembly     4
TRANSCRIPT      lq_asssembly    1
TRANSCRIPT      homology_models 10
TRANSCRIPT      transcriptome_models    10
OTHER_PREDICTION        AUGUSTUS_RUN1   10
OTHER_PREDICTION        AUGUSTUS_RUN2   10
OTHER_PREDICTION        AUGUSTUS_RUN3   10

When I looked at call-EVM/execute/Scafold_name/evm.out.log, I noticed that

WARNING: not considering ev_type: lq_assembly since not included in weights file
WARNING: not considering ev_type: hq_protein since not included in weights file
WARNING, skipping predType: homology_models, since not specified in weights file.
WARNING, skipping predType: transcriptome_models, since not specified in weights file

There are a few typos here, probably some variable names changed during the development and I am not sure what to do with homology and transcriptome models, therefore, I put them under OTHER_PREDICTIONS tag for the moment:

hq_protein_alignment -> hq_protein
lq_protein_alignment -> lq_protein
lq_asssembly -> lq_assembly
TRANSCRIPT homology_models 10 -> OTHER_PREDICTIONS homology_models 10
TRANSCRIPT transcriptome_models 10 -> OTHER_PREDICTIONS transcriptome_models 10

Strange warning/error for, $EVM_HOME/EvmUtils/write_EVM_commands.pl:
For this tool, I think the EVM code is somehow broken. Because if I run just $EVM_HOME/EvmUtils/write_EVM_commands.pl, I can see that there is an option -S. However, if I run $EVM_HOME/EvmUtils/write_EVM_commands.pl -S, I get:
Unknown option: S
I just change the -S option from this line of the code and everything worked smoothly and without any warning/error.

reat/annotation/prediction_module/main.wdl

Line 1504 in be8c679

$EVM_HOME/EvmUtils/write_EVM_commands.pl -S ~{extra_params} --genome ~{genome} \

In the "Configuring Augustus runs" section:
I found this line confusing.

reat/docs/modules/prediction/index.rst

Line 48 in 3852eee

    
           The output directory will contain a file of predictions corresponding to each :code:`--augustus_runs` input files, these files are named `augustus_run#` where `#` corresponds to the position of the file in the command-line argument list of run files.

In the last section of the same document, it is mentioned that we have to name our AUGUSTUS+hints predictions like AUGUSTUS_RUN1. However, here it is written with lower case letters (augustus_run#). In my case, the job only successfully proceeds if I use the capital letter in the EVM weights, as the file names, and the same way for calling reat with --augustus_runs AUGUSTUS_RUN1.

Best regards,
Armin

The text was updated successfully, but these errors were encountered:

swarbred · 2022-06-29T16:43:26Z

@dadrasarmin

Thanks for the feedback the documentation has lagged behind the development and there are a few holes that need filling

Yes this is the intron style GFF file output along with the bed file in reat transcriptome, we will update the documentation
Yes the example weights file is not correct and we will correct, an example file would be

ABINITIO_PREDICTION	GlimmerHMM	1
ABINITIO_PREDICTION	SNAP	1
ABINITIO_PREDICTION	AUGUSTUS_RUN_ABINITIO	1
ABINITIO_PREDICTION	CodingQuarry_v2.0	1
PROTEIN	hq_protein	4
PROTEIN lq_protein	1
TRANSCRIPT hq_assembly	4
TRANSCRIPT lq_assembly	1
OTHER_PREDICTION	AUGUSTUS_RUN1	10
OTHER_PREDICTION	AUGUSTUS_RUN2	10
OTHER_PREDICTION	AUGUSTUS_RUN3	10
OTHER_PREDICTION	homology_models	8
OTHER_PREDICTION	transcriptome_models	10

Yes this looks to be an EVM issue, it shouldn't cause an issue in reat but we will remove the -S option
This
"The output directory will contain a file of predictions corresponding to each :code:--augustus_runs input files, these files are named augustus_run# where # corresponds to the position of the file in the command-line argument list of run files."
is describing how the output files are named under

outputs/GenePredictors/Augustus/
augustus_abinitio.gff  augustus_run1.gff  augustus_run2.gff  augustus_run3.gff

As described in

reat/docs/modules/prediction/index.rst

Line 26 in 3852eee

    
           The EVM weights file should contain a line per prediction, in case of :code:`--augustus_runs` there should be a line with a label and a weight for each Augustus run, the labels are fixed and have the form `AUGUSTUS_RUN#` where `#` corresponds to the position of the run file in the list of :code:`--augustus_runs` provided through the command-line arguments.

They need to be referenced in the evm weights files as AUGUSTUS_RUN#

When providing these via --augustus_runs they can be named as desired by the user e.g. mine are named

--augustus_runs Inputs/Prediction/run1.augp Inputs/Prediction/run2.augp Inputs/Prediction/run3.augp

the order provided will determine which run corresponds to AUGUSTUS_RUN1 AUGUSTUS_RUN2 etc

Please let us know of any other issues

dadrasarmin · 2022-06-29T19:36:49Z

@swarbred
Thanks for your quick reply.
At the moment, I am stuck at "call-CombineEVM" step. The problem is related to this line

reat/annotation/prediction_module/main.wdl

Line 560 in be8c679

    
           cat $($EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}') > evm.out.gff3

/home/armin/projects/meso_genome_annotation/prediction_workflow/cromwell-executions/ei_prediction/ef913ab4-c6e5-405a-ad66-cfcc309a8146/call-CombineEVM/execution/script: line 24: /usr/bin/cat: Argument list too long

I guess the reason behind this is that my genome has lots of scaffold and contigs. I am not sure whether I can solve the problem by replacing this line with:

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}' | xargs cat >> evm.out.gff3

Do you have any suggestions?

swarbred · 2022-06-30T10:16:18Z

@dadrasarmin

You are hitting the ARG_MAX limit, I was aware that we would likely need to change this (call-JoinAugustus is also a cat).

Perhaps the easiest way is to use xargs

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}') | xargs -r cat > evm.out.gff3

That should work but we will make the change ASAP, we are very likely going to get the same error on the wheat annotation we are running.

swarbred · 2022-06-30T10:23:33Z

@dadrasarmin
I see you edited your reply to the same solution :-)

swarbred · 2022-06-30T13:37:46Z

2nd location to update

reat/annotation/prediction_module/augustus.wdl

Line 411 in 8ffc62b

cat ~{sep=' ' files} | join_aug_pred.pl > ~{out_filename}

dadrasarmin · 2022-06-30T13:42:57Z

@swarbred
Yes, I tried a few different commands, and most did not work. However, the one I wrote (without -r) and you suggested (with -r) worked without any problem. Thanks a lot.

Best regards,
Armin

dadrasarmin · 2022-06-30T15:22:16Z

One other thing that I forgot to mention. For transcriptome and homology workflow, "Configurable computational resources available" suggested on the documentation work completely fine. However, the one suggested for the prediction workflow causes an error and the program immediately stops working. Also, there is one computing resource that is necessary to run the prediction workflow, and it was not mentioned in the documentation that it is obligatory.(ei_prediction.augustus_resources). I changed the resource usage by changing the scripts instead of using --computational_resources for prediction workflow.

reat/annotation/prediction_module/main.wdl

Line 55 in c8c680f

RuntimeAttr augustus_resources

As a comparison we can see the difference between homology:

reat/annotation/homology_module/main.wdl

Lines 25 to 28 in c8c680f

    
           RuntimeAttr? index_attr 
        
           RuntimeAttr? score_attr 
        
           RuntimeAttr? aln_attr 
        
           RuntimeAttr? mikado_attr

The transcriptome workflow:

reat/annotation/transcriptome_module/subworkflows/mikado/wf_mikado.wdl

Lines 28 to 35 in c8c680f

    
           RuntimeAttr? orf_calling_resources 
        
           RuntimeAttr? orf_protein_index_resources 
        
           RuntimeAttr? orf_protein_alignment_resources 
        
           RuntimeAttr? homology_alignment_resources 
        
           RuntimeAttr? homology_index_resources 
        
           RuntimeAttr? mikado_pick_resources 
        
           RuntimeAttr? mikado_serialise_resources 
        
           RuntimeAttr? mikado_prepare_resources

And Prediction workflow:

reat/annotation/prediction_module/main.wdl

Line 568 in c8c680f

RuntimeAttr? resources

I guess the problem could be solved by substituting RuntimeAttr? resources with subprocesses names like RuntimeAttr? ExecuteEVMCommand.resources

swarbred · 2022-06-30T16:37:37Z

@dadrasarmin

For info here is an example compute_inputs.json for prediction

We will check what is given in the documentation

The line

reat/annotation/prediction_module/main.wdl

Line 568 in c8c680f

RuntimeAttr? resources

looks correct to me as is

The doc shows

"ei_prediction.Augustus.resources": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)",

and this should be ei_prediction.augustus_resources as shown below

{
	"ei_prediction.AlignProteins.resources": {
		"cpu_cores": 30,
		"mem_gb": 64,
		"constraints": "avx"
	},
	"ei_prediction.IndexProteinsDatabase.resources": {
		"cpu_cores": 8,
		"mem_gb": 8,
		"constraints": "avx"
	},
	"ei_prediction.SelfBlastFilter.resources": {
		"cpu_cores": 8,
		"mem_gb": 16,
		"constraints": "avx"
	},
	"ei_prediction.MikadoPick.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.Mikado.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.LengthChecker.resources": {
		"cpu_cores": 1,
		"mem_gb": 96
	},
	"ei_prediction.ExecuteEVMCommand.resources": {
		"cpu_cores": 1,
                "max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.augustus_resources": {
		"cpu_cores": 1,
		"max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.OptimiseAugustus.resources": {
		"cpu_cores": 1,
		"mem_gb": 24
	}
}

dadrasarmin · 2022-06-30T21:36:13Z

@swarbred

Besides the ei_prediction.Augustus.resources, there is another problem. If I substitute the incorrect one ei_prediction.Augustus.resources with the correct one ei_prediction.augustus_resources, the program still crashes. If I use the modified version from the documents:

{
	"ei_prediction.AlignProteins.resources": {
		"cpu_cores": 30,
		"mem_gb": 64,
		"constraints": "avx"
	},
	"ei_prediction.IndexProteinsDatabase.resources": {
		"cpu_cores": 8,
		"mem_gb": 8,
		"constraints": "avx"
	},
	"ei_prediction.SelfBlastFilter.resources": {
		"cpu_cores": 8,
		"mem_gb": 16,
		"constraints": "avx"
	},
	"ei_prediction.MikadoPick.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.Mikado.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.LengthChecker.resources": {
		"cpu_cores": 1,
		"mem_gb": 96
	},
	"ei_prediction.ExecuteEVMCommand.resources": {
		"cpu_cores": 1,
                "max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.augustus_resources": {
		"cpu_cores": 1,
		"max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.AugustusAbinitio.resources": {
		"cpu_cores": 1,
		"max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.OptimiseAugustus.resources": {
		"cpu_cores": 1,
		"mem_gb": 24
	}
}

If I take a look at run_details.json, I can see:

"run_details.json" [noeol] 47L, 81375B                                                                                       1,1           Top
  "status": "Failed",
  "failures": [{
    "message": "Workflow input processing failed",
    "causedBy": [{
      "causedBy": [],
      "message": "WARNING: Unexpected input provided: ei_prediction.AugustusAbinitio.resources (expected inputs: [ei_prediction.evm_extra_params, ei_prediction.base_training.resources, ei_prediction.homology_models, ei_prediction.extrinsic_config, ei_prediction.HQ_assembly, ei_prediction.extra_training_models, ei_prediction.SelectAugustusTestAndTrain.min_train_models, ei_prediction.AugustusAbinitio.bronze_models, ei_prediction.train_after_optimise.resources, ei_prediction.secondstrand_exon_hints, ei_prediction.reference_genome, ei_prediction.LengthChecker.query_start_scoring_distance, ei_prediction.Mikado.min_cdna_length, ei_prediction.mikado_scoring, ei_prediction.LengthChecker.min_pct_cds_fraction, ei_prediction.AlignProteins.resources, ei_prediction.do_augustus, ei_prediction.LengthChecker.min_target_coverage_scoring_percentage, ei_prediction.SelfBlastFilter.resources, ei_prediction.LengthChecker.max_tp_utr, ei_prediction.LengthChecker.target_end_hard_filter_distance, ei_prediction.EVM_weights, ei_prediction.LengthChecker.evalue_filter, ei_prediction.augustus_runs, ei_prediction.SelfBlastFilter.identity, ei_prediction.do_glimmer, ei_prediction.SNAP.resources, ei_prediction.Mikado.resources, ei_prediction.LengthChecker.max_tp_utr_complete, ei_prediction.LengthChecker.max_single_gap_score, ei_prediction.LQ_protein_alignments, ei_prediction.SelfBlastFilter.coverage, ei_prediction.GlimmerHMM.resources, ei_prediction.LengthChecker.target_end_scoring_distance, ei_prediction.Mikado.max_intron_length, ei_prediction.snap_training, ei_prediction.AugustusAbinitio.intron_hints, ei_prediction.codon_table, ei_prediction.chunk_size, ei_prediction.augustus_resources, ei_prediction.LengthChecker.min_target_coverage_score, ei_prediction.CodingQuarry.fresh_prediction, ei_prediction.AugustusAbinitio.all_models, ei_prediction.CodingQuarryFresh.resources, ei_prediction.firststrand_exon_hints, ei_prediction.AugustusAbinitio.expressed_exon_hints, ei_prediction.flank, ei_prediction.MikadoPick.resources, ei_prediction.LengthChecker.min_query_coverage_score, ei_prediction.intron_hints, ei_prediction.SelectAugustusTestAndTrain.force, ei_prediction.AugustusAbinitio.lq_protein_alignment_models, ei_prediction.LengthChecker.target_start_hard_filter_distance, ei_prediction.LengthChecker.target_end_score, ei_prediction.AugustusAbinitio.lq_assembly_models, ei_prediction.transcriptome_models, ei_prediction.SelfBlastFilter.top_n, ei_prediction.LengthChecker.query_end_hard_filter_distance, ei_prediction.do_codingquarry, ei_prediction.glimmer_training, ei_prediction.AugustusAbinitio.hints_source_and_priority, ei_prediction.IndexProteinsDatabase.resources, ei_prediction.AugustusAbinitio.silver_models, ei_prediction.species, ei_prediction.MikadoPick.extra_config, ei_prediction.unstranded_exon_hints, ei_prediction.AugustusAbinitio.homology_models, ei_prediction.protein_validation_database, ei_prediction.snap_extra_params, ei_prediction.LQ_assembly, ei_prediction.LengthChecker.query_start_hard_filter_distance, ei_prediction.LengthChecker.max_single_gap_hard_filter, ei_prediction.SelectAugustusTestAndTrain.target_mono_exonic_percentage, ei_prediction.LengthChecker.resources, ei_prediction.LengthChecker.min_fp_utr, ei_prediction.augustus_config_path, ei_prediction.LengthChecker.max_single_gap_scoring_length, ei_prediction.LengthChecker.target_start_score, ei_prediction.kfold, ei_prediction.ExecuteEVMCommand.resources, ei_prediction.LengthChecker.max_fp_utr, ei_prediction.optimise_augustus, ei_prediction.LengthChecker.query_start_score, ei_prediction.CodingQuarry.resources, ei_prediction.AugustusAbinitio.hq_assembly_models, ei_prediction.AugustusAbinitio.gold_models, ei_prediction.mikado_utr_files, ei_prediction.LengthChecker.query_end_scoring_distance, ei_prediction.overlap_size, ei_prediction.LengthChecker.max_fp_utr_complete, ei_prediction.codingquarry_extra_params, ei_prediction.codingquarry_training, ei_prediction.repeats_gff, ei_prediction.AugustusAbinitio.hq_protein_alignment_models, ei_prediction.force_train, ei_prediction.OptimiseAugustus.resources, ei_prediction.SelectAugustusTestAndTrain.max_test_models, ei_prediction.LengthChecker.query_end_score, ei_prediction.HQ_protein_alignments, ei_prediction.mikado_config, ei_prediction.do_snap, ei_prediction.LengthChecker.min_tp_utr, ei_prediction.LengthChecker.min_target_coverage_hard_filter, ei_prediction.LengthChecker.min_query_coverage_scoring_percentage, ei_prediction.glimmer_extra_params, ei_prediction.LengthChecker.min_query_coverage_hard_filter, ei_prediction.SelectAugustusTestAndTrain.max_train_models, ei_prediction.LengthChecker.target_start_scoring_distance, ei_prediction.augustus_extra_params])"
    }]

Therefore, "ei_prediction.AugustusAbinitio.resources" should be removed from the documentation.

gemygk · 2022-07-01T08:24:40Z

Hi @dadrasarmin

I appreciate your detailed checks.

We are doing to necessary changes as we speak and carrying out the tests before merging the changes.

The ei_prediction.AugustusAbinitio.resources is failing because it should have been ei_prediction.AugustusAbinitio.augustus_resources

dadrasarmin · 2022-07-01T08:42:14Z

Hi @gemygk,

As I said in my first comment, I really appreciate what you have built and the results I got recently are excellent. That is why I investigated the codes in more detail and found a few suggestions.
Thanks for sharing the correct way of giving the AugustusAbinitio computing resources. I gave up last week and changed the default values in the code instead of using --computational_resources, which is not a good practice in the long term.

gemygk · 2022-07-01T09:15:08Z

Thanks @dadrasarmin . It is the result of hard work from @ljyanesm and the team.

Please do let us know if you have any issues. In the meantime, we will try to get these changes into the repo.

Updated the documentation solving the following issue following issue: EI-CoreBioinformatics#35

@swarbred

Updated the values as described by @swarbred in the following issue: EI-CoreBioinformatics#35

Add aclaaration realted to de issue: EI-CoreBioinformatics#35

MmasterT added a commit to MmasterT/reat that referenced this issue Oct 12, 2022

Update index.rst

760a04b

Updated the documentation solving the following issue following issue: EI-CoreBioinformatics#35

MmasterT added a commit to MmasterT/reat that referenced this issue Oct 12, 2022

Update evm_default_weights.wgt

8b0e7d7

Updated the values as described by @swarbred in the following issue: EI-CoreBioinformatics#35

MmasterT added a commit to MmasterT/reat that referenced this issue Oct 12, 2022

Update __main__.py

6da4213

Add aclaaration realted to de issue: EI-CoreBioinformatics#35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/ #35

Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/ #35

dadrasarmin commented Jun 29, 2022

swarbred commented Jun 29, 2022

dadrasarmin commented Jun 29, 2022 •

edited

Loading

swarbred commented Jun 30, 2022

swarbred commented Jun 30, 2022

swarbred commented Jun 30, 2022

dadrasarmin commented Jun 30, 2022

dadrasarmin commented Jun 30, 2022

swarbred commented Jun 30, 2022

dadrasarmin commented Jun 30, 2022 •

edited

Loading

gemygk commented Jul 1, 2022

dadrasarmin commented Jul 1, 2022

gemygk commented Jul 1, 2022

Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/ #35

Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/ #35

Comments

dadrasarmin commented Jun 29, 2022

swarbred commented Jun 29, 2022

dadrasarmin commented Jun 29, 2022 • edited Loading

swarbred commented Jun 30, 2022

swarbred commented Jun 30, 2022

swarbred commented Jun 30, 2022

dadrasarmin commented Jun 30, 2022

dadrasarmin commented Jun 30, 2022

swarbred commented Jun 30, 2022

dadrasarmin commented Jun 30, 2022 • edited Loading

gemygk commented Jul 1, 2022

dadrasarmin commented Jul 1, 2022

gemygk commented Jul 1, 2022

dadrasarmin commented Jun 29, 2022 •

edited

Loading

dadrasarmin commented Jun 30, 2022 •

edited

Loading