Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/ #35

Open
dadrasarmin opened this issue Jun 29, 2022 · 12 comments

Comments

@dadrasarmin
Copy link

Hi reat developers,

First, I want to thank you for developing this great tool. I participated in your workshop at EI, and I am very motivated to use reat. I had some struggles during the last few days to run the prediction workflow and I have a four suggestions for the documentation file.

  1. I thought I could pass one of the portcullis output to --introns, however, my run failed multiple times. After some investigation, I noticed I had to provide the gff3 file instead of the bed file. I think it would be nice to mention this in the document.

    junctools convert -o junctions.bed -if igff -of ebed ~{junctions}

  2. On the documentation link, in section the "Evidence Modeler default weights file", we can find an example like:

ABINITIO_PREDICTION     GlimmerHMM      1
ABINITIO_PREDICTION     SNAP    1
ABINITIO_PREDICTION     CodingQuarry_v2.0       1
ABINITIO_PREDICTION AUGUSTUS_RUN_ABINITIO 1
PROTEIN hq_protein_alignment    4
PROTEIN lq_protein_alignment    1
TRANSCRIPT      hq_assembly     4
TRANSCRIPT      lq_asssembly    1
TRANSCRIPT      homology_models 10
TRANSCRIPT      transcriptome_models    10
OTHER_PREDICTION        AUGUSTUS_RUN1   10
OTHER_PREDICTION        AUGUSTUS_RUN2   10
OTHER_PREDICTION        AUGUSTUS_RUN3   10

When I looked at call-EVM/execute/Scafold_name/evm.out.log, I noticed that

WARNING: not considering ev_type: lq_assembly since not included in weights file
WARNING: not considering ev_type: hq_protein since not included in weights file
WARNING, skipping predType: homology_models, since not specified in weights file.
WARNING, skipping predType: transcriptome_models, since not specified in weights file

There are a few typos here, probably some variable names changed during the development and I am not sure what to do with homology and transcriptome models, therefore, I put them under OTHER_PREDICTIONS tag for the moment:

hq_protein_alignment -> hq_protein
lq_protein_alignment -> lq_protein
lq_asssembly -> lq_assembly
TRANSCRIPT homology_models 10 -> OTHER_PREDICTIONS homology_models 10
TRANSCRIPT transcriptome_models 10 -> OTHER_PREDICTIONS transcriptome_models 10

  1. Strange warning/error for, $EVM_HOME/EvmUtils/write_EVM_commands.pl:
    For this tool, I think the EVM code is somehow broken. Because if I run just $EVM_HOME/EvmUtils/write_EVM_commands.pl, I can see that there is an option -S. However, if I run $EVM_HOME/EvmUtils/write_EVM_commands.pl -S, I get:
    Unknown option: S
    I just change the -S option from this line of the code and everything worked smoothly and without any warning/error.

    $EVM_HOME/EvmUtils/write_EVM_commands.pl -S ~{extra_params} --genome ~{genome} \

  2. In the "Configuring Augustus runs" section:
    I found this line confusing.

    The output directory will contain a file of predictions corresponding to each :code:`--augustus_runs` input files, these files are named `augustus_run#` where `#` corresponds to the position of the file in the command-line argument list of run files.

    In the last section of the same document, it is mentioned that we have to name our AUGUSTUS+hints predictions like AUGUSTUS_RUN1. However, here it is written with lower case letters (augustus_run#). In my case, the job only successfully proceeds if I use the capital letter in the EVM weights, as the file names, and the same way for calling reat with --augustus_runs AUGUSTUS_RUN1.

Best regards,
Armin

@swarbred
Copy link
Collaborator

@dadrasarmin

Thanks for the feedback the documentation has lagged behind the development and there are a few holes that need filling

  1. Yes this is the intron style GFF file output along with the bed file in reat transcriptome, we will update the documentation
  2. Yes the example weights file is not correct and we will correct, an example file would be
ABINITIO_PREDICTION	GlimmerHMM	1
ABINITIO_PREDICTION	SNAP	1
ABINITIO_PREDICTION	AUGUSTUS_RUN_ABINITIO	1
ABINITIO_PREDICTION	CodingQuarry_v2.0	1
PROTEIN	hq_protein	4
PROTEIN lq_protein	1
TRANSCRIPT hq_assembly	4
TRANSCRIPT lq_assembly	1
OTHER_PREDICTION	AUGUSTUS_RUN1	10
OTHER_PREDICTION	AUGUSTUS_RUN2	10
OTHER_PREDICTION	AUGUSTUS_RUN3	10
OTHER_PREDICTION	homology_models	8
OTHER_PREDICTION	transcriptome_models	10
  1. Yes this looks to be an EVM issue, it shouldn't cause an issue in reat but we will remove the -S option
  2. This
    "The output directory will contain a file of predictions corresponding to each :code:--augustus_runs input files, these files are named augustus_run# where # corresponds to the position of the file in the command-line argument list of run files."
    is describing how the output files are named under
outputs/GenePredictors/Augustus/
augustus_abinitio.gff  augustus_run1.gff  augustus_run2.gff  augustus_run3.gff

As described in

The EVM weights file should contain a line per prediction, in case of :code:`--augustus_runs` there should be a line with a label and a weight for each Augustus run, the labels are fixed and have the form `AUGUSTUS_RUN#` where `#` corresponds to the position of the run file in the list of :code:`--augustus_runs` provided through the command-line arguments.

They need to be referenced in the evm weights files as AUGUSTUS_RUN#

When providing these via --augustus_runs they can be named as desired by the user e.g. mine are named

--augustus_runs Inputs/Prediction/run1.augp Inputs/Prediction/run2.augp Inputs/Prediction/run3.augp

the order provided will determine which run corresponds to AUGUSTUS_RUN1 AUGUSTUS_RUN2 etc

Please let us know of any other issues

@dadrasarmin
Copy link
Author

dadrasarmin commented Jun 29, 2022

@swarbred
Thanks for your quick reply.
At the moment, I am stuck at "call-CombineEVM" step. The problem is related to this line

cat $($EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}') > evm.out.gff3

/home/armin/projects/meso_genome_annotation/prediction_workflow/cromwell-executions/ei_prediction/ef913ab4-c6e5-405a-ad66-cfcc309a8146/call-CombineEVM/execution/script: line 24: /usr/bin/cat: Argument list too long

I guess the reason behind this is that my genome has lots of scaffold and contigs. I am not sure whether I can solve the problem by replacing this line with:

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}' | xargs cat >> evm.out.gff3

Do you have any suggestions?

@swarbred
Copy link
Collaborator

@dadrasarmin

You are hitting the ARG_MAX limit, I was aware that we would likely need to change this (call-JoinAugustus is also a cat).

Perhaps the easiest way is to use xargs

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}') | xargs -r cat > evm.out.gff3

That should work but we will make the change ASAP, we are very likely going to get the same error on the wheat annotation we are running.

@swarbred
Copy link
Collaborator

@dadrasarmin
I see you edited your reply to the same solution :-)

@swarbred
Copy link
Collaborator

2nd location to update

cat ~{sep=' ' files} | join_aug_pred.pl > ~{out_filename}

@dadrasarmin
Copy link
Author

@swarbred
Yes, I tried a few different commands, and most did not work. However, the one I wrote (without -r) and you suggested (with -r) worked without any problem. Thanks a lot.

Best regards,
Armin

@dadrasarmin
Copy link
Author

One other thing that I forgot to mention. For transcriptome and homology workflow, "Configurable computational resources available" suggested on the documentation work completely fine. However, the one suggested for the prediction workflow causes an error and the program immediately stops working. Also, there is one computing resource that is necessary to run the prediction workflow, and it was not mentioned in the documentation that it is obligatory.(ei_prediction.augustus_resources). I changed the resource usage by changing the scripts instead of using --computational_resources for prediction workflow.

RuntimeAttr augustus_resources

As a comparison we can see the difference between homology:

RuntimeAttr? index_attr
RuntimeAttr? score_attr
RuntimeAttr? aln_attr
RuntimeAttr? mikado_attr

The transcriptome workflow:
RuntimeAttr? orf_calling_resources
RuntimeAttr? orf_protein_index_resources
RuntimeAttr? orf_protein_alignment_resources
RuntimeAttr? homology_alignment_resources
RuntimeAttr? homology_index_resources
RuntimeAttr? mikado_pick_resources
RuntimeAttr? mikado_serialise_resources
RuntimeAttr? mikado_prepare_resources

And Prediction workflow:
RuntimeAttr? resources

I guess the problem could be solved by substituting RuntimeAttr? resources with subprocesses names like RuntimeAttr? ExecuteEVMCommand.resources

@swarbred
Copy link
Collaborator

@dadrasarmin

For info here is an example compute_inputs.json for prediction

We will check what is given in the documentation

The line

RuntimeAttr? resources

looks correct to me as is

The doc shows

"ei_prediction.Augustus.resources": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)",

and this should be ei_prediction.augustus_resources as shown below

{
	"ei_prediction.AlignProteins.resources": {
		"cpu_cores": 30,
		"mem_gb": 64,
		"constraints": "avx"
	},
	"ei_prediction.IndexProteinsDatabase.resources": {
		"cpu_cores": 8,
		"mem_gb": 8,
		"constraints": "avx"
	},
	"ei_prediction.SelfBlastFilter.resources": {
		"cpu_cores": 8,
		"mem_gb": 16,
		"constraints": "avx"
	},
	"ei_prediction.MikadoPick.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.Mikado.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.LengthChecker.resources": {
		"cpu_cores": 1,
		"mem_gb": 96
	},
	"ei_prediction.ExecuteEVMCommand.resources": {
		"cpu_cores": 1,
                "max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.augustus_resources": {
		"cpu_cores": 1,
		"max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.OptimiseAugustus.resources": {
		"cpu_cores": 1,
		"mem_gb": 24
	}
}

@dadrasarmin
Copy link
Author

dadrasarmin commented Jun 30, 2022

@swarbred

Besides the ei_prediction.Augustus.resources, there is another problem. If I substitute the incorrect one ei_prediction.Augustus.resources with the correct one ei_prediction.augustus_resources, the program still crashes. If I use the modified version from the documents:

{
	"ei_prediction.AlignProteins.resources": {
		"cpu_cores": 30,
		"mem_gb": 64,
		"constraints": "avx"
	},
	"ei_prediction.IndexProteinsDatabase.resources": {
		"cpu_cores": 8,
		"mem_gb": 8,
		"constraints": "avx"
	},
	"ei_prediction.SelfBlastFilter.resources": {
		"cpu_cores": 8,
		"mem_gb": 16,
		"constraints": "avx"
	},
	"ei_prediction.MikadoPick.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.Mikado.resources": {
		"cpu_cores": 30,
		"mem_gb": 120
	},
	"ei_prediction.LengthChecker.resources": {
		"cpu_cores": 1,
		"mem_gb": 96
	},
	"ei_prediction.ExecuteEVMCommand.resources": {
		"cpu_cores": 1,
                "max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.augustus_resources": {
		"cpu_cores": 1,
		"max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.AugustusAbinitio.resources": {
		"cpu_cores": 1,
		"max_retries": 3,
		"mem_gb": 24
	},
	"ei_prediction.OptimiseAugustus.resources": {
		"cpu_cores": 1,
		"mem_gb": 24
	}
}

If I take a look at run_details.json, I can see:

"run_details.json" [noeol] 47L, 81375B                                                                                       1,1           Top
  "status": "Failed",
  "failures": [{
    "message": "Workflow input processing failed",
    "causedBy": [{
      "causedBy": [],
      "message": "WARNING: Unexpected input provided: ei_prediction.AugustusAbinitio.resources (expected inputs: [ei_prediction.evm_extra_params, ei_prediction.base_training.resources, ei_prediction.homology_models, ei_prediction.extrinsic_config, ei_prediction.HQ_assembly, ei_prediction.extra_training_models, ei_prediction.SelectAugustusTestAndTrain.min_train_models, ei_prediction.AugustusAbinitio.bronze_models, ei_prediction.train_after_optimise.resources, ei_prediction.secondstrand_exon_hints, ei_prediction.reference_genome, ei_prediction.LengthChecker.query_start_scoring_distance, ei_prediction.Mikado.min_cdna_length, ei_prediction.mikado_scoring, ei_prediction.LengthChecker.min_pct_cds_fraction, ei_prediction.AlignProteins.resources, ei_prediction.do_augustus, ei_prediction.LengthChecker.min_target_coverage_scoring_percentage, ei_prediction.SelfBlastFilter.resources, ei_prediction.LengthChecker.max_tp_utr, ei_prediction.LengthChecker.target_end_hard_filter_distance, ei_prediction.EVM_weights, ei_prediction.LengthChecker.evalue_filter, ei_prediction.augustus_runs, ei_prediction.SelfBlastFilter.identity, ei_prediction.do_glimmer, ei_prediction.SNAP.resources, ei_prediction.Mikado.resources, ei_prediction.LengthChecker.max_tp_utr_complete, ei_prediction.LengthChecker.max_single_gap_score, ei_prediction.LQ_protein_alignments, ei_prediction.SelfBlastFilter.coverage, ei_prediction.GlimmerHMM.resources, ei_prediction.LengthChecker.target_end_scoring_distance, ei_prediction.Mikado.max_intron_length, ei_prediction.snap_training, ei_prediction.AugustusAbinitio.intron_hints, ei_prediction.codon_table, ei_prediction.chunk_size, ei_prediction.augustus_resources, ei_prediction.LengthChecker.min_target_coverage_score, ei_prediction.CodingQuarry.fresh_prediction, ei_prediction.AugustusAbinitio.all_models, ei_prediction.CodingQuarryFresh.resources, ei_prediction.firststrand_exon_hints, ei_prediction.AugustusAbinitio.expressed_exon_hints, ei_prediction.flank, ei_prediction.MikadoPick.resources, ei_prediction.LengthChecker.min_query_coverage_score, ei_prediction.intron_hints, ei_prediction.SelectAugustusTestAndTrain.force, ei_prediction.AugustusAbinitio.lq_protein_alignment_models, ei_prediction.LengthChecker.target_start_hard_filter_distance, ei_prediction.LengthChecker.target_end_score, ei_prediction.AugustusAbinitio.lq_assembly_models, ei_prediction.transcriptome_models, ei_prediction.SelfBlastFilter.top_n, ei_prediction.LengthChecker.query_end_hard_filter_distance, ei_prediction.do_codingquarry, ei_prediction.glimmer_training, ei_prediction.AugustusAbinitio.hints_source_and_priority, ei_prediction.IndexProteinsDatabase.resources, ei_prediction.AugustusAbinitio.silver_models, ei_prediction.species, ei_prediction.MikadoPick.extra_config, ei_prediction.unstranded_exon_hints, ei_prediction.AugustusAbinitio.homology_models, ei_prediction.protein_validation_database, ei_prediction.snap_extra_params, ei_prediction.LQ_assembly, ei_prediction.LengthChecker.query_start_hard_filter_distance, ei_prediction.LengthChecker.max_single_gap_hard_filter, ei_prediction.SelectAugustusTestAndTrain.target_mono_exonic_percentage, ei_prediction.LengthChecker.resources, ei_prediction.LengthChecker.min_fp_utr, ei_prediction.augustus_config_path, ei_prediction.LengthChecker.max_single_gap_scoring_length, ei_prediction.LengthChecker.target_start_score, ei_prediction.kfold, ei_prediction.ExecuteEVMCommand.resources, ei_prediction.LengthChecker.max_fp_utr, ei_prediction.optimise_augustus, ei_prediction.LengthChecker.query_start_score, ei_prediction.CodingQuarry.resources, ei_prediction.AugustusAbinitio.hq_assembly_models, ei_prediction.AugustusAbinitio.gold_models, ei_prediction.mikado_utr_files, ei_prediction.LengthChecker.query_end_scoring_distance, ei_prediction.overlap_size, ei_prediction.LengthChecker.max_fp_utr_complete, ei_prediction.codingquarry_extra_params, ei_prediction.codingquarry_training, ei_prediction.repeats_gff, ei_prediction.AugustusAbinitio.hq_protein_alignment_models, ei_prediction.force_train, ei_prediction.OptimiseAugustus.resources, ei_prediction.SelectAugustusTestAndTrain.max_test_models, ei_prediction.LengthChecker.query_end_score, ei_prediction.HQ_protein_alignments, ei_prediction.mikado_config, ei_prediction.do_snap, ei_prediction.LengthChecker.min_tp_utr, ei_prediction.LengthChecker.min_target_coverage_hard_filter, ei_prediction.LengthChecker.min_query_coverage_scoring_percentage, ei_prediction.glimmer_extra_params, ei_prediction.LengthChecker.min_query_coverage_hard_filter, ei_prediction.SelectAugustusTestAndTrain.max_train_models, ei_prediction.LengthChecker.target_start_scoring_distance, ei_prediction.augustus_extra_params])"
    }]

Therefore, "ei_prediction.AugustusAbinitio.resources" should be removed from the documentation.

@gemygk
Copy link
Collaborator

gemygk commented Jul 1, 2022

Hi @dadrasarmin

I appreciate your detailed checks.

We are doing to necessary changes as we speak and carrying out the tests before merging the changes.

The ei_prediction.AugustusAbinitio.resources is failing because it should have been ei_prediction.AugustusAbinitio.augustus_resources

@dadrasarmin
Copy link
Author

Hi @gemygk,

As I said in my first comment, I really appreciate what you have built and the results I got recently are excellent. That is why I investigated the codes in more detail and found a few suggestions.
Thanks for sharing the correct way of giving the AugustusAbinitio computing resources. I gave up last week and changed the default values in the code instead of using --computational_resources, which is not a good practice in the long term.

@gemygk
Copy link
Collaborator

gemygk commented Jul 1, 2022

Thanks @dadrasarmin . It is the result of hard work from @ljyanesm and the team.

Please do let us know if you have any issues. In the meantime, we will try to get these changes into the repo.

MmasterT added a commit to MmasterT/reat that referenced this issue Oct 12, 2022
Updated the documentation  solving the following issue following issue: EI-CoreBioinformatics#35
MmasterT added a commit to MmasterT/reat that referenced this issue Oct 12, 2022
Updated the values as described by @swarbred in the following issue: EI-CoreBioinformatics#35
MmasterT added a commit to MmasterT/reat that referenced this issue Oct 12, 2022
Add aclaaration realted to de issue: EI-CoreBioinformatics#35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants