Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

numeric nlg - template changes #1041

Merged
merged 8 commits into from
Jul 28, 2024
Merged

numeric nlg - template changes #1041

merged 8 commits into from
Jul 28, 2024

Conversation

ShirApp
Copy link
Collaborator

@ShirApp ShirApp commented Jul 22, 2024

No description provided.

@ShirApp ShirApp closed this Jul 22, 2024
@ShirApp ShirApp reopened this Jul 22, 2024
Copy link
Member

@elronbandel elronbandel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment regrading the web UI and the task definition

Signed-off-by: elronbandel <elron.bandel@ibm.com>
@elronbandel elronbandel enabled auto-merge (squash) July 28, 2024 07:05
@elronbandel elronbandel merged commit bc8e644 into main Jul 28, 2024
7 of 8 checks passed
@elronbandel elronbandel deleted the numericnlg_template branch July 28, 2024 10:31
csrajmohan pushed a commit that referenced this pull request Jul 29, 2024
benjaminsznajder pushed a commit that referenced this pull request Jul 29, 2024
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
benjaminsznajder added a commit that referenced this pull request Jul 29, 2024
* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027)

Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added seed to LLM as judges for consistent results (#1029)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* replace type and __type__ in type error (#1035)

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add task rag_end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add card for clapnq end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add sandbox_benjams

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add subset

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove constants

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* rename sandbox_benjams to sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add string to context id in rag (#1036)

* allow strings (hash) as context id

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* save to catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed issues with fresh install (#1037)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add validation to tldr, remove shuffle from billsum (#1038)

* add validation to tldr, remove shuffle from billsum
(shuffled by the SplitRandomMix)

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

* fix formatting

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

---------

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add CloseTextSampler and FixedIndicesSampler (#1034)

* Add CloseTextSampler

That returns demos that are textually close to the current instance.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Make sampler call pass  current instance

Added end 2 end test of sampler that depends on output

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added FixedIndicesSampler(Sampler):

Selects a fix set of samples based on a list of indices from the demo pool

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Made splitter currently use random_generators

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed all Sample randomization

To use common code to create randomizer per instance

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030)

* changed input and output of templates

to "input_fields" and "reference_ fields" .

This is to continue  the work done on tasks.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Fixed type hint

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Documentation update

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* FinQA - filter problematic examples (#1039)

filter problematic examples

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Arena hard elad2 (#1026)

* bug fixes in PairwiseChoiceTemplate

* add arena hard regex parser operator

* update mt bench card common

* update mt bench card common

* add reward bench

* update metric to pairwise comarison task

* arena hard tasks and cards

* update mt bench template

* add duplicate stream operator

* add PairwiseComparativeRatingTemplate

* add card

* add card

* add template

* add winrate metrics

* add comparative rating task

* add ExtractArenaHardNumericalJudgment

* add arena hard cards

* add arena hard template

* add weighted winrate metrics

* delete file

* update PairwiseComparativeRatingTemplate

* add metric

* add metric

* update

* update

* update

* fix template bug

* update

* llama 3 update

* update

* update

* update jsons

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* fix

* fix

* fix

* update

* update

* update

* bluebench related changes

* fix type issue

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* update

* update

* update

* prometheus1

* update

* fix

* fix

* merge with arena_branch

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* rebuild catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* add debugging to clapnq

* Reproduce all artifacts

* Add missing artifacts to catalog

* Add secrets baseline

Signed-off-by: Elad Venezian <eladv@il.ibm.com>

* Fix bugs with catalog creation

* Remove areana hard examples from tests, since they don't pass

* Add missing metadata to test mock

* Add data_classification_policy and recipe_metadata to the steams tests

* Fix test failures

* Update multi_turn_gpt4_judgement.py

* Update multi_turn_with_reference_gpt4_judgement.py

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* bug fix in LoadFromHFSpace

* revert

* revert

* update examples

* add coment to expain change

* update to new params usage

* pr fixes

* pr fixes

* update

* update

* update

* update

* update

* update

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* update

* cr fixes

* llmaj format fix

* llmaj format fix

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* demo's target prefix is now taken from demo instance (#1031)

* demo's target prefix is now taken from demo instance

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* do not pop fields out of demo instances.
Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove the reduced clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* define an empty template for rag end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Implement metrics ensemble (#1047)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add load_json_predictions as processor in the template

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add the processors/load_json_predictions.json generated to the catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add flores101 (#1053)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added example for selection of demos (#1052)

* Added example for selection of demos

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added example doc

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Update docs/docs/examples.rst

* Update docs/docs/examples.rst

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add overwrite

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063)

Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fix typo in japanese_llama system prompt (issue #964) (#1056)

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Allow assigning None in overwrites when fetching artifacts with modifications (#1062)

allow =None in overwrites for fetch

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Make sure preparation times printed fully and nicely (#1046)

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* numeric nlg - template changes (#1041)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add judge input to the metric (#1064)

* add judge input to the metric

* add judge input to the metric

* fix

* fix test

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Unitxt capitalization adding_dataset.rst (#1057)

making Unitxt capitalization consistent in text

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fixed the score_ci inconsistency issue (#1065)

* suggested fix for score_ci inconsistency issue

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* unify with the update, and thus simplified the check

Signed-off-by: dafnapension <dafnashein@yahoo.com>
---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Use of conventional python types in input definition of tasks and metrics (#1045)

* Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Make tasks types python types

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix errors

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Some fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* More fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update catalog

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix cards

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Revert change

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix typing in docs with new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* refactor of new asset to new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update secrets baseline

Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added prediction type to llm as jusdge to avoid warning (#1072)

* Added prediction type to llm as jusdge to avoid warning

Clarified the sandalone llm as judge example

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Removed accidentally added file

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed clapnq to check with reasonable error values

Also updated rag tasks to use new typing (instead of string types)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix the type hint

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* update catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add metric "metrics.rag.retrieval_at_k" to catalog (#1074)

* add metric "metrics.rag.retrieval_at_k" to catalog
this is a wrapper around the retrieval_at_k for the ragas scheme

* add corresponding json file for the new metric

---------

Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* merge - resolve conflict

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com>
Co-authored-by: Alon H <alonh@users.noreply.github.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com>
Co-authored-by: Elad <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com>
Co-authored-by: welisheva22 <welisheva22@gmail.com>
Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com>
Co-authored-by: Yoav Katz <katz@il.ibm.com>
Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
csrajmohan pushed a commit that referenced this pull request Aug 29, 2024
csrajmohan pushed a commit that referenced this pull request Aug 29, 2024
* Fix bug in data classes and add support for field overriding in fields containing types or functions (#1027)

Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added seed to LLM as judges for consistent results (#1029)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* replace type and __type__ in type error (#1035)

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add rag_end_to_end metrics

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add task rag_end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add card for clapnq end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add sandbox_benjams

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add subset

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add a reduction of clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove constants

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* rename sandbox_benjams to sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove sandbox

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add string to context id in rag (#1036)

* allow strings (hash) as context id

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* save to catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed issues with fresh install (#1037)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add validation to tldr, remove shuffle from billsum (#1038)

* add validation to tldr, remove shuffle from billsum
(shuffled by the SplitRandomMix)

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

* fix formatting

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>

---------

Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Refactor Rouge and Meteor to InstanceMetric for faster score computation (#1011)

* Remove confidence interval calculation for meteor metric by default

added a new metric with interval calculations

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage when metrics not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added error mesage

when post processors are not  a list

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed Rouge to be HuggingfaceBulkMetric

to avoid recalculation of metric on every resample

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* added meteor as an HuggingFaceInstanceMetric

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* removed meteor_with_confidence_intervals.json

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* fixed test_metric_utils.py by better concentrating on rougeL only

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* comment about rounded floats in tested scores

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* while generating metric meteor, compmare against HF implementation

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* added a test comparing new Rouge with HF Rouge, nd per arielge's good advice, changed bootstrap method to percentile in case of 100 or more instances

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* implemented Meteor and Rouge with inhouse code

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* download quietly, and import in prepare

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* trying to avoid .secrets.baseline

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* secret.baseline how do I get rid of it?

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add CloseTextSampler and FixedIndicesSampler (#1034)

* Add CloseTextSampler

That returns demos that are textually close to the current instance.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Make sampler call pass  current instance

Added end 2 end test of sampler that depends on output

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added FixedIndicesSampler(Sampler):

Selects a fix set of samples based on a list of indices from the demo pool

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Made splitter currently use random_generators

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Changed all Sample randomization

To use common code to create randomizer per instance

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Updated demos in test

After a non backward compatible change

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* changed input and output of templates to "input_fields" and "reference_ fields" - Non backward compatible (#1030)

* changed input and output of templates

to "input_fields" and "reference_ fields" .

This is to continue  the work done on tasks.

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Fixed type hint

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Documentation update

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* FinQA - filter problematic examples (#1039)

filter problematic examples

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Arena hard elad2 (#1026)

* bug fixes in PairwiseChoiceTemplate

* add arena hard regex parser operator

* update mt bench card common

* update mt bench card common

* add reward bench

* update metric to pairwise comarison task

* arena hard tasks and cards

* update mt bench template

* add duplicate stream operator

* add PairwiseComparativeRatingTemplate

* add card

* add card

* add template

* add winrate metrics

* add comparative rating task

* add ExtractArenaHardNumericalJudgment

* add arena hard cards

* add arena hard template

* add weighted winrate metrics

* delete file

* update PairwiseComparativeRatingTemplate

* add metric

* add metric

* update

* update

* update

* fix template bug

* update

* llama 3 update

* update

* update

* update jsons

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* update

* fix

* fix

* fix

* update

* update

* update

* bluebench related changes

* fix type issue

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* update

* update

* update

* prometheus1

* update

* fix

* fix

* merge with arena_branch

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* rebuild catalog

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>

* add debugging to clapnq

* Reproduce all artifacts

* Add missing artifacts to catalog

* Add secrets baseline

Signed-off-by: Elad Venezian <eladv@il.ibm.com>

* Fix bugs with catalog creation

* Remove areana hard examples from tests, since they don't pass

* Add missing metadata to test mock

* Add data_classification_policy and recipe_metadata to the steams tests

* Fix test failures

* Update multi_turn_gpt4_judgement.py

* Update multi_turn_with_reference_gpt4_judgement.py

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* revert catalog consistecy and preperation yml files

* Update docs/docs/examples.rst

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>

* bug fix in LoadFromHFSpace

* revert

* revert

* update examples

* add coment to expain change

* update to new params usage

* pr fixes

* pr fixes

* update

* update

* update

* update

* update

* update

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* Update prepare/templates/rag/response_generation.py

Co-authored-by: Yotam Perlitz <perlitz@gmail.com>

* update

* cr fixes

* llmaj format fix

* llmaj format fix

---------

Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* demo's target prefix is now taken from demo instance (#1031)

* demo's target prefix is now taken from demo instance

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* do not pop fields out of demo instances.
Traditionally done for main instance, but not allowed for demo instance that should serve also other main instances in the stream

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* simplified test-case per @yoavkatz idea. Still eagering samples different demos than non-eagering

Signed-off-by: dafnapension <dafnashein@yahoo.com>

---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* remove the reduced clap_nq

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* define an empty template for rag end_to_end

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Implement metrics ensemble (#1047)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add load_json_predictions as processor in the template

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add the processors/load_json_predictions.json generated to the catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add flores101 (#1053)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added example for selection of demos (#1052)

* Added example for selection of demos

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Added example doc

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Update docs/docs/examples.rst

* Update docs/docs/examples.rst

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix - building test is not working. The reason is that opendatasets points to kaggle without version, and currently kaggle-1.6.15 fails. We fix the version of kaggle to be 1.6.14 as a fix

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add overwrite

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Update introduction.rst - - copy edits (grammar, consistency, clarity) (#1063)

Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fix typo in japanese_llama system prompt (issue #964) (#1056)

Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Allow assigning None in overwrites when fetching artifacts with modifications (#1062)

allow =None in overwrites for fetch

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Make sure preparation times printed fully and nicely (#1046)

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* numeric nlg - template changes (#1041)

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* add judge input to the metric (#1064)

* add judge input to the metric

* add judge input to the metric

* fix

* fix test

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Unitxt capitalization adding_dataset.rst (#1057)

making Unitxt capitalization consistent in text

Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fixed the score_ci inconsistency issue (#1065)

* suggested fix for score_ci inconsistency issue

Signed-off-by: dafnapension <dafnashein@yahoo.com>

* unify with the update, and thus simplified the check

Signed-off-by: dafnapension <dafnashein@yahoo.com>
---------

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Use of conventional python types in input definition of tasks and metrics (#1045)

* Fix data classes not support field overriding in fields containing types or functions

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Make tasks types python types

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix errors

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Some fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* More fixes

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update catalog

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix cards

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Revert change

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Fix typing in docs with new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* refactor of new asset to new convention

Signed-off-by: elronbandel <elron.bandel@ibm.com>

* Update secrets baseline

Signed-off-by: elronbandel <elron.bandel@ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Added prediction type to llm as jusdge to avoid warning (#1072)

* Added prediction type to llm as jusdge to avoid warning

Clarified the sandalone llm as judge example

Signed-off-by: Yoav Katz <katz@il.ibm.com>

* Removed accidentally added file

Signed-off-by: Yoav Katz <katz@il.ibm.com>

---------

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Fixed clapnq to check with reasonable error values

Also updated rag tasks to use new typing (instead of string types)

Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* fix the type hint

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* update catalog

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* Add metric "metrics.rag.retrieval_at_k" to catalog (#1074)

* add metric "metrics.rag.retrieval_at_k" to catalog
this is a wrapper around the retrieval_at_k for the ragas scheme

* add corresponding json file for the new metric

---------

Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

* merge - resolve conflict

Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>

---------

Signed-off-by: elronbandel <elron.bandel@ibm.com>
Signed-off-by: Benjamin Sznajder <benjams@il.ibm.com>
Signed-off-by: Yoav Katz <katz@il.ibm.com>
Signed-off-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Signed-off-by: ALON HALFON <ALONHAL@il.ibm.com>
Signed-off-by: dafnapension <dafnashein@yahoo.com>
Signed-off-by: Elad Venezian <eladv@il.ibm.com>
Signed-off-by: welisheva22 <welisheva22@gmail.com>
Signed-off-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: Elron Bandel <elronbandel@gmail.com>
Co-authored-by: Yoav Katz <68273864+yoavkatz@users.noreply.github.com>
Co-authored-by: Yotam Perlitz <perlitz@gmail.com>
Co-authored-by: Benjamin Sznajder <benjams@il.ibm.com>
Co-authored-by: Alon H <alonh@users.noreply.github.com>
Co-authored-by: dafnapension <dafnashein@yahoo.com>
Co-authored-by: ShirApp <58909189+ShirApp@users.noreply.github.com>
Co-authored-by: Elad <eladv@il.ibm.com>
Co-authored-by: ofirarviv <ofir.arviv@ibm.com>
Co-authored-by: Yotam Perlitz <yotam.perlitz@ibm.com>
Co-authored-by: michal <shmueli@il.ibm.com>
Co-authored-by: dafnapension <46454972+dafnapension@users.noreply.github.com>
Co-authored-by: welisheva22 <welisheva22@gmail.com>
Co-authored-by: Jonathan Bnayahu <bnayahu@il.ibm.com>
Co-authored-by: hanansinger <95229126+hanansinger@users.noreply.github.com>
Co-authored-by: Yoav Katz <katz@il.ibm.com>
Co-authored-by: matanor <55045955+matanor@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants