[Feature Branch][DeepSparse Evaluation API] Update lm-eval, perplexity, additional datasets #1580

dbogunowicz · 2024-02-05T10:49:37Z

This PR updates the version of the lm-eval from 0.3 to 0.4.
Supported and tested datasets to evaluate on gsm8k, hellaswag, arc_challange.

Example usage

Example using CLI (when lm-eval is not installed):

 deepsparse.eval hf:mgoin/TinyStories-1M-ds --dataset hellaswag --dataset arc_challange --limit 2

2024-02-05 13:27:51 deepsparse.evaluation.cli INFO     Creating deepsparse pipeline to evaluate from model path: hf:mgoin/TinyStories-1M-ds
2024-02-05 13:27:51 deepsparse.evaluation.cli INFO     Datasets to evaluate on: ['hellaswag', 'arc_challange']
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Additional integration arguments supplied: {'limit': 2}
Fetching 11 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 164189.84it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 54535.87it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 92459.61it/s]
2024-02-05 13:27:54 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:27:54 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
Traceback (most recent call last):
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/integrations/__init__.py", line 20, in try_import_lm_evaluation_harness
    import lm_eval
ModuleNotFoundError: No module named 'lm_eval'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/bin/deepsparse.eval", line 8, in <module>
    sys.exit(main())
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/nm/drive0/damian/deepsparse/deepsparse_venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/cli.py", line 193, in main
    result: Result = evaluate(
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/evaluator.py", line 63, in evaluate
    eval_integration = EvaluationRegistry.resolve(pipeline, datasets, integration)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/registry.py", line 72, in resolve
    potentially_check_dependency_import(integration)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/utils.py", line 46, in potentially_check_dependency_import
    try_import_lm_evaluation_harness(raise_error=True)
  File "/nm/drive0/damian/deepsparse/src/deepsparse/evaluation/integrations/__init__.py", line 25, in try_import_lm_evaluation_harness
    raise ImportError(
ImportError: Unable to import lm_eval. To install run 'pip install lm-eval==0.4.0'

Example using CLI

 deepsparse.eval hf:mgoin/TinyStories-1M-ds --dataset hellaswag --dataset arc_challange --limit 2

2024-02-05 13:24:42 deepsparse.evaluation.cli INFO     Creating deepsparse pipeline to evaluate from model path: hf:mgoin/TinyStories-1M-ds
2024-02-05 13:24:42 deepsparse.evaluation.cli INFO     Datasets to evaluate on: ['hellaswag', 'arc_challange']
Batch size: 1
Splits to evaluate on: None
Metrics to evaluate on: None
Additional integration arguments supplied: {'limit': 2}
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 39911.20it/s]
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 20042.29it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 31906.88it/s]
2024-02-05 13:24:45 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:24:45 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
2024-02-05:13:24:49,100 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05:13:24:51,939 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05 13:24:51 deepsparse.evaluation.integrations.lm_evaluation_harness INFO     Selected Tasks: ['hellaswag']
2024-02-05:13:24:51,940 INFO     [lm_evaluation_harness.py:67] Selected Tasks: ['hellaswag']
2024-02-05:13:24:55,591 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:24:55,592 INFO     [evaluator.py:319] Running loglikelihood requests
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [01:11<00:00,  8.98s/it]
2024-02-05 13:26:07 deepsparse.evaluation.cli INFO     Evaluation done. Results:
formatted=[Evaluation(task='lm_evaluation_harness', dataset=Dataset(type=None, name='hellaswag', config={'model': 'DeepSparseLM', 'model_args': None, 'batch_size': 1, 'batch_sizes': [], 'device': None, 'use_cache': None, 'limit': 2, 'bootstrap_iters': 100000, 'gen_kwargs': None}, split=None), metrics=[Metric(name='acc,none', value=0.0), Metric(name='acc_stderr,none', value=0.0), Metric(name='acc_norm,none', value=1.0), Metric(name='acc_norm_stderr,none', value=0.0)], samples=None)]
2024-02-05 13:26:07 deepsparse.evaluation.cli INFO     Saving the evaluation results to /nm/drive0/damian/deepsparse/result.json
2024-02-05:13:26:07,507 INFO     [cli.py:212] Saving the evaluation results to /nm/drive0/damian/deepsparse/result.json

Example using `evaluate` function:

from deepsparse import evaluate

out = evaluate(model="hf:mgoin/TinyStories-1M-ds",
         datasets=["hellaswag", "arc_challenge"], 
          limit = 2)
print(out)

Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 131820.98it/s]
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 151767.58it/s]
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.0.20240104 COMMUNITY | (86c38139) (release) (optimized) (system=avx2, binary=avx2)
Fetching 11 files: 100%|██████████| 11/11 [00:00<00:00, 35654.83it/s]
2024-02-05 13:09:34 deepsparse.evaluation.registry INFO     No integration specified, inferring the evaluation function from the input arguments...
2024-02-05 13:09:34 deepsparse.evaluation.registry INFO     Inferred the evaluation function: lm-evaluation-harness
2024-02-05:13:09:38,769 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05:13:09:41,599 WARNING  [__init__.py:194] Some tasks could not be loaded due to missing dependencies. Run with `--verbosity DEBUG` for full details.
2024-02-05 13:09:41 deepsparse.evaluation.integrations.lm_evaluation_harness INFO     Selected Tasks: ['arc_challenge', 'hellaswag']
2024-02-05:13:09:41,601 INFO     [lm_evaluation_harness.py:67] Selected Tasks: ['arc_challenge', 'hellaswag']
2024-02-05:13:09:48,822 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:09:48,829 INFO     [task.py:340] Building contexts for task on rank 0...
2024-02-05:13:09:48,832 INFO     [evaluator.py:319] Running loglikelihood requests
100%|██████████| 16/16 [05:34<00:00, 20.92s/it]
formatted=[Evaluation(task='lm_evaluation_harness', dataset=Dataset(type=None, name='arc_challenge', config={'model': 'DeepSparseLM', 'model_args': None, ...

Example running unit tests (requires `lm-eval==0.4` to be installed)

damian@gpuserver6:/nm/drive0/damian/deepsparse$ pytest tests/deepsparse/evaluation/integrations/test_lm_evaluation_harness.py 
================================================================================================================================ test session starts ================================================================================================================================
platform linux -- Python 3.10.12, pytest-7.4.3, pluggy-1.3.0
rootdir: /nm/drive0/damian/deepsparse
configfile: pyproject.toml
plugins: flaky-3.7.0, anyio-3.7.1
collected 8 items                                                                                                                                                                                                                                                                   

tests/deepsparse/evaluation/integrations/test_lm_evaluation_harness.py ........                                                                                                                                                                                               [100%]

==================================================================================================================== 8 passed, 19 warnings in 302.35s (0:05:02) =====================================================================================================================

…_eval

…eepsparse into feature/damian/fix_lm_eval

dbogunowicz · 2024-02-05T13:34:36Z

src/deepsparse/evaluation/utils.py

@@ -79,24 +81,66 @@ def if_generative_language_model(pipeline: Pipeline) -> bool:
    return False


-def args_to_dict(args: Tuple[Any, ...]) -> Dict[str, Any]:
+def parse_kwarg_tuples(kwargs: tuple) -> Dict:


This is 1:1 copy of the same function from SparseML. As proposed by @rahul-tuli, in the future let's have it in the nm-utils so it can be shared between Deepsparse and SparseML

…te_until

mgoin · 2024-02-06T18:53:50Z

src/deepsparse/evaluation/utils.py

-
-LM_EVALUATION_HARNESS = "lm-evaluation-harness"
+_LOGGER = logging.getLogger(__name__)
+LM_EVALUATION_HARNESS = "lm-eval-harness"


Considering lm-eval-harness is still the name of the repo, I would propose to keep it as lm-eval-harness. I think if you'd like to alias lm_eval as well since that is the name of their CLI command, that would be fine

TYPOS FIXED: Considering lm-evaluation-harness is still the name of the repo, I would propose to keep it as lm-evaluation-harness. I think if you'd like to alias lm_eval as well since that is the name of their CLI command, that would be fine

Yes, this is something that has been introduced in this PR right? We are indeed committing to the name 'lm-eval-harness'

sorry there are multiple typos - i meant to keep lm-evaluation-harness and contest the change to lm-eval-harness

@mgoin note: this comes from the docs: https://neuralmagic.github.io/docs-v2/get-started/deploy (bottom of the page). I am happy to change this to whatever product sees fit.

Let's get this in in the current state, I can always change this detail if needed.

mgoin · 2024-02-06T18:55:09Z

src/deepsparse/evaluation/integrations/__init__.py

@@ -24,8 +24,7 @@ def try_import_lm_evaluation_harness(raise_error=False):
        if raise_error:
            raise ImportError(
                "Unable to import lm_eval. "
-                "To install run 'pip install "
-                "git+https://github.com/EleutherAI/lm-evaluation-harness@b018a7d51'"
+                "To install run 'pip install lm-eval==0.4.0'"


when or how will this error during normal use if raise_error=False by default? once the eval actually begins?

I see. Good point. Yes, I will change the default behavior of this function, and set raise_error to True.

This is the intended behavior when the acual eval is being ran. At runtime, when the user intends to use lm-eval, the module will try to do the hot import of the lm-eval. If it fails to find the dependency, installed, it will raise the error.

However, when testing, I do not want to raise errors, but use the output of this function (boolean) to skip the tests that require lm-eval installed.

src/deepsparse/evaluation/cli.py

src/deepsparse/evaluation/integrations/lm_evaluation_harness.py

tests/deepsparse/evaluation/test_utils.py

tests/deepsparse/evaluation/test_evaluator.py

src/deepsparse/evaluation/utils.py

…ness`

* initial commit * Update src/deepsparse/evaluation/integrations/__init__.py * design ready, time to define additional features * split prep_for_generation operator * fix logits * update non-kv cache pipeline and tests * add tests to address edge cases * add condition to check of kv_cache full during prompt inference, add test to cover this case, revert debugging changes * fix typing * remove commented code * remove irrelevant condition * perplexity for non-kv cache pipelines works! * logic is working * ready for review * [DeepSparse Evaluation API] Perplexity eval support for `openai_humaneval`, `c4`, `wikitext2` (#1586) * fix tests 2 * initial commit * add return to a function * make script more robust --------- Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>

…y, additional datasets (#1580)

…y, additional datasets (#1580) (#1596)

dbogunowicz and others added 23 commits January 29, 2024 15:18

initial implementation

6035536

initial commit

53cb9ec

add some more tests for hardening

6599f41

Update src/deepsparse/evaluation/cli.py

4721c1f

Update src/deepsparse/transformers/pipelines/text_generation/pipeline.py

1247794

Apply suggestions from code review

9e88f89

quality

fdb21c6

Merge branch 'main' into feature/damian/ui_improvements

be80132

Merge remote-tracking branch 'origin/main' into feature/damian/fix_lm…

5d40b8d

…_eval

fix the UI, implement loglikelihood function

3e5b7a8

Merge branch 'main' into feature/damian/fix_lm_eval

ff0944b

remove unneccessary file

f38f0db

Merge branch 'feature/damian/fix_lm_eval' of github.com:neuralmagic/d…

dd45493

…eepsparse into feature/damian/fix_lm_eval

Merge branch 'main' into feature/damian/ui_improvements

cd10b92

initial commit

b2aad17

tests passing, refactor time!

35454a1

cleanup

d3b84f8

Update test_evaluator.py

e7d8c31

finished

a148fc5

rebase

3b5977b

quality

a9e9847

rebase

787ee45

manual testing

b5a6d6d

dbogunowicz commented Feb 5, 2024

View reviewed changes

dbogunowicz mentioned this pull request Feb 5, 2024

[DeepSparse Evaluation API][Feature Branch] 1.7 Updates #1579

Closed

dbogunowicz requested review from rahul-tuli and bfineran February 5, 2024 14:29

dbogunowicz mentioned this pull request Feb 5, 2024

[DeepSparse Evaluation API] UX Improvements #1568

Merged

Base automatically changed from feature/damian/ui_improvements to main February 5, 2024 15:56

Merge remote-tracking branch 'origin/main' into feature/damian/genera…

d0698e7

…te_until

dbogunowicz added 4 commits February 5, 2024 17:04

UI improvements

e10f0c9

new UI adaptations

48a5900

make test more lightweight

44e3e6e

fix tests 2

abb6ab8

mgoin reviewed Feb 6, 2024

View reviewed changes

dbogunowicz and others added 3 commits February 7, 2024 12:34

Merge branch 'main' into feature/damian/generate_until

79fd7e0

good point Michael

e5aad65

Merge branch 'main' into feature/damian/generate_until

06302dc

dbogunowicz requested a review from mgoin February 8, 2024 09:33