-
Notifications
You must be signed in to change notification settings - Fork 1
/
02_basic_prompting_evaluation.py
311 lines (250 loc) · 11.1 KB
/
02_basic_prompting_evaluation.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
# Databricks notebook source
# MAGIC %md
# MAGIC
# MAGIC ## Basic Prompting & Evaluation
# MAGIC <hr/>
# MAGIC <img src="https://promptengineeringdbl.blob.core.windows.net/img/header.png"/>
# MAGIC
# MAGIC <br/>
# MAGIC <hr/>
# MAGIC
# MAGIC * Now that we know a bit more possible parameters used in LLMs, let's start with some basic Prompt Engineering experiments.
# MAGIC * For this first part, we will use relatively lightweight LLMs - 1b parameters or less:
# MAGIC * [gpt2-large](https://huggingface.co/gpt2-large) (774M parameters)
# MAGIC * [DialoGPT-large](https://huggingface.co/microsoft/DialoGPT-large?text=Hey+my+name+is+Julien%21+How+are+you%3F) (762M parameters)
# MAGIC * [bloom-560m](https://huggingface.co/bigscience/bloom-560m) (560M parameters)
# MAGIC * We will leverage [Hugging Face](https://huggingface.co) for obtaining model weights as well as the `transformers` library.
# MAGIC * For prompt tracking, evaluation and LLM comparisons, we will leverage some of the new MLflow LLM features, such as `mlflow.evaluate()`.
# MAGIC * For the sake of simplicity, here we're dealing with a specific case of prompt engineering, where we provide some **context** - examples of questions and outputs we expect. Concretely, this can be described as [Few Shot Learning](https://en.wikipedia.org/w/index.php?title=Few-shot_learning_(natural_language_processing).
# MAGIC * For more details on Few Shot Learning and Generative AI, you can browse some of [Databricks' training offerings around the topic](https://www.databricks.com/blog/now-available-new-generative-ai-learning-offerings).
# COMMAND ----------
# DBTITLE 1,Installing dependencies
# MAGIC %pip install transformers accelerate torch mlflow xformers -q
# COMMAND ----------
dbutils.library.restartPython()
# COMMAND ----------
# DBTITLE 1,Instantiating our models
# Wrapper class for Transformers models
from models.transformer import PyfuncTransformer
gcfg = {
"max_length": 180,
"max_new_tokens": 10,
"do_sample": False,
}
example = (
"Q: Are elephants larger than mice?\nA: Yes.\n\n"
"Q: Are mice carnivorous?\nA: No, mice are typically omnivores.\n\n"
"Q: What is the average lifespan of an elephant?\nA: The average lifespan of an elephant in the wild is about 60 to 70 years.\n\n"
"Q: Is Mount Everest the highest mountain in the world?\nA: Yes.\n\n"
"Q: Which city is known as the 'City of Love'?\nA: Paris is often referred to as the 'City of Love'.\n\n"
"Q: What is the capital of Australia?\nA: The capital of Australia is Canberra.\n\n"
"Q: Who wrote the novel '1984'?\nA: The novel '1984' was written by George Orwell.\n\n"
)
gpt2large = PyfuncTransformer(
"gpt2-large",
gen_config_dict=gcfg,
examples=example,
)
dialogpt = PyfuncTransformer(
"microsoft/DialoGPT-large",
gen_config_dict=gcfg,
examples=example,
)
bloom560 = PyfuncTransformer(
"bigscience/bloom-560m",
gen_config_dict=gcfg,
examples=example,
)
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC ##Logging Models into MLflow
# MAGIC
# MAGIC <br />
# MAGIC
# MAGIC * Model logging in MLflow is essentially version control for Machine Learning models: it guarantees reproducibility by recording model and environment details.
# MAGIC * Model logging also allows us to track (and compare) model versions, so we can refer fack to older models and see what effects changes to the models have.
# MAGIC * More details about MLflow Experiment Tracking and Model Registry can be found in MLflow documentation:
# MAGIC * [Experiment Tracking](https://mlflow.org/docs/latest/tracking.html)
# MAGIC * [Model Registry](https://mlflow.org/docs/latest/model-registry.html)
# COMMAND ----------
import mlflow
import time
mlflow.set_experiment(experiment_name="/Shared/compare_small_models")
run_ids = []
artifact_paths = []
model_names = ["gpt2large", "dialogpt", "bloom560"]
for model, name in zip([gpt2large, dialogpt, bloom560], model_names):
with mlflow.start_run(run_name=f"log_model_{name}_{time.time_ns()}"):
pyfunc_model = model
artifact_path = f"models/{name}"
mlflow.pyfunc.log_model(
artifact_path=artifact_path,
python_model=pyfunc_model,
input_example="Q: What color is the sky?\nA:",
)
run_ids.append(mlflow.active_run().info.run_id)
artifact_paths.append(artifact_path)
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC By now you should be able to see our MLflow experiment, along with the different runs we created for each model:
# MAGIC
# MAGIC <hr />
# MAGIC
# MAGIC <img src="https://github.com/rafaelvp-db/databricks-llm-workshop/blob/main/img/experiment.png?raw=true" />
# COMMAND ----------
# DBTITLE 1,Defining our evaluation prompts
import pandas as pd
eval_df = pd.DataFrame(
{
"question": [
"Q: What color is the sky?\nA:",
"Q: Are trees plants or animals?\nA:",
"Q: What is 2+2?\nA:",
"Q: Who is Darth Vader?\nA:",
"Q: What is your favorite color?\nA:",
]
}
)
eval_df
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC ### Evaluation
# MAGIC
# MAGIC Let's loop through our models and evaluate the answers to our prompts. We will leverage the `runs` that were created in the previous cells - this is not absolutely required, however by doing this we ensure all elements appear in MLflow's UI in a clean format.
# COMMAND ----------
run_ids
# COMMAND ----------
for i in range(3):
with mlflow.start_run(
run_id=run_ids[i]
): # reopen the run with the stored run ID
f"runs:/{run_ids[i]}/{artifact_paths[i]}"
evaluation_results = mlflow.evaluate(
model=f"runs:/{run_ids[i]}/{artifact_paths[i]}",
model_type="text",
data=eval_df,
)
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC * You’ll notice one other change we made this time around: we saved the evaluation data as an [MLflow Dataset](https://mlflow.org/docs/latest/python_api/mlflow.data.html#mlflow.data.dataset.Dataset) with `eval_data = mlflow.data.from_pandas(eval_df, name=”evaluate_configurations”)` and then referred to this dataset in our `evaluate()` call, explicitly associating the dataset with the evaluation.
# MAGIC * We can retrieve the dataset information from the run in the future if needed, ensuring that we don’t lose track of the data used in the evaluation.
# MAGIC * If we browse through our experiment page once again, we can see that we have the generation results for each of the models that we evaluated:
# MAGIC
# MAGIC <hr />
# MAGIC
# MAGIC <img src="https://github.com/rafaelvp-db/databricks-llm-workshop/blob/main/img/evaluate1.png?raw=true" />
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC ## Keeping track of generation configurations
# MAGIC
# MAGIC <hr />
# MAGIC
# MAGIC * Our `PyfuncTransformerWithParams` class supports passing **generation configuration** parameters at inference time.
# MAGIC * Logging these configurations in addition to our inputs and outputs can be quite useful, as we can compare how different configurations affect our generation outputs and quality.
# MAGIC * Let's do another sequence of runs in order to properly track these parameters, along with inputs and outputs for each of our models.
# MAGIC * This time, we'll focus on `bloom560` and use different configurations values - we'll mainly change `do_sample` to `True`, and we'll experiment with `top_k` sampling.
# COMMAND ----------
import json
config_dict1 = {
"do_sample": True,
"top_k": 10,
"max_length": 180,
"max_new_tokens": 10
}
config_dict2 = {
"do_sample": False,
"max_length": 180,
"max_new_tokens": 10
}
few_shot_examples_1 = (
"Q: Are elephants larger than mice?\nA: Yes.\n\n"
"Q: Are mice carnivorous?\nA: No, mice are typically omnivores.\n\n"
"Q: What is the average lifespan of an elephant?\nA: The average lifespan of an elephant in the wild is about 60 to 70 years.\n\n"
)
few_shot_examples_2 = (
"Q: Is Mount Everest the highest mountain in the world?\nA: Yes.\n\n"
"Q: Which city is known as the 'City of Love'?\nA: Paris is often referred to as the 'City of Love'.\n\n"
"Q: What is the capital of Australia?\nA: The capital of Australia is Canberra.\n\n"
"Q: Who wrote the novel '1984'?\nA: The novel '1984' was written by George Orwell.\n\n"
)
few_shot_examples = [few_shot_examples_1, few_shot_examples_2]
config_dicts = [config_dict1, config_dict2]
questions = [
"Q: What color is the sky?\nA:",
"Q: Are trees plants or animals?\nA:",
"Q: What is 2+2?\nA:",
"Q: Who is the Darth Vader?\nA:",
"Q: What is your favorite color?\nA:",
]
data = {
"input_text": questions * len(few_shot_examples),
"few_shot_examples": [
example for example in few_shot_examples for _ in range(len(questions))
],
"config_dict": [
json.dumps(config)
for config in config_dicts
for _ in range(len(questions))
],
}
eval_df = pd.DataFrame(data)
# COMMAND ----------
from models.transformer import PyfuncTransformerWithParams
bloom560_with_params = PyfuncTransformerWithParams("bigscience/bloom-560m")
mlflow.set_experiment(experiment_name="/Shared/compare_generation_params")
model_name = "bloom560"
with mlflow.start_run(run_name=f"log_model_{model_name}_{time.time_ns()}"):
# Define an input example
input_example = pd.DataFrame(
{
"input_text": "Q: What color is the sky?\nA:",
"few_shot_examples": example, # Assuming 'example' is defined and contains your few-shot prompts
"config_dict": {}, # Assuming an empty dict for the generation parameters in this example
}
)
perplexity = #...
mlflow.log_metric("perplexity", perplexity)
# Define the artifact_path
artifact_path = f"models/{model_name}"
# log the data
eval_data = mlflow.data.from_pandas(eval_df, name="evaluate_configurations")
# Log the model
mod = mlflow.pyfunc.log_model(
artifact_path=artifact_path,
python_model=bloom560_with_params,
input_example=input_example,
)
# Define the model_uri
model_uri = f"runs:/{mlflow.active_run().info.run_id}/{artifact_path}"
# Evaluate the model
mlflow.evaluate(model=model_uri, model_type="text", data=eval_data)
# COMMAND ----------
# DBTITLE 1,Visualizing Evaluation Results: Few Shot Examples
# MAGIC %md
# MAGIC
# MAGIC For our new experiment, we are now able to visualize not only **inputs** and **outputs** for each of our model runs, but also our **few shot examples**:
# MAGIC
# MAGIC <br />
# MAGIC
# MAGIC <img src="https://github.com/rafaelvp-db/databricks-llm-workshop/blob/main/img/evaluate2.png?raw=true" />
# COMMAND ----------
# DBTITLE 1,Visualizing Evaluation Results: Generation Configurations
# MAGIC %md
# MAGIC
# MAGIC On top of different **few shot examples** used, we also have access to the different **generation configuration parameters**:
# MAGIC
# MAGIC <img src="https://github.com/rafaelvp-db/databricks-llm-workshop/blob/main/img/evaluate3.png?raw=true" />
# COMMAND ----------
# MAGIC %md
# MAGIC
# MAGIC ##Reference
# MAGIC
# MAGIC <hr />
# MAGIC
# MAGIC * [Comparing LLMs with MLflow](https://medium.com/@dliden/comparing-llms-with-mlflow-1c69553718df)
# MAGIC * [Announcing MLflow 2.4: LLMOps Tools for Robust Model Evaluation](https://www.databricks.com/blog/announcing-mlflow-24-llmops-tools-robust-model-evaluation)