Skip to content

Commit

Permalink
Jun 30 Public Preview for Generative AI (#2409)
Browse files Browse the repository at this point in the history
* Creating folder structure

* Add RAG docs

* Add rag notebooks.

* add create faiss index notebook

* Apply black formatting to rag notebooks.

* change to aka link

---------

Co-authored-by: Lucas Pickup <lupickup@microsoft.com>
Co-authored-by: qiqicui <qiqicui@microsoft.com>
  • Loading branch information
3 people committed Jun 30, 2023
1 parent a8c4255 commit 7348716
Show file tree
Hide file tree
Showing 30 changed files with 5,223 additions and 0 deletions.
Empty file.
237 changes: 237 additions & 0 deletions sdk/python/generative-ai/promptflow/create_faiss_index.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# An example code for creating Faiss index\n",
"\n",
"Efficient text retrieval and matching from a large volume of text are crucial for building a Q&A system. One common method is to convert the text into vector representations and create index based on these vectors, which enables fast retrieval by utilizing the similarity between vectors. This example demonstrates the process of spliting the document into small chunks, leveraging an embedding store to convert the text into vectors and generating Faiss index."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Install Embeddingstore SDK"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install embeddingstore --extra-index-url https://azuremlsdktestpypi.azureedge.net/embeddingstore/"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import required libraries"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"from typing import List\n",
"import urllib.request\n",
"from bs4 import BeautifulSoup\n",
"from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
"\n",
"from embeddingstore.core.contracts import (\n",
" EmbeddingModelType,\n",
" StorageType,\n",
" StoreCoreConfig,\n",
")\n",
"from embeddingstore.core.embeddingstore_core import EmbeddingStoreCore"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare your data\n",
"For convenience, a few Azure Machine Learning documentation webpages are selected here as sample data. You can replace them with your own dataset.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"URL_PREFIX = \"https://learn.microsoft.com/en-us/azure/machine-learning/\"\n",
"URL_NAME_LIST = [\n",
" \"tutorial-azure-ml-in-a-day\",\n",
" \"overview-what-is-azure-machine-learning\",\n",
" \"concept-v2\",\n",
"]"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"Download the data to local path."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"local_file_path = os.path.join(os.getcwd(), \"data\")\n",
"os.makedirs(local_file_path, exist_ok=True)\n",
"for url_name in URL_NAME_LIST:\n",
" url = os.path.join(URL_PREFIX, url_name)\n",
" destination_path = os.path.join(local_file_path, url_name)\n",
" urllib.request.urlretrieve(url, destination_path)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Configure and create an embedding store\n",
"Embedding store sdk supports multiple types of embedding models (Azure OpenAI, OpenAI) and multiple types of store path (local path, HTTP URL, Azure blob). In this example, configure an embedding store with Azure OpenAI embedding model and local store path.\n",
"\n",
"Please refer to [create a resource and deploy a model using Azure OpenAI](https://learn.microsoft.com/en-us/azure/cognitive-services/openai/how-to/create-resource?pivots=web-portal) to set up an AOAI embedding model deployment. The output vector returned by different embedding models has different dimensions. It is recommended to deploy `text-embedding-ada-002` model, and the dimension of the output vector returned by this model is 1536. \n",
"\n",
"To use AOAI model, please store `Azure_OpenAI_MODEL_ENDPOINT` and `Azure_OpenAI_MODEL_API_KEY` as environment variables."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"MODEL_API_VERSION = \"2023-05-15\"\n",
"MODEL_DEPLOYMENT_NAME = \"text-embedding-ada-002\"\n",
"DIMENSION = 1536\n",
"\n",
"# Configure an embedding store to store index file.\n",
"store_path = os.path.join(os.getcwd(), \"faiss_index_store\")\n",
"config = StoreCoreConfig.create_config(\n",
" storage_type=StorageType.LOCAL,\n",
" store_identifier=store_path,\n",
" model_type=EmbeddingModelType.AOAI,\n",
" model_api_base=os.environ[\"Azure_OpenAI_MODEL_ENDPOINT\"],\n",
" model_api_key=os.environ[\"Azure_OpenAI_MODEL_API_KEY\"],\n",
" model_api_version=MODEL_API_VERSION,\n",
" model_name=MODEL_DEPLOYMENT_NAME,\n",
" dimension=DIMENSION,\n",
" create_if_not_exists=True,\n",
")\n",
"store = EmbeddingStoreCore(config)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split document to chunks, embed chunks and create Faiss index."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_file_chunks(file_name: str) -> List[str]:\n",
" with open(file_name, \"r\", encoding=\"utf-8\") as f:\n",
" page_content = f.read()\n",
" # use BeautifulSoup to parse HTML content\n",
" soup = BeautifulSoup(page_content, \"html.parser\")\n",
" text = soup.get_text(\" \", strip=True)\n",
" chunks = []\n",
" splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=10)\n",
" for chunk in splitter.split_text(text):\n",
" chunks.append(chunk)\n",
" return chunks"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"When inserting chunks into embedding store, the chunks are transformed into embeddings and Faiss index is generated under the store path."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"for root, _, files in os.walk(local_file_path):\n",
" for file in files:\n",
" each_file_path = os.path.join(root, file)\n",
"\n",
" # Split the file into chunks.\n",
" chunks = get_file_chunks(each_file_path)\n",
" count = len(chunks)\n",
" if URL_PREFIX is not None:\n",
" metadatas = [\n",
" {\"title\": file, \"source\": os.path.join(URL_PREFIX, file)}\n",
" ] * count\n",
" else:\n",
" metadatas = [{\"title\": file}] * count\n",
"\n",
" # Embed chunks into embeddings, generate index in embedding store.\n",
" # If your data is large, inserting too many chunks at once may cause\n",
" # rate limit error,you can refer to the following link to find solution\n",
" # https://learn.microsoft.com/en-us/azure/cognitive-services/openai/quotas-limits\n",
" store.batch_insert_texts(chunks, metadatas)\n",
" print(f\"Create index for {file} file successfully.\\n\")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"## Next step\n",
"Now you have successfully created Faiss index. To build a complete Q&A system, you can use [Faiss Index Lookup tool](https://aka.ms/faiss_index_lookup_tool) to search relavant texts from the created index by [Azure Machine Learning Prompt Flow](https://aka.ms/AMLPromptflow)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "amlsubmitjobs",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
34 changes: 34 additions & 0 deletions sdk/python/generative-ai/rag/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# RAG (Retrieval Augmented Generation) with AzureML

## Overview

Retrieval Augmented Generation (RAG) is a technique of augmenting large language models (LLM) with business-specific data. RAG enables businesses to harness the power on LLMs like text generation, summarization and conversations in context of their enterprise or domain specific data.

View more details about [what is Retrieval Augmented Generation (RAG)](./what-is-rag.md).

## Getting started

Follow the Quick start guidance to [Retrieval Augmented Generation using Azure Machine Learning prompt flow](./rag-quick-start.md)

## Documentation

This folder contains document about prompt flow. The following table lists the available documents.

| Category | Article |
|----------------|----------------|
|Overview|[What is Retrieval Augmented Generation](./what-is-rag.md)|
|Quick start|[Retrieval Augmented Generation using Azure Machine Learning prompt flow](./rag-quick-start.md)|
|Notebooks|[Process Git Repo into Azure Cognitive Search with Embeddings](./examples/notebooks/azure_cognitive_search/acs_mlindex_with_langchain.ipynb)|
|Notebooks|[Process private Git Repo into FAISS Embeddings Index](./examples/notebooks/faiss/faiss_mlindex_with_langchain.ipynb)|
|Notebooks|[QA Test Generation](./examples/notebooks/qa_data_generation.ipynb)|
|Notebooks| [Productionize Vector Index with Test Data Generation, Auto Prompt, Evaluations and Prompt Flow](./examples/notebooks/mlindex_with_testgen_autoprompt.ipynb)|

## Feedback and support

We'd love to hear your feedback which will help us improve the product.

You can reach out to azuremlpreviews@microsoft.com for any questions or share using [feedback form](https://forms.office.com/r/sGTkJ53e72).

## Troubleshoot

To troubleshoot issues see: [TROUBLESHOOT.md](TROUBLESHOOT.md).
39 changes: 39 additions & 0 deletions sdk/python/generative-ai/rag/TROUBLESHOOT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Troubleshoot

- [Troubleshoot](#troubleshoot)
- [AOAI deployment does not exist](#aoai-deployment-does-not-exist)
- [Unable to retrieve OBO tokens for resource](#unable-to-retrieve-obo-tokens-for-resource)
- [No Files Found](#no-files-found)
- [Rate Limit Error from Azure OpenAI](#rate-limit-error-from-azure-openai)

## AOAI deployment does not exist
If a pipeline fails in llm_rag_validate_deployments component, then it indicates that the RAG pipeline is unable to access model deployments from Azure Open AI connection. Some of the common scenarios include
- Incorrect API Base or Key for the Azure Open AI workspace connection.
- Deployment for selected model name (ex. text-embedding-ada-002, text-davinci-003, gpt-35-turbo, gpt-4 etc.) does not exist on the AOAI resource.
- Incorrect deployment name for the selected AOAI model.

## Unable to retrieve OBO tokens for resource

If a Component (ex. `LLM - Generate QnA Test Data`) fails with the following error in `Outputs + logs > user_logs/std_log.txt`:

> azure.core.exceptions.ClientAuthenticationError: Unexpected content type "text/plain; charset=utf-8"
Content: Unable to retrieve OBO tokens for resource https://management.azure.com.

It's probably because OBO token caching does not occur if the user is new to a region. A temporary solution to overcome this issue is to create a dataset from UI and Explore it. Follow these steps:

1. Open Data tab and click on Create:
![](./media/troubleshooting-1.png)
2. Set `Name` as `dummy-data` and `Type` as `Table (mltable)`. Hit Next.
3. Select `From a URI`. Hit Next.
4. Copy this [link](https://github.com/raw/Azure/azureml-examples/main/sdk/python/jobs/automl-standalone-jobs/automl-classification-task-bankmarketing/data/training-mltable-folder/) and paste into `URI`:
![](./media/troubleshooting-2.png)
5. Hit Next and then Create. The details page should open up.
6. Click on the `Explore` tab and wait for `Preview` to load:
![](./media/troubleshooting-3.png)
7. Once `Preview` table is loaded, OBO token should've been cached. You may retry whatever action you were trying before.

## No Files Found
If llm_rag_crack_and_chunk component fails with ```No Files Found`` error, then it indicates that the component was not able to find any of the supported file types in the provided source data files. Supported file types include .txt, .pdf, .html, .md, .ppt(x), doc(x), .xls(x). Any additional file types will be ignored.

## Rate Limit Error from Azure OpenAI
Components for embedding generation (llm_rag_generate_embeddings_parallel), test data generation (llm_rag_qa_data_generation) and auto prompt (llm_autoprompt_qna) may occassionally fail due to throttling or rate limit errors from Azure Open AI. These component implement retries with backoff but will error out if rate limit error continues to occur. Pleae retry the job again after sometime or ensure there are no other job using the same Azure Open AI instance simultaneously.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-3a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-4a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-5a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-6-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-6-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-6-5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-6-7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-6-9.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-6a.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-7-2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added sdk/python/generative-ai/rag/media/doc/UI-7-4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 18 additions & 0 deletions sdk/python/generative-ai/rag/notebooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# RAG (Retrieval Augmented Generation) with AzureML Components and Pipelines

Following notebooks demonstrate use of AzureML Components to process your data into Vector Index and use them in Promptflow or LangChain.

| Category | Article |
|----------------|----------------|
|Notebooks|[Process Git Repo into Azure Cognitive Search with Embeddings](./azure_cognitive_search/acs_mlindex_with_langchain.ipynb)|
|Notebooks|[Process private Git Repo into FAISS Embeddings Index](./faiss/faiss_mlindex_with_langchain.ipynb)|
|Notebooks|[QA Test Generation](./qa_data_generation.ipynb)|
|Notebooks|[Productionize Vector Index with Test Data Generation, Auto Prompt, Evaluations and Prompt Flow](./mlindex_with_testgen_autoprompt.ipynb)|
|Notebooks|[Import data from an S3 bucket into an Azure Cognitive Search Index](./azure_cognitive_search/s3_to_acs_mlindex_with_langchain.ipynb)|
|Notebooks|[Update a FAISS based Vector Index on a Schedule](./faiss/scheduled_update_faiss_index.ipynb)|

## Feedback and support

We'd love to hear your feedback which will help us improve the product.

You can reach out to azuremlpreviews@microsoft.com for any questions or share using [feedback form](https://forms.office.com/r/sGTkJ53e72).
Loading

0 comments on commit 7348716

Please sign in to comment.