Skip to content

Commit

Permalink
Packaging
Browse files Browse the repository at this point in the history
  • Loading branch information
louisbrulenaudet committed Aug 8, 2024
1 parent 1c9a9db commit ba6e115
Show file tree
Hide file tree
Showing 73 changed files with 4,113 additions and 1,399 deletions.
Binary file modified .DS_Store
Binary file not shown.
108 changes: 92 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,8 @@

# RAGoon : High level library for batched embeddings generation, blazingly-fast web-based RAG and quantitized indexes processing ⚡
[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) ![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)
<a target="_blank" href="https://colab.research.google.com/github/louisbrulenaudet/ragoon/blob/main/RAGoon%20%3A%20Improve%20Large%20Language%20Models%20retrieval%20using%20dynamic%20web-search.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

RAGoon is a Python library that aims to improve the performance of language models by providing contextually relevant information through retrieval-based querying, web scraping, and data augmentation techniques. It offers an integration of various APIs, enabling users to retrieve information from the web, enrich it with domain-specific knowledge, and feed it to language models for more informed responses.

RAGoon's core functionality revolves around the concept of few-shot learning, where language models are provided with a small set of high-quality examples to enhance their understanding and generate more accurate outputs. By curating and retrieving relevant data from the web, RAGoon equips language models with the necessary context and knowledge to tackle complex queries and generate insightful responses.
RAGoon is a set of NLP utilities for multi-model embedding production, high-dimensional vector visualization, and aims to improve language model performance by providing contextually relevant information through search-based querying, web scraping and data augmentation techniques.

## Quick install
The reference page for RAGoon is available on the official page of PyPI: [RAGoon](https://pypi.org/project/ragoon/).
Expand All @@ -17,8 +12,95 @@ The reference page for RAGoon is available on the official page of PyPI: [RAGoon
pip install ragoon
```

## Usage Example
Here's an example of how to use WebRAG:
## Usage

This section provides an overview of different code blocks that can be executed with RAGoon to enhance your NLP and language model projects.

### Embeddings production

This class handles loading a dataset from Hugging Face, processing it to add embeddings using specified models, and provides methods to save and upload the processed dataset.

```python
from ragoon import EmbeddingsDataLoader
from datasets import load_dataset

# Initialize the dataset loader with multiple models
loader = EmbeddingsDataLoader(
token="hf_token",
dataset=load_dataset("louisbrulenaudet/dac6-instruct", split="train"), # If dataset is already loaded.
# dataset_name="louisbrulenaudet/dac6-instruct", # If you want to load the dataset from the class.
model_configs=[
{"model": "bert-base-uncased", "query_prefix": "Query:"},
{"model": "distilbert-base-uncased", "query_prefix": "Query:"}
# Add more model configurations as needed
]
)

# Uncomment this line if passing dataset_name instead of dataset.
# loader.load_dataset()

# Process the splits with all models loaded
loader.process(
column="output",
preload_models=True
)

# To access the processed dataset
processed_dataset = loader.get_dataset()
print(processed_dataset[0])
```

You can also embed a single text using multiple models:

```python
from ragoon import EmbeddingsDataLoader

# Initialize the dataset loader with multiple models
loader = EmbeddingsDataLoader(
token="hf_token",
model_configs=[
{"model": "bert-base-uncased"},
{"model": "distilbert-base-uncased"}
]
)

# Load models
loader.load_models()

# Embed a single text with all loaded models
text = "This is a single text for embedding."
embedding_result = loader.batch_encode(text)

# Output the embeddings
print(embedding_result)
```

### Embeddings visualization

This class provides functionality to load embeddings from a FAISS index, reduce their dimensionality using PCA and/or t-SNE, and visualize them in an interactive 3D plot.

```python
from ragoon import EmbeddingsVisualizer

visualizer = EmbeddingsVisualizer(
index_path="path/to/index",
dataset_path="path/to/dataset"
)

visualizer.visualize(
method="pca",
save_html=True,
html_file_name="embedding_visualization.html"
)
```

![Plot](https://github.com/louisbrulenaudet/ragoon/blob/main/assets/embeddings_visualization.gif?raw=true)

### Dynamic web search

RAGoon is a Python library that aims to improve the performance of language models by providing contextually relevant information through retrieval-based querying, web scraping, and data augmentation techniques. It integrates various APIs, enabling users to retrieve information from the web, enrich it with domain-specific knowledge, and feed it to language models for more informed responses.

RAGoon's core functionality revolves around the concept of few-shot learning, where language models are provided with a small set of high-quality examples to enhance their understanding and generate more accurate outputs. By curating and retrieving relevant data from the web, RAGoon equips language models with the necessary context and knowledge to tackle complex queries and generate insightful responses.

```python
from groq import Groq
Expand All @@ -33,7 +115,7 @@ ragoon = WebRAG(
)

# Search and get results
query = "I want to do a left join in python polars"
query = "I want to do a left join in Python Polars"
results = ragoon.search(
query=query,
completion_model="Llama3-70b-8192",
Expand All @@ -45,13 +127,6 @@ results = ragoon.search(
print(results)
```

## Key Features
- **Query Generation**: RAGoon generates search queries tailored to retrieve results that directly address the user's intent, enhancing the context for subsequent language model interactions.
- **Web Scraping and Data Retrieval**: RAGoon leverages web scraping capabilities to extract relevant content from various websites, providing language models with domain-specific knowledge.
- **Parallel Processing**: RAGoon utilizes parallel processing techniques to efficiently scrape and retrieve data from multiple URLs simultaneously.
- **Language Model Integration**: RAGoon integrates with language models, such as OpenAI's GPT-3 or LLama 3 on Groq Cloud, enabling users to leverage natural language processing capabilities for their applications.
- **Extensible Design**: RAGoon's modular architecture allows for the integration of new data sources, retrieval methods, and language models, ensuring future extensibility.

## Citing this project
If you use this code in your research, please use the following BibTeX entry.

Expand All @@ -63,5 +138,6 @@ If you use this code in your research, please use the following BibTeX entry.
year = {2024}
}
```

## Feedback
If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).
Binary file added assets/.DS_Store
Binary file not shown.
Binary file added assets/embeddings_visualization.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/.DS_Store
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file modified docs/build/doctrees/api/ragoon.embeddings.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/api/ragoon.web_rag.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/autosummary.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/build/doctrees/generated/ragoon.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/index.doctree
Binary file not shown.
Binary file added docs/build/doctrees/installation.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/modules.doctree
Binary file not shown.
Binary file modified docs/build/doctrees/ragoon.doctree
Binary file not shown.
Binary file added docs/build/doctrees/tutorials.doctree
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/build/html/.buildinfo
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: d1abc10fa896a59268d874e3d6a15004
config: e33b700b0d6eb4fd0f4a11523347c226
tags: 645f666f9bcd5a90fca523b33c5a78b7
20 changes: 13 additions & 7 deletions docs/build/html/_modules/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Overview: module code &#8212; RAGoon 0.0.5 documentation</title>
<title>Overview: module code &#8212; RAGoon 0.0.4 documentation</title>



Expand Down Expand Up @@ -36,13 +36,12 @@
<link rel="preload" as="script" href="../_static/scripts/pydata-sphinx-theme.js?digest=dfe6caa3a7d634c4db9b" />
<script src="../_static/vendor/fontawesome/6.5.2/js/all.min.js?digest=dfe6caa3a7d634c4db9b"></script>

<script src="../_static/documentation_options.js?v=6424ca4d"></script>
<script src="../_static/documentation_options.js?v=dfd84850"></script>
<script src="../_static/doctools.js?v=9a2dae69"></script>
<script src="../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../_static/theme_switcher.js?v=2728e04b"></script>
<script src="../_static/scripts/sphinx-book-theme.js?v=887ef09a"></script>
<script>DOCUMENTATION_OPTIONS.pagename = '_modules/index';</script>
<link rel="icon" href="../_static/logo_light.svg"/>
<link rel="icon" href="../_static/logo.svg"/>
<link rel="index" title="Index" href="../genindex.html" />
<link rel="search" title="Search" href="../search.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
Expand Down Expand Up @@ -138,8 +137,8 @@



<img src="../_static/logo_light.svg" class="logo__image only-light" alt="RAGoon 0.0.5 documentation - Home"/>
<script>document.write(`<img src="../_static/logo_light.svg" class="logo__image only-dark" alt="RAGoon 0.0.5 documentation - Home"/>`);</script>
<img src="../_static/logo.svg" class="logo__image only-light" alt="RAGoon 0.0.4 documentation - Home"/>
<script>document.write(`<img src="../_static/logo.svg" class="logo__image only-dark" alt="RAGoon 0.0.4 documentation - Home"/>`);</script>


</a></div>
Expand All @@ -156,10 +155,16 @@
</script></div>
<div class="sidebar-primary-item"><nav class="bd-links bd-docs-nav" aria-label="Main">
<div class="bd-toc-item navbar-nav active">
<p aria-level="2" class="caption" role="heading"><span class="caption-text">📖 Reference</span></p>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../installation.html">🚀 Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../tutorials.html">🖼️ Tutorials</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">📖 Reference</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1 has-children"><a class="reference internal" href="../modules.html">ragoon</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul>
<li class="toctree-l2"><a class="reference internal" href="../api/ragoon.chunks.html">ragoon.chunks</a></li>
<li class="toctree-l2"><a class="reference internal" href="../api/ragoon.datasets.html">ragoon.datasets</a></li>
<li class="toctree-l2"><a class="reference internal" href="../api/ragoon.embeddings.html">ragoon.embeddings</a></li>
<li class="toctree-l2"><a class="reference internal" href="../api/ragoon.similarity_search.html">ragoon.similarity_search</a></li>
<li class="toctree-l2"><a class="reference internal" href="../api/ragoon.web_rag.html">ragoon.web_rag</a></li>
Expand Down Expand Up @@ -284,6 +289,7 @@ <h1></h1>

<h1>All modules for which code is available</h1>
<ul><li><a href="ragoon/chunks.html">ragoon.chunks</a></li>
<li><a href="ragoon/datasets.html">ragoon.datasets</a></li>
<li><a href="ragoon/embeddings.html">ragoon.embeddings</a></li>
<li><a href="ragoon/similarity_search.html">ragoon.similarity_search</a></li>
<li><a href="ragoon/web_rag.html">ragoon.web_rag</a></li>
Expand Down
26 changes: 17 additions & 9 deletions docs/build/html/_modules/ragoon/chunks.html
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>ragoon.chunks &#8212; RAGoon 0.0.5 documentation</title>
<title>ragoon.chunks &#8212; RAGoon 0.0.4 documentation</title>



Expand Down Expand Up @@ -36,13 +36,12 @@
<link rel="preload" as="script" href="../../_static/scripts/pydata-sphinx-theme.js?digest=dfe6caa3a7d634c4db9b" />
<script src="../../_static/vendor/fontawesome/6.5.2/js/all.min.js?digest=dfe6caa3a7d634c4db9b"></script>

<script src="../../_static/documentation_options.js?v=6424ca4d"></script>
<script src="../../_static/documentation_options.js?v=dfd84850"></script>
<script src="../../_static/doctools.js?v=9a2dae69"></script>
<script src="../../_static/sphinx_highlight.js?v=dc90522c"></script>
<script src="../../_static/theme_switcher.js?v=2728e04b"></script>
<script src="../../_static/scripts/sphinx-book-theme.js?v=887ef09a"></script>
<script>DOCUMENTATION_OPTIONS.pagename = '_modules/ragoon/chunks';</script>
<link rel="icon" href="../../_static/logo_light.svg"/>
<link rel="icon" href="../../_static/logo.svg"/>
<link rel="index" title="Index" href="../../genindex.html" />
<link rel="search" title="Search" href="../../search.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
Expand Down Expand Up @@ -138,8 +137,8 @@



<img src="../../_static/logo_light.svg" class="logo__image only-light" alt="RAGoon 0.0.5 documentation - Home"/>
<script>document.write(`<img src="../../_static/logo_light.svg" class="logo__image only-dark" alt="RAGoon 0.0.5 documentation - Home"/>`);</script>
<img src="../../_static/logo.svg" class="logo__image only-light" alt="RAGoon 0.0.4 documentation - Home"/>
<script>document.write(`<img src="../../_static/logo.svg" class="logo__image only-dark" alt="RAGoon 0.0.4 documentation - Home"/>`);</script>


</a></div>
Expand All @@ -156,10 +155,16 @@
</script></div>
<div class="sidebar-primary-item"><nav class="bd-links bd-docs-nav" aria-label="Main">
<div class="bd-toc-item navbar-nav active">
<p aria-level="2" class="caption" role="heading"><span class="caption-text">📖 Reference</span></p>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">Getting Started</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="../../installation.html">🚀 Installation</a></li>
<li class="toctree-l1"><a class="reference internal" href="../../tutorials.html">🖼️ Tutorials</a></li>
</ul>
<p aria-level="2" class="caption" role="heading"><span class="caption-text">📖 Reference</span></p>
<ul class="nav bd-sidenav">
<li class="toctree-l1 has-children"><a class="reference internal" href="../../modules.html">ragoon</a><details><summary><span class="toctree-toggle" role="presentation"><i class="fa-solid fa-chevron-down"></i></span></summary><ul>
<li class="toctree-l2"><a class="reference internal" href="../../api/ragoon.chunks.html">ragoon.chunks</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../api/ragoon.datasets.html">ragoon.datasets</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../api/ragoon.embeddings.html">ragoon.embeddings</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../api/ragoon.similarity_search.html">ragoon.similarity_search</a></li>
<li class="toctree-l2"><a class="reference internal" href="../../api/ragoon.web_rag.html">ragoon.web_rag</a></li>
Expand Down Expand Up @@ -293,15 +298,14 @@ <h1>Source code for ragoon.chunks</h1><div class="highlight"><pre>
<span class="c1"># See the License for the specific language governing permissions and</span>
<span class="c1"># limitations under the License.</span>

<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">string</span>
<span class="kn">import</span> <span class="nn">uuid</span>

<span class="kn">from</span> <span class="nn">concurrent.futures</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">ThreadPoolExecutor</span><span class="p">,</span>
<span class="n">as_completed</span>
<span class="p">)</span>

<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="p">(</span>
<span class="n">IO</span><span class="p">,</span>
Expand All @@ -327,6 +331,10 @@ <h1>Source code for ragoon.chunks</h1><div class="highlight"><pre>
<span class="kn">from</span> <span class="nn">tqdm</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span>

<span class="kn">from</span> <span class="nn">ragoon._logger</span> <span class="kn">import</span> <span class="n">Logger</span>

<span class="n">logger</span> <span class="o">=</span> <span class="n">Logger</span><span class="p">()</span>


<div class="viewcode-block" id="ChunkMetadata">
<a class="viewcode-back" href="../../generated/ragoon.html#ragoon.chunks.ChunkMetadata">[docs]</a>
Expand Down
Loading

0 comments on commit ba6e115

Please sign in to comment.