Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HnswDensevector SafeTensor Generator #2515

Open
wants to merge 37 commits into
base: master
Choose a base branch
from

Conversation

Panizghi
Copy link
Contributor

@Panizghi Panizghi commented Jun 2, 2024

Linked issue : castorini/ura-projects#31 (comment)
@17Melissa will provide the flow command below :)

@17Melissa
Copy link
Contributor

Setup for NFCorpus Indexing with Safetensors

To efficiently perform NFCorpus indexing using Safetensors, follow this setup workflow:

  1. Download and Unzip Collections
    • Begin by downloading the necessary collections and unzipping them. For instance:
      wget https://rgw.cs.uwaterloo.ca/pyserini/data/beir-v1.0.0-bge-base-en-v1.5.tar -P collections/tar xvf collections/beir-v1.0.0-bge-base-en-v1.5.tar -C collections/
  2. Prepare the Environment
    • Navigate to the Safetensors directory within the Anserini project
      cd /anserini/src/main/python/safetensors
    • Install the required Python packages:
      pip install -r requirements.txt
    • Activate the virtual environment
      python3 -m venv venv
      source venv/bin/activate
  3. Convert JSON to Safetensors Format
    • Use the provided script to convert JSON files to Safetensors format
      python3 -m json_to_bin
    • the script will create the following files in the target directory
      • Saved vectors to ../../../../target/safetensors/vectors/part00_vectors.safetensors
      • Saved docids to ../../../../target/safetensors/docids/part00_docids.safetensors
      • Saved docid_to_idx mapping to ../../../../target/safetensors/docid_to_idx/part00_docid_to_idx.json

Indexing Procedure

To build HNSWSafetensors indexes, use the following sample command:

bin/run.sh io.anserini.index.SafeTensorsIndexCollection \
  -collection JsonDenseVectorCollection \
  -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  \
  -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
  -threads 9 -storePositions -storeDocvectors -storeRaw \
  -vectorsDirectory target\safetesnors\vectors \
  -docidsDirectory  target\safetesnors\docids \
  -docidToIdxDirectory  target\safetesnors\docid_to_idx \
>& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

Ensure all paths and parameters are adjusted according to your setup and directory structure.

@lintool
Copy link
Member

lintool commented Jun 2, 2024

Can you make the safetensors collection go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/, alongside the original? So all files should go into collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus/.

We also shouldn't need a new indexer. The indexing command should be similar to https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-beir-v1.0.0-nfcorpus-bge-base-en-v1.5-hnsw.md

e.g.,

bin/run.sh io.anserini.index.IndexHnswDenseVectors \
  -collection JsonDenseVectorCollection \
  -input /path/to/beir-v1.0.0-bge-base-en-v1.5 \
  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
  -index indexes/lucene-hnsw.beir-v1.0.0-nfcorpus-bge-base-en-v1.5/ \
  -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge \
  >& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

With the only exception being a different -generator.

@17Melissa
Copy link
Contributor

Updated Workflow for Safetensors Conversion and Indexing Process

  1. Create Directory: Create the safetensors folder collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus
  2. Run Conversion Script: Execute the python script json_to_bin.py from the root directory using the command:
    python src/main/python/safetensors/json_to_bin.py
  3. Execute Indexing Command: Following the indexing command below, which you will run after the conversion script completes
bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &

@Panizghi
Copy link
Contributor Author

Panizghi commented Jul 9, 2024

Updates

  • Removed hardcoded path.
  • Removed indexer arguments and updated the path hierarchy.
  • Internal mapping of the docid and vectors.
  • Updated argument for Python input and output.

Updated commands

Python

python src/main/python/safetensors/json_to_bin.py --input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl --output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Java

bin/run.sh io.anserini.index.IndexHnswDenseVectors -collection JsonDenseVectorCollection -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ -threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &

@Panizghi Panizghi reopened this Jul 9, 2024
@lintool
Copy link
Member

lintool commented Jul 9, 2024

Looking at this command:

bin/run.sh io.anserini.index.SafeTensorsIndexCollection \
  -collection JsonDenseVectorCollection \
  -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  \
  -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \
  -generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \
  -threads 9 -storePositions -storeDocvectors -storeRaw \
  -vectorsDirectory target\safetesnors\vectors \
  -docidsDirectory  target\safetesnors\docids \
  -docidToIdxDirectory  target\safetesnors\docid_to_idx \
>& logs/log.beir-v1.0.0-bge-base-en-v1.5 &

What are these three options doing?

  -vectorsDirectory target\safetesnors\vectors \
  -docidsDirectory  target\safetesnors\docids \
  -docidToIdxDirectory  target\safetesnors\docid_to_idx \

And why are these the same?

  -input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus  \
  -index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ \

I would expect -index to specify the location of the index?

@Panizghi
Copy link
Contributor Author

Panizghi commented Jul 9, 2024

I think you are looking at the older command this is the updated one

bin/run.sh io.anserini.index.IndexHnswDenseVectors  \
-collection JsonDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \ 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &

@lintool
Copy link
Member

lintool commented Jul 9, 2024

I think you are looking at the older command this is the updated one

bin/run.sh io.anserini.index.IndexHnswDenseVectors  \
-collection JsonDenseVectorCollection \
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus \
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator \ 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.112 &

Ah, please update to keep up to date?

@Panizghi
Copy link
Contributor Author

Panizghi commented Jul 9, 2024

My apologies it got lost within all the commits : )
is right here #2515 (comment)

Python

python src/main/python/safetensors/json_to_bin.py 
--input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/vectors.part00.jsonl 
--output collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus

Java

bin/run.sh io.anserini.index.IndexHnswDenseVectors 
-collection JsonDenseVectorCollection 
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus 
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &

@lintool
Copy link
Member

lintool commented Jul 11, 2024

Sorry, I'm confused again:

bin/run.sh io.anserini.index.IndexHnswDenseVectors 
-collection JsonDenseVectorCollection 
-input collections/beir-v1.0.0/bge-base-en-v1.5/nfcorpus 
-generator HnswJsonWithSafeTensorsDenseVectorDocumentGenerator 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &

Why would -collection be JsonDenseVectorCollection now? Currently, DenseVectorDocumentGenerator reads from Json: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/generator/DenseVectorDocumentGenerator.java

So we'd have something like SafeTensorsDenseVectorCollection that reasons from SafeTenors?

@lintool
Copy link
Member

lintool commented Jul 11, 2024

I'm not getting your logic, but I think you need to implement two classes:

  • SafeTensorsDenseVectorCollection
  • SafeTensorsDenseVectorDocumentGenerator

And your command would be something like -collection SafeTensorsDenseVectorCollection ... -generator SafeTensorsDenseVectorDocumentGenerator.

And you'd "wire everything together".

@Panizghi
Copy link
Contributor Author

updated command :

bin/run.sh io.anserini.index.IndexHnswDenseVectors 
-collection SafeTensorsDenseVectorCollection 
-input collections/beir-v1.0.0/bge-base-en-v1.5.safetensors/nfcorpus   
-generator SafeTensorsDenseVectorDocumentGenerator 
-index indexes/beir-v1.0.0/bge-base-en-v1.5/nfcorpus/ 
-threads 16 -M 16 -efC 100 -memoryBuffer 65536 -noMerge  >& logs/log.beir-v1.0.0-nq.bge-base-en-v1.5 &

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants