diff --git a/CHANGELOG.md b/CHANGELOG.md index e8e5e6d7b..fd7182e65 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -24,4 +24,5 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), ### Infrastructure ### Documentation ### Maintenance +* Remove benchmarks folder from k-NN repo [#2127](https://github.com/opensearch-project/k-NN/pull/2127) ### Refactoring diff --git a/benchmarks/README.md b/benchmarks/README.md new file mode 100644 index 000000000..2e642d41b --- /dev/null +++ b/benchmarks/README.md @@ -0,0 +1,4 @@ +## Benchmark Folder Tools Deprecated +All benchmark workloads have been moved to [OpenSearch Benchmark Workloads](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch). Please use OSB tool to run the benchmarks. + +If you are still interested in using the old tool, the benchmarks are moved to the [branch](https://github.com/opensearch-project/k-NN/tree/old-benchmarks/benchmarks). diff --git a/benchmarks/osb/README.md b/benchmarks/osb/README.md deleted file mode 100644 index 0d0b05f8d..000000000 --- a/benchmarks/osb/README.md +++ /dev/null @@ -1,478 +0,0 @@ -# IMPORTANT NOTE: No new features will be added to this tool . This tool is currently in maintanence mode. All new features will be added to [vector search workload]( https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch) -# OpenSearch Benchmarks for k-NN - -## Overview - -This directory contains code and configurations to run k-NN benchmarking -workloads using OpenSearch Benchmarks. - -The [extensions](extensions) directory contains common code shared between -procedures. The [procedures](procedures) directory contains the individual -test procedures for this workload. - -## Getting Started - -### OpenSearch Benchmarks Background - -OpenSearch Benchmark is a framework for performance benchmarking an OpenSearch -cluster. For more details, checkout their -[repo](https://github.com/opensearch-project/opensearch-benchmark/). - -Before getting into the benchmarks, it is helpful to know a few terms: -1. Workload - Top level description of a benchmark suite. A workload will have a `workload.json` file that defines different components of the tests -2. Test Procedures - A workload can have a schedule of operations that run the test. However, a workload can also have several test procedures that define their own schedule of operations. This is helpful for sharing code between tests -3. Operation - An action against the OpenSearch cluster -4. Parameter source - Producers of parameters for OpenSearch operations -5. Runners - Code that actually will execute the OpenSearch operations - -### Setup - -OpenSearch Benchmarks requires Python 3.8 or greater to be installed. One of -the easier ways to do this is through Conda, a package and environment -management system for Python. - -First, follow the -[installation instructions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) -to install Conda on your system. - -Next, create a Python 3.8 environment: -``` -conda create -n knn-osb python=3.8 -``` - -After the environment is created, activate it: -``` -source activate knn-osb -``` - -Lastly, clone the k-NN repo and install all required python packages: -``` -git clone https://github.com/opensearch-project/k-NN.git -cd k-NN/benchmarks/osb -pip install -r requirements.txt -``` - -After all of this completes, you should be ready to run your first benchmark! - -### Running a benchmark - -Before running a benchmark, make sure you have the endpoint of your cluster and - the machine you are running the benchmarks from can access it. - Additionally, ensure that all data has been pulled to the client. - -Currently, we support 2 test procedures for the k-NN workload: train-test and -no-train-test. The train test has steps to train a model included in the -schedule, while no train does not. Both test procedures will index a data set -of vectors into an OpenSearch index and then run a set of queries against them. - -Once you have decided which test procedure you want to use, open up -[params/train-params.json](params/train-params.json) or -[params/no-train-params.json](params/no-train-params.json) and -fill out the parameters. Notice, at the bottom of `no-train-params.json` there -are several parameters that relate to training. Ignore these. They need to be -defined for the workload but not used. - -Once the parameters are set, set the URL and PORT of your cluster and run the -command to run the test procedure. - -``` -export URL= -export PORT= -export PARAMS_FILE= -export PROCEDURE={no-train-test | train-test} - -opensearch-benchmark execute_test \ - --target-hosts $URL:$PORT \ - --workload-path ./workload.json \ - --workload-params ${PARAMS_FILE} \ - --test-procedure=${PROCEDURE} \ - --pipeline benchmark-only -``` - -## Current Procedures - -### No Train Test - -The No Train Test procedure is used to test `knn_vector` indices that do not -use an algorithm that requires training. - -#### Workflow - -1. Delete old resources in the cluster if they are present -2. Create an OpenSearch index with `knn_vector` configured to use the HNSW algorithm -3. Wait for cluster to be green -4. Ingest data set into the cluster -5. Refresh the index -6. Run queries from data set against the cluster - -#### Parameters - -| Name | Description | -|-----------------------------------------|--------------------------------------------------------------------------| -| target_index_name | Name of index to add vectors to | -| target_field_name | Name of field to add vectors to | -| target_index_body | Path to target index definition | -| target_index_primary_shards | Target index primary shards | -| target_index_replica_shards | Target index replica shards | -| target_index_dimension | Dimension of target index | -| target_index_space_type | Target index space type | -| target_index_bulk_size | Target index bulk size | -| target_index_bulk_index_data_set_format | Format of vector data set | -| target_index_bulk_index_data_set_path | Path to vector data set | -| target_index_bulk_index_clients | Clients to be used for bulk ingestion (must be divisor of data set size) | -| target_index_max_num_segments | Number of segments to merge target index down to before beginning search | -| target_index_force_merge_timeout | Timeout for of force merge requests in seconds | -| hnsw_ef_search | HNSW ef search parameter | -| hnsw_ef_construction | HNSW ef construction parameter | -| hnsw_m | HNSW m parameter | -| query_k | The number of neighbors to return for the search | -| query_clients | Number of clients to use for running queries | -| query_data_set_format | Format of vector data set for queries | -| query_data_set_path | Path to vector data set for queries | - -#### Metrics - -The result metrics of this procedure will look like: -``` ------------------------------------------------------- - _______ __ _____ - / ____(_)___ ____ _/ / / ___/_________ ________ - / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ - / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ -/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ ------------------------------------------------------- - -| Metric | Task | Value | Unit | -|---------------------------------------------------------------:|------------------------:|------------:|-------:| -| Cumulative indexing time of primary shards | | 1.82885 | min | -| Min cumulative indexing time across primary shards | | 0.4121 | min | -| Median cumulative indexing time across primary shards | | 0.559617 | min | -| Max cumulative indexing time across primary shards | | 0.857133 | min | -| Cumulative indexing throttle time of primary shards | | 0 | min | -| Min cumulative indexing throttle time across primary shards | | 0 | min | -| Median cumulative indexing throttle time across primary shards | | 0 | min | -| Max cumulative indexing throttle time across primary shards | | 0 | min | -| Cumulative merge time of primary shards | | 5.89065 | min | -| Cumulative merge count of primary shards | | 3 | | -| Min cumulative merge time across primary shards | | 1.95945 | min | -| Median cumulative merge time across primary shards | | 1.96345 | min | -| Max cumulative merge time across primary shards | | 1.96775 | min | -| Cumulative merge throttle time of primary shards | | 0 | min | -| Min cumulative merge throttle time across primary shards | | 0 | min | -| Median cumulative merge throttle time across primary shards | | 0 | min | -| Max cumulative merge throttle time across primary shards | | 0 | min | -| Cumulative refresh time of primary shards | | 8.52517 | min | -| Cumulative refresh count of primary shards | | 29 | | -| Min cumulative refresh time across primary shards | | 2.64265 | min | -| Median cumulative refresh time across primary shards | | 2.93913 | min | -| Max cumulative refresh time across primary shards | | 2.94338 | min | -| Cumulative flush time of primary shards | | 0.00221667 | min | -| Cumulative flush count of primary shards | | 3 | | -| Min cumulative flush time across primary shards | | 0.000733333 | min | -| Median cumulative flush time across primary shards | | 0.000733333 | min | -| Max cumulative flush time across primary shards | | 0.00075 | min | -| Total Young Gen GC time | | 0.318 | s | -| Total Young Gen GC count | | 2 | | -| Total Old Gen GC time | | 0 | s | -| Total Old Gen GC count | | 0 | | -| Store size | | 1.43566 | GB | -| Translog size | | 1.53668e-07 | GB | -| Heap used for segments | | 0.00410843 | MB | -| Heap used for doc values | | 0.000286102 | MB | -| Heap used for terms | | 0.00121307 | MB | -| Heap used for norms | | 0 | MB | -| Heap used for points | | 0 | MB | -| Heap used for stored fields | | 0.00260925 | MB | -| Segment count | | 3 | | -| Min Throughput | custom-vector-bulk | 38005.8 | docs/s | -| Mean Throughput | custom-vector-bulk | 44827.9 | docs/s | -| Median Throughput | custom-vector-bulk | 40507.2 | docs/s | -| Max Throughput | custom-vector-bulk | 88967.8 | docs/s | -| 50th percentile latency | custom-vector-bulk | 29.5857 | ms | -| 90th percentile latency | custom-vector-bulk | 49.0719 | ms | -| 99th percentile latency | custom-vector-bulk | 72.6138 | ms | -| 99.9th percentile latency | custom-vector-bulk | 279.826 | ms | -| 100th percentile latency | custom-vector-bulk | 15688 | ms | -| 50th percentile service time | custom-vector-bulk | 29.5857 | ms | -| 90th percentile service time | custom-vector-bulk | 49.0719 | ms | -| 99th percentile service time | custom-vector-bulk | 72.6138 | ms | -| 99.9th percentile service time | custom-vector-bulk | 279.826 | ms | -| 100th percentile service time | custom-vector-bulk | 15688 | ms | -| error rate | custom-vector-bulk | 0 | % | -| Min Throughput | refresh-target-index | 0.01 | ops/s | -| Mean Throughput | refresh-target-index | 0.01 | ops/s | -| Median Throughput | refresh-target-index | 0.01 | ops/s | -| Max Throughput | refresh-target-index | 0.01 | ops/s | -| 100th percentile latency | refresh-target-index | 176610 | ms | -| 100th percentile service time | refresh-target-index | 176610 | ms | -| error rate | refresh-target-index | 0 | % | -| Min Throughput | knn-query-from-data-set | 444.17 | ops/s | -| Mean Throughput | knn-query-from-data-set | 601.68 | ops/s | -| Median Throughput | knn-query-from-data-set | 621.19 | ops/s | -| Max Throughput | knn-query-from-data-set | 631.23 | ops/s | -| 50th percentile latency | knn-query-from-data-set | 14.7612 | ms | -| 90th percentile latency | knn-query-from-data-set | 20.6954 | ms | -| 99th percentile latency | knn-query-from-data-set | 27.7499 | ms | -| 99.9th percentile latency | knn-query-from-data-set | 41.3506 | ms | -| 99.99th percentile latency | knn-query-from-data-set | 162.391 | ms | -| 100th percentile latency | knn-query-from-data-set | 162.756 | ms | -| 50th percentile service time | knn-query-from-data-set | 14.7612 | ms | -| 90th percentile service time | knn-query-from-data-set | 20.6954 | ms | -| 99th percentile service time | knn-query-from-data-set | 27.7499 | ms | -| 99.9th percentile service time | knn-query-from-data-set | 41.3506 | ms | -| 99.99th percentile service time | knn-query-from-data-set | 162.391 | ms | -| 100th percentile service time | knn-query-from-data-set | 162.756 | ms | -| error rate | knn-query-from-data-set | 0 | % | - - ---------------------------------- -[INFO] SUCCESS (took 618 seconds) ---------------------------------- -``` - -### Train Test - -The Train Test procedure is used to test `knn_vector` indices that do use an -algorithm that requires training. - -#### Workflow - -1. Delete old resources in the cluster if they are present -2. Create an OpenSearch index with `knn_vector` configured to load with training data -3. Wait for cluster to be green -4. Ingest data set into the training index -5. Refresh the index -6. Train a model based on user provided input parameters -7. Create an OpenSearch index with `knn_vector` configured to use the model -8. Ingest vectors into the target index -9. Refresh the target index -10. Run queries from data set against the cluster - -#### Parameters - -| Name | Description | -|-----------------------------------------|--------------------------------------------------------------------------| -| target_index_name | Name of index to add vectors to | -| target_field_name | Name of field to add vectors to | -| target_index_body | Path to target index definition | -| target_index_primary_shards | Target index primary shards | -| target_index_replica_shards | Target index replica shards | -| target_index_dimension | Dimension of target index | -| target_index_space_type | Target index space type | -| target_index_bulk_size | Target index bulk size | -| target_index_bulk_index_data_set_format | Format of vector data set for ingestion | -| target_index_bulk_index_data_set_path | Path to vector data set for ingestion | -| target_index_bulk_index_clients | Clients to be used for bulk ingestion (must be divisor of data set size) | -| target_index_max_num_segments | Number of segments to merge target index down to before beginning search | -| target_index_force_merge_timeout | Timeout for of force merge requests in seconds | -| ivf_nlists | IVF nlist parameter | -| ivf_nprobes | IVF nprobe parameter | -| pq_code_size | PQ code_size parameter | -| pq_m | PQ m parameter | -| train_model_method | Method to be used for model (ivf or ivfpq) | -| train_model_id | Model ID | -| train_index_name | Name of index to put training data into | -| train_field_name | Name of field to put training data into | -| train_index_body | Path to train index definition | -| train_search_size | Search size to use when pulling training data | -| train_timeout | Timeout to wait for training to finish | -| train_index_primary_shards | Train index primary shards | -| train_index_replica_shards | Train index replica shards | -| train_index_bulk_size | Train index bulk size | -| train_index_data_set_format | Format of vector data set for training | -| train_index_data_set_path | Path to vector data set for training | -| train_index_num_vectors | Number of vectors to use from vector data set for training | -| train_index_bulk_index_clients | Clients to be used for bulk ingestion (must be divisor of data set size) | -| query_k | The number of neighbors to return for the search | -| query_clients | Number of clients to use for running queries | -| query_data_set_format | Format of vector data set for queries | -| query_data_set_path | Path to vector data set for queries | - -#### Metrics - -The result metrics of this procedure will look like: -``` ------------------------------------------------------- - _______ __ _____ - / ____(_)___ ____ _/ / / ___/_________ ________ - / /_ / / __ \/ __ `/ / \__ \/ ___/ __ \/ ___/ _ \ - / __/ / / / / / /_/ / / ___/ / /__/ /_/ / / / __/ -/_/ /_/_/ /_/\__,_/_/ /____/\___/\____/_/ \___/ ------------------------------------------------------- - -| Metric | Task | Value | Unit | -|---------------------------------------------------------------:|------------------------:|-----------:|-----------------:| -| Cumulative indexing time of primary shards | | 2.92382 | min | -| Min cumulative indexing time across primary shards | | 0.42245 | min | -| Median cumulative indexing time across primary shards | | 0.43395 | min | -| Max cumulative indexing time across primary shards | | 1.63347 | min | -| Cumulative indexing throttle time of primary shards | | 0 | min | -| Min cumulative indexing throttle time across primary shards | | 0 | min | -| Median cumulative indexing throttle time across primary shards | | 0 | min | -| Max cumulative indexing throttle time across primary shards | | 0 | min | -| Cumulative merge time of primary shards | | 1.36293 | min | -| Cumulative merge count of primary shards | | 20 | | -| Min cumulative merge time across primary shards | | 0.263283 | min | -| Median cumulative merge time across primary shards | | 0.291733 | min | -| Max cumulative merge time across primary shards | | 0.516183 | min | -| Cumulative merge throttle time of primary shards | | 0.701683 | min | -| Min cumulative merge throttle time across primary shards | | 0.163883 | min | -| Median cumulative merge throttle time across primary shards | | 0.175717 | min | -| Max cumulative merge throttle time across primary shards | | 0.186367 | min | -| Cumulative refresh time of primary shards | | 0.222217 | min | -| Cumulative refresh count of primary shards | | 67 | | -| Min cumulative refresh time across primary shards | | 0.03915 | min | -| Median cumulative refresh time across primary shards | | 0.039825 | min | -| Max cumulative refresh time across primary shards | | 0.103417 | min | -| Cumulative flush time of primary shards | | 0.0276833 | min | -| Cumulative flush count of primary shards | | 1 | | -| Min cumulative flush time across primary shards | | 0 | min | -| Median cumulative flush time across primary shards | | 0 | min | -| Max cumulative flush time across primary shards | | 0.0276833 | min | -| Total Young Gen GC time | | 0.074 | s | -| Total Young Gen GC count | | 8 | | -| Total Old Gen GC time | | 0 | s | -| Total Old Gen GC count | | 0 | | -| Store size | | 1.67839 | GB | -| Translog size | | 0.115145 | GB | -| Heap used for segments | | 0.0350914 | MB | -| Heap used for doc values | | 0.00771713 | MB | -| Heap used for terms | | 0.0101089 | MB | -| Heap used for norms | | 0 | MB | -| Heap used for points | | 0 | MB | -| Heap used for stored fields | | 0.0172653 | MB | -| Segment count | | 25 | | -| Min Throughput | delete-model | 25.45 | ops/s | -| Mean Throughput | delete-model | 25.45 | ops/s | -| Median Throughput | delete-model | 25.45 | ops/s | -| Max Throughput | delete-model | 25.45 | ops/s | -| 100th percentile latency | delete-model | 39.0409 | ms | -| 100th percentile service time | delete-model | 39.0409 | ms | -| error rate | delete-model | 0 | % | -| Min Throughput | train-vector-bulk | 49518.9 | docs/s | -| Mean Throughput | train-vector-bulk | 54418.8 | docs/s | -| Median Throughput | train-vector-bulk | 52984.2 | docs/s | -| Max Throughput | train-vector-bulk | 62118.3 | docs/s | -| 50th percentile latency | train-vector-bulk | 26.5293 | ms | -| 90th percentile latency | train-vector-bulk | 41.8212 | ms | -| 99th percentile latency | train-vector-bulk | 239.351 | ms | -| 99.9th percentile latency | train-vector-bulk | 348.507 | ms | -| 100th percentile latency | train-vector-bulk | 436.292 | ms | -| 50th percentile service time | train-vector-bulk | 26.5293 | ms | -| 90th percentile service time | train-vector-bulk | 41.8212 | ms | -| 99th percentile service time | train-vector-bulk | 239.351 | ms | -| 99.9th percentile service time | train-vector-bulk | 348.507 | ms | -| 100th percentile service time | train-vector-bulk | 436.292 | ms | -| error rate | train-vector-bulk | 0 | % | -| Min Throughput | refresh-train-index | 0.47 | ops/s | -| Mean Throughput | refresh-train-index | 0.47 | ops/s | -| Median Throughput | refresh-train-index | 0.47 | ops/s | -| Max Throughput | refresh-train-index | 0.47 | ops/s | -| 100th percentile latency | refresh-train-index | 2142.96 | ms | -| 100th percentile service time | refresh-train-index | 2142.96 | ms | -| error rate | refresh-train-index | 0 | % | -| Min Throughput | ivfpq-train-model | 0.01 | models_trained/s | -| Mean Throughput | ivfpq-train-model | 0.01 | models_trained/s | -| Median Throughput | ivfpq-train-model | 0.01 | models_trained/s | -| Max Throughput | ivfpq-train-model | 0.01 | models_trained/s | -| 100th percentile latency | ivfpq-train-model | 136563 | ms | -| 100th percentile service time | ivfpq-train-model | 136563 | ms | -| error rate | ivfpq-train-model | 0 | % | -| Min Throughput | custom-vector-bulk | 62384.8 | docs/s | -| Mean Throughput | custom-vector-bulk | 69035.2 | docs/s | -| Median Throughput | custom-vector-bulk | 68675.4 | docs/s | -| Max Throughput | custom-vector-bulk | 80713.4 | docs/s | -| 50th percentile latency | custom-vector-bulk | 18.7726 | ms | -| 90th percentile latency | custom-vector-bulk | 34.8881 | ms | -| 99th percentile latency | custom-vector-bulk | 150.435 | ms | -| 99.9th percentile latency | custom-vector-bulk | 296.862 | ms | -| 100th percentile latency | custom-vector-bulk | 344.394 | ms | -| 50th percentile service time | custom-vector-bulk | 18.7726 | ms | -| 90th percentile service time | custom-vector-bulk | 34.8881 | ms | -| 99th percentile service time | custom-vector-bulk | 150.435 | ms | -| 99.9th percentile service time | custom-vector-bulk | 296.862 | ms | -| 100th percentile service time | custom-vector-bulk | 344.394 | ms | -| error rate | custom-vector-bulk | 0 | % | -| Min Throughput | refresh-target-index | 28.32 | ops/s | -| Mean Throughput | refresh-target-index | 28.32 | ops/s | -| Median Throughput | refresh-target-index | 28.32 | ops/s | -| Max Throughput | refresh-target-index | 28.32 | ops/s | -| 100th percentile latency | refresh-target-index | 34.9811 | ms | -| 100th percentile service time | refresh-target-index | 34.9811 | ms | -| error rate | refresh-target-index | 0 | % | -| Min Throughput | knn-query-from-data-set | 0.9 | ops/s | -| Mean Throughput | knn-query-from-data-set | 453.84 | ops/s | -| Median Throughput | knn-query-from-data-set | 554.15 | ops/s | -| Max Throughput | knn-query-from-data-set | 681 | ops/s | -| 50th percentile latency | knn-query-from-data-set | 11.7174 | ms | -| 90th percentile latency | knn-query-from-data-set | 15.4445 | ms | -| 99th percentile latency | knn-query-from-data-set | 21.0682 | ms | -| 99.9th percentile latency | knn-query-from-data-set | 39.5414 | ms | -| 99.99th percentile latency | knn-query-from-data-set | 1116.33 | ms | -| 100th percentile latency | knn-query-from-data-set | 1116.66 | ms | -| 50th percentile service time | knn-query-from-data-set | 11.7174 | ms | -| 90th percentile service time | knn-query-from-data-set | 15.4445 | ms | -| 99th percentile service time | knn-query-from-data-set | 21.0682 | ms | -| 99.9th percentile service time | knn-query-from-data-set | 39.5414 | ms | -| 99.99th percentile service time | knn-query-from-data-set | 1116.33 | ms | -| 100th percentile service time | knn-query-from-data-set | 1116.66 | ms | -| error rate | knn-query-from-data-set | 0 | % | - - ---------------------------------- -[INFO] SUCCESS (took 281 seconds) ---------------------------------- -``` - -## Adding a procedure - -Adding additional benchmarks is very simple. First, place any custom parameter -sources or runners in the [extensions](extensions) directory so that other tests -can use them and also update the [documentation](#custom-extensions) -accordingly. - -Next, create a new test procedure file and add the operations you want your test -to run. Lastly, be sure to update documentation. - -## Custom Extensions - -OpenSearch Benchmarks is very extendable. To fit the plugins needs, we add -customer parameter sources and custom runners. Parameter sources allow users to -supply custom parameters to an operation. Runners are what actually performs -the operations against OpenSearch. - -### Custom Parameter Sources - -Custom parameter sources are defined in [extensions/param_sources.py](extensions/param_sources.py). - -| Name | Description | Parameters | -|-------------------------|------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| bulk-from-data-set | Provides bulk payloads containing vectors from a data set for indexing | 1. data_set_format - (hdf5, bigann)
2. data_set_path - path to data set
3. index - name of index for bulk ingestion
4. field - field to place vector in
5. bulk_size - vectors per bulk request
6. num_vectors - number of vectors to use from the data set. Defaults to the whole data set. | -| knn-query-from-data-set | Provides a query generated from a data set | 1. data_set_format - (hdf5, bigann)
2. data_set_path - path to data set
3. index - name of index to query against
4. field - field to to query against
5. k - number of results to return
6. dimension - size of vectors to produce
7. num_vectors - number of vectors to use from the data set. Defaults to the whole data set. | - - -### Custom Runners - -Custom runners are defined in [extensions/runners.py](extensions/runners.py). - -| Syntax | Description | Parameters | -|--------------------|-----------------------------------------------------|:-------------------------------------------------------------------------------------------------------------| -| custom-vector-bulk | Bulk index a set of vectors in an OpenSearch index. | 1. bulk-from-data-set | -| custom-refresh | Run refresh with retry capabilities. | 1. index - name of index to refresh
2. retries - number of times to retry the operation | -| train-model | Trains a model. | 1. body - model definition
2. timeout - time to wait for model to finish
3. model_id - ID of model | -| delete-model | Deletes a model if it exists. | 1. model_id - ID of model | - -### Testing - -We have a set of unit tests for our extensions in -[tests](tests). To run all the tests, run the following -command: - -```commandline -python -m unittest discover ./tests -``` - -To run an individual test: -```commandline -python -m unittest tests.test_param_sources.VectorsFromDataSetParamSourceTestCase.test_partition_hdf5 -``` diff --git a/benchmarks/osb/__init__.py b/benchmarks/osb/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/benchmarks/osb/extensions/__init__.py b/benchmarks/osb/extensions/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/benchmarks/osb/extensions/data_set.py b/benchmarks/osb/extensions/data_set.py deleted file mode 100644 index 7e8058844..000000000 --- a/benchmarks/osb/extensions/data_set.py +++ /dev/null @@ -1,202 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -import os -import numpy as np -from abc import ABC, ABCMeta, abstractmethod -from enum import Enum -from typing import cast -import h5py -import struct - - -class Context(Enum): - """DataSet context enum. Can be used to add additional context for how a - data-set should be interpreted. - """ - INDEX = 1 - QUERY = 2 - NEIGHBORS = 3 - - -class DataSet(ABC): - """DataSet interface. Used for reading data-sets from files. - - Methods: - read: Read a chunk of data from the data-set - seek: Get to position in the data-set - size: Gets the number of items in the data-set - reset: Resets internal state of data-set to beginning - """ - __metaclass__ = ABCMeta - - BEGINNING = 0 - - @abstractmethod - def read(self, chunk_size: int): - pass - - @abstractmethod - def seek(self, offset: int): - pass - - @abstractmethod - def size(self): - pass - - @abstractmethod - def reset(self): - pass - - -class HDF5DataSet(DataSet): - """ Data-set format corresponding to `ANN Benchmarks - `_ - """ - - FORMAT_NAME = "hdf5" - - def __init__(self, dataset_path: str, context: Context): - file = h5py.File(dataset_path) - self.data = cast(h5py.Dataset, file[self.parse_context(context)]) - self.current = self.BEGINNING - - def read(self, chunk_size: int): - if self.current >= self.size(): - return None - - end_offset = self.current + chunk_size - if end_offset > self.size(): - end_offset = self.size() - - v = cast(np.ndarray, self.data[self.current:end_offset]) - self.current = end_offset - return v - - def seek(self, offset: int): - - if offset < self.BEGINNING: - raise Exception("Offset must be greater than or equal to 0") - - if offset >= self.size(): - raise Exception("Offset must be less than the data set size") - - self.current = offset - - def size(self): - return self.data.len() - - def reset(self): - self.current = self.BEGINNING - - @staticmethod - def parse_context(context: Context) -> str: - if context == Context.NEIGHBORS: - return "neighbors" - - if context == Context.INDEX: - return "train" - - if context == Context.QUERY: - return "test" - - raise Exception("Unsupported context") - - -class BigANNVectorDataSet(DataSet): - """ Data-set format for vector data-sets for `Big ANN Benchmarks - `_ - """ - - DATA_SET_HEADER_LENGTH = 8 - U8BIN_EXTENSION = "u8bin" - FBIN_EXTENSION = "fbin" - FORMAT_NAME = "bigann" - - BYTES_PER_U8INT = 1 - BYTES_PER_FLOAT = 4 - - def __init__(self, dataset_path: str): - self.file = open(dataset_path, 'rb') - self.file.seek(BigANNVectorDataSet.BEGINNING, os.SEEK_END) - num_bytes = self.file.tell() - self.file.seek(BigANNVectorDataSet.BEGINNING) - - if num_bytes < BigANNVectorDataSet.DATA_SET_HEADER_LENGTH: - raise Exception("File is invalid") - - self.num_points = int.from_bytes(self.file.read(4), "little") - self.dimension = int.from_bytes(self.file.read(4), "little") - self.bytes_per_num = self._get_data_size(dataset_path) - - if (num_bytes - BigANNVectorDataSet.DATA_SET_HEADER_LENGTH) != self.num_points * \ - self.dimension * self.bytes_per_num: - raise Exception("File is invalid") - - self.reader = self._value_reader(dataset_path) - self.current = BigANNVectorDataSet.BEGINNING - - def read(self, chunk_size: int): - if self.current >= self.size(): - return None - - end_offset = self.current + chunk_size - if end_offset > self.size(): - end_offset = self.size() - - v = np.asarray([self._read_vector() for _ in - range(end_offset - self.current)]) - self.current = end_offset - return v - - def seek(self, offset: int): - - if offset < self.BEGINNING: - raise Exception("Offset must be greater than or equal to 0") - - if offset >= self.size(): - raise Exception("Offset must be less than the data set size") - - bytes_offset = BigANNVectorDataSet.DATA_SET_HEADER_LENGTH + \ - self.dimension * self.bytes_per_num * offset - self.file.seek(bytes_offset) - self.current = offset - - def _read_vector(self): - return np.asarray([self.reader(self.file) for _ in - range(self.dimension)]) - - def size(self): - return self.num_points - - def reset(self): - self.file.seek(BigANNVectorDataSet.DATA_SET_HEADER_LENGTH) - self.current = BigANNVectorDataSet.BEGINNING - - def __del__(self): - self.file.close() - - @staticmethod - def _get_data_size(file_name): - ext = file_name.split('.')[-1] - if ext == BigANNVectorDataSet.U8BIN_EXTENSION: - return BigANNVectorDataSet.BYTES_PER_U8INT - - if ext == BigANNVectorDataSet.FBIN_EXTENSION: - return BigANNVectorDataSet.BYTES_PER_FLOAT - - raise Exception("Unknown extension") - - @staticmethod - def _value_reader(file_name): - ext = file_name.split('.')[-1] - if ext == BigANNVectorDataSet.U8BIN_EXTENSION: - return lambda file: float(int.from_bytes(file.read(BigANNVectorDataSet.BYTES_PER_U8INT), "little")) - - if ext == BigANNVectorDataSet.FBIN_EXTENSION: - return lambda file: struct.unpack('= self.num_vectors + self.offset: - raise StopIteration - - if self.vector_batch is None or len(self.vector_batch) == 0: - self.vector_batch = self._batch_read(self.data_set) - if self.vector_batch is None: - raise StopIteration - vector = self.vector_batch.pop(0) - self.current += 1 - self.percent_completed = self.current / self.total - - return self._build_query_body(self.index_name, self.field_name, self.k, - vector) - - def _batch_read(self, data_set: DataSet): - return list(data_set.read(self.VECTOR_READ_BATCH_SIZE)) - - def _build_query_body(self, index_name: str, field_name: str, k: int, - vector) -> dict: - """Builds a k-NN query that can be used to execute an approximate nearest - neighbor search against a k-NN plugin index - Args: - index_name: name of index to search - field_name: name of field to search - k: number of results to return - vector: vector used for query - Returns: - A dictionary containing the body used for search, a set of request - parameters to attach to the search and the name of the index. - """ - return { - "index": index_name, - "request-params": { - "_source": { - "exclude": [field_name] - } - }, - "body": { - "size": k, - "query": { - "knn": { - field_name: { - "vector": vector, - "k": k - } - } - } - } - } - - -class BulkVectorsFromDataSetParamSource(VectorsFromDataSetParamSource): - """ Create bulk index requests from a data set of vectors. - - Attributes: - bulk_size: number of vectors per request - retries: number of times to retry the request when it fails - """ - - DEFAULT_RETRIES = 10 - - def __init__(self, workload, params, **kwargs): - super().__init__(params, Context.INDEX) - self.bulk_size: int = parse_int_parameter("bulk_size", params) - self.retries: int = parse_int_parameter("retries", params, - self.DEFAULT_RETRIES) - - def params(self): - """ - Returns: A bulk index parameter with vectors from a data set. - """ - if self.current >= self.num_vectors + self.offset: - raise StopIteration - - def action(doc_id): - return {'index': {'_index': self.index_name, '_id': doc_id}} - - partition = self.data_set.read(self.bulk_size) - body = bulk_transform(partition, self.field_name, action, self.current) - size = len(body) // 2 - self.current += size - self.percent_completed = self.current / self.total - - return { - "body": body, - "retries": self.retries, - "size": size - } diff --git a/benchmarks/osb/extensions/registry.py b/benchmarks/osb/extensions/registry.py deleted file mode 100644 index 5ce17ab6f..000000000 --- a/benchmarks/osb/extensions/registry.py +++ /dev/null @@ -1,13 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -from .param_sources import register as param_sources_register -from .runners import register as runners_register - - -def register(registry): - param_sources_register(registry) - runners_register(registry) diff --git a/benchmarks/osb/extensions/runners.py b/benchmarks/osb/extensions/runners.py deleted file mode 100644 index d048f80b0..000000000 --- a/benchmarks/osb/extensions/runners.py +++ /dev/null @@ -1,121 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. -from opensearchpy.exceptions import ConnectionTimeout -from .util import parse_int_parameter, parse_string_parameter -import logging -import time - - -def register(registry): - registry.register_runner( - "custom-vector-bulk", BulkVectorsFromDataSetRunner(), async_runner=True - ) - registry.register_runner( - "custom-refresh", CustomRefreshRunner(), async_runner=True - ) - registry.register_runner( - "train-model", TrainModelRunner(), async_runner=True - ) - registry.register_runner( - "delete-model", DeleteModelRunner(), async_runner=True - ) - - -class BulkVectorsFromDataSetRunner: - - async def __call__(self, opensearch, params): - size = parse_int_parameter("size", params) - retries = parse_int_parameter("retries", params, 0) + 1 - - for _ in range(retries): - try: - await opensearch.bulk( - body=params["body"], - timeout='5m' - ) - - return size, "docs" - except ConnectionTimeout: - logging.getLogger(__name__)\ - .warning("Bulk vector ingestion timed out. Retrying") - - raise TimeoutError("Failed to submit bulk request in specified number " - "of retries: {}".format(retries)) - - def __repr__(self, *args, **kwargs): - return "custom-vector-bulk" - - -class CustomRefreshRunner: - - async def __call__(self, opensearch, params): - retries = parse_int_parameter("retries", params, 0) + 1 - - for _ in range(retries): - try: - await opensearch.indices.refresh( - index=parse_string_parameter("index", params) - ) - - return - except ConnectionTimeout: - logging.getLogger(__name__)\ - .warning("Custom refresh timed out. Retrying") - - raise TimeoutError("Failed to refresh the index in specified number " - "of retries: {}".format(retries)) - - def __repr__(self, *args, **kwargs): - return "custom-refresh" - - -class TrainModelRunner: - - async def __call__(self, opensearch, params): - # Train a model and wait for it training to complete - body = params["body"] - timeout = parse_int_parameter("timeout", params) - model_id = parse_string_parameter("model_id", params) - - method = "POST" - model_uri = "/_plugins/_knn/models/{}".format(model_id) - await opensearch.transport.perform_request(method, "{}/_train".format(model_uri), body=body) - - start_time = time.time() - while time.time() < start_time + timeout: - time.sleep(1) - model_response = await opensearch.transport.perform_request("GET", model_uri) - - if 'state' not in model_response.keys(): - continue - - if model_response['state'] == 'created': - #TODO: Return model size as well - return 1, "models_trained" - - if model_response['state'] == 'failed': - raise Exception("Failed to create model: {}".format(model_response)) - - raise Exception('Failed to create model: {} within timeout {} seconds' - .format(model_id, timeout)) - - def __repr__(self, *args, **kwargs): - return "train-model" - - -class DeleteModelRunner: - - async def __call__(self, opensearch, params): - # Delete model provided by model id - method = "DELETE" - model_id = parse_string_parameter("model_id", params) - uri = "/_plugins/_knn/models/{}".format(model_id) - - # Ignore if model doesnt exist - await opensearch.transport.perform_request(method, uri, params={"ignore": [400, 404]}) - - def __repr__(self, *args, **kwargs): - return "delete-model" diff --git a/benchmarks/osb/extensions/util.py b/benchmarks/osb/extensions/util.py deleted file mode 100644 index f7f6aab62..000000000 --- a/benchmarks/osb/extensions/util.py +++ /dev/null @@ -1,71 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -import numpy as np -from typing import List -from typing import Dict -from typing import Any - - -def bulk_transform(partition: np.ndarray, field_name: str, action, - offset: int) -> List[Dict[str, Any]]: - """Partitions and transforms a list of vectors into OpenSearch's bulk - injection format. - Args: - offset: to start counting from - partition: An array of vectors to transform. - field_name: field name for action - action: Bulk API action. - Returns: - An array of transformed vectors in bulk format. - """ - actions = [] - _ = [ - actions.extend([action(i + offset), None]) - for i in range(len(partition)) - ] - actions[1::2] = [{field_name: vec} for vec in partition.tolist()] - return actions - - -def parse_string_parameter(key: str, params: dict, default: str = None) -> str: - if key not in params: - if default is not None: - return default - raise ConfigurationError( - "Value cannot be None for param {}".format(key) - ) - - if type(params[key]) is str: - return params[key] - - raise ConfigurationError("Value must be a string for param {}".format(key)) - - -def parse_int_parameter(key: str, params: dict, default: int = None) -> int: - if key not in params: - if default: - return default - raise ConfigurationError( - "Value cannot be None for param {}".format(key) - ) - - if type(params[key]) is int: - return params[key] - - raise ConfigurationError("Value must be a int for param {}".format(key)) - - -class ConfigurationError(Exception): - """Exception raised for errors configuration. - - Attributes: - message -- explanation of the error - """ - - def __init__(self, message: str): - self.message = f'{message}' - super().__init__(self.message) diff --git a/benchmarks/osb/indices/faiss-index.json b/benchmarks/osb/indices/faiss-index.json deleted file mode 100644 index 2db4d34d4..000000000 --- a/benchmarks/osb/indices/faiss-index.json +++ /dev/null @@ -1,27 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": {{ target_index_primary_shards }}, - "number_of_replicas": {{ target_index_replica_shards }} - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": {{ target_index_dimension }}, - "method": { - "name": "hnsw", - "space_type": "{{ target_index_space_type }}", - "engine": "faiss", - "parameters": { - "ef_search": {{ hnsw_ef_search }}, - "ef_construction": {{ hnsw_ef_construction }}, - "m": {{ hnsw_m }} - } - } - } - } - } -} diff --git a/benchmarks/osb/indices/lucene-index.json b/benchmarks/osb/indices/lucene-index.json deleted file mode 100644 index 0a4ed868a..000000000 --- a/benchmarks/osb/indices/lucene-index.json +++ /dev/null @@ -1,26 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": {{ target_index_primary_shards }}, - "number_of_replicas": {{ target_index_replica_shards }} - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": {{ target_index_dimension }}, - "method": { - "name": "hnsw", - "space_type": "{{ target_index_space_type }}", - "engine": "lucene", - "parameters": { - "ef_construction": {{ hnsw_ef_construction }}, - "m": {{ hnsw_m }} - } - } - } - } - } -} diff --git a/benchmarks/osb/indices/model-index.json b/benchmarks/osb/indices/model-index.json deleted file mode 100644 index 0e92c8903..000000000 --- a/benchmarks/osb/indices/model-index.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": {{ target_index_primary_shards | default(1) }}, - "number_of_replicas": {{ target_index_replica_shards | default(0) }} - } - }, - "mappings": { - "properties": { - "{{ target_field_name }}": { - "type": "knn_vector", - "model_id": "{{ train_model_id }}" - } - } - } -} diff --git a/benchmarks/osb/indices/nmslib-index.json b/benchmarks/osb/indices/nmslib-index.json deleted file mode 100644 index 4ceb57977..000000000 --- a/benchmarks/osb/indices/nmslib-index.json +++ /dev/null @@ -1,27 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "knn.algo_param.ef_search": {{ hnsw_ef_search }}, - "number_of_shards": {{ target_index_primary_shards }}, - "number_of_replicas": {{ target_index_replica_shards }} - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": {{ target_index_dimension }}, - "method": { - "name": "hnsw", - "space_type": "{{ target_index_space_type }}", - "engine": "nmslib", - "parameters": { - "ef_construction": {{ hnsw_ef_construction }}, - "m": {{ hnsw_m }} - } - } - } - } - } -} diff --git a/benchmarks/osb/indices/train-index.json b/benchmarks/osb/indices/train-index.json deleted file mode 100644 index 82af8215e..000000000 --- a/benchmarks/osb/indices/train-index.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "settings": { - "index": { - "number_of_shards": {{ train_index_primary_shards }}, - "number_of_replicas": {{ train_index_replica_shards }} - } - }, - "mappings": { - "properties": { - "{{ train_field_name }}": { - "type": "knn_vector", - "dimension": {{ target_index_dimension }} - } - } - } -} diff --git a/benchmarks/osb/operations/default.json b/benchmarks/osb/operations/default.json deleted file mode 100644 index ee33166f0..000000000 --- a/benchmarks/osb/operations/default.json +++ /dev/null @@ -1,53 +0,0 @@ -[ - { - "name": "ivfpq-train-model", - "operation-type": "train-model", - "model_id": "{{ train_model_id }}", - "timeout": {{ train_timeout }}, - "body": { - "training_index": "{{ train_index_name }}", - "training_field": "{{ train_field_name }}", - "dimension": {{ target_index_dimension }}, - "search_size": {{ train_search_size }}, - "max_training_vector_count": {{ train_index_num_vectors }}, - "method": { - "name":"ivf", - "engine":"faiss", - "space_type": "{{ target_index_space_type }}", - "parameters":{ - "nlist": {{ ivf_nlists }}, - "nprobes": {{ ivf_nprobes }}, - "encoder":{ - "name":"pq", - "parameters":{ - "code_size": {{ pq_code_size }}, - "m": {{ pq_m }} - } - } - } - } - } - }, - { - "name": "ivf-train-model", - "operation-type": "train-model", - "model_id": "{{ train_model_id }}", - "timeout": {{ train_timeout | default(1000) }}, - "body": { - "training_index": "{{ train_index_name }}", - "training_field": "{{ train_field_name }}", - "search_size": {{ train_search_size }}, - "dimension": {{ target_index_dimension }}, - "max_training_vector_count": {{ train_index_num_vectors }}, - "method": { - "name":"ivf", - "engine":"faiss", - "space_type": "{{ target_index_space_type }}", - "parameters":{ - "nlist": {{ ivf_nlists }}, - "nprobes": {{ ivf_nprobes }} - } - } - } - } -] diff --git a/benchmarks/osb/params/no-train-params.json b/benchmarks/osb/params/no-train-params.json deleted file mode 100644 index 58e4197fd..000000000 --- a/benchmarks/osb/params/no-train-params.json +++ /dev/null @@ -1,40 +0,0 @@ -{ - "target_index_name": "target_index", - "target_field_name": "target_field", - "target_index_body": "indices/nmslib-index.json", - "target_index_primary_shards": 3, - "target_index_replica_shards": 1, - "target_index_dimension": 128, - "target_index_space_type": "l2", - "target_index_bulk_size": 200, - "target_index_bulk_index_data_set_format": "hdf5", - "target_index_bulk_index_data_set_path": "", - "target_index_bulk_index_clients": 10, - "target_index_max_num_segments": 10, - "target_index_force_merge_timeout": 45.0, - "hnsw_ef_search": 512, - "hnsw_ef_construction": 512, - "hnsw_m": 16, - - "query_k": 10, - "query_clients": 10, - "query_data_set_format": "hdf5", - "query_data_set_path": "", - - "ivf_nlists": 1, - "ivf_nprobes": 1, - "pq_code_size": 1, - "pq_m": 1, - "train_model_method": "", - "train_model_id": "", - "train_index_name": "", - "train_field_name": "", - "train_index_body": "", - "train_search_size": 1, - "train_timeout": 1, - "train_index_bulk_size": 1, - "train_index_data_set_format": "", - "train_index_data_set_path": "", - "train_index_num_vectors": 1, - "train_index_bulk_index_clients": 1 -} diff --git a/benchmarks/osb/params/train-params.json b/benchmarks/osb/params/train-params.json deleted file mode 100644 index f55ed4333..000000000 --- a/benchmarks/osb/params/train-params.json +++ /dev/null @@ -1,38 +0,0 @@ -{ - "target_index_name": "target_index", - "target_field_name": "target_field", - "target_index_body": "indices/model-index.json", - "target_index_primary_shards": 3, - "target_index_replica_shards": 1, - "target_index_dimension": 128, - "target_index_space_type": "l2", - "target_index_bulk_size": 200, - "target_index_bulk_index_data_set_format": "hdf5", - "target_index_bulk_index_data_set_path": "", - "target_index_bulk_index_clients": 10, - "target_index_max_num_segments": 10, - "target_index_force_merge_timeout": 45.0, - "ivf_nlists": 10, - "ivf_nprobes": 1, - "pq_code_size": 8, - "pq_m": 8, - "train_model_method": "ivfpq", - "train_model_id": "test-model", - "train_index_name": "train_index", - "train_field_name": "train_field", - "train_index_body": "indices/train-index.json", - "train_search_size": 500, - "train_timeout": 5000, - "train_index_primary_shards": 1, - "train_index_replica_shards": 0, - "train_index_bulk_size": 200, - "train_index_data_set_format": "hdf5", - "train_index_data_set_path": "", - "train_index_num_vectors": 1000000, - "train_index_bulk_index_clients": 10, - - "query_k": 10, - "query_clients": 10, - "query_data_set_format": "hdf5", - "query_data_set_path": "" -} diff --git a/benchmarks/osb/procedures/no-train-test.json b/benchmarks/osb/procedures/no-train-test.json deleted file mode 100644 index 01985b914..000000000 --- a/benchmarks/osb/procedures/no-train-test.json +++ /dev/null @@ -1,73 +0,0 @@ -{% import "benchmark.helpers" as benchmark with context %} -{ - "name": "no-train-test", - "default": true, - "schedule": [ - { - "operation": { - "name": "delete-target-index", - "operation-type": "delete-index", - "only-if-exists": true, - "index": "{{ target_index_name }}" - } - }, - { - "operation": { - "name": "create-target-index", - "operation-type": "create-index", - "index": "{{ target_index_name }}" - } - }, - { - "name": "wait-for-cluster-to-be-green", - "operation": "cluster-health", - "request-params": { - "wait_for_status": "green" - } - }, - { - "operation": { - "name": "custom-vector-bulk", - "operation-type": "custom-vector-bulk", - "param-source": "bulk-from-data-set", - "index": "{{ target_index_name }}", - "field": "{{ target_field_name }}", - "bulk_size": {{ target_index_bulk_size }}, - "data_set_format": "{{ target_index_bulk_index_data_set_format }}", - "data_set_path": "{{ target_index_bulk_index_data_set_path }}" - }, - "clients": {{ target_index_bulk_index_clients }} - }, - { - "operation": { - "name": "refresh-target-index", - "operation-type": "custom-refresh", - "index": "{{ target_index_name }}", - "retries": 100 - } - }, - { - "operation": { - "name": "force-merge", - "operation-type": "force-merge", - "request-timeout": {{ target_index_force_merge_timeout }}, - "index": "{{ target_index_name }}", - "mode": "polling", - "max-num-segments": {{ target_index_max_num_segments }} - } - }, - { - "operation": { - "name": "knn-query-from-data-set", - "operation-type": "search", - "index": "{{ target_index_name }}", - "param-source": "knn-query-from-data-set", - "k": {{ query_k }}, - "field": "{{ target_field_name }}", - "data_set_format": "{{ query_data_set_format }}", - "data_set_path": "{{ query_data_set_path }}" - }, - "clients": {{ query_clients }} - } - ] -} diff --git a/benchmarks/osb/procedures/train-test.json b/benchmarks/osb/procedures/train-test.json deleted file mode 100644 index ca26db0b0..000000000 --- a/benchmarks/osb/procedures/train-test.json +++ /dev/null @@ -1,127 +0,0 @@ -{% import "benchmark.helpers" as benchmark with context %} -{ - "name": "train-test", - "default": false, - "schedule": [ - { - "operation": { - "name": "delete-target-index", - "operation-type": "delete-index", - "only-if-exists": true, - "index": "{{ target_index_name }}" - } - }, - { - "operation": { - "name": "delete-train-index", - "operation-type": "delete-index", - "only-if-exists": true, - "index": "{{ train_index_name }}" - } - }, - { - "operation": { - "operation-type": "delete-model", - "name": "delete-model", - "model_id": "{{ train_model_id }}" - } - }, - { - "operation": { - "name": "create-train-index", - "operation-type": "create-index", - "index": "{{ train_index_name }}" - } - }, - { - "name": "wait-for-train-index-to-be-green", - "operation": "cluster-health", - "request-params": { - "wait_for_status": "green" - } - }, - { - "operation": { - "name": "train-vector-bulk", - "operation-type": "custom-vector-bulk", - "param-source": "bulk-from-data-set", - "index": "{{ train_index_name }}", - "field": "{{ train_field_name }}", - "bulk_size": {{ train_index_bulk_size }}, - "data_set_format": "{{ train_index_data_set_format }}", - "data_set_path": "{{ train_index_data_set_path }}", - "num_vectors": {{ train_index_num_vectors }} - }, - "clients": {{ train_index_bulk_index_clients }} - }, - { - "operation": { - "name": "refresh-train-index", - "operation-type": "custom-refresh", - "index": "{{ train_index_name }}", - "retries": 100 - } - }, - { - "operation": "{{ train_model_method }}-train-model" - }, - { - "operation": { - "name": "create-target-index", - "operation-type": "create-index", - "index": "{{ target_index_name }}" - } - }, - { - "name": "wait-for-target-index-to-be-green", - "operation": "cluster-health", - "request-params": { - "wait_for_status": "green" - } - }, - { - "operation": { - "name": "custom-vector-bulk", - "operation-type": "custom-vector-bulk", - "param-source": "bulk-from-data-set", - "index": "{{ target_index_name }}", - "field": "{{ target_field_name }}", - "bulk_size": {{ target_index_bulk_size }}, - "data_set_format": "{{ target_index_bulk_index_data_set_format }}", - "data_set_path": "{{ target_index_bulk_index_data_set_path }}" - }, - "clients": {{ target_index_bulk_index_clients }} - }, - { - "operation": { - "name": "refresh-target-index", - "operation-type": "custom-refresh", - "index": "{{ target_index_name }}", - "retries": 100 - } - }, - { - "operation": { - "name": "force-merge", - "operation-type": "force-merge", - "request-timeout": {{ target_index_force_merge_timeout }}, - "index": "{{ target_index_name }}", - "mode": "polling", - "max-num-segments": {{ target_index_max_num_segments }} - } - }, - { - "operation": { - "name": "knn-query-from-data-set", - "operation-type": "search", - "index": "{{ target_index_name }}", - "param-source": "knn-query-from-data-set", - "k": {{ query_k }}, - "field": "{{ target_field_name }}", - "data_set_format": "{{ query_data_set_format }}", - "data_set_path": "{{ query_data_set_path }}" - }, - "clients": {{ query_clients }} - } - ] -} diff --git a/benchmarks/osb/requirements.in b/benchmarks/osb/requirements.in deleted file mode 100644 index a9e12b5d3..000000000 --- a/benchmarks/osb/requirements.in +++ /dev/null @@ -1,4 +0,0 @@ -opensearch-py -numpy -h5py -opensearch-benchmark diff --git a/benchmarks/osb/requirements.txt b/benchmarks/osb/requirements.txt deleted file mode 100644 index a220ee44f..000000000 --- a/benchmarks/osb/requirements.txt +++ /dev/null @@ -1,96 +0,0 @@ -# -# This file is autogenerated by pip-compile with python 3.8 -# To update, run: -# -# pip-compile -# -aiohttp==3.9.4 - # via opensearch-py -aiosignal==1.2.0 - # via aiohttp -async-timeout==4.0.2 - # via aiohttp -attrs==21.4.0 - # via - # aiohttp - # jsonschema -cachetools==4.2.4 - # via google-auth -certifi==2023.7.22 - # via - # opensearch-benchmark - # opensearch-py -frozenlist==1.3.0 - # via - # aiohttp - # aiosignal -google-auth==1.22.1 - # via opensearch-benchmark -google-crc32c==1.3.0 - # via google-resumable-media -google-resumable-media==1.1.0 - # via opensearch-benchmark -h5py==3.6.0 - # via -r requirements.in -idna==3.7 - # via yarl -ijson==2.6.1 - # via opensearch-benchmark -importlib-metadata==4.11.3 - # via jsonschema -jinja2==3.1.3 - # via opensearch-benchmark -jsonschema==3.1.1 - # via opensearch-benchmark -markupsafe==2.0.1 - # via - # jinja2 - # opensearch-benchmark -multidict==6.0.2 - # via - # aiohttp - # yarl -numpy==1.24.2 - # via - # -r requirements.in - # h5py -opensearch-benchmark==0.0.2 - # via -r requirements.in -opensearch-py[async]==1.0.0 - # via - # -r requirements.in - # opensearch-benchmark -psutil==5.8.0 - # via opensearch-benchmark -py-cpuinfo==7.0.0 - # via opensearch-benchmark -pyasn1==0.4.8 - # via - # pyasn1-modules - # rsa -pyasn1-modules==0.2.8 - # via google-auth -pyrsistent==0.18.1 - # via jsonschema -rsa==4.8 - # via google-auth -six==1.16.0 - # via - # google-auth - # google-resumable-media - # jsonschema -tabulate==0.8.7 - # via opensearch-benchmark -thespian==3.10.1 - # via opensearch-benchmark -urllib3==1.26.18 - # via opensearch-py -yappi==1.2.3 - # via opensearch-benchmark -yarl==1.7.2 - # via aiohttp -zipp==3.7.0 - # via importlib-metadata - -# The following packages are considered to be unsafe in a requirements file: -# setuptools diff --git a/benchmarks/osb/tests/__init__.py b/benchmarks/osb/tests/__init__.py deleted file mode 100644 index e69de29bb..000000000 diff --git a/benchmarks/osb/tests/data_set_helper.py b/benchmarks/osb/tests/data_set_helper.py deleted file mode 100644 index 2b144da49..000000000 --- a/benchmarks/osb/tests/data_set_helper.py +++ /dev/null @@ -1,197 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -from abc import ABC, abstractmethod - -import h5py -import numpy as np - -from osb.extensions.data_set import Context, HDF5DataSet, BigANNVectorDataSet - -""" Module containing utility classes and functions for working with data sets. - -Included are utilities that can be used to build data sets and write them to -paths. -""" - - -class DataSetBuildContext: - """ Data class capturing information needed to build a particular data set - - Attributes: - data_set_context: Indicator of what the data set is used for, - vectors: A 2D array containing vectors that are used to build data set. - path: string representing path where data set should be serialized to. - """ - def __init__(self, data_set_context: Context, vectors: np.ndarray, path: str): - self.data_set_context: Context = data_set_context - self.vectors: np.ndarray = vectors #TODO: Validate shape - self.path: str = path - - def get_num_vectors(self) -> int: - return self.vectors.shape[0] - - def get_dimension(self) -> int: - return self.vectors.shape[1] - - def get_type(self) -> np.dtype: - return self.vectors.dtype - - -class DataSetBuilder(ABC): - """ Abstract builder used to create a build a collection of data sets - - Attributes: - data_set_build_contexts: list of data set build contexts that builder - will build. - """ - def __init__(self): - self.data_set_build_contexts = list() - - def add_data_set_build_context(self, data_set_build_context: DataSetBuildContext): - """ Adds a data set build context to list of contexts to be built. - - Args: - data_set_build_context: DataSetBuildContext to be added to list - - Returns: Updated DataSetBuilder - - """ - self._validate_data_set_context(data_set_build_context) - self.data_set_build_contexts.append(data_set_build_context) - return self - - def build(self): - """ Builds and serializes all data sets build contexts - - Returns: - - """ - [self._build_data_set(data_set_build_context) for data_set_build_context - in self.data_set_build_contexts] - - @abstractmethod - def _build_data_set(self, context: DataSetBuildContext): - """ Builds an individual data set - - Args: - context: DataSetBuildContext of data set to be built - - Returns: - - """ - pass - - @abstractmethod - def _validate_data_set_context(self, context: DataSetBuildContext): - """ Validates that data set context can be added to this builder - - Args: - context: DataSetBuildContext to be validated - - Returns: - - """ - pass - - -class HDF5Builder(DataSetBuilder): - - def __init__(self): - super(HDF5Builder, self).__init__() - self.data_set_meta_data = dict() - - def _validate_data_set_context(self, context: DataSetBuildContext): - if context.path not in self.data_set_meta_data.keys(): - self.data_set_meta_data[context.path] = { - context.data_set_context: context - } - return - - if context.data_set_context in \ - self.data_set_meta_data[context.path].keys(): - raise IllegalDataSetBuildContext("Path and context for data set " - "are already present in builder.") - - self.data_set_meta_data[context.path][context.data_set_context] = \ - context - - @staticmethod - def _validate_extension(context: DataSetBuildContext): - ext = context.path.split('.')[-1] - - if ext != HDF5DataSet.FORMAT_NAME: - raise IllegalDataSetBuildContext("Invalid file extension") - - def _build_data_set(self, context: DataSetBuildContext): - # For HDF5, because multiple data sets can be grouped in the same file, - # we will build data sets in memory and not write to disk until - # _flush_data_sets_to_disk is called - with h5py.File(context.path, 'a') as hf: - hf.create_dataset( - HDF5DataSet.parse_context(context.data_set_context), - data=context.vectors - ) - - -class BigANNBuilder(DataSetBuilder): - - def _validate_data_set_context(self, context: DataSetBuildContext): - self._validate_extension(context) - - # prevent the duplication of paths for data sets - data_set_paths = [c.path for c in self.data_set_build_contexts] - if any(data_set_paths.count(x) > 1 for x in data_set_paths): - raise IllegalDataSetBuildContext("Build context paths have to be " - "unique.") - - @staticmethod - def _validate_extension(context: DataSetBuildContext): - ext = context.path.split('.')[-1] - - if ext != BigANNVectorDataSet.U8BIN_EXTENSION and ext != \ - BigANNVectorDataSet.FBIN_EXTENSION: - raise IllegalDataSetBuildContext("Invalid file extension") - - if ext == BigANNVectorDataSet.U8BIN_EXTENSION and context.get_type() != \ - np.u8int: - raise IllegalDataSetBuildContext("Invalid data type for {} ext." - .format(BigANNVectorDataSet - .U8BIN_EXTENSION)) - - if ext == BigANNVectorDataSet.FBIN_EXTENSION and context.get_type() != \ - np.float32: - print(context.get_type()) - raise IllegalDataSetBuildContext("Invalid data type for {} ext." - .format(BigANNVectorDataSet - .FBIN_EXTENSION)) - - def _build_data_set(self, context: DataSetBuildContext): - num_vectors = context.get_num_vectors() - dimension = context.get_dimension() - - with open(context.path, 'wb') as f: - f.write(int.to_bytes(num_vectors, 4, "little")) - f.write(int.to_bytes(dimension, 4, "little")) - context.vectors.tofile(f) - - -def create_random_2d_array(num_vectors: int, dimension: int) -> np.ndarray: - rng = np.random.default_rng() - return rng.random(size=(num_vectors, dimension), dtype=np.float32) - - -class IllegalDataSetBuildContext(Exception): - """Exception raised when passed in DataSetBuildContext is illegal - - Attributes: - message -- explanation of the error - """ - - def __init__(self, message: str): - self.message = f'{message}' - super().__init__(self.message) - diff --git a/benchmarks/osb/tests/test_param_sources.py b/benchmarks/osb/tests/test_param_sources.py deleted file mode 100644 index cda730cee..000000000 --- a/benchmarks/osb/tests/test_param_sources.py +++ /dev/null @@ -1,353 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -import os -import random -import shutil -import string -import sys -import tempfile -import unittest - -# Add parent directory to path -import numpy as np - -sys.path.append(os.path.abspath(os.path.join(os.getcwd(), os.pardir))) - -from osb.tests.data_set_helper import HDF5Builder, create_random_2d_array, \ - DataSetBuildContext, BigANNBuilder -from osb.extensions.data_set import Context, HDF5DataSet -from osb.extensions.param_sources import VectorsFromDataSetParamSource, \ - QueryVectorsFromDataSetParamSource, BulkVectorsFromDataSetParamSource -from osb.extensions.util import ConfigurationError - -DEFAULT_INDEX_NAME = "test-index" -DEFAULT_FIELD_NAME = "test-field" -DEFAULT_CONTEXT = Context.INDEX -DEFAULT_TYPE = HDF5DataSet.FORMAT_NAME -DEFAULT_NUM_VECTORS = 10 -DEFAULT_DIMENSION = 10 -DEFAULT_RANDOM_STRING_LENGTH = 8 - - -class VectorsFromDataSetParamSourceTestCase(unittest.TestCase): - - def setUp(self) -> None: - self.data_set_dir = tempfile.mkdtemp() - - # Create a data set we know to be valid for convenience - self.valid_data_set_path = _create_data_set( - DEFAULT_NUM_VECTORS, - DEFAULT_DIMENSION, - DEFAULT_TYPE, - DEFAULT_CONTEXT, - self.data_set_dir - ) - - def tearDown(self): - shutil.rmtree(self.data_set_dir) - - def test_missing_params(self): - empty_params = dict() - self.assertRaises( - ConfigurationError, - lambda: VectorsFromDataSetParamSourceTestCase. - TestVectorsFromDataSetParamSource(empty_params, DEFAULT_CONTEXT) - ) - - def test_invalid_data_set_format(self): - invalid_data_set_format = "invalid-data-set-format" - - test_param_source_params = { - "index": DEFAULT_INDEX_NAME, - "field": DEFAULT_FIELD_NAME, - "data_set_format": invalid_data_set_format, - "data_set_path": self.valid_data_set_path, - } - self.assertRaises( - ConfigurationError, - lambda: self.TestVectorsFromDataSetParamSource( - test_param_source_params, - DEFAULT_CONTEXT - ) - ) - - def test_invalid_data_set_path(self): - invalid_data_set_path = "invalid-data-set-path" - test_param_source_params = { - "index": DEFAULT_INDEX_NAME, - "field": DEFAULT_FIELD_NAME, - "data_set_format": HDF5DataSet.FORMAT_NAME, - "data_set_path": invalid_data_set_path, - } - self.assertRaises( - FileNotFoundError, - lambda: self.TestVectorsFromDataSetParamSource( - test_param_source_params, - DEFAULT_CONTEXT - ) - ) - - def test_partition_hdf5(self): - num_vectors = 100 - - hdf5_data_set_path = _create_data_set( - num_vectors, - DEFAULT_DIMENSION, - HDF5DataSet.FORMAT_NAME, - DEFAULT_CONTEXT, - self.data_set_dir - ) - - test_param_source_params = { - "index": DEFAULT_INDEX_NAME, - "field": DEFAULT_FIELD_NAME, - "data_set_format": HDF5DataSet.FORMAT_NAME, - "data_set_path": hdf5_data_set_path, - } - test_param_source = self.TestVectorsFromDataSetParamSource( - test_param_source_params, - DEFAULT_CONTEXT - ) - - num_partitions = 10 - vecs_per_partition = test_param_source.num_vectors // num_partitions - - self._test_partition( - test_param_source, - num_partitions, - vecs_per_partition - ) - - def test_partition_bigann(self): - num_vectors = 100 - float_extension = "fbin" - - bigann_data_set_path = _create_data_set( - num_vectors, - DEFAULT_DIMENSION, - float_extension, - DEFAULT_CONTEXT, - self.data_set_dir - ) - - test_param_source_params = { - "index": DEFAULT_INDEX_NAME, - "field": DEFAULT_FIELD_NAME, - "data_set_format": "bigann", - "data_set_path": bigann_data_set_path, - } - test_param_source = self.TestVectorsFromDataSetParamSource( - test_param_source_params, - DEFAULT_CONTEXT - ) - - num_partitions = 10 - vecs_per_partition = test_param_source.num_vectors // num_partitions - - self._test_partition( - test_param_source, - num_partitions, - vecs_per_partition - ) - - def _test_partition( - self, - test_param_source: VectorsFromDataSetParamSource, - num_partitions: int, - vec_per_partition: int - ): - for i in range(num_partitions): - test_param_source_i = test_param_source.partition(i, num_partitions) - self.assertEqual(test_param_source_i.num_vectors, vec_per_partition) - self.assertEqual(test_param_source_i.offset, i * vec_per_partition) - - class TestVectorsFromDataSetParamSource(VectorsFromDataSetParamSource): - """ - Empty implementation of ABC VectorsFromDataSetParamSource so that we can - test the concrete methods. - """ - - def params(self): - pass - - -class QueryVectorsFromDataSetParamSourceTestCase(unittest.TestCase): - - def setUp(self) -> None: - self.data_set_dir = tempfile.mkdtemp() - - def tearDown(self): - shutil.rmtree(self.data_set_dir) - - def test_params(self): - # Create a data set - k = 12 - data_set_path = _create_data_set( - DEFAULT_NUM_VECTORS, - DEFAULT_DIMENSION, - DEFAULT_TYPE, - Context.QUERY, - self.data_set_dir - ) - - # Create a QueryVectorsFromDataSetParamSource with relevant params - test_param_source_params = { - "index": DEFAULT_INDEX_NAME, - "field": DEFAULT_FIELD_NAME, - "data_set_format": DEFAULT_TYPE, - "data_set_path": data_set_path, - "k": k, - } - query_param_source = QueryVectorsFromDataSetParamSource( - None, test_param_source_params - ) - - # Check each - for i in range(DEFAULT_NUM_VECTORS): - self._check_params( - query_param_source.params(), - DEFAULT_INDEX_NAME, - DEFAULT_FIELD_NAME, - DEFAULT_DIMENSION, - k - ) - - # Assert last call creates stop iteration - self.assertRaises( - StopIteration, - lambda: query_param_source.params() - ) - - def _check_params( - self, - params: dict, - expected_index: str, - expected_field: str, - expected_dimension: int, - expected_k: int - ): - index_name = params.get("index") - self.assertEqual(expected_index, index_name) - body = params.get("body") - self.assertIsInstance(body, dict) - query = body.get("query") - self.assertIsInstance(query, dict) - query_knn = query.get("knn") - self.assertIsInstance(query_knn, dict) - field = query_knn.get(expected_field) - self.assertIsInstance(field, dict) - vector = field.get("vector") - self.assertIsInstance(vector, np.ndarray) - self.assertEqual(len(list(vector)), expected_dimension) - k = field.get("k") - self.assertEqual(k, expected_k) - - -class BulkVectorsFromDataSetParamSourceTestCase(unittest.TestCase): - - def setUp(self) -> None: - self.data_set_dir = tempfile.mkdtemp() - - def tearDown(self): - shutil.rmtree(self.data_set_dir) - - def test_params(self): - num_vectors = 49 - bulk_size = 10 - data_set_path = _create_data_set( - num_vectors, - DEFAULT_DIMENSION, - DEFAULT_TYPE, - Context.INDEX, - self.data_set_dir - ) - - test_param_source_params = { - "index": DEFAULT_INDEX_NAME, - "field": DEFAULT_FIELD_NAME, - "data_set_format": DEFAULT_TYPE, - "data_set_path": data_set_path, - "bulk_size": bulk_size - } - bulk_param_source = BulkVectorsFromDataSetParamSource( - None, test_param_source_params - ) - - # Check each payload returned - vectors_consumed = 0 - while vectors_consumed < num_vectors: - expected_num_vectors = min(num_vectors - vectors_consumed, bulk_size) - self._check_params( - bulk_param_source.params(), - DEFAULT_INDEX_NAME, - DEFAULT_FIELD_NAME, - DEFAULT_DIMENSION, - expected_num_vectors - ) - vectors_consumed += expected_num_vectors - - # Assert last call creates stop iteration - self.assertRaises( - StopIteration, - lambda: bulk_param_source.params() - ) - - def _check_params( - self, - params: dict, - expected_index: str, - expected_field: str, - expected_dimension: int, - expected_num_vectors_in_payload: int - ): - size = params.get("size") - self.assertEqual(size, expected_num_vectors_in_payload) - body = params.get("body") - self.assertIsInstance(body, list) - self.assertEqual(len(body) // 2, expected_num_vectors_in_payload) - - # Bulk payload has 2 parts: first one is the header and the second one - # is the body. The header will have the index name and the body will - # have the vector - for header, req_body in zip(*[iter(body)] * 2): - index = header.get("index") - self.assertIsInstance(index, dict) - index_name = index.get("_index") - self.assertEqual(index_name, expected_index) - - vector = req_body.get(expected_field) - self.assertIsInstance(vector, list) - self.assertEqual(len(vector), expected_dimension) - - -def _create_data_set( - num_vectors: int, - dimension: int, - extension: str, - data_set_context: Context, - data_set_dir -) -> str: - - file_name_base = ''.join(random.choice(string.ascii_letters) for _ in - range(DEFAULT_RANDOM_STRING_LENGTH)) - data_set_file_name = "{}.{}".format(file_name_base, extension) - data_set_path = os.path.join(data_set_dir, data_set_file_name) - context = DataSetBuildContext( - data_set_context, - create_random_2d_array(num_vectors, dimension), - data_set_path) - - if extension == HDF5DataSet.FORMAT_NAME: - HDF5Builder().add_data_set_build_context(context).build() - else: - BigANNBuilder().add_data_set_build_context(context).build() - - return data_set_path - - -if __name__ == '__main__': - unittest.main() diff --git a/benchmarks/osb/workload.json b/benchmarks/osb/workload.json deleted file mode 100644 index bd0d84195..000000000 --- a/benchmarks/osb/workload.json +++ /dev/null @@ -1,17 +0,0 @@ -{% import "benchmark.helpers" as benchmark with context %} -{ - "version": 2, - "description": "k-NN Plugin train workload", - "indices": [ - { - "name": "{{ target_index_name }}", - "body": "{{ target_index_body }}" - }, - { - "name": "{{ train_index_name }}", - "body": "{{ train_index_body }}" - } - ], - "operations": {{ benchmark.collect(parts="operations/*.json") }}, - "test_procedures": [{{ benchmark.collect(parts="procedures/*.json") }}] -} diff --git a/benchmarks/osb/workload.py b/benchmarks/osb/workload.py deleted file mode 100644 index 32e6ad02c..000000000 --- a/benchmarks/osb/workload.py +++ /dev/null @@ -1,18 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -# This code needs to be included at the top of every workload.py file. -# OpenSearch Benchmarks is not able to find other helper files unless the path -# is updated. -import os -import sys -sys.path.append(os.path.abspath(os.getcwd())) - -from extensions.registry import register as custom_register - - -def register(registry): - custom_register(registry) diff --git a/benchmarks/perf-tool/.pylintrc b/benchmarks/perf-tool/.pylintrc deleted file mode 100644 index 15bf4ccc3..000000000 --- a/benchmarks/perf-tool/.pylintrc +++ /dev/null @@ -1,443 +0,0 @@ -# This Pylint rcfile contains a best-effort configuration to uphold the -# best-practices and style described in the Google Python style guide: -# https://google.github.io/styleguide/pyguide.html -# -# Its canonical open-source location is: -# https://google.github.io/styleguide/pylintrc - -[MASTER] - -fail-under=9.0 - -# Files or directories to be skipped. They should be base names, not paths. -ignore=third_party - -# Files or directories matching the regex patterns are skipped. The regex -# matches against base names, not paths. -ignore-patterns= - -# Pickle collected data for later comparisons. -persistent=no - -# List of plugins (as comma separated values of python modules names) to load, -# usually to register additional checkers. -load-plugins= - -# Use multiple processes to speed up Pylint. -jobs=4 - -# Allow loading of arbitrary C extensions. Extensions are imported into the -# active Python interpreter and may run arbitrary code. -unsafe-load-any-extension=no - - -[MESSAGES CONTROL] - -# Only show warnings with the listed confidence levels. Leave empty to show -# all. Valid levels: HIGH, INFERENCE, INFERENCE_FAILURE, UNDEFINED -confidence= - -# Enable the message, report, category or checker with the given id(s). You can -# either give multiple identifier separated by comma (,) or put this option -# multiple time (only on the command line, not in the configuration file where -# it should appear only once). See also the "--disable" option for examples. -#enable= - -# Disable the message, report, category or checker with the given id(s). You -# can either give multiple identifiers separated by comma (,) or put this -# option multiple times (only on the command line, not in the configuration -# file where it should appear only once).You can also use "--disable=all" to -# disable everything first and then reenable specific checks. For example, if -# you want to run only the similarities checker, you can use "--disable=all -# --enable=similarities". If you want to run only the classes checker, but have -# no Warning level messages displayed, use"--disable=all --enable=classes -# --disable=W" -disable=abstract-method, - apply-builtin, - arguments-differ, - attribute-defined-outside-init, - backtick, - bad-option-value, - basestring-builtin, - buffer-builtin, - c-extension-no-member, - consider-using-enumerate, - cmp-builtin, - cmp-method, - coerce-builtin, - coerce-method, - delslice-method, - div-method, - duplicate-code, - eq-without-hash, - execfile-builtin, - file-builtin, - filter-builtin-not-iterating, - fixme, - getslice-method, - global-statement, - hex-method, - idiv-method, - implicit-str-concat-in-sequence, - import-error, - import-self, - import-star-module-level, - inconsistent-return-statements, - input-builtin, - intern-builtin, - invalid-str-codec, - locally-disabled, - long-builtin, - long-suffix, - map-builtin-not-iterating, - misplaced-comparison-constant, - missing-function-docstring, - metaclass-assignment, - next-method-called, - next-method-defined, - no-absolute-import, - no-else-break, - no-else-continue, - no-else-raise, - no-else-return, - no-init, # added - no-member, - no-name-in-module, - no-self-use, - nonzero-method, - oct-method, - old-division, - old-ne-operator, - old-octal-literal, - old-raise-syntax, - parameter-unpacking, - print-statement, - raising-string, - range-builtin-not-iterating, - raw_input-builtin, - rdiv-method, - reduce-builtin, - relative-import, - reload-builtin, - round-builtin, - setslice-method, - signature-differs, - standarderror-builtin, - suppressed-message, - sys-max-int, - too-few-public-methods, - too-many-ancestors, - too-many-arguments, - too-many-boolean-expressions, - too-many-branches, - too-many-instance-attributes, - too-many-locals, - too-many-nested-blocks, - too-many-public-methods, - too-many-return-statements, - too-many-statements, - trailing-newlines, - unichr-builtin, - unicode-builtin, - unnecessary-pass, - unpacking-in-except, - useless-else-on-loop, - useless-object-inheritance, - useless-suppression, - using-cmp-argument, - wrong-import-order, - xrange-builtin, - zip-builtin-not-iterating, - - -[REPORTS] - -# Set the output format. Available formats are text, parseable, colorized, msvs -# (visual studio) and html. You can also give a reporter class, eg -# mypackage.mymodule.MyReporterClass. -output-format=text - -# Put messages in a separate file for each module / package specified on the -# command line instead of printing them on stdout. Reports (if any) will be -# written in a file name "pylint_global.[txt|html]". This option is deprecated -# and it will be removed in Pylint 2.0. -files-output=no - -# Tells whether to display a full report or only the messages -reports=no - -# Python expression which should return a note less than 10 (10 is the highest -# note). You have access to the variables errors warning, statement which -# respectively contain the number of errors / warnings messages and the total -# number of statements analyzed. This is used by the global evaluation report -# (RP0004). -evaluation=10.0 - ((float(5 * error + warning + refactor + convention) / statement) * 10) - -# Template used to display messages. This is a python new-style format string -# used to format the message information. See doc for all details -#msg-template= - - -[BASIC] - -# Good variable names which should always be accepted, separated by a comma -good-names=main,_ - -# Bad variable names which should always be refused, separated by a comma -bad-names= - -# Colon-delimited sets of names that determine each other's naming style when -# the name regexes allow several styles. -name-group= - -# Include a hint for the correct naming format with invalid-name -include-naming-hint=no - -# List of decorators that produce properties, such as abc.abstractproperty. Add -# to this list to register other decorators that produce valid properties. -property-classes=abc.abstractproperty,cached_property.cached_property,cached_property.threaded_cached_property,cached_property.cached_property_with_ttl,cached_property.threaded_cached_property_with_ttl - -# Regular expression matching correct function names -function-rgx=^(?:(?PsetUp|tearDown|setUpModule|tearDownModule)|(?P_?[A-Z][a-zA-Z0-9]*)|(?P_?[a-z][a-z0-9_]*))$ - -# Regular expression matching correct variable names -variable-rgx=^[a-z][a-z0-9_]*$ - -# Regular expression matching correct constant names -const-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$ - -# Regular expression matching correct attribute names -attr-rgx=^_{0,2}[a-z][a-z0-9_]*$ - -# Regular expression matching correct argument names -argument-rgx=^[a-z][a-z0-9_]*$ - -# Regular expression matching correct class attribute names -class-attribute-rgx=^(_?[A-Z][A-Z0-9_]*|__[a-z0-9_]+__|_?[a-z][a-z0-9_]*)$ - -# Regular expression matching correct inline iteration names -inlinevar-rgx=^[a-z][a-z0-9_]*$ - -# Regular expression matching correct class names -class-rgx=^_?[A-Z][a-zA-Z0-9]*$ - -# Regular expression matching correct module names -module-rgx=^(_?[a-z][a-z0-9_]*|__init__)$ - -# Regular expression matching correct method names -method-rgx=(?x)^(?:(?P_[a-z0-9_]+__|runTest|setUp|tearDown|setUpTestCase|tearDownTestCase|setupSelf|tearDownClass|setUpClass|(test|assert)_*[A-Z0-9][a-zA-Z0-9_]*|next)|(?P_{0,2}[A-Z][a-zA-Z0-9_]*)|(?P_{0,2}[a-z][a-z0-9_]*))$ - -# Regular expression which should only match function or class names that do -# not require a docstring. -no-docstring-rgx=(__.*__|main|test.*|.*test|.*Test)$ - -# Minimum line length for functions/classes that require docstrings, shorter -# ones are exempt. -docstring-min-length=10 - - -[TYPECHECK] - -# List of decorators that produce context managers, such as -# contextlib.contextmanager. Add to this list to register other decorators that -# produce valid context managers. -contextmanager-decorators=contextlib.contextmanager,contextlib2.contextmanager - -# Tells whether missing members accessed in mixin class should be ignored. A -# mixin class is detected if its name ends with "mixin" (case insensitive). -ignore-mixin-members=yes - -# List of module names for which member attributes should not be checked -# (useful for modules/projects where namespaces are manipulated during runtime -# and thus existing member attributes cannot be deduced by static analysis. It -# supports qualified module names, as well as Unix pattern matching. -ignored-modules= - -# List of class names for which member attributes should not be checked (useful -# for classes with dynamically set attributes). This supports the use of -# qualified names. -ignored-classes=optparse.Values,thread._local,_thread._local - -# List of members which are set dynamically and missed by pylint inference -# system, and so shouldn't trigger E1101 when accessed. Python regular -# expressions are accepted. -generated-members= - - -[FORMAT] - -# Maximum number of characters on a single line. -max-line-length=80 - -# TODO(https://github.com/PyCQA/pylint/issues/3352): Direct pylint to exempt -# lines made too long by directives to pytype. - -# Regexp for a line that is allowed to be longer than the limit. -ignore-long-lines=(?x)( - ^\s*(\#\ )??$| - ^\s*(from\s+\S+\s+)?import\s+.+$) - -# Allow the body of an if to be on the same line as the test if there is no -# else. -single-line-if-stmt=yes - -# List of optional constructs for which whitespace checking is disabled. `dict- -# separator` is used to allow tabulation in dicts, etc.: {1 : 1,\n222: 2}. -# `trailing-comma` allows a space between comma and closing bracket: (a, ). -# `empty-line` allows space-only lines. -no-space-check= - -# Maximum number of lines in a module -max-module-lines=99999 - -# String used as indentation unit. The internal Google style guide mandates 2 -# spaces. Google's externaly-published style guide says 4, consistent with -# PEP 8. Here, we use 2 spaces, for conformity with many open-sourced Google -# projects (like TensorFlow). -indent-string=' ' - -# Number of spaces of indent required inside a hanging or continued line. -indent-after-paren=4 - -# Expected format of line ending, e.g. empty (any line ending), LF or CRLF. -expected-line-ending-format= - - -[MISCELLANEOUS] - -# List of note tags to take in consideration, separated by a comma. -notes=TODO - - -[STRING] - -# This flag controls whether inconsistent-quotes generates a warning when the -# character used as a quote delimiter is used inconsistently within a module. -check-quote-consistency=yes - - -[VARIABLES] - -# Tells whether we should check for unused import in __init__ files. -init-import=no - -# A regular expression matching the name of dummy variables (i.e. expectedly -# not used). -dummy-variables-rgx=^\*{0,2}(_$|unused_|dummy_) - -# List of additional names supposed to be defined in builtins. Remember that -# you should avoid to define new builtins when possible. -additional-builtins= - -# List of strings which can identify a callback function by name. A callback -# name must start or end with one of those strings. -callbacks=cb_,_cb - -# List of qualified module names which can have objects that can redefine -# builtins. -redefining-builtins-modules=six,six.moves,past.builtins,future.builtins,functools - - -[LOGGING] - -# Logging modules to check that the string format arguments are in logging -# function parameter format -logging-modules=logging,absl.logging,tensorflow.io.logging - - -[SIMILARITIES] - -# Minimum lines number of a similarity. -min-similarity-lines=4 - -# Ignore comments when computing similarities. -ignore-comments=yes - -# Ignore docstrings when computing similarities. -ignore-docstrings=yes - -# Ignore imports when computing similarities. -ignore-imports=no - - -[SPELLING] - -# Spelling dictionary name. Available dictionaries: none. To make it working -# install python-enchant package. -spelling-dict= - -# List of comma separated words that should not be checked. -spelling-ignore-words= - -# A path to a file that contains private dictionary; one word per line. -spelling-private-dict-file= - -# Tells whether to store unknown words to indicated private dictionary in -# --spelling-private-dict-file option instead of raising a message. -spelling-store-unknown-words=no - - -[IMPORTS] - -# Deprecated modules which should not be used, separated by a comma -deprecated-modules=regsub, - TERMIOS, - Bastion, - rexec, - sets - -# Create a graph of every (i.e. internal and external) dependencies in the -# given file (report RP0402 must not be disabled) -import-graph= - -# Create a graph of external dependencies in the given file (report RP0402 must -# not be disabled) -ext-import-graph= - -# Create a graph of internal dependencies in the given file (report RP0402 must -# not be disabled) -int-import-graph= - -# Force import order to recognize a module as part of the standard -# compatibility libraries. -known-standard-library= - -# Force import order to recognize a module as part of a third party library. -known-third-party=enchant, absl - -# Analyse import fallback blocks. This can be used to support both Python 2 and -# 3 compatible code, which means that the block might have code that exists -# only in one or another interpreter, leading to false positives when analysed. -analyse-fallback-blocks=no - - -[CLASSES] - -# List of method names used to declare (i.e. assign) instance attributes. -defining-attr-methods=__init__, - __new__, - setUp - -# List of member names, which should be excluded from the protected access -# warning. -exclude-protected=_asdict, - _fields, - _replace, - _source, - _make - -# List of valid names for the first argument in a class method. -valid-classmethod-first-arg=cls, - class_ - -# List of valid names for the first argument in a metaclass class method. -valid-metaclass-classmethod-first-arg=mcs - - -[EXCEPTIONS] - -# Exceptions that will emit a warning when being caught. Defaults to -# "Exception" -overgeneral-exceptions=StandardError, - Exception, - BaseException diff --git a/benchmarks/perf-tool/.style.yapf b/benchmarks/perf-tool/.style.yapf deleted file mode 100644 index 39b663a7a..000000000 --- a/benchmarks/perf-tool/.style.yapf +++ /dev/null @@ -1,10 +0,0 @@ -[style] -COLUMN_LIMIT: 80 -DEDENT_CLOSING_BRACKETS: True -INDENT_DICTIONARY_VALUE: True -SPLIT_ALL_COMMA_SEPARATED_VALUES: True -SPLIT_ARGUMENTS_WHEN_COMMA_TERMINATED: True -SPLIT_BEFORE_CLOSING_BRACKET: True -SPLIT_BEFORE_EXPRESSION_AFTER_OPENING_PAREN: True -SPLIT_BEFORE_FIRST_ARGUMENT: True -SPLIT_BEFORE_NAMED_ASSIGNS: True diff --git a/benchmarks/perf-tool/README.md b/benchmarks/perf-tool/README.md deleted file mode 100644 index 36f76bcdb..000000000 --- a/benchmarks/perf-tool/README.md +++ /dev/null @@ -1,449 +0,0 @@ -# IMPORTANT NOTE: No new features will be added to this tool . This tool is currently in maintanence mode. All new features will be added to [vector search workload]( https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/vectorsearch) - -# OpenSearch k-NN Benchmarking -- [Welcome!](#welcome) -- [Install Prerequisites](#install-prerequisites) -- [Usage](#usage) -- [Contributing](#contributing) - -## Welcome! - -This directory contains the code related to benchmarking the k-NN plugin. -Benchmarks can be run against any OpenSearch cluster with the k-NN plugin -installed. Benchmarks are highly configurable using the test configuration -file. - -## Install Prerequisites - -### Setup - -K-NN perf requires Python 3.8 or greater to be installed. One of -the easier ways to do this is through Conda, a package and environment -management system for Python. - -First, follow the -[installation instructions](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) -to install Conda on your system. - -Next, create a Python 3.8 environment: -``` -conda create -n knn-perf python=3.8 -``` - -After the environment is created, activate it: -``` -source activate knn-perf -``` - -Lastly, clone the k-NN repo and install all required python packages: -``` -git clone https://github.com/opensearch-project/k-NN.git -cd k-NN/benchmarks/perf-tool -pip install -r requirements.txt -``` - -After all of this completes, you should be ready to run your first performance benchmarks! - - -## Usage - -### Quick Start - -In order to run a benchmark, you must first create a test configuration yml -file. Checkout [this example](https://github.com/opensearch-project/k-NN/blob/main/benchmarks/perf-tool/sample-configs) file -for benchmarking *faiss*'s IVF method. This file contains the definition for -the benchmark that you want to run. At the top are -[test parameters](#test-parameters). These define high level settings of the -test, such as the endpoint of the OpenSearch cluster. - -Next, you define the actions that the test will perform. These actions are -referred to as steps. First, you can define "setup" steps. These are steps that -are run once at the beginning of the execution to configure the cluster how you -want it. These steps do not contribute to the final metrics. - -After that, you define the "steps". These are the steps that the test will be -collecting metrics on. Each step emits certain metrics. These are run -multiple times, depending on the test parameter "num_runs". At the end of the -execution of all of the runs, the metrics from each run are collected and -averaged. - -Lastly, you define the "cleanup" steps. The "cleanup" steps are executed after -each test run. For instance, if you are measuring index performance, you may -want to delete the index after each run. - -To run the test, execute the following command: -``` -python knn-perf-tool.py [--log LOGLEVEL] test config-path.yml output.json - ---log log level of tool, options are: info, debug, warning, error, critical -``` - -The output will be a json document containing the results. - -Additionally, you can get the difference between two test runs using the diff -command: -``` -python knn-perf-tool.py [--log LOGLEVEL] diff result1.json result2.json - ---log log level of tool, options are: info, debug, warning, error, critical -``` - -The output will be the delta between the two metrics. - -### Test Parameters - -| Parameter Name | Description | Default | -|----------------|------------------------------------------------------------------------------------|------------| -| endpoint | Endpoint OpenSearch cluster is running on | localhost | -| port | Port on which OpenSearch Cluster is running on | 9200 | -| test_name | Name of test | No default | -| test_id | String ID of test | No default | -| num_runs | Number of runs to execute steps | 1 | -| show_runs | Whether to output each run in addition to the total summary | false | -| setup | List of steps to run once before metric collection starts | [] | -| steps | List of steps that make up one test run. Metrics will be collected on these steps. | No default | -| cleanup | List of steps to run after each test run | [] | - -### Steps - -Included are the list of steps that are currently supported. Each step contains -a set of parameters that are passed in the test configuration file and a set -of metrics that the test produces. - -#### create_index - -Creates an OpenSearch index. - -##### Parameters -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| index_name | Name of index to create | No default | -| index_spec | Path to index specification | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Time to execute step end to end. | ms | - -#### disable_refresh - -Disables refresh for all indices in the cluster. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Time to execute step end to end. | ms | - -#### refresh_index - -Refreshes an OpenSearch index. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| index_name | Name of index to refresh | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Time to execute step end to end. | ms | -| store_kb | Size of index after refresh completes | KB | - -#### force_merge - -Force merges an index to a specified number of segments. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| index_name | Name of index to force merge | No default | -| max_num_segments | Number of segments to force merge to | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Time to execute step end to end. | ms | - -#### train_model - -Trains a model. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| model_id | Model id to set | Test | -| train_index | Index to pull training data from | No default | -| train_field | Field to pull training data from | No default | -| dimension | Dimension of model | No default | -| description | Description of model | No default | -| max_training_vector_count | Number of training vectors to used | No default | -| method_spec | Path to method specification | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Time to execute step end to end | ms | - -#### delete_model - -Deletes a model from the cluster. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| model_id | Model id to delete | Test | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Time to execute step end to end | ms | - -#### delete_index - -Deletes an index from the cluster. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| index_name | Name of index to delete | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Time to execute step end to end | ms | - -#### ingest - -Ingests a dataset of vectors into the cluster. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| index_name | Name of index to ingest into | No default | -| field_name | Name of field to ingest into | No default | -| bulk_size | Documents per bulk request | 300 | -| dataset_format | Format the data-set is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | -| dataset_path | Path to data-set | No default | -| doc_count | Number of documents to create from data-set | Size of the data-set | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Total time to ingest the dataset into the index.| ms | - -#### ingest_multi_field - -Ingests a dataset of multiple context types into the cluster. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- |-----------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- | -| index_name | Name of index to ingest into | No default | -| field_name | Name of field to ingest into | No default | -| bulk_size | Documents per bulk request | 300 | -| dataset_path | Path to data-set | No default | -| doc_count | Number of documents to create from data-set | Size of the data-set | -| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default | -| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Total time to ingest the dataset into the index.| ms | - -#### ingest_nested_field - -Ingests a dataset with nested field into the cluster. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ----------- | -| index_name | Name of index to ingest into | No default | -| field_name | Name of field to ingest into | No default | -| dataset_path | Path to data-set | No default | -| attributes_dataset_name | Name of dataset with additional attributes inside the main dataset | No default | -| attribute_spec | Definition of attributes, format is: [{ name: [name_val], type: [type_val]}] Order is important and must match order of attributes column in dataset file. It should contains { name: 'parent_id', type: 'int'} | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Total time to ingest the dataset into the index.| ms | - -#### query - -Runs a set of queries against an index. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- | ----------- | ----------- | -| k | Number of neighbors to return on search | 100 | -| r | r value in Recall@R | 1 | -| index_name | Name of index to search | No default | -| field_name | Name field to search | No default | -| calculate_recall | Whether to calculate recall values | False | -| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | -| dataset_path | Path to dataset | No default | -| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | -| neighbors_path | Path to neighbors dataset | No default | -| query_count | Number of queries to create from data-set | Size of the data-set | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- |---------------------------------------------------------------------------------------------------------| ----------- | -| took | Took times returned per query aggregated as total, p50, p90, p99, p99.9 and p100 (when applicable) | ms | -| memory_kb | Native memory k-NN is using at the end of the query workload | KB | -| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 | -| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 | - -#### query_with_filter - -Runs a set of queries with filter against an index. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| -| k | Number of neighbors to return on search | 100 | -| r | r value in Recall@R | 1 | -| index_name | Name of index to search | No default | -| field_name | Name field to search | No default | -| calculate_recall | Whether to calculate recall values | False | -| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | -| dataset_path | Path to dataset | No default | -| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | -| neighbors_path | Path to neighbors dataset | No default | -| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default | -| filter_spec | Path to filter specification | No default | -| filter_type | Type of filter format, we do support following types:
FILTER inner filter format for approximate k-NN search
SCRIPT score scripting with exact k-NN search and pre-filtering
BOOL_POST_FILTER Bool query with post-filtering | SCRIPT | -| score_script_similarity | Similarity function that has been used to index dataset. Used for SCRIPT filter type and ignored for others | l2 | -| query_count | Number of queries to create from data-set | Size of the data-set | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms | -| memory_kb | Native memory k-NN is using at the end of the query workload | KB | -| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 | -| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 | - - -#### query_nested_field - -Runs a set of queries with nested field against an index. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| -| k | Number of neighbors to return on search | 100 | -| r | r value in Recall@R | 1 | -| index_name | Name of index to search | No default | -| field_name | Name field to search | No default | -| calculate_recall | Whether to calculate recall values | False | -| dataset_format | Format the dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | -| dataset_path | Path to dataset | No default | -| neighbors_format | Format the neighbors dataset is in. Currently hdf5 and bigann is supported. The hdf5 file must be organized in the same way that the ann-benchmarks organizes theirs. | 'hdf5' | -| neighbors_path | Path to neighbors dataset | No default | -| neighbors_dataset | Name of filter dataset inside the neighbors dataset | No default | -| query_count | Number of queries to create from data-set | Size of the data-set | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- | ----------- | ----------- | -| took | Took times returned per query aggregated as total, p50, p90 and p99 (when applicable) | ms | -| memory_kb | Native memory k-NN is using at the end of the query workload | KB | -| recall@R | ratio of top R results from the ground truth neighbors that are in the K results returned by the plugin | float 0.0-1.0 | -| recall@K | ratio of results returned that were ground truth nearest neighbors | float 0.0-1.0 | - -#### get_stats - -Gets the index stats. - -##### Parameters - -| Parameter Name | Description | Default | -| ----------- |-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------| -| index_name | Name of index to search | No default | - -##### Metrics - -| Metric Name | Description | Unit | -| ----------- |-------------------------------------------------|------------| -| num_of_committed_segments | Total number of commited segments in the index | integer >= 0 | -| num_of_search_segments | Total number of search segments in the index | integer >= 0 | - -### Data sets - -This benchmark tool uses pre-generated data sets to run indexing and query workload. For some benchmark types existing dataset need to be -extended. Filtering is an example of use case where such dataset extension is needed. - -It's possible to use script provided with this repo to generate dataset and run benchmark for filtering queries. -You need to have existing dataset with vector data. This dataset will be used to generate additional attribute data and set of ground truth neighbours document ids. - -To generate dataset with attributes based on vectors only dataset use following command pattern: - -```commandline -python add-filters-to-dataset.py True False -``` - -To generate neighbours dataset for different filters based on dataset with attributes use following command pattern: - -```commandline -python add-filters-to-dataset.py False True -``` - -After that new dataset(s) can be referred from testcase definition in `ingest_extended` and `query_with_filter` steps. - -To generate dataset with parent doc id based on vectors only dataset, use following command pattern: -```commandline -python add-parent-doc-id-to-dataset.py -``` -This will generate neighbours dataset as well. This new dataset(s) can be referred from testcase definition in `ingest_nested_field` and `query_nested_field` steps. - -## Contributing - -### Linting - -Use pylint to lint the code: -``` -pylint knn-perf-tool.py okpt/**/*.py okpt/**/**/*.py -``` - -### Formatting - -We use yapf and the google style to format our code. After installing yapf, you can format your code by running: - -``` -yapf --style google knn-perf-tool.py okpt/**/*.py okpt/**/**/*.py -``` - -### Updating requirements - -Add new requirements to "requirements.in" and run `pip-compile` diff --git a/benchmarks/perf-tool/add-filters-to-dataset.py b/benchmarks/perf-tool/add-filters-to-dataset.py deleted file mode 100644 index 0624f7323..000000000 --- a/benchmarks/perf-tool/add-filters-to-dataset.py +++ /dev/null @@ -1,200 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. -""" -Script builds complex dataset with additional attributes from exiting dataset that has only vectors. -Additional attributes are predefined in the script: color, taste, age. Only HDF5 format of vector dataset is supported. - -Output dataset file will have additional dataset 'attributes' with multiple columns, each column corresponds to one attribute -from an attribute set, and value is generated at random, e.g.: - -0: green None 71 -1: green bitter 28 - -there is no explicit index reference in 'attributes' dataset, index of the row corresponds to a document id. -For instance, in example above two rows of fields mapped to documents with ids '0' and '1'. - -If 'generate_filters' flag is set script generates additional dataset of neighbours (ground truth) for each filter type. -Output is a new file with several datasets, each dataset corresponds to one filter. Datasets are named 'neighbour_filter_X' -where X is 1 based index of particular filter. -Each dataset has rows with array of integers, where integer corresponds to -a document id from original dataset with additional fields. Array ca have -1 values that are treated as null, this is because -subset of filtered documents is same of smaller than original set. - -For example, dataset file content may look like : - -neighbour_filter_1: [[ 2, 5, -1], - [ 3, 1, -1], - [ 2 5, 7]] -neighbour_filter_2: [[-1, -1, -1], - [ 5, 6, -1], - [ 4, 2, 1]] - -In this case we do have datasets for two filters, 3 query results for each. [2, 5, -1] indicates that for first query -if filter 1 is used most similar document is with id 2, next similar is 5, and the rest do not pass filter 1 criteria. - -Example of script usage: - - create new hdf5 file with attribute dataset - add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data.hdf5 ~/dev/opensearch/datasets/data-with-attr True False - - create new hdf5 file with filter datasets - add-filters-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data-with-attr.hdf5 ~/dev/opensearch/datasets/data-with-filters False True -""" - -import getopt -import os -import random -import sys - -import h5py - -from osb.extensions.data_set import HDF5DataSet - - -class _Dataset: - """Type of dataset container for data with additional attributes""" - DEFAULT_TYPE = HDF5DataSet.FORMAT_NAME - - def create_dataset(self, source_dataset_path, out_file_path, generate_attrs: bool, generate_filters: bool) -> None: - path_elements = os.path.split(os.path.abspath(source_dataset_path)) - data_set_dir = path_elements[0] - - # For HDF5, because multiple data sets can be grouped in the same file, - # we will build data sets in memory and not write to disk until - # _flush_data_sets_to_disk is called - # read existing dataset - data_hdf5 = os.path.join(os.path.dirname(os.path.realpath('/')), source_dataset_path) - - with h5py.File(data_hdf5, "r") as hf: - - if generate_attrs: - data_set_w_attr = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir) - - possible_colors = ['red', 'green', 'yellow', 'blue', None] - possible_tastes = ['sweet', 'salty', 'sour', 'bitter', None] - max_age = 100 - - for key in hf.keys(): - if key not in ['neighbors', 'test', 'train']: - continue - data_set_w_attr.create_dataset(key, data=hf[key][()]) - - attributes = [] - for i in range(len(hf['train'])): - attr = [random.choice(possible_colors), random.choice(possible_tastes), - random.randint(0, max_age + 1)] - attributes.append(attr) - - data_set_w_attr.create_dataset('attributes', (len(attributes), 3), 'S10', data=attributes) - - data_set_w_attr.flush() - data_set_w_attr.close() - - if generate_filters: - attributes = hf['attributes'][()] - expected_neighbors = hf['neighbors'][()] - - data_set_filters = self.create_dataset_file(out_file_path, self.DEFAULT_TYPE, data_set_dir) - - def filter1(attributes, vector_idx): - if attributes[vector_idx][0].decode() == 'red' and int(attributes[vector_idx][2].decode()) >= 20: - return True - else: - return False - - self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_1', filter1) - - # filter 2 - color = blue or None and taste = 'salty' - def filter2(attributes, vector_idx): - if (attributes[vector_idx][0].decode() == 'blue' or attributes[vector_idx][ - 0].decode() == 'None') and attributes[vector_idx][1].decode() == 'salty': - return True - else: - return False - - self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_2', filter2) - - # filter 3 - color and taste are not None and age is between 20 and 80 - def filter3(attributes, vector_idx): - if attributes[vector_idx][0].decode() != 'None' and attributes[vector_idx][ - 1].decode() != 'None' and 20 <= \ - int(attributes[vector_idx][2].decode()) <= 80: - return True - else: - return False - - self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_3', filter3) - - # filter 4 - color green or blue and taste is bitter and age is between (30, 60) - def filter4(attributes, vector_idx): - if (attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue') \ - and (attributes[vector_idx][1].decode() == 'bitter') \ - and 30 <= int(attributes[vector_idx][2].decode()) <= 60: - return True - else: - return False - - self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_4', filter4) - - # filter 5 color is (green or blue or yellow) or taste = sweet or age is between (30, 70) - def filter5(attributes, vector_idx): - if attributes[vector_idx][0].decode() == 'green' or attributes[vector_idx][0].decode() == 'blue' \ - or attributes[vector_idx][0].decode() == 'yellow' \ - or attributes[vector_idx][1].decode() == 'sweet' \ - or 30 <= int(attributes[vector_idx][2].decode()) <= 70: - return True - else: - return False - - self.apply_filter(expected_neighbors, attributes, data_set_filters, 'neighbors_filter_5', filter5) - - data_set_filters.flush() - data_set_filters.close() - - def apply_filter(self, expected_neighbors, attributes, data_set_w_filtering, filter_name, filter_func): - neighbors_filter = [] - filtered_count = 0 - for expected_neighbors_row in expected_neighbors: - neighbors_filter_row = [-1] * len(expected_neighbors_row) - idx = 0 - for vector_idx in expected_neighbors_row: - if filter_func(attributes, vector_idx): - neighbors_filter_row[idx] = vector_idx - idx += 1 - filtered_count += 1 - neighbors_filter.append(neighbors_filter_row) - overall_count = len(expected_neighbors) * len(expected_neighbors[0]) - perc = float(filtered_count / overall_count) * 100 - print('ground truth size for {} is {}, percentage {}'.format(filter_name, filtered_count, perc)) - data_set_w_filtering.create_dataset(filter_name, data=neighbors_filter) - return expected_neighbors - - def create_dataset_file(self, file_name, extension, data_set_dir) -> h5py.File: - data_set_file_name = "{}.{}".format(file_name, extension) - data_set_path = os.path.join(data_set_dir, data_set_file_name) - - data_set_w_filtering = h5py.File(data_set_path, 'a') - - return data_set_w_filtering - - -def main(argv): - opts, args = getopt.getopt(argv, "") - in_file_path = args[0] - out_file_path = args[1] - generate_attr = str2bool(args[2]) - generate_filters = str2bool(args[3]) - - worker = _Dataset() - worker.create_dataset(in_file_path, out_file_path, generate_attr, generate_filters) - - -def str2bool(v): - return v.lower() in ("yes", "true", "t", "1") - - -if __name__ == "__main__": - main(sys.argv[1:]) diff --git a/benchmarks/perf-tool/add-parent-doc-id-to-dataset.py b/benchmarks/perf-tool/add-parent-doc-id-to-dataset.py deleted file mode 100644 index a4acafd03..000000000 --- a/benchmarks/perf-tool/add-parent-doc-id-to-dataset.py +++ /dev/null @@ -1,291 +0,0 @@ -# Copyright OpenSearch Contributors -# SPDX-License-Identifier: Apache-2.0 - -""" -Script builds complex dataset with additional attributes from exiting dataset that has only vectors. -Additional attributes are predefined in the script: color, taste, age, and parent doc id. Only HDF5 format of vector dataset is supported. - -Output dataset file will have additional dataset 'attributes' with multiple columns, each column corresponds to one attribute -from an attribute set, and value is generated at random, e.g.: - -0: green None 71 1 -1: green bitter 28 1 -2: green bitter 28 1 -3: green bitter 28 2 -... - -there is no explicit index reference in 'attributes' dataset, index of the row corresponds to a document id. -For instance, in example above two rows of fields mapped to documents with ids '0' and '1'. - -The parend doc ids are assigned in non-decreasing order. - -If 'generate_filters' flag is set script generates additional dataset of neighbours (ground truth). -Output is a new file with three dataset each of which corresponds to a certain type of query. -Dataset name neighbour_nested is a ground truth for query without filtering. -Dataset name neighbour_filtered_relaxed is a ground truth for query with filtering of (30 <= age <= 70) or color in ["green", "blue", "yellow"] or taste in ["sweet"] -Dataset name neighbour_filtered_restrictive is a ground truth for query with filtering of (30 <= age <= 60) and color in ["green", "blue"] and taste in ["bitter"] - - -Each dataset has rows with array of integers, where integer corresponds to -a document id from original dataset with additional fields. - -Example of script usage: - - create new hdf5 file with attribute dataset - add-parent-doc-id-to-dataset.py ~/dev/opensearch/k-NN/benchmarks/perf-tool/dataset/data.hdf5 ~/dev/opensearch/datasets/data-nested.hdf5 - -""" -import getopt -import multiprocessing -import random -import sys -from multiprocessing import Process -from typing import cast -import traceback - -import h5py -import numpy as np - - -class MyVector: - def __init__(self, vector, id, color=None, taste=None, age=None, parent_id=None): - self.vector = vector - self.id = id - self.age = age - self.color = color - self.taste = taste - self.parent_id = parent_id - - def apply_restricted_filter(self): - return (30 <= self.age <= 60) and self.color in ["green", "blue"] and self.taste in ["bitter"] - - def apply_relaxed_filter(self): - return (30 <= self.age <= 70) or self.color in ["green", "blue", "yellow"] or self.taste in ["sweet"] - - def __str__(self): - return f'Vector : {self.vector}, id : {self.id}, color: {self.color}, taste: {self.taste}, age: {self.age}, parent_id: {self.parent_id}\n' - - def __repr__(self): - return f'Vector : {self.vector}, id : {self.id}, color: {self.color}, taste: {self.taste}, age: {self.age}, parent_id: {self.parent_id}\n' - -class HDF5DataSet: - def __init__(self, file_path, key): - self.file_name = file_path - self.file = h5py.File(self.file_name) - self.key = key - self.data = cast(h5py.Dataset, self.file[key]) - self.metadata = None - self.metadata = cast(h5py.Dataset, self.file["attributes"]) if key == "train" else None - print(f'Keys in the file are {self.file.keys()}') - - def read(self, start, end=None): - if end is None: - end = self.data.len() - values = cast(np.ndarray, self.data[start:end]) - metadata = cast(list, self.metadata[start:end]) if self.metadata is not None else None - if metadata is not None: - print(metadata) - vectors = [] - i = 0 - for value in values: - if self.metadata is None: - vector = MyVector(value, i) - else: - # color, taste, age, and parent id - vector = MyVector(value, i, str(metadata[i][0].decode()), str(metadata[i][1].decode()), - int(metadata[i][2]), int(metadata[i][3])) - vectors.append(vector) - i = i + 1 - return vectors - - def read_neighbors(self, start, end): - return cast(np.ndarray, self.data[start:end]) - - def size(self): - return self.data.len() - - def close(self): - self.file.close() - -class _Dataset: - def run(self, source_path, target_path) -> None: - # Add attributes - print(f'Adding attributes started.') - with h5py.File(source_path, "r") as in_file: - out_file = h5py.File(target_path, "w") - possible_colors = ['red', 'green', 'yellow', 'blue', None] - possible_tastes = ['sweet', 'salty', 'sour', 'bitter', None] - max_age = 100 - min_field_size = 10 - max_field_size = 10 - - # Copy train and test data - for key in in_file.keys(): - if key not in ['test', 'train']: - continue - out_file.create_dataset(key, data=in_file[key][()]) - - # Generate attributes - attributes = [] - field_size = random.randint(min_field_size, max_field_size) - parent_id = 1 - field_count = 0 - for i in range(len(in_file['train'])): - attr = [random.choice(possible_colors), random.choice(possible_tastes), - random.randint(0, max_age + 1), parent_id] - attributes.append(attr) - field_count += 1 - if field_count >= field_size: - field_size = random.randint(min_field_size, max_field_size) - field_count = 0 - parent_id += 1 - out_file.create_dataset('attributes', (len(attributes), 4), 'S10', data=attributes) - - out_file.flush() - out_file.close() - - print(f'Adding attributes completed.') - - - # Calculate ground truth - print(f'Calculating ground truth started.') - cpus = multiprocessing.cpu_count() - total_clients = min(8, cpus) # 1 # 10 - hdf5Data_train = HDF5DataSet(target_path, "train") - train_vectors = hdf5Data_train.read(0, hdf5Data_train.size()) - hdf5Data_train.close() - print(f'Train vector size: {len(train_vectors)}') - - hdf5Data_test = HDF5DataSet(target_path, "test") - total_queries = hdf5Data_test.size() # 10000 - dis = [] * total_queries - - for i in range(total_queries): - dis.insert(i, []) - - queries_per_client = int(total_queries / total_clients + 0.5) - if queries_per_client == 0: - queries_per_client = total_queries - - processes = [] - test_vectors = hdf5Data_test.read(0, total_queries) - hdf5Data_test.close() - tasks_that_are_done = multiprocessing.Queue() - for client in range(total_clients): - start_index = int(client * queries_per_client) - if start_index + queries_per_client <= total_queries: - end_index = int(start_index + queries_per_client) - else: - end_index = total_queries - - print(f'Start Index: {start_index}, end Index: {end_index}') - print(f'client is : {client}') - p = Process(target=queryTask, args=( - train_vectors, test_vectors, start_index, end_index, client, total_queries, tasks_that_are_done)) - processes.append(p) - p.start() - if end_index >= total_queries: - print(f'Exiting end Index : {end_index} total_queries: {total_queries}') - break - - # wait for tasks to be completed - print('Waiting for all tasks to be completed') - j = 0 - # This is required because threads can hang if the data sent from the sub process increases by a certain limit - # https://stackoverflow.com/questions/21641887/python-multiprocessing-process-hangs-on-join-for-large-queue - while j < total_queries: - while not tasks_that_are_done.empty(): - calculatedDis = tasks_that_are_done.get() - i = 0 - for d in calculatedDis: - if d: - dis[i] = d - j = j + 1 - i = i + 1 - - for p in processes: - if p.is_alive(): - p.join() - else: - print("Process was not alive hence shutting down") - - data_set_file = h5py.File(target_path, "a") - for type in ['nested', 'relaxed', 'restricted']: - results = [] - for d in dis: - r = [] - for i in range(min(10000, len(d[type]))): - r.append(d[type][i]['id']) - results.append(r) - - - data_set_file.create_dataset("neighbour_" + type, (len(results), len(results[0])), data=results) - data_set_file.flush() - data_set_file.close() - -def calculateL2Distance(point1, point2): - return np.linalg.norm(point1 - point2) - - -def queryTask(train_vectors, test_vectors, startIndex, endIndex, process_number, total_queries, tasks_that_are_done): - print(f'Starting Process number : {process_number}') - all_distances = [] * total_queries - for i in range(total_queries): - all_distances.insert(i, {}) - try: - test_vectors = test_vectors[startIndex:endIndex] - i = startIndex - for test in test_vectors: - distances = [] - values = {} - for value in train_vectors: - values[value.id] = value - distances.append({ - "dis": calculateL2Distance(test.vector, value.vector), - "id": value.parent_id - }) - - distances.sort(key=lambda vector: vector['dis']) - seen_set_nested = set() - seen_set_restricted = set() - seen_set_relaxed = set() - nested = [] - restricted = [] - relaxed = [] - for sub_i in range(len(distances)): - id = distances[sub_i]['id'] - # Check if the number has been seen before - if len(nested) < 1000 and id not in seen_set_nested: - # If not seen before, mark it as seen - seen_set_nested.add(id) - nested.append(distances[sub_i]) - if len(restricted) < 1000 and id not in seen_set_restricted and values[id].apply_restricted_filter(): - seen_set_restricted.add(id) - restricted.append(distances[sub_i]) - if len(relaxed) < 1000 and id not in seen_set_relaxed and values[id].apply_relaxed_filter(): - seen_set_relaxed.add(id) - relaxed.append(distances[sub_i]) - - all_distances[i]['nested'] = nested - all_distances[i]['restricted'] = restricted - all_distances[i]['relaxed'] = relaxed - print(f"Process {process_number} queries completed: {i + 1 - startIndex}, queries left: {endIndex - i - 1}") - i = i + 1 - except: - print( - f"Got exception while running the thread: {process_number} with startIndex: {startIndex} endIndex: {endIndex} ") - traceback.print_exc() - tasks_that_are_done.put(all_distances) - print(f'Exiting Process number : {process_number}') - - -def main(argv): - opts, args = getopt.getopt(argv, "") - in_file_path = args[0] - out_file_path = args[1] - - worker = _Dataset() - worker.run(in_file_path, out_file_path) - -if __name__ == "__main__": - main(sys.argv[1:]) \ No newline at end of file diff --git a/benchmarks/perf-tool/dataset/data-nested.hdf5 b/benchmarks/perf-tool/dataset/data-nested.hdf5 deleted file mode 100644 index 4223d7281..000000000 Binary files a/benchmarks/perf-tool/dataset/data-nested.hdf5 and /dev/null differ diff --git a/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 deleted file mode 100644 index 01df75f83..000000000 Binary files a/benchmarks/perf-tool/dataset/data-with-attr-with-filters.hdf5 and /dev/null differ diff --git a/benchmarks/perf-tool/dataset/data-with-attr.hdf5 b/benchmarks/perf-tool/dataset/data-with-attr.hdf5 deleted file mode 100644 index 22873b06c..000000000 Binary files a/benchmarks/perf-tool/dataset/data-with-attr.hdf5 and /dev/null differ diff --git a/benchmarks/perf-tool/dataset/data.hdf5 b/benchmarks/perf-tool/dataset/data.hdf5 deleted file mode 100644 index c9268606d..000000000 Binary files a/benchmarks/perf-tool/dataset/data.hdf5 and /dev/null differ diff --git a/benchmarks/perf-tool/knn-perf-tool.py b/benchmarks/perf-tool/knn-perf-tool.py deleted file mode 100644 index 48eedc427..000000000 --- a/benchmarks/perf-tool/knn-perf-tool.py +++ /dev/null @@ -1,10 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. -"""Script for user to run the testing tool.""" - -import okpt.main - -okpt.main.main() diff --git a/benchmarks/perf-tool/okpt/__init__.py b/benchmarks/perf-tool/okpt/__init__.py deleted file mode 100644 index c3bffc54c..000000000 --- a/benchmarks/perf-tool/okpt/__init__.py +++ /dev/null @@ -1,6 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - diff --git a/benchmarks/perf-tool/okpt/diff/diff.py b/benchmarks/perf-tool/okpt/diff/diff.py deleted file mode 100644 index 23f424ab9..000000000 --- a/benchmarks/perf-tool/okpt/diff/diff.py +++ /dev/null @@ -1,142 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Provides the Diff class.""" - -from enum import Enum -from typing import Any, Dict, Tuple - - -class InvalidTestResultsError(Exception): - """Exception raised when the test results are invalid. - - The results can be invalid if they have different fields, non-numeric - values, or if they don't follow the standard result format. - """ - def __init__(self, msg: str): - self.message = msg - super().__init__(self.message) - - -def _is_numeric(a) -> bool: - return isinstance(a, (int, float)) - - -class TestResultFields(str, Enum): - METADATA = 'metadata' - RESULTS = 'results' - TEST_PARAMETERS = 'test_parameters' - - -class TestResultNames(str, Enum): - BASE = 'base_result' - CHANGED = 'changed_result' - - -class Diff: - """Diff class for validating and diffing two test result files. - - Methods: - diff: Returns the diff between two test results. (changed - base) - """ - def __init__( - self, - base_result: Dict[str, - Any], - changed_result: Dict[str, - Any], - metadata: bool - ): - """Initializes test results and validate them.""" - self.base_result = base_result - self.changed_result = changed_result - self.metadata = metadata - - # make sure results have proper test result fields - is_valid, key, result = self._validate_keys() - if not is_valid: - raise InvalidTestResultsError( - f'{result} has a missing or invalid key `{key}`.' - ) - - self.base_results = self.base_result[TestResultFields.RESULTS] - self.changed_results = self.changed_result[TestResultFields.RESULTS] - - # make sure results have the same fields - is_valid, key, result = self._validate_structure() - if not is_valid: - raise InvalidTestResultsError( - f'key `{key}` is not present in {result}.' - ) - - # make sure results have numeric values - is_valid, key, result = self._validate_types() - if not is_valid: - raise InvalidTestResultsError( - f'key `{key}` in {result} points to a non-numeric value.' - ) - - def _validate_keys(self) -> Tuple[bool, str, str]: - """Ensure both test results have `metadata` and `results` keys.""" - check_keydict = lambda key, res: key in res and isinstance( - res[key], dict) - - # check if results have a `metadata` field and if `metadata` is a dict - if self.metadata: - if not check_keydict(TestResultFields.METADATA, self.base_result): - return (False, TestResultFields.METADATA, TestResultNames.BASE) - if not check_keydict(TestResultFields.METADATA, - self.changed_result): - return ( - False, - TestResultFields.METADATA, - TestResultNames.CHANGED - ) - # check if results have a `results` field and `results` is a dict - if not check_keydict(TestResultFields.RESULTS, self.base_result): - return (False, TestResultFields.RESULTS, TestResultNames.BASE) - if not check_keydict(TestResultFields.RESULTS, self.changed_result): - return (False, TestResultFields.RESULTS, TestResultNames.CHANGED) - return (True, '', '') - - def _validate_structure(self) -> Tuple[bool, str, str]: - """Ensure both test results have the same keys.""" - for k in self.base_results: - if not k in self.changed_results: - return (False, k, TestResultNames.CHANGED) - for k in self.changed_results: - if not k in self.base_results: - return (False, k, TestResultNames.BASE) - return (True, '', '') - - def _validate_types(self) -> Tuple[bool, str, str]: - """Ensure both test results have numeric values.""" - for k, v in self.base_results.items(): - if not _is_numeric(v): - return (False, k, TestResultNames.BASE) - for k, v in self.changed_results.items(): - if not _is_numeric(v): - return (False, k, TestResultNames.BASE) - return (True, '', '') - - def diff(self) -> Dict[str, Any]: - """Return the diff between the two test results. (changed - base)""" - results_diff = { - key: self.changed_results[key] - self.base_results[key] - for key in self.base_results - } - - # add metadata if specified - if self.metadata: - return { - f'{TestResultNames.BASE}_{TestResultFields.METADATA}': - self.base_result[TestResultFields.METADATA], - f'{TestResultNames.CHANGED}_{TestResultFields.METADATA}': - self.changed_result[TestResultFields.METADATA], - 'diff': - results_diff - } - return results_diff diff --git a/benchmarks/perf-tool/okpt/io/args.py b/benchmarks/perf-tool/okpt/io/args.py deleted file mode 100644 index f8c5d8809..000000000 --- a/benchmarks/perf-tool/okpt/io/args.py +++ /dev/null @@ -1,178 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Parses and defines command line arguments for the program. - -Defines the subcommands `test` and `diff` and the corresponding -files that are required by each command. - -Functions: - define_args(): Define the command line arguments. - get_args(): Returns a dictionary of the command line args. -""" - -import argparse -import sys -from dataclasses import dataclass -from io import TextIOWrapper -from typing import Union - -_read_type = argparse.FileType('r') -_write_type = argparse.FileType('w') - - -def _add_config(parser, name, **kwargs): - """"Add configuration file path argument.""" - opts = { - 'type': _read_type, - 'help': 'Path of configuration file.', - 'metavar': 'config_path', - **kwargs, - } - parser.add_argument(name, **opts) - - -def _add_result(parser, name, **kwargs): - """"Add results files paths argument.""" - opts = { - 'type': _read_type, - 'help': 'Path of one result file.', - 'metavar': 'result_path', - **kwargs, - } - parser.add_argument(name, **opts) - - -def _add_results(parser, name, **kwargs): - """"Add results files paths argument.""" - opts = { - 'nargs': '+', - 'type': _read_type, - 'help': 'Paths of result files.', - 'metavar': 'result_paths', - **kwargs, - } - parser.add_argument(name, **opts) - - -def _add_output(parser, name, **kwargs): - """"Add output file path argument.""" - opts = { - 'type': _write_type, - 'help': 'Path of output file.', - 'metavar': 'output_path', - **kwargs, - } - parser.add_argument(name, **opts) - - -def _add_metadata(parser, name, **kwargs): - opts = { - 'action': 'store_true', - **kwargs, - } - parser.add_argument(name, **opts) - - -def _add_test_cmd(subparsers): - test_parser = subparsers.add_parser('test') - _add_config(test_parser, 'config') - _add_output(test_parser, 'output') - - -def _add_diff_cmd(subparsers): - diff_parser = subparsers.add_parser('diff') - _add_metadata(diff_parser, '--metadata') - _add_result( - diff_parser, - 'base_result', - help='Base test result.', - metavar='base_result' - ) - _add_result( - diff_parser, - 'changed_result', - help='Changed test result.', - metavar='changed_result' - ) - _add_output(diff_parser, '--output', default=sys.stdout) - - -@dataclass -class TestArgs: - log: str - command: str - config: TextIOWrapper - output: TextIOWrapper - - -@dataclass -class DiffArgs: - log: str - command: str - metadata: bool - base_result: TextIOWrapper - changed_result: TextIOWrapper - output: TextIOWrapper - - -def get_args() -> Union[TestArgs, DiffArgs]: - """Define, parse and return command line args. - - Returns: - A dict containing the command line args. - """ - parser = argparse.ArgumentParser( - description= - 'Run performance tests against the OpenSearch plugin and various ANN ' - 'libaries.' - ) - - def define_args(): - """Define tool commands.""" - - # add log level arg - parser.add_argument( - '--log', - default='info', - type=str, - choices=['debug', - 'info', - 'warning', - 'error', - 'critical'], - help='Log level of the tool.' - ) - - subparsers = parser.add_subparsers( - title='commands', - dest='command', - help='sub-command help' - ) - subparsers.required = True - - # add subcommands - _add_test_cmd(subparsers) - _add_diff_cmd(subparsers) - - define_args() - args = parser.parse_args() - if args.command == 'test': - return TestArgs( - log=args.log, - command=args.command, - config=args.config, - output=args.output - ) - else: - return DiffArgs( - log=args.log, - command=args.command, - metadata=args.metadata, - base_result=args.base_result, - changed_result=args.changed_result, - output=args.output - ) diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/base.py b/benchmarks/perf-tool/okpt/io/config/parsers/base.py deleted file mode 100644 index 795aab1b2..000000000 --- a/benchmarks/perf-tool/okpt/io/config/parsers/base.py +++ /dev/null @@ -1,67 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Base Parser class. - -Classes: - BaseParser: Base class for config parsers. - -Exceptions: - ConfigurationError: An error in the configuration syntax. -""" - -import os -from io import TextIOWrapper - -import cerberus - -from okpt.io.utils import reader - - -class ConfigurationError(Exception): - """Exception raised for errors in the tool configuration. - - Attributes: - message -- explanation of the error - """ - - def __init__(self, message: str): - self.message = f'{message}' - super().__init__(self.message) - - -def _get_validator_from_schema_name(schema_name: str): - """Get the corresponding Cerberus validator from a schema name.""" - curr_file_dir = os.path.dirname(os.path.abspath(__file__)) - schemas_dir = os.path.join(os.path.dirname(curr_file_dir), 'schemas') - schema_file_path = os.path.join(schemas_dir, f'{schema_name}.yml') - schema_obj = reader.parse_yaml_from_path(schema_file_path) - return cerberus.Validator(schema_obj) - - -class BaseParser: - """Base class for config parsers. - - Attributes: - validator: Cerberus validator for a particular schema - errors: Cerberus validation errors (if any are found during validation) - - Methods: - parse: Parse config. - """ - - def __init__(self, schema_name: str): - self.validator = _get_validator_from_schema_name(schema_name) - self.errors = '' - - def parse(self, file_obj: TextIOWrapper): - """Convert file object to dict, while validating against config schema.""" - config_obj = reader.parse_yaml(file_obj) - is_config_valid = self.validator.validate(config_obj) - if not is_config_valid: - raise ConfigurationError(self.validator.errors) - - return self.validator.document diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/test.py b/benchmarks/perf-tool/okpt/io/config/parsers/test.py deleted file mode 100644 index c47e30ecc..000000000 --- a/benchmarks/perf-tool/okpt/io/config/parsers/test.py +++ /dev/null @@ -1,81 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Provides ToolParser. - -Classes: - ToolParser: Tool config parser. -""" -from dataclasses import dataclass -from io import TextIOWrapper -from typing import List - -from okpt.io.config.parsers import base -from okpt.test.steps.base import Step, StepConfig -from okpt.test.steps.factory import create_step - - -@dataclass -class TestConfig: - test_name: str - test_id: str - endpoint: str - port: int - timeout: int - num_runs: int - show_runs: bool - setup: List[Step] - steps: List[Step] - cleanup: List[Step] - - -class TestParser(base.BaseParser): - """Parser for Test config. - - Methods: - parse: Parse and validate the Test config. - """ - - def __init__(self): - super().__init__('test') - - def parse(self, file_obj: TextIOWrapper) -> TestConfig: - """See base class.""" - config_obj = super().parse(file_obj) - - implicit_step_config = dict() - if 'endpoint' in config_obj: - implicit_step_config['endpoint'] = config_obj['endpoint'] - - if 'port' in config_obj: - implicit_step_config['port'] = config_obj['port'] - - # Each step should have its own parse - take the config object and check if its valid - setup = [] - if 'setup' in config_obj: - setup = [create_step(StepConfig(step["name"], step, implicit_step_config)) for step in config_obj['setup']] - - steps = [create_step(StepConfig(step["name"], step, implicit_step_config)) for step in config_obj['steps']] - - cleanup = [] - if 'cleanup' in config_obj: - cleanup = [create_step(StepConfig(step["name"], step, implicit_step_config)) for step - in config_obj['cleanup']] - - test_config = TestConfig( - endpoint=config_obj['endpoint'], - port=config_obj['port'], - timeout=config_obj['timeout'], - test_name=config_obj['test_name'], - test_id=config_obj['test_id'], - num_runs=config_obj['num_runs'], - show_runs=config_obj['show_runs'], - setup=setup, - steps=steps, - cleanup=cleanup - ) - - return test_config diff --git a/benchmarks/perf-tool/okpt/io/config/parsers/util.py b/benchmarks/perf-tool/okpt/io/config/parsers/util.py deleted file mode 100644 index 454fec5a0..000000000 --- a/benchmarks/perf-tool/okpt/io/config/parsers/util.py +++ /dev/null @@ -1,116 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Utility functions for parsing""" - - -from okpt.io.config.parsers.base import ConfigurationError -from okpt.io.dataset import HDF5DataSet, BigANNNeighborDataSet, \ - BigANNVectorDataSet, DataSet, Context - - -def parse_dataset(dataset_format: str, dataset_path: str, - context: Context, custom_context=None) -> DataSet: - if dataset_format == 'hdf5': - return HDF5DataSet(dataset_path, context, custom_context) - - if dataset_format == 'bigann' and context == Context.NEIGHBORS: - return BigANNNeighborDataSet(dataset_path) - - if dataset_format == 'bigann': - return BigANNVectorDataSet(dataset_path) - - raise Exception("Unsupported data-set format") - - -def parse_string_param(key: str, first_map, second_map, default) -> str: - value = first_map.get(key) - if value is not None: - if type(value) is str: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - value = second_map.get(key) - if value is not None: - if type(value) is str: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - if default is None: - raise ConfigurationError("{} must be set".format(key)) - return default - - -def parse_int_param(key: str, first_map, second_map, default) -> int: - value = first_map.get(key) - if value is not None: - if type(value) is int: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - value = second_map.get(key) - if value is not None: - if type(value) is int: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - if default is None: - raise ConfigurationError("{} must be set".format(key)) - return default - - -def parse_bool_param(key: str, first_map, second_map, default) -> bool: - value = first_map.get(key) - if value is not None: - if type(value) is bool: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - value = second_map.get(key) - if value is not None: - if type(value) is bool: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - if default is None: - raise ConfigurationError("{} must be set".format(key)) - return default - - -def parse_dict_param(key: str, first_map, second_map, default) -> dict: - value = first_map.get(key) - if value is not None: - if type(value) is dict: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - value = second_map.get(key) - if value is not None: - if type(value) is dict: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - if default is None: - raise ConfigurationError("{} must be set".format(key)) - return default - - -def parse_list_param(key: str, first_map, second_map, default) -> list: - value = first_map.get(key) - if value is not None: - if type(value) is list: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - value = second_map.get(key) - if value is not None: - if type(value) is list: - return value - raise ConfigurationError("Invalid type for {}".format(key)) - - if default is None: - raise ConfigurationError("{} must be set".format(key)) - return default diff --git a/benchmarks/perf-tool/okpt/io/config/schemas/test.yml b/benchmarks/perf-tool/okpt/io/config/schemas/test.yml deleted file mode 100644 index 4d5c21a15..000000000 --- a/benchmarks/perf-tool/okpt/io/config/schemas/test.yml +++ /dev/null @@ -1,35 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -# defined using the cerberus validation API -# https://docs.python-cerberus.org/en/stable/index.html -endpoint: - type: string - default: "localhost" -port: - type: integer - default: 9200 -timeout: - type: integer - default: 60 -test_name: - type: string -test_id: - type: string -num_runs: - type: integer - default: 1 - min: 1 - max: 10000 -show_runs: - type: boolean - default: false -setup: - type: list -steps: - type: list -cleanup: - type: list diff --git a/benchmarks/perf-tool/okpt/io/dataset.py b/benchmarks/perf-tool/okpt/io/dataset.py deleted file mode 100644 index 001563bab..000000000 --- a/benchmarks/perf-tool/okpt/io/dataset.py +++ /dev/null @@ -1,222 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Defines DataSet interface and implements particular formats - -A DataSet is the basic functionality that it can be read in chunks, or -read completely and reset to the start. - -Currently, we support HDF5 formats from ann-benchmarks and big-ann-benchmarks -datasets. - -Classes: - HDF5DataSet: Format used in ann-benchmarks - BigANNNeighborDataSet: Neighbor format for big-ann-benchmarks - BigANNVectorDataSet: Vector format for big-ann-benchmarks -""" -import os -from abc import ABC, ABCMeta, abstractmethod -from enum import Enum -from typing import cast -import h5py -import numpy as np - -import struct - - -class Context(Enum): - """DataSet context enum. Can be used to add additional context for how a - data-set should be interpreted. - """ - INDEX = 1 - QUERY = 2 - NEIGHBORS = 3 - CUSTOM = 4 - - -class DataSet(ABC): - """DataSet interface. Used for reading data-sets from files. - - Methods: - read: Read a chunk of data from the data-set - size: Gets the number of items in the data-set - reset: Resets internal state of data-set to beginning - """ - __metaclass__ = ABCMeta - - @abstractmethod - def read(self, chunk_size: int): - pass - - @abstractmethod - def size(self): - pass - - @abstractmethod - def reset(self): - pass - - -class HDF5DataSet(DataSet): - """ Data-set format corresponding to `ANN Benchmarks - `_ - """ - - def __init__(self, dataset_path: str, context: Context, custom_context=None): - file = h5py.File(dataset_path) - self.data = cast(h5py.Dataset, file[self._parse_context(context, custom_context)]) - self.current = 0 - - def read(self, chunk_size: int): - if self.current >= self.size(): - return None - - end_i = self.current + chunk_size - if end_i > self.size(): - end_i = self.size() - - v = cast(np.ndarray, self.data[self.current:end_i]) - self.current = end_i - return v - - def size(self): - return self.data.len() - - def reset(self): - self.current = 0 - - @staticmethod - def _parse_context(context: Context, custom_context=None) -> str: - if context == Context.NEIGHBORS: - return "neighbors" - - if context == Context.INDEX: - return "train" - - if context == Context.QUERY: - return "test" - - if context == Context.CUSTOM: - return custom_context - - raise Exception("Unsupported context") - - -class BigANNNeighborDataSet(DataSet): - """ Data-set format for neighbor data-sets for `Big ANN Benchmarks - `_""" - - def __init__(self, dataset_path: str): - self.file = open(dataset_path, 'rb') - self.file.seek(0, os.SEEK_END) - num_bytes = self.file.tell() - self.file.seek(0) - - if num_bytes < 8: - raise Exception("File is invalid") - - self.num_queries = int.from_bytes(self.file.read(4), "little") - self.k = int.from_bytes(self.file.read(4), "little") - - # According to the website, the number of bytes that will follow will - # be: num_queries X K x sizeof(uint32_t) bytes + num_queries X K x - # sizeof(float) - if (num_bytes - 8) != 2 * (self.num_queries * self.k * 4): - raise Exception("File is invalid") - - self.current = 0 - - def read(self, chunk_size: int): - if self.current >= self.size(): - return None - - end_i = self.current + chunk_size - if end_i > self.size(): - end_i = self.size() - - v = [[int.from_bytes(self.file.read(4), "little") for _ in - range(self.k)] for _ in range(end_i - self.current)] - - self.current = end_i - return v - - def size(self): - return self.num_queries - - def reset(self): - self.file.seek(8) - self.current = 0 - - -class BigANNVectorDataSet(DataSet): - """ Data-set format for vector data-sets for `Big ANN Benchmarks - `_ - """ - - def __init__(self, dataset_path: str): - self.file = open(dataset_path, 'rb') - self.file.seek(0, os.SEEK_END) - num_bytes = self.file.tell() - self.file.seek(0) - - if num_bytes < 8: - raise Exception("File is invalid") - - self.num_points = int.from_bytes(self.file.read(4), "little") - self.dimension = int.from_bytes(self.file.read(4), "little") - bytes_per_num = self._get_data_size(dataset_path) - - if (num_bytes - 8) != self.num_points * self.dimension * bytes_per_num: - raise Exception("File is invalid") - - self.reader = self._value_reader(dataset_path) - self.current = 0 - - def read(self, chunk_size: int): - if self.current >= self.size(): - return None - - end_i = self.current + chunk_size - if end_i > self.size(): - end_i = self.size() - - v = np.asarray([self._read_vector() for _ in - range(end_i - self.current)]) - self.current = end_i - return v - - def _read_vector(self): - return np.asarray([self.reader(self.file) for _ in - range(self.dimension)]) - - def size(self): - return self.num_points - - def reset(self): - self.file.seek(8) # Seek to 8 bytes to skip re-reading metadata - self.current = 0 - - @staticmethod - def _get_data_size(file_name): - ext = file_name.split('.')[-1] - if ext == "u8bin": - return 1 - - if ext == "fbin": - return 4 - - raise Exception("Unknown extension") - - @staticmethod - def _value_reader(file_name): - ext = file_name.split('.')[-1] - if ext == "u8bin": - return lambda file: float(int.from_bytes(file.read(1), "little")) - - if ext == "fbin": - return lambda file: struct.unpack(' TextIOWrapper: - """Given a file path, get a readable file object. - - Args: - file path - - Returns: - Writeable file object - """ - return open(path, 'r', encoding='UTF-8') - - -def parse_yaml(file: TextIOWrapper) -> Dict[str, Any]: - """Parses YAML file from file object. - - Args: - file: file object to parse - - Returns: - A dict representing the YAML file. - """ - return yaml.load(file, Loader=yaml.SafeLoader) - - -def parse_yaml_from_path(path: str) -> Dict[str, Any]: - """Parses YAML file from file path. - - Args: - path: file path to parse - - Returns: - A dict representing the YAML file. - """ - file = reader.get_file_obj(path) - return parse_yaml(file) - - -def parse_json(file: TextIOWrapper) -> Dict[str, Any]: - """Parses JSON file from file object. - - Args: - file: file object to parse - - Returns: - A dict representing the JSON file. - """ - return json.load(file) - - -def parse_json_from_path(path: str) -> Dict[str, Any]: - """Parses JSON file from file path. - - Args: - path: file path to parse - - Returns: - A dict representing the JSON file. - """ - file = reader.get_file_obj(path) - return json.load(file) diff --git a/benchmarks/perf-tool/okpt/io/utils/writer.py b/benchmarks/perf-tool/okpt/io/utils/writer.py deleted file mode 100644 index 1f14bfd94..000000000 --- a/benchmarks/perf-tool/okpt/io/utils/writer.py +++ /dev/null @@ -1,40 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. -"""Provides functions for writing to file. - -Functions: - get_file_obj(): Get a writeable file object. - write_json(): Writes a python dictionary to a JSON file -""" - -import json -from io import TextIOWrapper -from typing import Any, Dict, TextIO, Union - - -def get_file_obj(path: str) -> TextIOWrapper: - """Get a writeable file object from a file path. - - Args: - file path - - Returns: - Writeable file object - """ - return open(path, 'w', encoding='UTF-8') - - -def write_json(data: Dict[str, Any], - file: Union[TextIOWrapper, TextIO], - pretty=False): - """Writes a dictionary to a JSON file. - - Args: - data: A dict to write to JSON. - file: Path of output file. - """ - indent = 2 if pretty else 0 - json.dump(data, file, indent=indent) diff --git a/benchmarks/perf-tool/okpt/main.py b/benchmarks/perf-tool/okpt/main.py deleted file mode 100644 index 3e6e022d4..000000000 --- a/benchmarks/perf-tool/okpt/main.py +++ /dev/null @@ -1,55 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -""" Runner script that serves as the main controller of the testing tool.""" - -import logging -import sys -from typing import cast - -from okpt.diff import diff -from okpt.io import args -from okpt.io.config.parsers import test -from okpt.io.utils import reader, writer -from okpt.test import runner - - -def main(): - """Main function of entry module.""" - cli_args = args.get_args() - output = cli_args.output - if cli_args.log: - log_level = getattr(logging, cli_args.log.upper()) - logging.basicConfig(level=log_level) - - if cli_args.command == 'test': - cli_args = cast(args.TestArgs, cli_args) - - # parse config - parser = test.TestParser() - test_config = parser.parse(cli_args.config) - logging.info('Configs are valid.') - - # run tests - test_runner = runner.TestRunner(test_config=test_config) - test_result = test_runner.execute() - - # write test results - logging.debug( - f'Test Result:\n {writer.write_json(test_result, sys.stdout, pretty=True)}' - ) - writer.write_json(test_result, output, pretty=True) - elif cli_args.command == 'diff': - cli_args = cast(args.DiffArgs, cli_args) - - # parse test results - base_result = reader.parse_json(cli_args.base_result) - changed_result = reader.parse_json(cli_args.changed_result) - - # get diff - diff_result = diff.Diff(base_result, changed_result, - cli_args.metadata).diff() - writer.write_json(data=diff_result, file=output, pretty=True) diff --git a/benchmarks/perf-tool/okpt/test/__init__.py b/benchmarks/perf-tool/okpt/test/__init__.py deleted file mode 100644 index ff4fd04d1..000000000 --- a/benchmarks/perf-tool/okpt/test/__init__.py +++ /dev/null @@ -1,5 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. diff --git a/benchmarks/perf-tool/okpt/test/profile.py b/benchmarks/perf-tool/okpt/test/profile.py deleted file mode 100644 index d96860f9a..000000000 --- a/benchmarks/perf-tool/okpt/test/profile.py +++ /dev/null @@ -1,86 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Provides decorators to profile functions. - -The decorators work by adding a `measureable` (time, memory, etc) field to a -dictionary returned by the wrapped function. So the wrapped functions must -return a dictionary in order to be profiled. -""" -import functools -import time -from typing import Callable - - -class TimerStoppedWithoutStartingError(Exception): - """Error raised when Timer is stopped without having been started.""" - - def __init__(self): - super().__init__() - self.message = 'Timer must call start() before calling end().' - - -class _Timer(): - """Timer class for timing. - - Methods: - start: Starts the timer. - end: Stops the timer and returns the time elapsed since start. - - Raises: - TimerStoppedWithoutStartingError: Timer must start before ending. - """ - - def __init__(self): - self.start_time = None - - def start(self): - """Starts the timer.""" - self.start_time = time.perf_counter() - - def end(self) -> float: - """Stops the timer. - - Returns: - The time elapsed in milliseconds. - """ - # ensure timer has started before ending - if self.start_time is None: - raise TimerStoppedWithoutStartingError() - - elapsed = (time.perf_counter() - self.start_time) * 1000 - self.start_time = None - return elapsed - - -def took(f: Callable): - """Profiles a functions execution time. - - Args: - f: Function to profile. - - Returns: - A function that wraps the passed in function and adds a time took field - to the return value. - """ - - @functools.wraps(f) - def wrapper(*args, **kwargs): - """Wrapper function.""" - timer = _Timer() - timer.start() - result = f(*args, **kwargs) - time_took = timer.end() - - # if result already has a `took` field, don't modify the result - if isinstance(result, dict) and 'took' in result: - return result - # `result` may not be a dictionary, so it may not be unpackable - elif isinstance(result, dict): - return {**result, 'took': time_took} - return {'took': time_took} - - return wrapper diff --git a/benchmarks/perf-tool/okpt/test/runner.py b/benchmarks/perf-tool/okpt/test/runner.py deleted file mode 100644 index 150154691..000000000 --- a/benchmarks/perf-tool/okpt/test/runner.py +++ /dev/null @@ -1,107 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Provides a test runner class.""" -import logging -import platform -import sys -from datetime import datetime -from typing import Any, Dict, List - -import psutil - -from okpt.io.config.parsers import test -from okpt.test.test import Test, get_avg - - -def _aggregate_runs(runs: List[Dict[str, Any]]): - """Aggregates and averages a list of test results. - - Args: - results: A list of test results. - num_runs: Number of times the tests were ran. - - Returns: - A dictionary containing the averages of the test results. - """ - aggregate: Dict[str, Any] = {} - for run in runs: - for key, value in run.items(): - if key in aggregate: - aggregate[key].append(value) - else: - aggregate[key] = [value] - - aggregate = {key: get_avg(value) for key, value in aggregate.items()} - return aggregate - - -class TestRunner: - """Test runner class for running tests and aggregating the results. - - Methods: - execute: Run the tests and aggregate the results. - """ - - def __init__(self, test_config: test.TestConfig): - """"Initializes test state.""" - self.test_config = test_config - self.test = Test(test_config) - - def _get_metadata(self): - """"Retrieves the test metadata.""" - svmem = psutil.virtual_memory() - return { - 'test_name': - self.test_config.test_name, - 'test_id': - self.test_config.test_id, - 'date': - datetime.now().strftime('%m/%d/%Y %H:%M:%S'), - 'python_version': - sys.version, - 'os_version': - platform.platform(), - 'processor': - platform.processor() + ', ' + - str(psutil.cpu_count(logical=True)) + ' cores', - 'memory': - str(svmem.used) + ' (used) / ' + str(svmem.available) + - ' (available) / ' + str(svmem.total) + ' (total)', - } - - def execute(self) -> Dict[str, Any]: - """Runs the tests and aggregates the results. - - Returns: - A dictionary containing the aggregate of test results. - """ - logging.info('Setting up tests.') - self.test.setup() - logging.info('Beginning to run tests.') - runs = [] - for i in range(self.test_config.num_runs): - logging.info( - f'Running test {i + 1} of {self.test_config.num_runs}' - ) - runs.append(self.test.execute()) - - logging.info('Finished running tests.') - aggregate = _aggregate_runs(runs) - - # add metadata to test results - test_result = { - 'metadata': - self._get_metadata(), - 'results': - aggregate - } - - # include info about all test runs if specified in config - if self.test_config.show_runs: - test_result['runs'] = runs - - return test_result diff --git a/benchmarks/perf-tool/okpt/test/steps/base.py b/benchmarks/perf-tool/okpt/test/steps/base.py deleted file mode 100644 index 829980421..000000000 --- a/benchmarks/perf-tool/okpt/test/steps/base.py +++ /dev/null @@ -1,60 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. -"""Provides base Step interface.""" - -from dataclasses import dataclass -from typing import Any, Dict, List - -from okpt.test import profile - - -@dataclass -class StepConfig: - step_name: str - config: Dict[str, object] - implicit_config: Dict[str, object] - - -class Step: - """Test step interface. - - Attributes: - label: Name of the step. - - Methods: - execute: Run the step and return a step response with the label and - corresponding measures. - """ - - label = 'base_step' - - def __init__(self, step_config: StepConfig): - self.step_config = step_config - - def _action(self): - """Step logic/behavior to be executed and profiled.""" - pass - - def _get_measures(self) -> List[str]: - """Gets the measures for a particular test""" - pass - - def execute(self) -> List[Dict[str, Any]]: - """Execute step logic while profiling various measures. - - Returns: - Dict containing step label and various step measures. - """ - action = self._action - - # profile the action with measure decorators - add if necessary - action = getattr(profile, 'took')(action) - - result = action() - if isinstance(result, dict): - return [{'label': self.label, **result}] - - raise ValueError('Invalid return by a step') diff --git a/benchmarks/perf-tool/okpt/test/steps/factory.py b/benchmarks/perf-tool/okpt/test/steps/factory.py deleted file mode 100644 index 2033f2672..000000000 --- a/benchmarks/perf-tool/okpt/test/steps/factory.py +++ /dev/null @@ -1,50 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. -"""Factory for creating steps.""" - -from okpt.io.config.parsers.base import ConfigurationError -from okpt.test.steps.base import Step, StepConfig - -from okpt.test.steps.steps import CreateIndexStep, DisableRefreshStep, RefreshIndexStep, DeleteIndexStep, \ - TrainModelStep, DeleteModelStep, ForceMergeStep, ClearCacheStep, IngestStep, IngestMultiFieldStep, \ - IngestNestedFieldStep, QueryStep, QueryWithFilterStep, QueryNestedFieldStep, GetStatsStep, WarmupStep - - -def create_step(step_config: StepConfig) -> Step: - if step_config.step_name == CreateIndexStep.label: - return CreateIndexStep(step_config) - elif step_config.step_name == DisableRefreshStep.label: - return DisableRefreshStep(step_config) - elif step_config.step_name == RefreshIndexStep.label: - return RefreshIndexStep(step_config) - elif step_config.step_name == TrainModelStep.label: - return TrainModelStep(step_config) - elif step_config.step_name == DeleteModelStep.label: - return DeleteModelStep(step_config) - elif step_config.step_name == DeleteIndexStep.label: - return DeleteIndexStep(step_config) - elif step_config.step_name == IngestStep.label: - return IngestStep(step_config) - elif step_config.step_name == IngestMultiFieldStep.label: - return IngestMultiFieldStep(step_config) - elif step_config.step_name == IngestNestedFieldStep.label: - return IngestNestedFieldStep(step_config) - elif step_config.step_name == QueryStep.label: - return QueryStep(step_config) - elif step_config.step_name == QueryWithFilterStep.label: - return QueryWithFilterStep(step_config) - elif step_config.step_name == QueryNestedFieldStep.label: - return QueryNestedFieldStep(step_config) - elif step_config.step_name == ForceMergeStep.label: - return ForceMergeStep(step_config) - elif step_config.step_name == ClearCacheStep.label: - return ClearCacheStep(step_config) - elif step_config.step_name == GetStatsStep.label: - return GetStatsStep(step_config) - elif step_config.step_name == WarmupStep.label: - return WarmupStep(step_config) - - raise ConfigurationError(f'Invalid step {step_config.step_name}') diff --git a/benchmarks/perf-tool/okpt/test/steps/steps.py b/benchmarks/perf-tool/okpt/test/steps/steps.py deleted file mode 100644 index 99b2728dc..000000000 --- a/benchmarks/perf-tool/okpt/test/steps/steps.py +++ /dev/null @@ -1,987 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. -"""Provides steps for OpenSearch tests. - -Some OpenSearch operations return a `took` field in the response body, -so the profiling decorators aren't needed for some functions. -""" -import json -from abc import abstractmethod -from typing import Any, Dict, List - -import numpy as np -import requests -import time - -from opensearchpy import OpenSearch, RequestsHttpConnection - -from okpt.io.config.parsers.base import ConfigurationError -from okpt.io.config.parsers.util import parse_string_param, parse_int_param, parse_dataset, parse_bool_param, \ - parse_list_param -from okpt.io.dataset import Context -from okpt.io.utils.reader import parse_json_from_path -from okpt.test.steps import base -from okpt.test.steps.base import StepConfig - - -class OpenSearchStep(base.Step): - """See base class.""" - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.endpoint = parse_string_param('endpoint', step_config.config, - step_config.implicit_config, - 'localhost') - default_port = 9200 if self.endpoint == 'localhost' else 80 - self.port = parse_int_param('port', step_config.config, - step_config.implicit_config, default_port) - self.timeout = parse_int_param('timeout', step_config.config, {}, 60) - self.opensearch = get_opensearch_client(str(self.endpoint), - int(self.port), int(self.timeout)) - - -class CreateIndexStep(OpenSearchStep): - """See base class.""" - - label = 'create_index' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - index_spec = parse_string_param('index_spec', step_config.config, {}, - None) - self.body = parse_json_from_path(index_spec) - if self.body is None: - raise ConfigurationError('Index body must be passed in') - - def _action(self): - """Creates an OpenSearch index, applying the index settings/mappings. - - Returns: - An OpenSearch index creation response body. - """ - self.opensearch.indices.create(index=self.index_name, body=self.body) - return {} - - def _get_measures(self) -> List[str]: - return ['took'] - - -class DisableRefreshStep(OpenSearchStep): - """See base class.""" - - label = 'disable_refresh' - - def _action(self): - """Disables the refresh interval for an OpenSearch index. - - Returns: - An OpenSearch index settings update response body. - """ - self.opensearch.indices.put_settings( - body={'index': { - 'refresh_interval': -1 - }}) - - return {} - - def _get_measures(self) -> List[str]: - return ['took'] - - -class RefreshIndexStep(OpenSearchStep): - """See base class.""" - - label = 'refresh_index' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - - def _action(self): - while True: - try: - self.opensearch.indices.refresh(index=self.index_name) - return {'store_kb': get_index_size_in_kb(self.opensearch, - self.index_name)} - except: - pass - - def _get_measures(self) -> List[str]: - return ['took', 'store_kb'] - - -class ForceMergeStep(OpenSearchStep): - """See base class.""" - - label = 'force_merge' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - self.max_num_segments = parse_int_param('max_num_segments', - step_config.config, {}, None) - - def _action(self): - while True: - try: - self.opensearch.indices.forcemerge( - index=self.index_name, - max_num_segments=self.max_num_segments) - return {} - except: - pass - - def _get_measures(self) -> List[str]: - return ['took'] - -class ClearCacheStep(OpenSearchStep): - """See base class.""" - - label = 'clear_cache' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - - def _action(self): - while True: - try: - self.opensearch.indices.clear_cache( - index=self.index_name) - return {} - except: - pass - - def _get_measures(self) -> List[str]: - return ['took'] - - -class WarmupStep(OpenSearchStep): - """See base class.""" - - label = 'warmup_operation' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.index_name = parse_string_param('index_name', step_config.config, {}, - None) - - def _action(self): - """Performs warmup operation on an index.""" - warmup_operation(self.endpoint, self.port, self.index_name) - return {} - - def _get_measures(self) -> List[str]: - return ['took'] - - -class TrainModelStep(OpenSearchStep): - """See base class.""" - - label = 'train_model' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - self.model_id = parse_string_param('model_id', step_config.config, {}, - 'Test') - self.train_index_name = parse_string_param('train_index', - step_config.config, {}, None) - self.train_index_field = parse_string_param('train_field', - step_config.config, {}, - None) - self.dimension = parse_int_param('dimension', step_config.config, {}, - None) - self.description = parse_string_param('description', step_config.config, - {}, 'Default') - self.max_training_vector_count = parse_int_param( - 'max_training_vector_count', step_config.config, {}, 10000000000000) - - method_spec = parse_string_param('method_spec', step_config.config, {}, - None) - self.method = parse_json_from_path(method_spec) - if self.method is None: - raise ConfigurationError('method must be passed in') - - def _action(self): - """Train a model for an index. - - Returns: - The trained model - """ - - # Build body - body = { - 'training_index': self.train_index_name, - 'training_field': self.train_index_field, - 'description': self.description, - 'dimension': self.dimension, - 'method': self.method, - 'max_training_vector_count': self.max_training_vector_count - } - - # So, we trained the model. Now we need to wait until we have to wait - # until the model is created. Poll every - # 1/10 second - requests.post('http://' + self.endpoint + ':' + str(self.port) + - '/_plugins/_knn/models/' + str(self.model_id) + '/_train', - json.dumps(body), - headers={'content-type': 'application/json'}) - - sleep_time = 0.1 - timeout = 100000 - i = 0 - while i < timeout: - time.sleep(sleep_time) - model_response = get_model(self.endpoint, self.port, self.model_id) - if 'state' in model_response.keys() and model_response['state'] == \ - 'created': - return {} - i += 1 - - raise TimeoutError('Failed to create model') - - def _get_measures(self) -> List[str]: - return ['took'] - - -class DeleteModelStep(OpenSearchStep): - """See base class.""" - - label = 'delete_model' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - self.model_id = parse_string_param('model_id', step_config.config, {}, - 'Test') - - def _action(self): - """Train a model for an index. - - Returns: - The trained model - """ - delete_model(self.endpoint, self.port, self.model_id) - return {} - - def _get_measures(self) -> List[str]: - return ['took'] - - -class DeleteIndexStep(OpenSearchStep): - """See base class.""" - - label = 'delete_index' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - - def _action(self): - """Delete the index - - Returns: - An empty dict - """ - delete_index(self.opensearch, self.index_name) - return {} - - def _get_measures(self) -> List[str]: - return ['took'] - - -class BaseIngestStep(OpenSearchStep): - """See base class.""" - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - self.field_name = parse_string_param('field_name', step_config.config, - {}, None) - self.bulk_size = parse_int_param('bulk_size', step_config.config, {}, - 300) - self.implicit_config = step_config.implicit_config - dataset_format = parse_string_param('dataset_format', - step_config.config, {}, 'hdf5') - dataset_path = parse_string_param('dataset_path', step_config.config, - {}, None) - self.dataset = parse_dataset(dataset_format, dataset_path, - Context.INDEX) - - self.input_doc_count = parse_int_param('doc_count', step_config.config, {}, - self.dataset.size()) - self.doc_count = min(self.input_doc_count, self.dataset.size()) - - def _action(self): - - def action(doc_id): - return {'index': {'_index': self.index_name, '_id': doc_id}} - - # Maintain minimal state outside of this loop. For large data sets, too - # much state may cause out of memory failure - for i in range(0, self.doc_count, self.bulk_size): - partition = self.dataset.read(self.bulk_size) - self._handle_data_bulk(partition, action, i) - self.dataset.reset() - - return {} - - def _get_measures(self) -> List[str]: - return ['took'] - - @abstractmethod - def _handle_data_bulk(self, partition, action, i): - pass - - -class IngestStep(BaseIngestStep): - """See base class.""" - - label = 'ingest' - - def _handle_data_bulk(self, partition, action, i): - if partition is None: - return - body = bulk_transform(partition, self.field_name, action, i) - bulk_index(self.opensearch, self.index_name, body) - - -class IngestMultiFieldStep(BaseIngestStep): - """See base class.""" - - label = 'ingest_multi_field' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - dataset_path = parse_string_param('dataset_path', step_config.config, - {}, None) - - self.attributes_dataset_name = parse_string_param('attributes_dataset_name', - step_config.config, {}, None) - - self.attributes_dataset = parse_dataset('hdf5', dataset_path, - Context.CUSTOM, self.attributes_dataset_name) - - self.attribute_spec = parse_list_param('attribute_spec', - step_config.config, {}, []) - - self.partition_attr = self.attributes_dataset.read(self.doc_count) - self.action_buffer = None - - def _handle_data_bulk(self, partition, action, i): - if partition is None: - return - body = self.bulk_transform_with_attributes(partition, self.partition_attr, self.field_name, - action, i, self.attribute_spec) - bulk_index(self.opensearch, self.index_name, body) - - def bulk_transform_with_attributes(self, partition: np.ndarray, partition_attr, field_name: str, - action, offset: int, attributes_def) -> List[Dict[str, Any]]: - """Partitions and transforms a list of vectors into OpenSearch's bulk - injection format. - Args: - partition: An array of vectors to transform. - partition_attr: dictionary of additional data to transform - field_name: field name for action - action: Bulk API action. - offset: to start counting from - attributes_def: definition of additional doc fields - Returns: - An array of transformed vectors in bulk format. - """ - actions = [] - _ = [ - actions.extend([action(i + offset), None]) - for i in range(len(partition)) - ] - idx = 1 - part_list = partition.tolist() - for i in range(len(partition)): - actions[idx] = {field_name: part_list[i]} - attr_idx = i + offset - attr_def_idx = 0 - for attribute in attributes_def: - attr_def_name = attribute['name'] - attr_def_type = attribute['type'] - - if attr_def_type == 'str': - val = partition_attr[attr_idx][attr_def_idx].decode() - if val != 'None': - actions[idx][attr_def_name] = val - elif attr_def_type == 'int': - val = int(partition_attr[attr_idx][attr_def_idx].decode()) - actions[idx][attr_def_name] = val - attr_def_idx += 1 - idx += 2 - - return actions - - -class IngestNestedFieldStep(BaseIngestStep): - """See base class.""" - - label = 'ingest_nested_field' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - dataset_path = parse_string_param('dataset_path', step_config.config, - {}, None) - - self.attributes_dataset_name = parse_string_param('attributes_dataset_name', - step_config.config, {}, None) - - self.attributes_dataset = parse_dataset('hdf5', dataset_path, - Context.CUSTOM, self.attributes_dataset_name) - - self.attribute_spec = parse_list_param('attribute_spec', - step_config.config, {}, []) - - self.partition_attr = self.attributes_dataset.read(self.doc_count) - - if self.dataset.size() != self.doc_count: - raise ValueError("custom doc_count is not supported for nested field") - self.action_buffer = None - self.action_parent_id = None - self.count = 0 - - def _handle_data_bulk(self, partition, action, i): - if partition is None: - return - body = self.bulk_transform_with_nested(partition, self.partition_attr, self.field_name, - action, i, self.attribute_spec) - if len(body) > 0: - bulk_index(self.opensearch, self.index_name, body) - - def bulk_transform_with_nested(self, partition: np.ndarray, partition_attr, field_name: str, - action, offset: int, attributes_def) -> List[Dict[str, Any]]: - """Partitions and transforms a list of vectors into OpenSearch's bulk - injection format. - Args: - partition: An array of vectors to transform. - partition_attr: dictionary of additional data to transform - field_name: field name for action - action: Bulk API action. - offset: to start counting from - attributes_def: definition of additional doc fields - Returns: - An array of transformed vectors in bulk format. - """ - # offset is index of start row. We need number of parent doc - 1. - # The number of parent document can be calculated by using partition_attr data. - # We need to keep the last parent doc aside so that additional data can be added later. - parent_id_idx = next((index for (index, d) in enumerate(attributes_def) if d.get('name') == 'parent_id'), None) - if parent_id_idx is None: - raise ValueError("parent_id should be provided as attribute spec") - if attributes_def[parent_id_idx]['type'] != 'int': - raise ValueError("parent_id should be int type") - - first_index = offset - last_index = offset + len(partition) - 1 - num_of_actions = int(partition_attr[last_index][parent_id_idx].decode()) - int(partition_attr[first_index][parent_id_idx].decode()) - if self.action_buffer is None: - self.action_buffer = {"nested_field": []} - self.action_parent_id = int(partition_attr[first_index][parent_id_idx].decode()) - - actions = [] - _ = [ - actions.extend([action(i + self.action_parent_id), None]) - for i in range(num_of_actions) - ] - - idx = 1 - part_list = partition.tolist() - for i in range(len(partition)): - self.count += 1 - nested = {field_name: part_list[i]} - attr_idx = i + offset - attr_def_idx = 0 - current_parent_id = None - for attribute in attributes_def: - attr_def_name = attribute['name'] - attr_def_type = attribute['type'] - if attr_def_name == "parent_id": - current_parent_id = int(partition_attr[attr_idx][attr_def_idx].decode()) - attr_def_idx += 1 - continue - - if attr_def_type == 'str': - val = partition_attr[attr_idx][attr_def_idx].decode() - if val != 'None': - nested[attr_def_name] = val - elif attr_def_type == 'int': - val = int(partition_attr[attr_idx][attr_def_idx].decode()) - nested[attr_def_name] = val - attr_def_idx += 1 - - if self.action_parent_id == current_parent_id: - self.action_buffer["nested_field"].append(nested) - else: - actions.extend([action(self.action_parent_id), self.action_buffer]) - self.action_buffer = {"nested_field": []} - self.action_buffer["nested_field"].append(nested) - self.action_parent_id = current_parent_id - idx += 2 - - if self.count == self.doc_count: - actions.extend([action(self.action_parent_id), self.action_buffer]) - - return actions - - -class BaseQueryStep(OpenSearchStep): - """See base class.""" - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.k = parse_int_param('k', step_config.config, {}, 100) - self.r = parse_int_param('r', step_config.config, {}, 1) - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - self.field_name = parse_string_param('field_name', step_config.config, - {}, None) - self.calculate_recall = parse_bool_param('calculate_recall', - step_config.config, {}, False) - dataset_format = parse_string_param('dataset_format', - step_config.config, {}, 'hdf5') - dataset_path = parse_string_param('dataset_path', - step_config.config, {}, None) - self.dataset = parse_dataset(dataset_format, dataset_path, - Context.QUERY) - - input_query_count = parse_int_param('query_count', - step_config.config, {}, - self.dataset.size()) - self.query_count = min(input_query_count, self.dataset.size()) - - self.neighbors_format = parse_string_param('neighbors_format', - step_config.config, {}, 'hdf5') - self.neighbors_path = parse_string_param('neighbors_path', - step_config.config, {}, None) - - def _action(self): - - results = {} - query_responses = [] - for _ in range(self.query_count): - query = self.dataset.read(1) - if query is None: - break - query_responses.append( - query_index(self.opensearch, self.index_name, - self.get_body(query[0]) , self.get_exclude_fields())) - - results['took'] = [ - float(query_response['took']) for query_response in query_responses - ] - results['client_time'] = [ - float(query_response['client_time']) for query_response in query_responses - ] - results['memory_kb'] = get_cache_size_in_kb(self.endpoint, self.port) - - if self.calculate_recall: - ids = [[int(hit['_id']) - for hit in query_response['hits']['hits']] - for query_response in query_responses] - results['recall@K'] = recall_at_r(ids, self.neighbors, - self.k, self.k, self.query_count) - self.neighbors.reset() - results[f'recall@{str(self.r)}'] = recall_at_r( - ids, self.neighbors, self.r, self.k, self.query_count) - self.neighbors.reset() - - self.dataset.reset() - - return results - - def _get_measures(self) -> List[str]: - measures = ['took', 'memory_kb', 'client_time'] - - if self.calculate_recall: - measures.extend(['recall@K', f'recall@{str(self.r)}']) - - return measures - - @abstractmethod - def get_body(self, vec): - pass - - def get_exclude_fields(self): - return [self.field_name] - -class QueryStep(BaseQueryStep): - """See base class.""" - - label = 'query' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path, - Context.NEIGHBORS) - self.implicit_config = step_config.implicit_config - - def get_body(self, vec): - return { - 'size': self.k, - 'query': { - 'knn': { - self.field_name: { - 'vector': vec, - 'k': self.k - } - } - } - } - - -class QueryWithFilterStep(BaseQueryStep): - """See base class.""" - - label = 'query_with_filter' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - neighbors_dataset = parse_string_param('neighbors_dataset', - step_config.config, {}, None) - - self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path, - Context.CUSTOM, neighbors_dataset) - - self.filter_type = parse_string_param('filter_type', step_config.config, {}, 'SCRIPT') - self.filter_spec = parse_string_param('filter_spec', step_config.config, {}, None) - self.score_script_similarity = parse_string_param('score_script_similarity', step_config.config, {}, 'l2') - - self.implicit_config = step_config.implicit_config - - def get_body(self, vec): - filter_json = json.load(open(self.filter_spec)) - if self.filter_type == 'FILTER': - return { - 'size': self.k, - 'query': { - 'knn': { - self.field_name: { - 'vector': vec, - 'k': self.k, - 'filter': filter_json - } - } - } - } - elif self.filter_type == 'SCRIPT': - return { - 'size': self.k, - 'query': { - 'script_score': { - 'query': { - 'bool': { - 'filter': filter_json - } - }, - 'script': { - 'source': 'knn_score', - 'lang': 'knn', - 'params': { - 'field': self.field_name, - 'query_value': vec, - 'space_type': self.score_script_similarity - } - } - } - } - } - elif self.filter_type == 'BOOL_POST_FILTER': - return { - 'size': self.k, - 'query': { - 'bool': { - 'filter': filter_json, - 'must': [ - { - 'knn': { - self.field_name: { - 'vector': vec, - 'k': self.k - } - } - } - ] - } - } - } - else: - raise ConfigurationError('Not supported filter type {}'.format(self.filter_type)) - -class QueryNestedFieldStep(BaseQueryStep): - """See base class.""" - - label = 'query_nested_field' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - neighbors_dataset = parse_string_param('neighbors_dataset', - step_config.config, {}, None) - - self.neighbors = parse_dataset(self.neighbors_format, self.neighbors_path, - Context.CUSTOM, neighbors_dataset) - - self.implicit_config = step_config.implicit_config - - def get_body(self, vec): - return { - 'size': self.k, - 'query': { - 'nested': { - 'path': 'nested_field', - 'query': { - 'knn': { - 'nested_field.' + self.field_name: { - 'vector': vec, - 'k': self.k - } - } - } - } - } - } - -class GetStatsStep(OpenSearchStep): - """See base class.""" - - label = 'get_stats' - - def __init__(self, step_config: StepConfig): - super().__init__(step_config) - - self.index_name = parse_string_param('index_name', step_config.config, - {}, None) - - def _action(self): - """Get stats for cluster/index etc. - - Returns: - Stats with following info: - - number of committed and search segments in the index - """ - results = {} - segment_stats = get_segment_stats(self.opensearch, self.index_name) - shards = segment_stats["indices"][self.index_name]["shards"] - num_of_committed_segments = 0 - num_of_search_segments = 0; - for shard_key in shards.keys(): - for segment in shards[shard_key]: - num_of_committed_segments += segment["num_committed_segments"] - num_of_search_segments += segment["num_search_segments"] - - results['committed_segments'] = num_of_committed_segments - results['search_segments'] = num_of_search_segments - return results - - def _get_measures(self) -> List[str]: - return ['committed_segments', 'search_segments'] - -# Helper functions - (AKA not steps) -def bulk_transform(partition: np.ndarray, field_name: str, action, - offset: int) -> List[Dict[str, Any]]: - """Partitions and transforms a list of vectors into OpenSearch's bulk - injection format. - Args: - offset: to start counting from - partition: An array of vectors to transform. - field_name: field name for action - action: Bulk API action. - Returns: - An array of transformed vectors in bulk format. - """ - actions = [] - _ = [ - actions.extend([action(i + offset), None]) - for i in range(len(partition)) - ] - actions[1::2] = [{field_name: vec} for vec in partition.tolist()] - return actions - - -def delete_index(opensearch: OpenSearch, index_name: str): - """Deletes an OpenSearch index. - - Args: - opensearch: An OpenSearch client. - index_name: Name of the OpenSearch index to be deleted. - """ - opensearch.indices.delete(index=index_name, ignore=[400, 404]) - - -def get_model(endpoint, port, model_id): - """ - Retrieve a model from an OpenSearch cluster - Args: - endpoint: Endpoint OpenSearch is running on - port: Port OpenSearch is running on - model_id: ID of model to be deleted - Returns: - Get model response - """ - response = requests.get('http://' + endpoint + ':' + str(port) + - '/_plugins/_knn/models/' + model_id, - headers={'content-type': 'application/json'}) - return response.json() - - -def delete_model(endpoint, port, model_id): - """ - Deletes a model from OpenSearch cluster - Args: - endpoint: Endpoint OpenSearch is running on - port: Port OpenSearch is running on - model_id: ID of model to be deleted - Returns: - Deleted model response - """ - response = requests.delete('http://' + endpoint + ':' + str(port) + - '/_plugins/_knn/models/' + model_id, - headers={'content-type': 'application/json'}) - return response.json() - - -def warmup_operation(endpoint, port, index): - """ - Performs warmup operation on index to load native library files - of that index to reduce query latencies. - Args: - endpoint: Endpoint OpenSearch is running on - port: Port OpenSearch is running on - index: index name - Returns: - number of shards the plugin succeeded and failed to warm up. - """ - response = requests.get('http://' + endpoint + ':' + str(port) + - '/_plugins/_knn/warmup/' + index, - headers={'content-type': 'application/json'}) - return response.json() - - -def get_opensearch_client(endpoint: str, port: int, timeout=60): - """ - Get an opensearch client from an endpoint and port - Args: - endpoint: Endpoint OpenSearch is running on - port: Port OpenSearch is running on - timeout: timeout for OpenSearch client, default value 60 - Returns: - OpenSearch client - - """ - # TODO: fix for security in the future - return OpenSearch( - hosts=[{ - 'host': endpoint, - 'port': port - }], - use_ssl=False, - verify_certs=False, - connection_class=RequestsHttpConnection, - timeout=timeout, - ) - - -def recall_at_r(results, neighbor_dataset, r, k, query_count): - """ - Calculates the recall@R for a set of queries against a ground truth nearest - neighbor set - Args: - results: 2D list containing ids of results returned by OpenSearch. - results[i][j] i refers to query, j refers to - result in the query - neighbor_dataset: 2D dataset containing ids of the true nearest - neighbors for a set of queries - r: number of top results to check if they are in the ground truth k-NN - set. - k: k value for the query - query_count: number of queries - Returns: - Recall at R - """ - correct = 0.0 - total_num_of_results = 0 - for query in range(query_count): - true_neighbors = neighbor_dataset.read(1) - if true_neighbors is None: - break - true_neighbors_set = set(true_neighbors[0][:k]) - true_neighbors_set.discard(-1) - min_r = min(r, len(true_neighbors_set)) - total_num_of_results += min_r - for j in range(min_r): - if results[query][j] in true_neighbors_set: - correct += 1.0 - - return correct / total_num_of_results - - -def get_index_size_in_kb(opensearch, index_name): - """ - Gets the size of an index in kilobytes - Args: - opensearch: opensearch client - index_name: name of index to look up - Returns: - size of index in kilobytes - """ - return int( - opensearch.indices.stats(index_name, metric='store')['indices'] - [index_name]['total']['store']['size_in_bytes']) / 1024 - - -def get_cache_size_in_kb(endpoint, port): - """ - Gets the size of the k-NN cache in kilobytes - Args: - endpoint: endpoint of OpenSearch cluster - port: port of endpoint OpenSearch is running on - Returns: - size of cache in kilobytes - """ - response = requests.get('http://' + endpoint + ':' + str(port) + - '/_plugins/_knn/stats', - headers={'content-type': 'application/json'}) - stats = response.json() - - keys = stats['nodes'].keys() - - total_used = 0 - for key in keys: - total_used += int(stats['nodes'][key]['graph_memory_usage']) - return total_used - - -def query_index(opensearch: OpenSearch, index_name: str, body: dict, - excluded_fields: list): - start_time = round(time.time()*1000) - queryResponse = opensearch.search(index=index_name, - body=body, - _source_excludes=excluded_fields) - end_time = round(time.time() * 1000) - queryResponse['client_time'] = end_time - start_time - return queryResponse - - -def bulk_index(opensearch: OpenSearch, index_name: str, body: List): - return opensearch.bulk(index=index_name, body=body) - -def get_segment_stats(opensearch: OpenSearch, index_name: str): - return opensearch.indices.segments(index=index_name) diff --git a/benchmarks/perf-tool/okpt/test/test.py b/benchmarks/perf-tool/okpt/test/test.py deleted file mode 100644 index c947545ad..000000000 --- a/benchmarks/perf-tool/okpt/test/test.py +++ /dev/null @@ -1,188 +0,0 @@ -# SPDX-License-Identifier: Apache-2.0 -# -# The OpenSearch Contributors require contributions made to -# this file be licensed under the Apache-2.0 license or a -# compatible open source license. - -"""Provides a base Test class.""" -from math import floor -from typing import Any, Dict, List - -from okpt.io.config.parsers.test import TestConfig -from okpt.test.steps.base import Step - - -def get_avg(values: List[Any]): - """Get average value of a list. - - Args: - values: A list of values. - - Returns: - The average value in the list. - """ - valid_total = len(values) - running_sum = 0.0 - - for value in values: - if value == -1: - valid_total -= 1 - continue - running_sum += value - - if valid_total == 0: - return -1 - return running_sum / valid_total - - -def _pxx(values: List[Any], p: float): - """Calculates the pXX statistics for a given list. - - Args: - values: List of values. - p: Percentile (between 0 and 1). - - Returns: - The corresponding pXX metric. - """ - lowest_percentile = 1 / len(values) - highest_percentile = (len(values) - 1) / len(values) - - # return -1 if p is out of range or if the list doesn't have enough elements - # to support the specified percentile - if p < 0 or p > 1: - return -1.0 - elif p < lowest_percentile or p > highest_percentile: - if p == 1.0 and len(values) > 1: - return float(values[len(values) - 1]) - return -1.0 - else: - return float(values[floor(len(values) * p)]) - - -def _aggregate_steps(step_results: List[Dict[str, Any]], - measure_labels=None): - """Aggregates the steps for a given Test. - - The aggregation process extracts the measures from each step and calculates - the total time spent performing each step measure, including the - percentile metrics, if possible. - - The aggregation process also extracts the test measures by simply summing - up the respective step measures. - - A step measure is formatted as `{step_name}_{measure_name}`, for example, - {bulk_index}_{took} or {query_index}_{memory}. The braces are not included - in the actual key string. - - Percentile/Total step measures are give as - `{step_name}_{measure_name}_{percentile|total}`. - - Test measures are just step measure sums so they just given as - `test_{measure_name}`. - - Args: - steps: List of test steps to be aggregated. - measures: List of step metrics to account for. - - Returns: - A complete test result. - """ - if measure_labels is None: - measure_labels = ['took'] - test_measures = { - f'test_{measure_label}': 0 - for measure_label in measure_labels - } - step_measures: Dict[str, Any] = {} - - # iterate over all test steps - for step in step_results: - step_label = step['label'] - - step_measure_labels = list(step.keys()) - step_measure_labels.remove('label') - - # iterate over all measures in each test step - for measure_label in step_measure_labels: - - step_measure = step[measure_label] - step_measure_label = f'{measure_label}' if step_label == 'get_stats' else f'{step_label}_{measure_label}' - - # Add cumulative test measures from steps to test measures - if measure_label in measure_labels: - test_measures[f'test_{measure_label}'] += sum(step_measure) if \ - isinstance(step_measure, list) else step_measure - - if step_measure_label in step_measures: - _ = step_measures[step_measure_label].extend(step_measure) \ - if isinstance(step_measure, list) else \ - step_measures[step_measure_label].append(step_measure) - else: - step_measures[step_measure_label] = step_measure if \ - isinstance(step_measure, list) else [step_measure] - - aggregate = {**test_measures} - # calculate the totals and percentile statistics for each step measure - # where relevant - for step_measure_label, step_measure in step_measures.items(): - step_measure.sort() - - aggregate[step_measure_label + '_total'] = float(sum(step_measure)) - - p50 = _pxx(step_measure, 0.50) - if p50 != -1: - aggregate[step_measure_label + '_p50'] = p50 - p90 = _pxx(step_measure, 0.90) - if p90 != -1: - aggregate[step_measure_label + '_p90'] = p90 - p99 = _pxx(step_measure, 0.99) - if p99 != -1: - aggregate[step_measure_label + '_p99'] = p99 - p99_9 = _pxx(step_measure, 0.999) - if p99_9 != -1: - aggregate[step_measure_label + '_p99.9'] = p99_9 - p100 = _pxx(step_measure, 1.00) - if p100 != -1: - aggregate[step_measure_label + '_p100'] = p100 - - return aggregate - - -class Test: - """A base Test class, representing a collection of steps to profiled and - aggregated. - - Methods: - setup: Performs test setup. Usually for steps not intended to be - profiled. - run_steps: Runs the test steps, aggregating the results into the - `step_results` instance field. - cleanup: Perform test cleanup. Useful for clearing the state of a - persistent process like OpenSearch. Cleanup steps are executed after - each run. - execute: Runs steps, cleans up, and aggregates the test result. - """ - def __init__(self, test_config: TestConfig): - """Initializes the test state. - """ - self.test_config = test_config - self.setup_steps: List[Step] = test_config.setup - self.test_steps: List[Step] = test_config.steps - self.cleanup_steps: List[Step] = test_config.cleanup - - def setup(self): - _ = [step.execute() for step in self.setup_steps] - - def _run_steps(self): - step_results = [] - _ = [step_results.extend(step.execute()) for step in self.test_steps] - return step_results - - def _cleanup(self): - _ = [step.execute() for step in self.cleanup_steps] - - def execute(self): - results = self._run_steps() - self._cleanup() - return _aggregate_steps(results) diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json deleted file mode 100644 index 7e8ddda8e..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/index.json +++ /dev/null @@ -1,27 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1, - "knn.algo_param.ef_search": 100 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "faiss", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json deleted file mode 100644 index 3e04d12c4..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json +++ /dev/null @@ -1,42 +0,0 @@ -{ - "bool": - { - "should": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 70 - } - } - }, - { - "term": - { - "color": "green" - } - }, - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "yellow" - } - }, - { - "term": - { - "taste": "sweet" - } - } - ] - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml deleted file mode 100644 index ba8850e1d..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml +++ /dev/null @@ -1,40 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss HNSW Relaxed Filter Test" -test_id: "Faiss HNSW Relaxed Filter Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-hnsw/filtering/relaxed-filter/index.json - - name: ingest_multi_field - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query_with_filter - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-with-relaxed-filters.hdf5 - neighbors_dataset: neighbors_filter_5 - filter_spec: release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json - filter_type: FILTER diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json deleted file mode 100644 index 7e8ddda8e..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/index.json +++ /dev/null @@ -1,27 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1, - "knn.algo_param.ef_search": 100 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "faiss", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json deleted file mode 100644 index 9e6356f1c..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json +++ /dev/null @@ -1,44 +0,0 @@ -{ - "bool": - { - "must": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 60 - } - } - }, - { - "term": - { - "taste": "bitter" - } - }, - { - "bool": - { - "should": - [ - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "green" - } - } - ] - } - } - ] - } -} \ No newline at end of file diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml deleted file mode 100644 index 94f4073c7..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml +++ /dev/null @@ -1,40 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss HNSW Restrictive Filter Test" -test_id: "Faiss HNSW Restrictive Filter Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-hnsw/filtering/restrictive-filter/index.json - - name: ingest_multi_field - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query_with_filter - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-with-restrictive-filters.hdf5 - neighbors_dataset: neighbors_filter_4 - filter_spec: release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json - filter_type: FILTER diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/index.json deleted file mode 100644 index 7e8ddda8e..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/index.json +++ /dev/null @@ -1,27 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1, - "knn.algo_param.ef_search": 100 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "faiss", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/index.json b/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/index.json deleted file mode 100644 index 338ceb1f4..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/index.json +++ /dev/null @@ -1,35 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1, - "knn.algo_param.ef_search": 100 - } - }, - "mappings": { - "_source": { - "excludes": ["nested_field"] - }, - "properties": { - "nested_field": { - "type": "nested", - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "faiss", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml deleted file mode 100644 index 151b2014d..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml +++ /dev/null @@ -1,37 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss HNSW Nested Field Test" -test_id: "Faiss HNSW Nested Field Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-hnsw/nested/simple/index.json - - name: ingest_nested_field - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-nested.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' }, { name: 'parent_id', type: 'int'} ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query_nested_field - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-nested.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-nested.hdf5 - neighbors_dataset: neighbour_nested \ No newline at end of file diff --git a/benchmarks/perf-tool/release-configs/faiss-hnsw/test.yml b/benchmarks/perf-tool/release-configs/faiss-hnsw/test.yml deleted file mode 100644 index c4740acf5..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnsw/test.yml +++ /dev/null @@ -1,35 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss HNSW Test" -test_id: "Faiss HNSW Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-hnsw/index.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean.hdf5 diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/index.json b/benchmarks/perf-tool/release-configs/faiss-hnswpq/index.json deleted file mode 100644 index 479703412..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/index.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "model_id": "test-model" - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnswpq/method-spec.json deleted file mode 100644 index 2d67bf2df..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/method-spec.json +++ /dev/null @@ -1,15 +0,0 @@ -{ - "name":"hnsw", - "engine":"faiss", - "space_type": "l2", - "parameters":{ - "ef_construction": 256, - "m": 16, - "encoder": { - "name": "pq", - "parameters": { - "m": 16 - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/test.yml b/benchmarks/perf-tool/release-configs/faiss-hnswpq/test.yml deleted file mode 100644 index f573ede9c..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/test.yml +++ /dev/null @@ -1,59 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss HNSW PQ Test" -test_id: "Faiss HNSW PQ Test" -num_runs: 3 -show_runs: false -setup: - - name: delete_index - index_name: train_index - - name: create_index - index_name: train_index - index_spec: release-configs/faiss-hnswpq/train-index-spec.json - - name: ingest - index_name: train_index - field_name: train_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - doc_count: 50000 - - name: refresh_index - index_name: train_index -steps: - - name: delete_model - model_id: test-model - - name: delete_index - index_name: target_index - - name: train_model - model_id: test-model - train_index: train_index - train_field: train_field - dimension: 128 - method_spec: release-configs/faiss-hnswpq/method-spec.json - max_training_vector_count: 50000 - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-hnswpq/index.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean.hdf5 diff --git a/benchmarks/perf-tool/release-configs/faiss-hnswpq/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-hnswpq/train-index-spec.json deleted file mode 100644 index 804a5707e..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-hnswpq/train-index-spec.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "settings": { - "index": { - "number_of_shards": 24, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "train_field": { - "type": "knn_vector", - "dimension": 128 - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/index.json deleted file mode 100644 index ade7fa377..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/index.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "model_id": "test-model" - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json deleted file mode 100644 index 51ae89877..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json +++ /dev/null @@ -1,9 +0,0 @@ -{ - "name":"ivf", - "engine":"faiss", - "space_type": "l2", - "parameters":{ - "nlist": 128, - "nprobes": 8 - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json deleted file mode 100644 index 3e04d12c4..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json +++ /dev/null @@ -1,42 +0,0 @@ -{ - "bool": - { - "should": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 70 - } - } - }, - { - "term": - { - "color": "green" - } - }, - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "yellow" - } - }, - { - "term": - { - "taste": "sweet" - } - } - ] - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml deleted file mode 100644 index adb25a04d..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml +++ /dev/null @@ -1,64 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss IVF Relaxed Filter Test" -test_id: "Faiss IVF Relaxed Filter Test" -num_runs: 3 -show_runs: false -setup: - - name: delete_index - index_name: train_index - - name: create_index - index_name: train_index - index_spec: release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json - - name: ingest - index_name: train_index - field_name: train_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - doc_count: 50000 - - name: refresh_index - index_name: train_index -steps: - - name: delete_model - model_id: test-model - - name: delete_index - index_name: target_index - - name: train_model - model_id: test-model - train_index: train_index - train_field: train_field - dimension: 128 - method_spec: release-configs/faiss-ivf/filtering/relaxed-filter/method-spec.json - max_training_vector_count: 50000 - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-ivf/filtering/relaxed-filter/index.json - - name: ingest_multi_field - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query_with_filter - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-with-relaxed-filters.hdf5 - neighbors_dataset: neighbors_filter_5 - filter_spec: release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-spec.json - filter_type: FILTER diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json deleted file mode 100644 index 137fac9d8..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/relaxed-filter/train-index-spec.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "settings": { - "index": { - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "train_field": { - "type": "knn_vector", - "dimension": 128 - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/index.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/index.json deleted file mode 100644 index ade7fa377..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/index.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "model_id": "test-model" - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json deleted file mode 100644 index 51ae89877..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json +++ /dev/null @@ -1,9 +0,0 @@ -{ - "name":"ivf", - "engine":"faiss", - "space_type": "l2", - "parameters":{ - "nlist": 128, - "nprobes": 8 - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json deleted file mode 100644 index 9e6356f1c..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json +++ /dev/null @@ -1,44 +0,0 @@ -{ - "bool": - { - "must": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 60 - } - } - }, - { - "term": - { - "taste": "bitter" - } - }, - { - "bool": - { - "should": - [ - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "green" - } - } - ] - } - } - ] - } -} \ No newline at end of file diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml deleted file mode 100644 index bad047eab..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml +++ /dev/null @@ -1,64 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss IVF restrictive Filter Test" -test_id: "Faiss IVF restrictive Filter Test" -num_runs: 3 -show_runs: false -setup: - - name: delete_index - index_name: train_index - - name: create_index - index_name: train_index - index_spec: release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json - - name: ingest - index_name: train_index - field_name: train_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - doc_count: 50000 - - name: refresh_index - index_name: train_index -steps: - - name: delete_model - model_id: test-model - - name: delete_index - index_name: target_index - - name: train_model - model_id: test-model - train_index: train_index - train_field: train_field - dimension: 128 - method_spec: release-configs/faiss-ivf/filtering/restrictive-filter/method-spec.json - max_training_vector_count: 50000 - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-ivf/filtering/restrictive-filter/index.json - - name: ingest_multi_field - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query_with_filter - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-with-restrictive-filters.hdf5 - neighbors_dataset: neighbors_filter_4 - filter_spec: release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-spec.json - filter_type: FILTER diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json deleted file mode 100644 index 804a5707e..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/filtering/restrictive-filter/train-index-spec.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "settings": { - "index": { - "number_of_shards": 24, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "train_field": { - "type": "knn_vector", - "dimension": 128 - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/index.json b/benchmarks/perf-tool/release-configs/faiss-ivf/index.json deleted file mode 100644 index 479703412..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/index.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "model_id": "test-model" - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/method-spec.json deleted file mode 100644 index 51ae89877..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/method-spec.json +++ /dev/null @@ -1,9 +0,0 @@ -{ - "name":"ivf", - "engine":"faiss", - "space_type": "l2", - "parameters":{ - "nlist": 128, - "nprobes": 8 - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/test.yml b/benchmarks/perf-tool/release-configs/faiss-ivf/test.yml deleted file mode 100644 index 367c42594..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/test.yml +++ /dev/null @@ -1,59 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss IVF" -test_id: "Faiss IVF" -num_runs: 3 -show_runs: false -setup: - - name: delete_index - index_name: train_index - - name: create_index - index_name: train_index - index_spec: release-configs/faiss-ivf/train-index-spec.json - - name: ingest - index_name: train_index - field_name: train_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - doc_count: 50000 - - name: refresh_index - index_name: train_index -steps: - - name: delete_model - model_id: test-model - - name: delete_index - index_name: target_index - - name: train_model - model_id: test-model - train_index: train_index - train_field: train_field - dimension: 128 - method_spec: release-configs/faiss-ivf/method-spec.json - max_training_vector_count: 50000 - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-ivf/index.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean.hdf5 diff --git a/benchmarks/perf-tool/release-configs/faiss-ivf/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivf/train-index-spec.json deleted file mode 100644 index 804a5707e..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivf/train-index-spec.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "settings": { - "index": { - "number_of_shards": 24, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "train_field": { - "type": "knn_vector", - "dimension": 128 - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/index.json b/benchmarks/perf-tool/release-configs/faiss-ivfpq/index.json deleted file mode 100644 index 479703412..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/index.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "model_id": "test-model" - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/method-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivfpq/method-spec.json deleted file mode 100644 index 204b0a653..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/method-spec.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "name":"ivf", - "engine":"faiss", - "space_type": "l2", - "parameters":{ - "nlist": 128, - "nprobes": 8, - "encoder": { - "name": "pq", - "parameters": { - "m": 16, - "code_size": 8 - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/test.yml b/benchmarks/perf-tool/release-configs/faiss-ivfpq/test.yml deleted file mode 100644 index c3f63348b..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/test.yml +++ /dev/null @@ -1,59 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Faiss IVF PQ Test" -test_id: "Faiss IVF PQ Test" -num_runs: 3 -show_runs: false -setup: - - name: delete_index - index_name: train_index - - name: create_index - index_name: train_index - index_spec: release-configs/faiss-ivfpq/train-index-spec.json - - name: ingest - index_name: train_index - field_name: train_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - doc_count: 50000 - - name: refresh_index - index_name: train_index -steps: - - name: delete_model - model_id: test-model - - name: delete_index - index_name: target_index - - name: train_model - model_id: test-model - train_index: train_index - train_field: train_field - dimension: 128 - method_spec: release-configs/faiss-ivfpq/method-spec.json - max_training_vector_count: 50000 - - name: create_index - index_name: target_index - index_spec: release-configs/faiss-ivfpq/index.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean.hdf5 diff --git a/benchmarks/perf-tool/release-configs/faiss-ivfpq/train-index-spec.json b/benchmarks/perf-tool/release-configs/faiss-ivfpq/train-index-spec.json deleted file mode 100644 index 804a5707e..000000000 --- a/benchmarks/perf-tool/release-configs/faiss-ivfpq/train-index-spec.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "settings": { - "index": { - "number_of_shards": 24, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "train_field": { - "type": "knn_vector", - "dimension": 128 - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/index.json deleted file mode 100644 index 7a9ff2890..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/index.json +++ /dev/null @@ -1,26 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "lucene", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json deleted file mode 100644 index 3e04d12c4..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json +++ /dev/null @@ -1,42 +0,0 @@ -{ - "bool": - { - "should": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 70 - } - } - }, - { - "term": - { - "color": "green" - } - }, - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "yellow" - } - }, - { - "term": - { - "taste": "sweet" - } - } - ] - } -} diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml deleted file mode 100644 index 3bbb99a0f..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml +++ /dev/null @@ -1,38 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Lucene HNSW Relaxed Filter Test" -test_id: "Lucene HNSW Relaxed Filter Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/lucene-hnsw/filtering/relaxed-filter/index.json - - name: ingest_multi_field - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: query_with_filter - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-with-relaxed-filters.hdf5 - neighbors_dataset: neighbors_filter_5 - filter_spec: release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-spec.json - filter_type: FILTER diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/index.json deleted file mode 100644 index 7a9ff2890..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/index.json +++ /dev/null @@ -1,26 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "lucene", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json deleted file mode 100644 index 9e6356f1c..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json +++ /dev/null @@ -1,44 +0,0 @@ -{ - "bool": - { - "must": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 60 - } - } - }, - { - "term": - { - "taste": "bitter" - } - }, - { - "bool": - { - "should": - [ - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "green" - } - } - ] - } - } - ] - } -} \ No newline at end of file diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml deleted file mode 100644 index aa4c5193f..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml +++ /dev/null @@ -1,38 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Lucene HNSW Restrictive Filter Test" -test_id: "Lucene HNSW Restrictive Filter Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/lucene-hnsw/filtering/restrictive-filter/index.json - - name: ingest_multi_field - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: query_with_filter - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-with-attr.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-with-restrictive-filters.hdf5 - neighbors_dataset: neighbors_filter_4 - filter_spec: release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-spec.json - filter_type: FILTER diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/index.json deleted file mode 100644 index 7a9ff2890..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/index.json +++ /dev/null @@ -1,26 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "lucene", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/index.json b/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/index.json deleted file mode 100644 index b41b51c77..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/index.json +++ /dev/null @@ -1,34 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1 - } - }, - "mappings": { - "_source": { - "excludes": ["nested_field"] - }, - "properties": { - "nested_field": { - "type": "nested", - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "lucene", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml deleted file mode 100644 index be825487a..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml +++ /dev/null @@ -1,37 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Lucene HNSW Nested Field Test" -test_id: "Lucene HNSW Nested Field Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/lucene-hnsw/nested/simple/index.json - - name: ingest_nested_field - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-nested.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' }, { name: 'parent_id', type: 'int'} ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query_nested_field - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean-nested.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean-nested.hdf5 - neighbors_dataset: neighbour_nested \ No newline at end of file diff --git a/benchmarks/perf-tool/release-configs/lucene-hnsw/test.yml b/benchmarks/perf-tool/release-configs/lucene-hnsw/test.yml deleted file mode 100644 index b253ee08e..000000000 --- a/benchmarks/perf-tool/release-configs/lucene-hnsw/test.yml +++ /dev/null @@ -1,33 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Lucene HNSW" -test_id: "Lucene HNSW" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/lucene-hnsw/index.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean.hdf5 diff --git a/benchmarks/perf-tool/release-configs/nmslib-hnsw/index.json b/benchmarks/perf-tool/release-configs/nmslib-hnsw/index.json deleted file mode 100644 index eb714c5c8..000000000 --- a/benchmarks/perf-tool/release-configs/nmslib-hnsw/index.json +++ /dev/null @@ -1,27 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 24, - "number_of_replicas": 1, - "knn.algo_param.ef_search": 100 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "nmslib", - "parameters": { - "ef_construction": 256, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/release-configs/nmslib-hnsw/test.yml b/benchmarks/perf-tool/release-configs/nmslib-hnsw/test.yml deleted file mode 100644 index 94ad9b131..000000000 --- a/benchmarks/perf-tool/release-configs/nmslib-hnsw/test.yml +++ /dev/null @@ -1,35 +0,0 @@ -endpoint: [ENDPOINT] -port: [PORT] -test_name: "Nmslib HNSW Test" -test_id: "Nmslib HNSW Test" -num_runs: 3 -show_runs: false -steps: - - name: delete_index - index_name: target_index - - name: create_index - index_name: target_index - index_spec: release-configs/nmslib-hnsw/index.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 1 - - name: warmup_operation - index_name: target_index - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: dataset/sift-128-euclidean.hdf5 diff --git a/benchmarks/perf-tool/release-configs/run_all_tests.sh b/benchmarks/perf-tool/release-configs/run_all_tests.sh deleted file mode 100755 index e65d5b5c4..000000000 --- a/benchmarks/perf-tool/release-configs/run_all_tests.sh +++ /dev/null @@ -1,102 +0,0 @@ -#!/bin/bash -set -e - -# Description: -# Run a performance test for release -# Dataset should be available in perf-tool/dataset before running this script -# -# Example: -# ./run-test.sh --endpoint localhost -# -# Usage: -# ./run-test.sh \ -# --endpoint -# --port 80 \ -# --num-runs 3 \ -# --outputs ~/outputs - -while [ "$1" != "" ]; do - case $1 in - -url | --endpoint ) shift - ENDPOINT=$1 - ;; - -p | --port ) shift - PORT=$1 - ;; - -n | --num-runs ) shift - NUM_RUNS=$1 - ;; - -o | --outputs ) shift - OUTPUTS=$1 - ;; - * ) echo "Unknown parameter" - echo $1 - exit 1 - ;; - esac - shift -done - -if [ ! -n "$ENDPOINT" ]; then - echo "--endpoint should be specified" - exit -fi - -if [ ! -n "$PORT" ]; then - PORT=80 - echo "--port is not specified. Using default values $PORT" -fi - -if [ ! -n "$NUM_RUNS" ]; then - NUM_RUNS=3 - echo "--num-runs is not specified. Using default values $NUM_RUNS" -fi - -if [ ! -n "$OUTPUTS" ]; then - OUTPUTS="$HOME/outputs" - echo "--outputs is not specified. Using default values $OUTPUTS" -fi - - -curl -X PUT "http://$ENDPOINT:$PORT/_cluster/settings?pretty" -H 'Content-Type: application/json' -d' -{ - "persistent" : { - "knn.algo_param.index_thread_qty" : 4 - } -} -' - -TESTS="./release-configs/faiss-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml -./release-configs/faiss-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml -./release-configs/faiss-hnsw/nested/simple/simple-nested-test.yml -./release-configs/faiss-hnsw/test.yml -./release-configs/faiss-hnswpq/test.yml -./release-configs/faiss-ivf/filtering/relaxed-filter/relaxed-filter-test.yml -./release-configs/faiss-ivf/filtering/restrictive-filter/restrictive-filter-test.yml -./release-configs/faiss-ivf/test.yml -./release-configs/faiss-ivfpq/test.yml -./release-configs/lucene-hnsw/filtering/relaxed-filter/relaxed-filter-test.yml -./release-configs/lucene-hnsw/filtering/restrictive-filter/restrictive-filter-test.yml -./release-configs/lucene-hnsw/nested/simple/simple-nested-test.yml -./release-configs/lucene-hnsw/test.yml -./release-configs/nmslib-hnsw/test.yml" - -if [ ! -d $OUTPUTS ] -then - mkdir $OUTPUTS -fi - -for TEST in $TESTS -do - ORG_FILE=$TEST - NEW_FILE="$ORG_FILE.tmp" - OUT_FILE=$(grep test_id $ORG_FILE | cut -d':' -f2 | sed -r 's/^ "|"$//g' | sed 's/ /_/g') - echo "cp $ORG_FILE $NEW_FILE" - cp $ORG_FILE $NEW_FILE - sed -i "/^endpoint:/c\endpoint: $ENDPOINT" $NEW_FILE - sed -i "/^port:/c\port: $PORT" $NEW_FILE - sed -i "/^num_runs:/c\num_runs: $NUM_RUNS" $NEW_FILE - python3 knn-perf-tool.py test $NEW_FILE $OUTPUTS/$OUT_FILE - #Sleep for 1 min to cool down cpu from the previous run - sleep 60 -done diff --git a/benchmarks/perf-tool/requirements.in b/benchmarks/perf-tool/requirements.in deleted file mode 100644 index fd3555aab..000000000 --- a/benchmarks/perf-tool/requirements.in +++ /dev/null @@ -1,7 +0,0 @@ -Cerberus -opensearch-py -PyYAML -numpy -h5py -requests -psutil diff --git a/benchmarks/perf-tool/requirements.txt b/benchmarks/perf-tool/requirements.txt deleted file mode 100644 index fdfe205f8..000000000 --- a/benchmarks/perf-tool/requirements.txt +++ /dev/null @@ -1,37 +0,0 @@ -# -# This file is autogenerated by pip-compile with python 3.9 -# To update, run: -# -# pip-compile -# -cerberus==1.3.4 - # via -r requirements.in -certifi==2024.7.4 - # via - # opensearch-py - # requests -charset-normalizer==2.0.4 - # via requests -h5py==3.3.0 - # via -r requirements.in -idna==3.7 - # via requests -numpy==1.24.2 - # via - # -r requirements.in - # h5py -opensearch-py==1.0.0 - # via -r requirements.in -psutil==5.8.0 - # via -r requirements.in -pyyaml==5.4.1 - # via -r requirements.in -requests==2.32.0 - # via -r requirements.in -urllib3==1.26.18 - # via - # opensearch-py - # requests - -# The following packages are considered to be unsafe in a requirements file: -# setuptools diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/index-spec.json b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/index-spec.json deleted file mode 100644 index 5542ef387..000000000 --- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/index-spec.json +++ /dev/null @@ -1,17 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "number_of_shards": 3, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "model_id": "test-model" - } - } - } -} diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/method-spec.json b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/method-spec.json deleted file mode 100644 index 1aa7f809f..000000000 --- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/method-spec.json +++ /dev/null @@ -1,8 +0,0 @@ -{ - "name":"ivf", - "engine":"faiss", - "parameters":{ - "nlist":16, - "nprobes": 4 - } -} diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/test.yml b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/test.yml deleted file mode 100644 index 027ba8683..000000000 --- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/test.yml +++ /dev/null @@ -1,62 +0,0 @@ -endpoint: localhost -test_name: faiss_sift_ivf -test_id: "Test workflow for faiss ivf" -num_runs: 3 -show_runs: true -setup: - - name: delete_model - model_id: test-model - - name: delete_index - index_name: target_index - - name: delete_index - index_name: train_index - - name: create_index - index_name: train_index - index_spec: sample-configs/faiss-sift-ivf/train-index-spec.json - - name: ingest - index_name: train_index - field_name: train_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: ../dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: train_index -steps: - - name: train_model - model_id: test-model - train_index: train_index - train_field: train_field - dimension: 128 - method_spec: sample-configs/faiss-sift-ivf/method-spec.json - max_training_vector_count: 1000000000 - - name: create_index - index_name: target_index - index_spec: sample-configs/faiss-sift-ivf/index-spec.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: ../dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 10 - - name: warmup_operation - index_name: target_index - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: ../dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: ../dataset/sift-128-euclidean.hdf5 -cleanup: - - name: delete_model - model_id: test-model - - name: delete_index - index_name: target_index diff --git a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/train-index-spec.json b/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/train-index-spec.json deleted file mode 100644 index 00a418e4f..000000000 --- a/benchmarks/perf-tool/sample-configs/faiss-sift-ivf/train-index-spec.json +++ /dev/null @@ -1,16 +0,0 @@ -{ - "settings": { - "index": { - "number_of_shards": 3, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "train_field": { - "type": "knn_vector", - "dimension": 128 - } - } - } -} diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json deleted file mode 100644 index f529de4fe..000000000 --- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-1-spec.json +++ /dev/null @@ -1,24 +0,0 @@ -{ - "bool": - { - "must": - [ - { - "range": - { - "age": - { - "gte": 20, - "lte": 100 - } - } - }, - { - "term": - { - "color": "red" - } - } - ] - } -} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json deleted file mode 100644 index 9d4514e62..000000000 --- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-2-spec.json +++ /dev/null @@ -1,40 +0,0 @@ -{ - "bool": - { - "must": - [ - { - "term": - { - "taste": "salty" - } - }, - { - "bool": - { - "should": - [ - { - "bool": - { - "must_not": - { - "exists": - { - "field": "color" - } - } - } - }, - { - "term": - { - "color": "blue" - } - } - ] - } - } - ] - } -} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json deleted file mode 100644 index d69f8768e..000000000 --- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-3-spec.json +++ /dev/null @@ -1,30 +0,0 @@ -{ - "bool": - { - "must": - [ - { - "range": - { - "age": - { - "gte": 20, - "lte": 80 - } - } - }, - { - "exists": - { - "field": "color" - } - }, - { - "exists": - { - "field": "taste" - } - } - ] - } -} \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json deleted file mode 100644 index 822d63b37..000000000 --- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-4-spec.json +++ /dev/null @@ -1,44 +0,0 @@ -{ - "bool": - { - "must": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 60 - } - } - }, - { - "term": - { - "taste": "bitter" - } - }, - { - "bool": - { - "should": - [ - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "green" - } - } - ] - } - } - ] - } -} diff --git a/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json b/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json deleted file mode 100644 index 3e04d12c4..000000000 --- a/benchmarks/perf-tool/sample-configs/filter-spec/filter-5-spec.json +++ /dev/null @@ -1,42 +0,0 @@ -{ - "bool": - { - "should": - [ - { - "range": - { - "age": - { - "gte": 30, - "lte": 70 - } - } - }, - { - "term": - { - "color": "green" - } - }, - { - "term": - { - "color": "blue" - } - }, - { - "term": - { - "color": "yellow" - } - }, - { - "term": - { - "taste": "sweet" - } - } - ] - } -} diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json deleted file mode 100644 index 83ea79b15..000000000 --- a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/index-spec.json +++ /dev/null @@ -1,27 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "refresh_interval": "10s", - "number_of_shards": 30, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "lucene", - "parameters": { - "ef_construction": 100, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml b/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml deleted file mode 100644 index aa2ee6389..000000000 --- a/benchmarks/perf-tool/sample-configs/lucene-sift-hnsw-filter/test.yml +++ /dev/null @@ -1,41 +0,0 @@ -endpoint: localhost -test_name: lucene_sift_hnsw -test_id: "Test workflow for lucene hnsw" -num_runs: 1 -show_runs: false -setup: - - name: delete_index - index_name: target_index -steps: - - name: create_index - index_name: target_index - index_spec: sample-configs/lucene-sift-hnsw-filter/index-spec.json - - name: ingest_multi_field - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5 - attributes_dataset_name: attributes - attribute_spec: [ { name: 'color', type: 'str' }, { name: 'taste', type: 'str' }, { name: 'age', type: 'int' } ] - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 10 - - name: query_with_filter - k: 10 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: ../dataset/sift-128-euclidean-with-attr.hdf5 - neighbors_format: hdf5 - neighbors_path: ../dataset/sift-128-euclidean-with-attr-with-filters.hdf5 - neighbors_dataset: neighbors_filter_1 - filter_spec: sample-configs/filter-spec/filter-1-spec.json - query_count: 100 -cleanup: - - name: delete_index - index_name: target_index \ No newline at end of file diff --git a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/index-spec.json b/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/index-spec.json deleted file mode 100644 index 75abe7baa..000000000 --- a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/index-spec.json +++ /dev/null @@ -1,28 +0,0 @@ -{ - "settings": { - "index": { - "knn": true, - "knn.algo_param.ef_search": 512, - "refresh_interval": "10s", - "number_of_shards": 1, - "number_of_replicas": 0 - } - }, - "mappings": { - "properties": { - "target_field": { - "type": "knn_vector", - "dimension": 128, - "method": { - "name": "hnsw", - "space_type": "l2", - "engine": "nmslib", - "parameters": { - "ef_construction": 512, - "m": 16 - } - } - } - } - } -} diff --git a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/test.yml b/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/test.yml deleted file mode 100644 index 6d96bf80c..000000000 --- a/benchmarks/perf-tool/sample-configs/nmslib-sift-hnsw/test.yml +++ /dev/null @@ -1,38 +0,0 @@ -endpoint: localhost -test_name: nmslib_sift_hnsw -test_id: "Test workflow for nmslib hnsw" -num_runs: 2 -show_runs: false -setup: - - name: delete_index - index_name: target_index -steps: - - name: create_index - index_name: target_index - index_spec: sample-configs/nmslib-sift-hnsw/index-spec.json - - name: ingest - index_name: target_index - field_name: target_field - bulk_size: 500 - dataset_format: hdf5 - dataset_path: ../dataset/sift-128-euclidean.hdf5 - - name: refresh_index - index_name: target_index - - name: force_merge - index_name: target_index - max_num_segments: 10 - - name: warmup_operation - index_name: target_index - - name: query - k: 100 - r: 1 - calculate_recall: true - index_name: target_index - field_name: target_field - dataset_format: hdf5 - dataset_path: ../dataset/sift-128-euclidean.hdf5 - neighbors_format: hdf5 - neighbors_path: ../dataset/sift-128-euclidean.hdf5 -cleanup: - - name: delete_index - index_name: target_index