-
Notifications
You must be signed in to change notification settings - Fork 140
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Upgrade version to 0.5.1 for bug fix release (#312)
* Update sparsifying_bert_using_recipes.md (#299) Fix wrong link to tutorial images * BERT pruning tutorial clean up (#300) * Disable save ckpt for BERT tutorial command (#301) * Add output for eval in tutorial (#302) * Rewrite readme for hugging face transformers integration (#303) * Rewrite readme for hugging face transformers integration * Update integrations/huggingface-transformers/README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * Update integrations/huggingface-transformers/README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * Update integrations/huggingface-transformers/README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * Update integrations/huggingface-transformers/README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * Update integrations/huggingface-transformers/README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * Update integrations/huggingface-transformers/README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * Update integrations/huggingface-transformers/README.md Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * update from review Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> * Passage retrieval compression (#297) * adding IR elastic stuff * adding data download and modified es dense ranking * adding Doc2query * adding DPR code * updating doc2quyery code * adding msmarco eval scri[t * making dataset HF compatible * making dataset HF compatible * running doc2query t5 * model running * working on integrating * done with yaml recipe for all prunable layers * fixing config spacing for pruning yaml * work on dataset making * updaed thedownload data script and model training * running doc2query but missing the work for pruning * fixing issues in pruning * moving around DPR * added optimal lobotomizing project * adding to readme for baseline * new structures * cleaning up structure and pushing baseline numbers * moving sparse_ml_utils.py to src Co-authored-by: Mark Kurtz <mark@neuralmagic.com> * Update example commands for hugging face integration (#306) * fix: correct minor typo (#307) * Phased pruning (#311) * Update example commands for hugging face integration * Phased pruning implementation * Update for quality * Upgrade version to 0.5.1 for bug fix release Co-authored-by: Tuan Nguyen <tuan@neuralmagic.com> Co-authored-by: Jeannie Finks <74554921+jeanniefinks@users.noreply.github.com> Co-authored-by: spacemanidol <dcampos3@illinois.edu> Co-authored-by: Rahul Tuli <rahul@neuralmagic.com>
- Loading branch information
1 parent
8502988
commit 69b4927
Showing
76 changed files
with
12,875 additions
and
406 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Compressing DPR | ||
Author: @spacemanidol | ||
|
||
Methods | ||
1. Varying models | ||
2. Sturctured Pruning | ||
3. Unstructured Pruning | ||
4. Dimensionality Reduction | ||
## Usage | ||
batch_size: 4 | ||
dev_batch_size: 16 | ||
adam_eps: 1e-8 | ||
adam_betas: (0.9, 0.999) | ||
max_grad_norm: 2.0 | ||
log_batch_step: 1 | ||
train_rolling_loss_step: 100 | ||
weight_decay: 0.0 | ||
learning_rate: 2e-5 | ||
# Linear warmup over warmup_steps. | ||
warmup_steps: 1237 | ||
|
||
# Number of updates steps to accumulate before performing a backward/update pass. | ||
gradient_accumulation_steps: 1 | ||
|
||
# Total number of training epochs to perform. | ||
num_train_epochs: 40 | ||
eval_per_epoch: 1 | ||
hard_negatives: 1 | ||
other_negatives: 0 | ||
val_av_rank_hard_neg: 30 | ||
val_av_rank_other_neg: 30 | ||
val_av_rank_bsz: 128 | ||
val_av_rank_max_qs: 10000 | ||
|
||
https://www.dropbox.com/s/lvvpsx0cjk4vemv/collection.tar.gz?dl=1 | ||
https://www.dropbox.com/s/hq6xjhswiz60siu/queries.dev.small.tsv?dl=1 | ||
https://www.dropbox.com/s/khsplt2fhqwjs0v/qrels.dev.small.tsv?dl=1 | ||
https://www.dropbox.com/s/uzkvv4gpj3a596a/predicted_queries_topk_sampling.zip?dl=1 | ||
https://www.dropbox.com/s/nc1drdkjpxxsngg/run.dev.small.tsv?dl=1 | ||
## Results | ||
|
||
| Top-k passages | Original DPR NQ model | New DPR model | | ||
| ------------- |:-------------:| -----:| | ||
| 1 | 45.87 | 52.47 | | ||
| 5 | 68.14 | 72.24 | | ||
| 20 | 79.97 | 81.33 | | ||
| 100 | 85.87 | 87.29 | | ||
### requirements.txt |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
## Hydra | ||
|
||
[Hydra](https://github.com/facebookresearch/hydra) is an open-source Python | ||
framework that simplifies the development of research and other complex | ||
applications. The key feature is the ability to dynamically create a | ||
hierarchical configuration by composition and override it through config files | ||
and the command line. | ||
|
||
## DPR configuration | ||
All DPR tools configuration parameters are now split between different config groups and you can either modify them in the config files or override from command line. | ||
|
||
Each tools's (train_dense_encoder.py, generate_dense_embeddings.py, dense_retriever.py and train_reader.py) main method has now a hydra @hydra.main decorator with the name of the configuration file in the conf/ dir. | ||
For example, dense_retriever.py takes all its parameters from conf/dense_retriever.yaml file. | ||
Every tool's configuration files refers to other configuration files via "defaults:" parameter. | ||
It is called a [configuration group](https://hydra.cc/docs/tutorials/structured_config/config_groups) in Hydra. | ||
|
||
Let's take a look at dense_retriever.py's configuration: | ||
|
||
|
||
```yaml | ||
|
||
defaults: | ||
- encoder: hf_bert | ||
- datasets: retriever_default | ||
- ctx_sources: default_sources | ||
|
||
indexers: | ||
flat: | ||
_target_: dpr.indexer.faiss_indexers.DenseFlatIndexer | ||
|
||
hnsw: | ||
_target_: dpr.indexer.faiss_indexers.DenseHNSWFlatIndexer | ||
|
||
hnsw_sq: | ||
_target_: dpr.indexer.faiss_indexers.DenseHNSWSQIndexer | ||
|
||
... | ||
qa_dataset: | ||
... | ||
ctx_datatsets: | ||
... | ||
indexer: flat | ||
... | ||
|
||
``` | ||
|
||
" - encoder: " - a configuration group that contains all parameters to instantiate the encoder. The actual parameters are located in conf/encoder/hf_bert.yaml file. | ||
If you want to override some of them, you can either | ||
- Modify that config file | ||
- Create a new config group file under conf/encoder/ folder and enable to use it by providing encoder={your file name} command line argument | ||
- Override specific parameter from command line. For example: encoder.sequence_length=300 | ||
|
||
" - datasets:" - a configuration group that contains a list of all possible sources of queries for evaluation. One can find them in conf/datasets/retriever_default.yaml file. | ||
One should specify the dataset to use by providing qa_dataset parameter in order to use one of them during evaluation. For example, if you want to run the retriever on NQ test set, set qa_dataset=nq_test as a command line parameter. | ||
|
||
It is much easier now to use custom datasets, without the need to convert them to DPR format. Just define your own class that provides relevant __getitem__(), __len__() and load_data() methods (inherit from QASrc). | ||
|
||
" - ctx_sources: " - a configuration group that contains a list of all possible passage sources. One can find them in conf/ctx_sources/default_sources.yaml file. | ||
One should specify a list of names of the passages datasets as ctx_datatsets parameter. For example, if you want to use dpr's old wikipedia passages, set ctx_datatsets=[dpr_wiki]. | ||
Please note that this parameter is a list and you can effectively concatenate different passage source into one. In order to use multiple sources at once, one also needs to provide relevant embeddings files in encoded_ctx_files parameter, which is also a list. | ||
|
||
|
||
"indexers:" - a parameters map that defines various indexes. The actual index is selected by indexer parameter which is 'flat' by default but you can use loss index types by setting indexer=hnsw or indexer=hnsw_sq in the command line. | ||
|
||
Please refer to the configuration files comments for every parameter. |
47 changes: 47 additions & 0 deletions
47
research/information_retrieval/DPR/conf/biencoder_train_cfg.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,47 @@ | ||
|
||
# configuration groups | ||
defaults: | ||
- encoder: hf_bert | ||
- train: biencoder_default | ||
- datasets: encoder_train_default | ||
|
||
train_datasets: | ||
dev_datasets: | ||
output_dir: | ||
train_sampling_rates: | ||
loss_scale_factors: | ||
|
||
# Whether to lower case the input text. Set True for uncased models, False for the cased ones. | ||
do_lower_case: True | ||
|
||
fix_ctx_encoder: False | ||
val_av_rank_start_epoch: 30 | ||
seed: 12345 | ||
checkpoint_file_name: dpr_biencoder | ||
|
||
# A trained bi-encoder checkpoint file to initialize the model | ||
model_file: | ||
|
||
# TODO: move to a conf group | ||
# local_rank for distributed training on gpus | ||
local_rank: -1 | ||
global_loss_buf_sz: 592000 | ||
device: | ||
distributed_world_size: | ||
distributed_port: | ||
no_cuda: False | ||
n_gpu: | ||
fp16: True | ||
|
||
# For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." | ||
# "See details at https://nvidia.github.io/apex/amp.html | ||
fp16_opt_level: O1 | ||
|
||
# tokens which won't be slit by tokenizer | ||
special_tokens: | ||
|
||
ignore_checkpoint_offset: False | ||
ignore_checkpoint_optimizer: False | ||
|
||
# set to >1 to enable multiple query encoders | ||
multi_q_encoder: False |
6 changes: 6 additions & 0 deletions
6
research/information_retrieval/DPR/conf/ctx_sources/default_sources.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
# @package _group_ | ||
|
||
dpr_wiki: | ||
_target_: dpr.data.retriever_data.CsvCtxSrc | ||
file: data.wikipedia_split.psgs_w100 | ||
id_prefix: 'wiki:' |
46 changes: 46 additions & 0 deletions
46
research/information_retrieval/DPR/conf/datasets/encoder_train_default.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# @package _group_ | ||
|
||
nq_train: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.nq-train | ||
|
||
nq_train_hn1: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.nq-adv-hn-train | ||
|
||
nq_dev: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.nq-dev | ||
|
||
trivia_train: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.trivia-train | ||
|
||
trivia_dev: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.trivia-dev | ||
|
||
squad1_train: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.squad1-train | ||
|
||
squad1_dev: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.squad1-dev | ||
|
||
webq_train: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.webq-train | ||
|
||
webq_dev: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.webq-dev | ||
|
||
curatedtrec_train: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.curatedtrec-train | ||
|
||
curatedtrec_dev: | ||
_target_: dpr.data.biencoder_data.JsonQADataset | ||
file: data.retriever.curatedtrec-dev | ||
|
33 changes: 33 additions & 0 deletions
33
research/information_retrieval/DPR/conf/datasets/retriever_default.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
# @package _group_ | ||
|
||
nq_test: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.nq-test | ||
|
||
nq_train: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.nq-train | ||
|
||
nq_dev: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.nq-dev | ||
|
||
trivia_test: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.trivia-test | ||
|
||
trivia_train: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.trivia-train | ||
|
||
trivia_dev: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.trivia-dev | ||
|
||
webq_test: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.webq-test | ||
|
||
curatedtrec_test: | ||
_target_: dpr.data.retriever_data.CsvQASrc | ||
file: data.retriever.qas.curatedtrec-test |
71 changes: 71 additions & 0 deletions
71
research/information_retrieval/DPR/conf/dense_retriever.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
defaults: | ||
- encoder: hf_bert # defines encoder initialization parameters | ||
- datasets: retriever_default # contains a list of all possible sources of queries for evaluation. Specific set is selected by qa_dataset parameter | ||
- ctx_sources: default_sources # contains a list of all possible passage sources. Specific passages sources selected by ctx_datatsets parameter | ||
|
||
indexers: | ||
flat: | ||
_target_: dpr.indexer.faiss_indexers.DenseFlatIndexer | ||
|
||
hnsw: | ||
_target_: dpr.indexer.faiss_indexers.DenseHNSWFlatIndexer | ||
|
||
hnsw_sq: | ||
_target_: dpr.indexer.faiss_indexers.DenseHNSWSQIndexer | ||
|
||
# the name of the queries dataset from the 'datasets' config group | ||
qa_dataset: | ||
|
||
# a list of names of the passages datasets from the 'ctx_sources' config group | ||
ctx_datatsets: | ||
|
||
#Glob paths to encoded passages (from generate_dense_embeddings tool) | ||
encoded_ctx_files: [] | ||
|
||
out_file: | ||
# "regex" or "string" | ||
match: string | ||
n_docs: 100 | ||
validation_workers: 16 | ||
|
||
# Batch size to generate query embeddings | ||
batch_size: 128 | ||
|
||
# Whether to lower case the input text. Set True for uncased models, False for the cased ones. | ||
do_lower_case: True | ||
|
||
# The attribute name of encoder to use for queries. Options for the BiEncoder model: question_model, ctx_model | ||
# question_model is used if this param is empty | ||
encoder_path: | ||
|
||
# path to the FAISS index location - it is only needed if you want to serialize faiss index to files or read from them | ||
# (instead of using encoded_ctx_files) | ||
# it should point to either directory or a common index files prefix name | ||
# if there is no index at the specific location, the index will be created from encoded_ctx_files | ||
index_path: | ||
|
||
kilt_out_file: | ||
|
||
# A trained bi-encoder checkpoint file to initialize the model | ||
model_file: | ||
|
||
validate_as_tables: False | ||
rpc_retriever_cfg_file: | ||
indexer: flat | ||
|
||
# tokens which won't be slit by tokenizer | ||
special_tokens: | ||
|
||
# TODO: move to a conf group | ||
# local_rank for distributed training on gpus | ||
local_rank: -1 | ||
global_loss_buf_sz: 150000 | ||
device: | ||
distributed_world_size: | ||
no_cuda: False | ||
n_gpu: | ||
fp16: False | ||
|
||
# For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." | ||
# "See details at https://nvidia.github.io/apex/amp.html | ||
fp16_opt_level: O1 |
Oops, something went wrong.