Skip to content

Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark

License

Notifications You must be signed in to change notification settings

nobu-g/JGLUE-evaluation-scripts

Repository files navigation

JGLUE Evaluation Scripts

test lint pre-commit.ci status Ruff CodeFactor Grade license

Requirements

Getting started

  • Create a virtual environment and install dependencies.

    $ poetry env use /path/to/python
    $ poetry install
  • Log in to wandb.

    $ wandb login

Training and evaluation

You can train and test a model with the following command:

# For training and evaluating MARC-ja
poetry run python src/train.py -cn marc_ja devices=[0,1] max_batches_per_device=16

Here are commonly used options:

  • -cn: Task name. Choose from marc_ja, jcola, jsts, jnli, jsquad, and jcqa.
  • devices: GPUs to use.
  • max_batches_per_device: Maximum number of batches to process per device (default: 4).
  • compile: JIT-compile the model with torch.compile for faster training ( default: false).
  • model: Pre-trained model name. see YAML config files under configs/model.

To evaluate on the out-of-domain split of the JCoLA dataset, specify datamodule/valid=jcola_ood ( or datamodule/valid=jcola_ood_annotated). For more options, see YAML config files under configs.

Debugging

poetry run python scripts/train.py -cn marc_ja.debug

You can specify trainer=cpu.debug to use CPU.

poetry run python scripts/train.py -cn marc_ja.debug trainer=cpu.debug

If you are on a machine with GPUs, you can specify the GPUs to use with the devices option.

poetry run python scripts/train.py -cn marc_ja.debug devices=[0]

Tuning hyper-parameters

$ wandb sweep <(sed 's/MODEL_NAME/deberta_base/' sweeps/jcola.yaml)
wandb: Creating sweep from: /dev/fd/xx
wandb: Created sweep with ID: xxxxxxxx
wandb: View sweep at: https://wandb.ai/<wandb-user>/JGLUE-evaluation-scripts/sweeps/xxxxxxxx
wandb: Run sweep agent with: wandb agent <wandb-user>/JGLUE-evaluation-scripts/xxxxxxxx
$ DEVICES=0,1 MAX_BATCHES_PER_DEVICE=16 COMPILE=true wandb agent <wandb-user>/JGLUE-evaluation-scripts/xxxxxxxx

Results

We fine-tuned the following models and evaluated them on the dev set of JGLUE. We tuned learning rate and training epochs for each model and task following the JGLUE paper.

Model MARC-ja/acc JCoLA/acc JSTS/pearson JSTS/spearman JNLI/acc JSQuAD/EM JSQuAD/F1 JComQA/acc
Waseda RoBERTa base 0.965 0.867 0.913 0.876 0.905 0.853 0.916 0.853
Waseda RoBERTa large (seq512) 0.969 0.849 0.925 0.890 0.928 0.910 0.955 0.900
LUKE Japanese base* 0.965 - 0.916 0.877 0.912 - - 0.842
LUKE Japanese large* 0.965 - 0.932 0.902 0.927 - - 0.893
DeBERTaV2 base 0.970 0.879 0.922 0.886 0.922 0.899 0.951 0.873
DeBERTaV2 large 0.968 0.882 0.925 0.892 0.924 0.912 0.959 0.890
DeBERTaV3 base 0.960 0.878 0.927 0.891 0.927 0.896 0.947 0.875

*The scores of LUKE are from the official repository.

Tuned hyper-parameters

  • Learning rate: {2e-05, 3e-05, 5e-05}
Model MARC-ja/acc JCoLA/acc JSTS/pearson JSTS/spearman JNLI/acc JSQuAD/EM JSQuAD/F1 JComQA/acc
Waseda RoBERTa base 3e-05 3e-05 2e-05 2e-05 3e-05 3e-05 3e-05 5e-05
Waseda RoBERTa large (seq512) 2e-05 2e-05 3e-05 3e-05 2e-05 2e-05 2e-05 3e-05
DeBERTaV2 base 2e-05 3e-05 5e-05 5e-05 3e-05 2e-05 2e-05 5e-05
DeBERTaV2 large 5e-05 2e-05 5e-05 5e-05 2e-05 2e-05 2e-05 3e-05
DeBERTaV3 base 5e-05 2e-05 3e-05 3e-05 2e-05 5e-05 5e-05 2e-05
  • Training epochs: {3, 4}
Model MARC-ja/acc JCoLA/acc JSTS/pearson JSTS/spearman JNLI/acc JSQuAD/EM JSQuAD/F1 JComQA/acc
Waseda RoBERTa base 4 3 4 4 3 4 4 3
Waseda RoBERTa large (seq512) 4 4 4 4 3 3 3 3
DeBERTaV2 base 3 4 3 3 3 4 4 4
DeBERTaV2 large 3 3 4 4 3 4 4 3
DeBERTaV3 base 4 4 4 4 4 4 4 4

Huggingface hub links

Author

Nobuhiro Ueda (ueda at nlp.ist.i.kyoto-u.ac.jp)

Reference

About

Training and evaluation scripts for JGLUE, a Japanese language understanding benchmark

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published