Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some descriptions in ReadMe not clear #1

Open
withchencheng opened this issue Nov 12, 2022 · 3 comments
Open

some descriptions in ReadMe not clear #1

withchencheng opened this issue Nov 12, 2022 · 3 comments

Comments

@withchencheng
Copy link

Hi, Yu, I don't understand some descriptions in this project ReadMe.

  1. In https://github.com/yueyu1030/actune#training

Take AG News dataset as an example, run_agnews_finetune.sh is used for running the experiment of standard active learning approaches, and run_agnews_finetune.sh is used for running active self-training experiments as unlabeled data is also used during fine-tuning.

You use the same file name run_agnews_finetune.sh twice. Which one is your paper's method?

  1. In https://github.com/yueyu1030/actune#hyperparameter-tuning
    pool stands for what?

Thank you!

@yueyu1030
Copy link
Owner

Thanks for reaching out.

For question #1, run_agnews.sh is used as our main method (active self-training). We will modify the README to avoid confusion.

For question #2, pool is the size of unlabeled data used in self-training. In self-training, we often do not use the whole unlabeled data as many pseudo-labels may contain noise. A common solution is to first select a subset of data with low uncertainty (pool is the size for such a subset), and fine-tune the pretrained language model on the subset (together with the labeled data) only. Hope these explanations help.

Best,
Yue

@linhlt-it-ee
Copy link

so how you config this number with different datasets. What is the value of this number if I test your with TREC?

@yueyu1030
Copy link
Owner

Overall we tune this parameter based on the performance of the validation set.
If there is no validation set, we recommend gradually (linearly) increasing the number of unlabeled examples to around 50% of the size of the unlabeled pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants