Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to prepare two sequences as input for bert-multitask-learning? #33

Open
rudra0713 opened this issue Nov 20, 2019 · 7 comments
Open

Comments

@rudra0713
Copy link

Hi, I have a dataset that involves 2 sequences and the task is classifying the sequence pair. I am not sure how to prepare the input in this case. So far, I have been working with only one sequence where I used the following format:

["Everyone", "should", "be", "happy", "."]
How do I extend this for 2 sequences? Do I have to insert a "SEP" token myself?

@JayYip
Copy link
Owner

JayYip commented Nov 25, 2019

@JayYip
Copy link
Owner

JayYip commented Nov 25, 2019

Sorry, I misread your question. You can prepare something like:

@preprocessing_fn
def proc_fn(params, mode):
    return [{'a': ["Everyone", "should", "be", "happy", "."], 'b': ["you're", "right"]}], ['true']

@rudra0713
Copy link
Author

I prepared two sequences following your format, Here's an example:
{'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']}
After printring tokens in add_special_tokens_with_seqs function in utils.py, I got this,
tokens -> ['[CLS]', 'a', 'b', '[SEP]']
I was expecting 'a' and 'b' to be replaced by the original sequences. Is this okay?
For a single sequence task, when I printed tokens, I got the desired output,
tokens -> ['[CLS]', 'marriage', 'came', 'from', 'religion', '.', '[SEP]']

@JayYip
Copy link
Owner

JayYip commented Jan 6, 2020

Maybe it's a bug. Could you confirm that the example argument of create_single_problem_single_instance is a tuple like below?

({'a': ['Everyone', 'should', 'not', 'be', 'happy', '.'], 'b': ["you're", 'right']}, 'some label')

@rudra0713
Copy link
Author

rudra0713 commented Jan 7, 2020

After adding this print, this is what I have found
if the Mode in the preprocessing function is 'train' or 'eval', the output of example aligns with what you mentioned,
`example (from create_single_problem_single_instance function) -> ({'a': ['we', 'Should', 'be', 'optimistic', 'about', 'the', 'future', '.'], 'b': ['Anything', 'that', 'improves', 'rush', 'hour', 'traffic', 'ca', "n't", 'be', 'all', 'that', 'bad', '.']}, 0)
tokens (from add_special_tokens_with_seqs function)-> ['[CLS]', 'we', 'should', 'be', 'op', '##timi', '##stic', 'about', 'the', 'future', '.', '[SEP]', 'anything', 'that', 'improve', '##s', 'rus', '##h', 'hour', 'traffic', 'ca', 'n', "##'", '##t', 'be', 'all', 'that', 'bad', '.', '[SEP]']

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this ->
tokens -> ['[CLS]', 'a', 'b', '[SEP]']
Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy.
`

@JayYip
Copy link
Owner

JayYip commented Jan 10, 2020

But when the mode is 'infer', right before printing the accuracies of the particular task, there is no print of 'example', and tokens become like this ->
tokens -> ['[CLS]', 'a', 'b', '[SEP]']

This is a bug. I'll fix it later.

Also, for the same dataset and same split, previously I got 76% accuracy with BERT model, but with the multitask setting, for that same task alone, I am getting only 48.71% accuracy.

That's weird. Maybe it's caused by another bug. Could you provide more info?

@JayYip JayYip closed this as completed Jan 10, 2020
@JayYip
Copy link
Owner

JayYip commented Jan 10, 2020

Sorry, accidentally closed. Reopen now.

@JayYip JayYip reopened this Jan 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants