added tokenizer as arg for pt.text.sliding #387

mihirs16 · 2023-03-31T09:06:52Z

For the purposes of obtaining passages of a given length, tokenisation is performed by using the tokenizer object passed as an argument. The tokenizer object must have a .tokenize(str) -> list[str] method. By default, tokenisation is performed by splitting on one-or-more spaces, i.e. based on the Python regular expression re.compile(r'\s+').

Example

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
pipe = (pt.BatchRetrieve(index, wmodel="DPH", metadata=["docno", "body"])
      >> pt.text.sliding(length=128, stride=64, prepend_attr=None, tokenizer=tok)
      >> pt.text.scorer(wmodel="DPH")
      >> pt.text.max_passage() )

mihirs16 · 2023-03-31T09:07:28Z

yet to discuss the possibility and implementation of sentence segmentation (as part of #217)

cmacdonald · 2023-04-02T08:33:30Z

Thanks @mihirs16 - we're attending ECIR this week, so reviewing of the PR may be delayed.

cmacdonald

Can you add some unit new tests too? You should assert that your requirements on the token lengths are assured.
you may need to add transformers to requirements-test.txt (but ideally, I'd like NOT to add such a heavy dependency), so perhaps you could mock up another tokenizer class.

pyterrier/text.py

… custom MockTokenizer

mihirs16 · 2023-04-25T13:38:37Z

hey! would love further feedback (if any) about this

cmacdonald · 2023-04-26T11:18:25Z

Hi @mihirs16

Thanks for the ping. Most of the PR looks good. I think that my concern is whether

row[self.text_attr] = ' '.join(toks)

is now appropriate for recovering the passage segment.

Perhaps if the tokenizer is specified, it should call the .decode() method to recover the passage? [NB: Not sure .decode() is the correct method!)

I think also we could have a separate test case that runs IF transformers is installed, and is skipped otherwise:

def test_sliding_tokenize_HGF(self):
  try:
    from transformers import AutoTokenizer
   catch:
    self.skipTest("transformers not installed") 
  ...

mihirs16 · 2023-04-26T22:15:48Z

Sorry for the long post ahead 😅 but most of it is just python shell output.

Thanks for the ping. Most of the PR looks good. I think that my concern is whether
row[self.text_attr] = ' '.join(toks)
is now appropriate for recovering the passage segment.

Perhaps if the tokenizer is specified, it should call the .decode() method to recover the passage? [NB: Not sure .decode() is the correct method!)

@cmacdonald that's a very good point, here's a comparison between the first time a passage is tokenized, and then de-tokenized using ' '.join(toks), and then tokenized again to see if it reproduces the same result... which it does not.

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenized = tokenizer.tokenize("Perhaps if the tokenizer is specified, it should call the .decode() method to recover the passage?")
>>> tokenized, len(tokenized)
(['perhaps', 'if', 'the', 'token', '##izer', 'is', 'specified', ',', 'it', 'should', 'call', 'the', '.', 'deco', '##de', '(', ')', 'method', 'to', 'recover', 'the', 'passage', '?'], 23)
>>> untokenized = ' '.join(tokenized)
>>> untokenized
'perhaps if the token ##izer is specified , it should call the . deco ##de ( ) method to recover the passage ?'
>>> tokenized_again = tokenizer.tokenize(untokenized)
>>> tokenized_again, len(tokenized_again)
(['perhaps', 'if', 'the', 'token', '#', '#', 'i', '##zer', 'is', 'specified', ',', 'it', 'should', 'call', 'the', '.', 'deco', '#', '#', 'de', '(', ')', 'method', 'to', 'recover', 'the', 'passage', '?'], 28)

As far as i understand, .decode() is useful when we have used .encode() to tokenize and encode the passage. Here's the following example using a huggingface transformer:

>>> encoded = tokenizer.encode("Perhaps if the tokenizer is specified, it should call the .decode() method to recover the passage?")
>>> encoded, len(encoded)
([101, 3383, 2065, 1996, 19204, 17629, 2003, 9675, 1010, 2009, 2323, 2655, 1996, 1012, 21933, 3207, 1006, 1007, 4118, 2000, 8980, 1996, 6019, 1029, 102], 25)
>>> decoded = tokenizer.decode(encoded)
>>> decoded
'[CLS] perhaps if the tokenizer is specified, it should call the. decode ( ) method to recover the passage? [SEP]'

Otherwise, if we use the .tokenize() method, there are other utilities such as .convert_tokens_to_string() to get the original input again. Here's an example:

>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenized = tokenizer.tokenize("Perhaps if the tokenizer is specified, it should call the .decode() method to recover the passage?")
>>> tokenized, len(tokenized)
(['perhaps', 'if', 'the', 'token', '##izer', 'is', 'specified', ',', 'it', 'should', 'call', 'the', '.', 'deco', '##de', '(', ')', 'method', 'to', 'recover', 'the', 'passage', '?'], 23)
>>> untokenized = tokenizer.convert_tokens_to_string(tokenized)
>>> untokenized
'perhaps if the tokenizer is specified, it should call the. decode ( ) method to recover the passage?'
>>> tokenized_again = tokenizer.tokenize(untokenized)
>>> tokenized_again, len(tokenized_again)
(['perhaps', 'if', 'the', 'token', '##izer', 'is', 'specified', ',', 'it', 'should', 'call', 'the', '.', 'deco', '##de', '(', ')', 'method', 'to', 'recover', 'the', 'passage', '?'], 23)

In the light of this, I suppose the Passager class should have a detokenize to handle this in the same way as tokenize?

(using .convert_tokens_to_string() and making it a mandatory attribute, just like tokenize() for the tokenizer passed to SlidingWindowPassager).

cmacdonald · 2023-04-27T12:00:14Z

I think the type of members of toks could be string or int.

So I think we can roughly do:

toks = tok.tokenize(text)
reconstructed = tok.convert_tokens_to_string(toks)

….join by default | test: hgf transformers

mihirs16 · 2023-04-28T16:49:35Z

i took the liberty of setting tokenize and detokenize functions in the constructor itself, and also added a separate unittest to handle the case where transformers are installed (but haven't added it to the requirements-test.txt)

mihirs16 · 2023-05-21T21:34:41Z

@cmacdonald would love some feedback on this!

cmacdonald

Sorry for the delay on this. I have left a few comments, but its nearly there :-)

pyterrier/text.py

cmacdonald · 2023-10-10T16:44:14Z

Thanks for this @mihirs16!

added tokenizer as arg for pt.text.sliding

a05046e

cmacdonald requested changes Apr 6, 2023

View reviewed changes

pyterrier/text.py Outdated Show resolved Hide resolved

pyterrier/text.py Outdated Show resolved Hide resolved

fix: if statements removed from critical loops | add: unittest with a…

3c7f888

… custom MockTokenizer

mihirs16 requested a review from cmacdonald April 7, 2023 23:17

fix: custom tokenizer uses .convert_tokens_to_string() instead of ' '…

22491a8

….join by default | test: hgf transformers

Update requirements-test.txt to include transformers for testing

43942fb

cmacdonald reviewed Oct 10, 2023

View reviewed changes

pyterrier/text.py Outdated Show resolved Hide resolved

pyterrier/text.py Outdated Show resolved Hide resolved

fix: doc fix and removed callable check

aed6964

mihirs16 requested a review from cmacdonald October 10, 2023 12:44

update to documentation

61d7685

cmacdonald merged commit f0d1f18 into terrier-org:master Oct 10, 2023
12 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added tokenizer as arg for pt.text.sliding #387

added tokenizer as arg for pt.text.sliding #387

mihirs16 commented Mar 31, 2023

mihirs16 commented Mar 31, 2023 •

edited

Loading

cmacdonald commented Apr 2, 2023

cmacdonald left a comment •

edited

Loading

mihirs16 commented Apr 25, 2023

cmacdonald commented Apr 26, 2023

mihirs16 commented Apr 26, 2023 •

edited

Loading

cmacdonald commented Apr 27, 2023

mihirs16 commented Apr 28, 2023

mihirs16 commented May 21, 2023

cmacdonald left a comment

cmacdonald commented Oct 10, 2023

added tokenizer as arg for pt.text.sliding #387

added tokenizer as arg for pt.text.sliding #387

Conversation

mihirs16 commented Mar 31, 2023

mihirs16 commented Mar 31, 2023 • edited Loading

cmacdonald commented Apr 2, 2023

cmacdonald left a comment • edited Loading

Choose a reason for hiding this comment

mihirs16 commented Apr 25, 2023

cmacdonald commented Apr 26, 2023

mihirs16 commented Apr 26, 2023 • edited Loading

cmacdonald commented Apr 27, 2023

mihirs16 commented Apr 28, 2023

mihirs16 commented May 21, 2023

cmacdonald left a comment

Choose a reason for hiding this comment

cmacdonald commented Oct 10, 2023

mihirs16 commented Mar 31, 2023 •

edited

Loading

cmacdonald left a comment •

edited

Loading

mihirs16 commented Apr 26, 2023 •

edited

Loading