Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: SwTokenizer getstate #136

Closed
wants to merge 5 commits into from
Closed

fix: SwTokenizer getstate #136

wants to merge 5 commits into from

Conversation

Bing-su
Copy link
Contributor

@Bing-su Bing-su commented Aug 1, 2023

fixes: #135

https://docs.python.org/ko/3.11/library/pickle.html?highlight=pickle#pickling-class-instances

python 3.11부터는 __getstate__가 정의되어있지 않을때의 기본 동작을 정의함으로써 이 문제를 해결한 것으로 보입니다. python 3.11에서도 같은 에러 발생

python 3.10이하에서는 여전히 필요합니다.

@bab2min
Copy link
Owner

bab2min commented Aug 7, 2023

안녕하세요 @Bing-su 님,
좋은 기여에 감사드립니다. 다만 PR 날리신걸 보면 getstate만 구현이 되어 있는데, 이경우 pickle.dump는 가능해도 pickle.load는 불가능해보입니다. 또한 현재 getstate로는 Python단의 attribute만 저장되는데, 이게 의미 있는 pickle dump일지도 확인이 필요합니다. 관련해서 pickle로 dump한 파일을 다시 load하여 정상적으로 SwTokenizer가 작동하는지 확인 가능해주실 수 있을까요?

@Bing-su
Copy link
Contributor Author

Bing-su commented Aug 7, 2023

https://docs.python.org/ko/3.11/library/pickle.html?highlight=pickle#object.__setstate__
__getstate__가 dict를 반환하면, __setstate__를 정의할 필요는 없는 것 같습니다.

test/test_transformers_addon.py에 피클-역피클 한 뒤 다시 테스트를 진행하는 코드를 추가했습니다.

(kiwi)
kiwipiepy on  pickle via △ v3.27.0 via 🐍 v3.10.12 via 🅒 kiwi took 2s
❯ python .\test\test_transformers_addon.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
(kiwi)
kiwipiepy on  pickle via △ v3.27.0 via 🐍 v3.10.12 via 🅒 kiwi took 3s

그리고 피클 라이브러리들로 피클화한 뒤, 비교해보는 테스트를 진행해보았습니다.

import pickle
import dill
import cloudpickle
import kiwipiepy.transformers_addon
from transformers import AutoTokenizer

repo = "kiwi-farm/roberta-base-32k"
orig = AutoTokenizer.from_pretrained(repo)

with open("pk1.pkl", "wb") as f:
    pickle.dump(orig, f)

with open("pk2.pkl", "wb") as f:
    dill.dump(orig, f)

with open("pk3.pkl", "wb") as f:
    cloudpickle.dump(orig, f)
from itertools import permutations

with open("pk1.pkl", "rb") as f:
    upk1 = pickle.load(f)

with open("pk2.pkl", "rb") as f:
    upk2 = dill.load(f)

with open("pk3.pkl", "rb") as f:
    upk3 = cloudpickle.load(f)

for (tk1, tk2) in permutations([orig, upk1, upk2, upk3], 2):
    for (k, v1), (_, v2) in zip(tk1.__dict__.items(), tk2.__dict__.items()):
        if k != "_tokenizer":
            assert getattr(tk1, k) == getattr(tk2, k)
        else:
            assert vars(getattr(tk1, k)) == vars(getattr(tk2, k))

print("ok!")
ok!

@bab2min
Copy link
Owner

bab2min commented Aug 7, 2023

@Bing-su property만 찍어보면 정상적으로 작동하는 것처럼 보일 수 있지만, 내부의 c++로 구현된 object를 호출하는 부분이 연결되면 아마 오류가 뜰 것으로 예상되어서요. test에서 tokenizer.tokenize 등의 메소드를 호출해보는게 좋을것 같아서 test_transformers_addon에 해당 함수를 추가했습니다.

예상대로 unpickle후 kiwi를 사용하는 부분에서 segmentation fault가 발생하고 있습니다. c++단에서 Kiwi object의 pickle/unpickle를 직접 지원하거나 아니면 SwTokenizer object에서 unpickle시에 kiwi를 다시 적절하게 복원하는 작업이 필요할 것 같습니다.

Fatal Python error: Segmentation fault

Current thread 0x00007f5822561700 (most recent call first):
  File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/sw_tokenizer.py", line 416 in kiwi
  File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/sw_tokenizer.py", line 263 in encode
  File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/transformers_addon.py", line 303 in _make_encoded
  File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/transformers_addon.py", line 264 in _encode_plus
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2512 in encode_plus
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2176 in encode
  File "/__w/kiwipiepy/kiwipiepy/kiwipiepy/transformers_addon.py", line 469 in tokenize
  File "/__w/kiwipiepy/kiwipiepy/test/test_transformers_addon.py", line 99 in test_pickle
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line [112](https://github.com/bab2min/kiwipiepy/actions/runs/5787034956/job/15683061040?pr=136#step:10:113) in _hookexec
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/python.py", line 1788 in runtest
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 169 in pytest_runtest_call
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 262 in <lambda>
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 341 in from_call
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 262 in call_runtest_hook
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 222 in call_and_report
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line 133 in runtestprotocol
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/runner.py", line [114](https://github.com/bab2min/kiwipiepy/actions/runs/5787034956/job/15683061040?pr=136#step:10:115) in pytest_runtest_protocol
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 349 in pytest_runtestloop
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 324 in _main
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 270 in wrap_session
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/main.py", line 317 in pytest_cmdline_main
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_callers.py", line 80 in _multicall
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_manager.py", line 112 in _hookexec
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pluggy/_hooks.py", line 433 in __call__
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/config/__init__.py", line 167 in main
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/_pytest/config/__init__.py", line 189 in console_main
  File "/opt/python/cp37-cp37m/lib/python3.7/site-packages/pytest/__main__.py", line 5 in <module>
  File "/opt/python/cp37-cp37m/lib/python3.7/runpy.py", line 85 in _run_code
  File "/opt/python/cp37-cp37m/lib/python3.7/runpy.py", line 193 in _run_module_as_main
/__w/_temp/ff2c73ac-ace6-4011-964a-9897075bfa1d.sh: line 1:   927 Segmentation fault      (core dumped) /opt/python/cp37-cp37m/bin/python -m pytest --verbose test/test_transformers_addon.py
test/test_transformers_addon.py::test_pickle

@Bing-su
Copy link
Contributor Author

Bing-su commented Aug 8, 2023

말씀하신게 맞습니다. 더 테스트를 해보고 다시 찾아오겠습니다. 감사합니다.

@Bing-su Bing-su closed this Aug 9, 2023
@Bing-su Bing-su deleted the pickle branch August 11, 2023 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SwTokenizer가 피클되지 않는 문제
2 participants