Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mersenne prime hashing fix. #200

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Mersenne prime hashing fix. #200

wants to merge 1 commit into from

Conversation

Apsod
Copy link

@Apsod Apsod commented May 28, 2024

This pull request aims to fix issue #198 by implementing a numpy-compatible function that does not overflow, based on separating low and high bits in multiplicands and using the mersenne prime modulus bit-trick.

pytest -sv ./tests/ passed without warnings for dedup, see below for summary:

===================================================================================== warnings summary ======================================================================================
tests/executor/test_local.py: 85 warnings
tests/test_io.py: 3 warnings
  /home/user/miniconda3/envs/datatrove/lib/python3.12/site-packages/botocore/auth.py:419: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
    datetime_now = datetime.datetime.utcnow()

tests/executor/test_local.py::TestLocalExecutor::test_executor
tests/executor/test_local.py::TestLocalExecutor::test_executor
tests/executor/test_local.py::TestLocalExecutor::test_executor
tests/executor/test_local.py::TestLocalExecutor::test_executor
tests/executor/test_local.py::TestLocalExecutor::test_executor
tests/executor/test_local.py::TestLocalExecutor::test_executor
  /home/user/miniconda3/envs/datatrove/lib/python3.12/site-packages/multiprocess/popen_fork.py:66: DeprecationWarning: This process (pid=144783) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

tests/pipeline/test_filters.py::TestFilters::test_url
  /home/user/miniconda3/envs/datatrove/lib/python3.12/tarfile.py:2221: DeprecationWarning: Python 3.14 will, by default, filter extracted tar archives and reject files or modify their metadata. Use the filter argument to control this behavior.
    warnings.warn(

tests/pipeline/test_ngrams_decont.py::TestNGramDecont::test_label_only
  /home/user/miniconda3/envs/datatrove/lib/python3.12/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
    warnings.warn(

tests/pipeline/test_word_tokenizers.py::TestWordTokenizers::test_sent_tokenizers
  /home/user/miniconda3/envs/datatrove/lib/python3.12/site-packages/jieba/_compat.py:18: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    import pkg_resources

tests/pipeline/test_word_tokenizers.py::TestWordTokenizers::test_sent_tokenizers
  /home/user/miniconda3/envs/datatrove/lib/python3.12/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

tests/pipeline/test_word_tokenizers.py::TestWordTokenizers::test_sent_tokenizers
  /home/user/miniconda3/envs/datatrove/lib/python3.12/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('sphinxcontrib')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

tests/pipeline/test_word_tokenizers.py::TestWordTokenizers::test_sent_tokenizers
  /home/user/miniconda3/envs/datatrove/lib/python3.12/site-packages/google/protobuf/internal/well_known_types.py:91: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    _EPOCH_DATETIME_NAIVE = datetime.datetime.utcfromtimestamp(0)

tests/test_io.py::TestIO::test_safely_create_file_locking
tests/test_io.py::TestIO::test_safely_create_file_locking
tests/test_io.py::TestIO::test_safely_create_file_locking
  /home/user/miniconda3/envs/datatrove/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=144783) is multi-threaded, use of fork() may lead to deadlocks in the child.
    self.pid = os.fork()

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================== 50 passed, 1 skipped, 103 warnings in 207.21s (0:03:27) ==================================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant