-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
package torchtext #17129
Merged
Merged
package torchtext #17129
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
4178274
add torchtext
h-vetinari 8ca903a
add test dependencies & initialise required cache
h-vetinari 5cc0b4e
bump to 0.13.1
h-vetinari b3b2060
Squashed & adapted changes from #21740
giswqs 4e1146f
rebase patch & use libsentencepiece
h-vetinari f174c91
patch some CMake variables; always use ninja
h-vetinari 301d013
fix third party stuff
h-vetinari 36555d7
fix TORCH_INSTALL_PREFIX
h-vetinari 8b7d69b
more patches
h-vetinari 117f982
avoid submodule issues on osx
h-vetinari a3a3a8e
use C++17 to match abseil ABI in libsentencepiece
h-vetinari 2a86fa8
remove obsolete caching; removed by upstream PR 1587
h-vetinari 8d185da
back to 0.13.1 for pytorch 1.12 compatibility
h-vetinari e96816b
add missing test dep
h-vetinari File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2017 Victor Zhong | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# upstream did not publish any tags, and a performance-critical | ||
# fix appears after the last bump (for 0.0.3). This is what e.g. | ||
# torchtext uses upstream (installing through git), so we add a ".1" | ||
{% set version = "0.0.3.1" %} | ||
{% set commit = "f1998b72a941d1e5f9578a66dc1c20b01913caab" %} | ||
|
||
package: | ||
name: revtok | ||
version: {{ version }} | ||
|
||
source: | ||
url: https://github.com/jekbradbury/revtok/archive/{{ commit }}.tar.gz | ||
sha256: a7447fefb44fbe46140bfc337c6ec756b869c37f737fd18eaec1293d15865b8f | ||
|
||
build: | ||
number: 0 | ||
noarch: python | ||
script: {{ PYTHON }} -m pip install . --no-deps -vv | ||
|
||
requirements: | ||
host: | ||
- python >=3.6 | ||
- pip | ||
run: | ||
- python >=3.6 | ||
- tqdm | ||
test: | ||
imports: | ||
- revtok | ||
|
||
about: | ||
home: https://github.com/jekbradbury/revtok | ||
license: MIT | ||
license_file: LICENSE | ||
summary: Reversible tokenization in Python. | ||
dev_url: https://github.com/jekbradbury/revtok | ||
|
||
extra: | ||
recipe-maintainers: | ||
- h-vetinari |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
BSD 3-Clause License | ||
|
||
Copyright (c) James Bradbury and Soumith Chintala 2016, | ||
All rights reserved. | ||
|
||
Redistribution and use in source and binary forms, with or without | ||
modification, are permitted provided that the following conditions are met: | ||
|
||
* Redistributions of source code must retain the above copyright notice, this | ||
list of conditions and the following disclaimer. | ||
|
||
* Redistributions in binary form must reproduce the above copyright notice, | ||
this list of conditions and the following disclaimer in the documentation | ||
and/or other materials provided with the distribution. | ||
|
||
* Neither the name of the copyright holder nor the names of its | ||
contributors may be used to endorse or promote products derived from | ||
this software without specific prior written permission. | ||
|
||
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | ||
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | ||
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE | ||
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE | ||
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | ||
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | ||
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | ||
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | ||
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE | ||
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
{% set version = "0.13.1" %} | ||
{% set spacy_model = "en_core_web_sm" %} | ||
|
||
package: | ||
name: torchtext | ||
version: {{ version }} | ||
|
||
source: | ||
url: https://github.com/pytorch/text/archive/refs/tags/v{{ version }}.tar.gz | ||
sha256: 1f7de1fd8c6303ea13ef2aed48a6df67df5f92d1c4a6918253be7decd93753be | ||
patches: | ||
# don't build pytorch/text/tree/main/third_party; pull in from conda-forge | ||
- patches/0001-do-not-build-third_party-libs.patch | ||
# make sure we install from $PREFIX and into $SP_DIR/torchtext | ||
- patches/0002-fix-some-CMake-arguments-for-our-infrastructure.patch | ||
# remove spurious run requirements from installation requirements | ||
- patches/0003-remove-unnecessary-installation-requirements.patch | ||
- patches/0004-load-library-from-correct-place.patch | ||
- patches/0005-must-use-C-17-to-match-abseil.patch | ||
|
||
build: | ||
number: 0 | ||
# no pytorch on windows in conda-forge, see | ||
# https://github.com/conda-forge/pytorch-cpu-feedstock/issues/32 | ||
skip: true # [win] | ||
script: {{ PYTHON }} -m pip install . --no-deps -vv | ||
rpaths: | ||
- lib/ | ||
- {{ SP_DIR }}/torch/lib | ||
- {{ SP_DIR }}/torchtext/lib | ||
|
||
requirements: | ||
build: | ||
- {{ compiler('cxx') }} | ||
- sysroot_linux-64 2.17 # [linux64] | ||
- cmake | ||
- ninja | ||
- pkg-config | ||
host: | ||
- python | ||
- pip | ||
- numpy | ||
- pytorch | ||
# from pytorch/text/tree/main/third_party | ||
- double-conversion | ||
- libsentencepiece | ||
- libutf8proc | ||
- re2 | ||
run: | ||
- python | ||
- nltk | ||
- requests | ||
- revtok ==0.0.3.1 | ||
- sacremoses | ||
- spacy | ||
- tqdm | ||
|
||
test: | ||
requires: | ||
- pip | ||
- pytest | ||
- expecttest | ||
- parameterized | ||
- spacy-model-{{ spacy_model }} | ||
- torchdata | ||
source_files: | ||
- test/ | ||
imports: | ||
- torchtext | ||
- torchtext.datasets | ||
- torchtext.data | ||
- torchtext.nn | ||
- torchtext.vocab | ||
commands: | ||
- pip check | ||
# then run test suite | ||
- pytest test/ -v | ||
|
||
about: | ||
home: https://pytorch.org/text | ||
license: BSD-3-Clause | ||
license_file: LICENSE | ||
summary: Data loaders and abstractions for text and NLP | ||
dev_url: https://github.com/pytorch/text | ||
|
||
extra: | ||
recipe-maintainers: | ||
- h-vetinari | ||
- giswqs |
53 changes: 53 additions & 0 deletions
53
recipes/torchtext/patches/0001-do-not-build-third_party-libs.patch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
From 187c7dc6a0a1a7ee2d87f4be7aae5a7ed1ff5a30 Mon Sep 17 00:00:00 2001 | ||
From: "H. Vetinari" <h.vetinari@gmx.com> | ||
Date: Wed, 1 Dec 2021 21:07:42 +1100 | ||
Subject: [PATCH 1/5] do not build third_party libs | ||
|
||
--- | ||
CMakeLists.txt | 5 ++++- | ||
torchtext/csrc/CMakeLists.txt | 8 -------- | ||
2 files changed, 4 insertions(+), 9 deletions(-) | ||
|
||
diff --git a/CMakeLists.txt b/CMakeLists.txt | ||
index 1ead15d4..fe9f9636 100644 | ||
--- a/CMakeLists.txt | ||
+++ b/CMakeLists.txt | ||
@@ -63,5 +63,8 @@ endif() | ||
# TORCH_CXX_FLAGS contains the same -D_GLIBCXX_USE_CXX11_ABI value as PyTorch | ||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall ${TORCH_CXX_FLAGS}") | ||
|
||
-add_subdirectory(third_party) | ||
+find_package(re2 REQUIRED) | ||
+find_package(double-conversion REQUIRED) | ||
+find_package(sentencepiece REQUIRED) | ||
+ | ||
add_subdirectory(torchtext/csrc) | ||
diff --git a/torchtext/csrc/CMakeLists.txt b/torchtext/csrc/CMakeLists.txt | ||
index 037f814d..658b9034 100644 | ||
--- a/torchtext/csrc/CMakeLists.txt | ||
+++ b/torchtext/csrc/CMakeLists.txt | ||
@@ -24,10 +24,6 @@ set( | ||
set( | ||
LIBTORCHTEXT_INCLUDE_DIRS | ||
${PROJECT_SOURCE_DIR} | ||
- ${PROJECT_SOURCE_DIR}/third_party/sentencepiece/src | ||
- $<TARGET_PROPERTY:re2,INCLUDE_DIRECTORIES> | ||
- $<TARGET_PROPERTY:double-conversion,INCLUDE_DIRECTORIES> | ||
- $<TARGET_PROPERTY:utf8proc,INCLUDE_DIRECTORIES> | ||
${TORCH_INSTALL_PREFIX}/include | ||
${TORCH_INSTALL_PREFIX}/include/torch/csrc/api/include | ||
) | ||
@@ -123,10 +119,6 @@ if (BUILD_TORCHTEXT_PYTHON_EXTENSION) | ||
set( | ||
EXTENSION_INCLUDE_DIRS | ||
${PROJECT_SOURCE_DIR} | ||
- ${PROJECT_SOURCE_DIR}/third_party/sentencepiece/src | ||
- $<TARGET_PROPERTY:re2,INCLUDE_DIRECTORIES> | ||
- $<TARGET_PROPERTY:double-conversion,INCLUDE_DIRECTORIES> | ||
- $<TARGET_PROPERTY:utf8proc,INCLUDE_DIRECTORIES> | ||
${TORCH_INSTALL_PREFIX}/include | ||
${TORCH_INSTALL_PREFIX}/include/torch/csrc/api/include | ||
) | ||
-- | ||
2.38.1.windows.1 | ||
|
43 changes: 43 additions & 0 deletions
43
recipes/torchtext/patches/0002-fix-some-CMake-arguments-for-our-infrastructure.patch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
From 630e2ae4ee1e921cc4c8c264a12790ea7e21abf0 Mon Sep 17 00:00:00 2001 | ||
From: "H. Vetinari" <h.vetinari@gmx.com> | ||
Date: Wed, 18 Jan 2023 16:56:57 +1100 | ||
Subject: [PATCH 2/5] fix some CMake arguments for our infrastructure | ||
|
||
--- | ||
build_tools/setup_helpers/extension.py | 10 ++++------ | ||
1 file changed, 4 insertions(+), 6 deletions(-) | ||
|
||
diff --git a/build_tools/setup_helpers/extension.py b/build_tools/setup_helpers/extension.py | ||
index 1f7236e4..71261577 100644 | ||
--- a/build_tools/setup_helpers/extension.py | ||
+++ b/build_tools/setup_helpers/extension.py | ||
@@ -60,12 +60,13 @@ class CMakeBuild(build_ext): | ||
cfg = "Debug" if self.debug else "Release" | ||
|
||
cmake_args = [ | ||
+ "-GNinja", | ||
f"-DCMAKE_BUILD_TYPE={cfg}", | ||
- f"-DCMAKE_PREFIX_PATH={torch.utils.cmake_prefix_path}", | ||
- f"-DCMAKE_INSTALL_PREFIX={extdir}", | ||
+ f"-DCMAKE_PREFIX_PATH={os.environ['PREFIX']}", | ||
+ f"-DCMAKE_INSTALL_PREFIX={os.environ['SP_DIR'] + '/torchtext'}", | ||
"-DCMAKE_VERBOSE_MAKEFILE=ON", | ||
f"-DPython_INCLUDE_DIR={distutils.sysconfig.get_python_inc()}", | ||
- f"-DTORCH_INSTALL_PREFIX:STRING={os.path.dirname(torch.__file__)}", | ||
+ f"-DTORCH_INSTALL_PREFIX:STRING={os.environ['SP_DIR'] + '/torch'}", | ||
"-DBUILD_TORCHTEXT_PYTHON_EXTENSION:BOOL=ON", | ||
"-DRE2_BUILD_TESTING:BOOL=OFF", | ||
"-DBUILD_TESTING:BOOL=OFF", | ||
@@ -75,9 +76,6 @@ class CMakeBuild(build_ext): | ||
] | ||
build_args = ["--target", "install"] | ||
|
||
- # Default to Ninja | ||
- if "CMAKE_GENERATOR" not in os.environ or platform.system() == "Windows": | ||
- cmake_args += ["-GNinja"] | ||
if platform.system() == "Windows": | ||
import sys | ||
|
||
-- | ||
2.38.1.windows.1 | ||
|
25 changes: 25 additions & 0 deletions
25
recipes/torchtext/patches/0003-remove-unnecessary-installation-requirements.patch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
From 55627e85bccfccc9ad9344556e52f6b9a178034c Mon Sep 17 00:00:00 2001 | ||
From: "H. Vetinari" <h.vetinari@gmx.com> | ||
Date: Wed, 18 Jan 2023 22:06:33 +1100 | ||
Subject: [PATCH 3/5] remove unnecessary installation requirements | ||
|
||
--- | ||
setup.py | 2 +- | ||
1 file changed, 1 insertion(+), 1 deletion(-) | ||
|
||
diff --git a/setup.py b/setup.py | ||
index 080415f7..d3ca73d2 100644 | ||
--- a/setup.py | ||
+++ b/setup.py | ||
@@ -86,7 +86,7 @@ setup_info = dict( | ||
description="Text utilities and datasets for PyTorch", | ||
long_description=read("README.rst"), | ||
license="BSD", | ||
- install_requires=["tqdm", "requests", pytorch_package_dep, "numpy"], | ||
+ install_requires=[pytorch_package_dep, "numpy"], | ||
python_requires=">=3.7", | ||
classifiers=[ | ||
"Programming Language :: Python :: 3.7", | ||
-- | ||
2.38.1.windows.1 | ||
|
25 changes: 25 additions & 0 deletions
25
recipes/torchtext/patches/0004-load-library-from-correct-place.patch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
From 973b89da3291944637807f9908bc50bd0f02f772 Mon Sep 17 00:00:00 2001 | ||
From: "H. Vetinari" <h.vetinari@gmx.com> | ||
Date: Wed, 18 Jan 2023 22:25:03 +1100 | ||
Subject: [PATCH 4/5] load library from correct place | ||
|
||
--- | ||
torchtext/_extension.py | 2 +- | ||
1 file changed, 1 insertion(+), 1 deletion(-) | ||
|
||
diff --git a/torchtext/_extension.py b/torchtext/_extension.py | ||
index b6dbb07b..e6205be1 100644 | ||
--- a/torchtext/_extension.py | ||
+++ b/torchtext/_extension.py | ||
@@ -4,7 +4,7 @@ from pathlib import Path | ||
import torch | ||
from torchtext._internal import module_utils as _mod_utils | ||
|
||
-_LIB_DIR = Path(__file__).parent / "lib" | ||
+_LIB_DIR = Path(os.environ["SP_DIR"]) / "torch" / "lib" | ||
|
||
|
||
def _get_lib_path(lib: str): | ||
-- | ||
2.38.1.windows.1 | ||
|
25 changes: 25 additions & 0 deletions
25
recipes/torchtext/patches/0005-must-use-C-17-to-match-abseil.patch
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
From e4712395289276ec81a17a3367e42ab77eb7b8cd Mon Sep 17 00:00:00 2001 | ||
From: "H. Vetinari" <h.vetinari@gmx.com> | ||
Date: Fri, 20 Jan 2023 17:16:20 +1100 | ||
Subject: [PATCH 5/5] must use C++17 to match abseil | ||
|
||
--- | ||
CMakeLists.txt | 2 +- | ||
1 file changed, 1 insertion(+), 1 deletion(-) | ||
|
||
diff --git a/CMakeLists.txt b/CMakeLists.txt | ||
index fe9f9636..ab29a6da 100644 | ||
--- a/CMakeLists.txt | ||
+++ b/CMakeLists.txt | ||
@@ -27,7 +27,7 @@ if(env_cxx_standard GREATER -1) | ||
"PyTorch requires -std=c++14. Please remove -std=c++ settings in your environment.") | ||
endif() | ||
|
||
-set(CMAKE_CXX_STANDARD 14) | ||
+set(CMAKE_CXX_STANDARD 17) | ||
set(CMAKE_C_STANDARD 11) | ||
|
||
set(CMAKE_EXPORT_COMPILE_COMMANDS ON) | ||
-- | ||
2.38.1.windows.1 | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason for this is https://github.com/pytorch/text/blob/v0.11.0-rc3/requirements.txt#L11, and wanting not to leave out a potentially very substantial performance improvement from jekbradbury/revtok#4 (the PR notes 100-fold improvement by interning strings in some cases), which is fact the only change after 0.0.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be better to patch, just to reference the pypi version but ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean use 0.0.3 and carry a patch for jekbradbury/revtok#4? I don't mind either way. That repo looks pretty dead otherwise (last commit 4.5 years ago), and I don't expect anyone to start using
micro
versions even if it does get revived, which is why I chose it.