Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify PyUncode_Count and unicode_count #97982

Closed
sobolevn opened this issue Oct 6, 2022 · 2 comments
Closed

Unify PyUncode_Count and unicode_count #97982

sobolevn opened this issue Oct 6, 2022 · 2 comments
Assignees
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement

Comments

@sobolevn
Copy link
Member

sobolevn commented Oct 6, 2022

Feature or enhancement

Right now PyUnicode_Count from

Py_ssize_t
PyUnicode_Count(PyObject *str,
PyObject *substr,
Py_ssize_t start,
Py_ssize_t end)
{
Py_ssize_t result;
int kind1, kind2;
const void *buf1 = NULL, *buf2 = NULL;
Py_ssize_t len1, len2;
if (ensure_unicode(str) < 0 || ensure_unicode(substr) < 0)
return -1;
kind1 = PyUnicode_KIND(str);
kind2 = PyUnicode_KIND(substr);
if (kind1 < kind2)
return 0;
len1 = PyUnicode_GET_LENGTH(str);
len2 = PyUnicode_GET_LENGTH(substr);
ADJUST_INDICES(start, end, len1);
if (end - start < len2)
return 0;
buf1 = PyUnicode_DATA(str);
buf2 = PyUnicode_DATA(substr);
if (kind2 != kind1) {
buf2 = unicode_askind(kind2, buf2, len2, kind1);
if (!buf2)
goto onError;
}
switch (kind1) {
case PyUnicode_1BYTE_KIND:
if (PyUnicode_IS_ASCII(str) && PyUnicode_IS_ASCII(substr))
result = asciilib_count(
((const Py_UCS1*)buf1) + start, end - start,
buf2, len2, PY_SSIZE_T_MAX
);
else
result = ucs1lib_count(
((const Py_UCS1*)buf1) + start, end - start,
buf2, len2, PY_SSIZE_T_MAX
);
break;
case PyUnicode_2BYTE_KIND:
result = ucs2lib_count(
((const Py_UCS2*)buf1) + start, end - start,
buf2, len2, PY_SSIZE_T_MAX
);
break;
case PyUnicode_4BYTE_KIND:
result = ucs4lib_count(
((const Py_UCS4*)buf1) + start, end - start,
buf2, len2, PY_SSIZE_T_MAX
);
break;
default:
Py_UNREACHABLE();
}
assert((kind2 != kind1) == (buf2 != PyUnicode_DATA(substr)));
if (kind2 != kind1)
PyMem_Free((void *)buf2);
return result;
onError:
assert((kind2 != kind1) == (buf2 != PyUnicode_DATA(substr)));
if (kind2 != kind1)
PyMem_Free((void *)buf2);
return -1;
}
and unicode_count from

cpython/Objects/unicodeobject.c

Lines 10854 to 10916 in cbdeda8

static PyObject *
unicode_count(PyObject *self, PyObject *args)
{
PyObject *substring = NULL; /* initialize to fix a compiler warning */
Py_ssize_t start = 0;
Py_ssize_t end = PY_SSIZE_T_MAX;
PyObject *result;
int kind1, kind2;
const void *buf1, *buf2;
Py_ssize_t len1, len2, iresult;
if (!parse_args_finds_unicode("count", args, &substring, &start, &end))
return NULL;
kind1 = PyUnicode_KIND(self);
kind2 = PyUnicode_KIND(substring);
if (kind1 < kind2)
return PyLong_FromLong(0);
len1 = PyUnicode_GET_LENGTH(self);
len2 = PyUnicode_GET_LENGTH(substring);
ADJUST_INDICES(start, end, len1);
if (end - start < len2)
return PyLong_FromLong(0);
buf1 = PyUnicode_DATA(self);
buf2 = PyUnicode_DATA(substring);
if (kind2 != kind1) {
buf2 = unicode_askind(kind2, buf2, len2, kind1);
if (!buf2)
return NULL;
}
switch (kind1) {
case PyUnicode_1BYTE_KIND:
iresult = ucs1lib_count(
((const Py_UCS1*)buf1) + start, end - start,
buf2, len2, PY_SSIZE_T_MAX
);
break;
case PyUnicode_2BYTE_KIND:
iresult = ucs2lib_count(
((const Py_UCS2*)buf1) + start, end - start,
buf2, len2, PY_SSIZE_T_MAX
);
break;
case PyUnicode_4BYTE_KIND:
iresult = ucs4lib_count(
((const Py_UCS4*)buf1) + start, end - start,
buf2, len2, PY_SSIZE_T_MAX
);
break;
default:
Py_UNREACHABLE();
}
result = PyLong_FromSsize_t(iresult);
assert((kind2 == kind1) == (buf2 == PyUnicode_DATA(substring)));
if (kind2 != kind1)
PyMem_Free((void *)buf2);
return result;
}
share a lot of code.

They can be unified, because the do the same thing.

Pitch

Citing @encukou:

Apparently unicode_count missed an optimization in 2011, otherwise they're equivalent (except arg parsing & converting the return value). Merging them could add the optimization to unicode_count.
If you want to work on that, note that there's also anylib_count that duplicates the main switch.

Previous discussion

Link: #96929

PR in the works.

@sobolevn sobolevn added type-feature A feature request or enhancement interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Oct 6, 2022
@sobolevn sobolevn self-assigned this Oct 6, 2022
sobolevn added a commit to sobolevn/cpython that referenced this issue Oct 7, 2022
vstinner added a commit that referenced this issue Oct 11, 2022
asciilib_count() is the same than ucs1lib_count(): the code is not
specialized for ASCII strings, so it's not worth it to have a
separated function. Remove asciilib_count() function.
@vstinner
Copy link
Member

I remove asciilib_count() function in commit df3a6d9.

vstinner pushed a commit that referenced this issue Oct 12, 2022
Add unicode_count_impl() to factorize PyUnicode_Count()
and unicode_count() code.
@vstinner
Copy link
Member

Fixed by ccab67b

carljm added a commit to carljm/cpython that referenced this issue Oct 14, 2022
* main: (38 commits)
  pythongh-98251: Allow venv to pass along PYTHON* variables to pip and ensurepip when they do not impact path resolution (pythonGH-98259)
  Bpo-41246: IOCP Proactor avoid callback code duplication (python#21399)
  bpo-46364: Use sockets for stdin of asyncio only on AIX (python#30596)
  pythongh-98178: syslog() is not thread-safe on macOS (python#98213)
  Mark all targets in `Doc/Makefile` as `PHONY` (pythonGH-98189)
  pythongh-97982: Factorize PyUnicode_Count() and unicode_count() code (python#98025)
  pythongh-96265: Formatting changes for faq/general (python#98129)
  tutorial: remove "with single quotes" (python#98204)
  pythongh-97669: Remove Tools/scripts/startuptime.py (python#98214)
  signalmodule.c uses _PyErr_WriteUnraisableMsg() (python#98217)
  pythongh-97669: Fix test_tools reference leak (python#98216)
  pythongh-97669: Create Tools/patchcheck/ directory (python#98186)
  pythongh-65046: Link to logging cookbook from asyncio docs (python#98207)
  Formatting fixes in contextlib docs (python#98111)
  pythongh-95276: Add callable entry to the glossary (python#95738)
  pythongh-96130: Rephrase use of "typecheck" verb for clarity (python#98144)
  Fix some incorrect indentation around the main switch (python#98177)
  pythongh-98172: Fix formatting in `except*` docs (python#98173)
  pythongh-97982: Remove asciilib_count() (python#98164)
  pythongh-95756: Free and NULL-out code caches when needed (pythonGH-98181)
  ...
carljm added a commit to carljm/cpython that referenced this issue Oct 14, 2022
* main: (37 commits)
  pythongh-98251: Allow venv to pass along PYTHON* variables to pip and ensurepip when they do not impact path resolution (pythonGH-98259)
  Bpo-41246: IOCP Proactor avoid callback code duplication (python#21399)
  bpo-46364: Use sockets for stdin of asyncio only on AIX (python#30596)
  pythongh-98178: syslog() is not thread-safe on macOS (python#98213)
  Mark all targets in `Doc/Makefile` as `PHONY` (pythonGH-98189)
  pythongh-97982: Factorize PyUnicode_Count() and unicode_count() code (python#98025)
  pythongh-96265: Formatting changes for faq/general (python#98129)
  tutorial: remove "with single quotes" (python#98204)
  pythongh-97669: Remove Tools/scripts/startuptime.py (python#98214)
  signalmodule.c uses _PyErr_WriteUnraisableMsg() (python#98217)
  pythongh-97669: Fix test_tools reference leak (python#98216)
  pythongh-97669: Create Tools/patchcheck/ directory (python#98186)
  pythongh-65046: Link to logging cookbook from asyncio docs (python#98207)
  Formatting fixes in contextlib docs (python#98111)
  pythongh-95276: Add callable entry to the glossary (python#95738)
  pythongh-96130: Rephrase use of "typecheck" verb for clarity (python#98144)
  Fix some incorrect indentation around the main switch (python#98177)
  pythongh-98172: Fix formatting in `except*` docs (python#98173)
  pythongh-97982: Remove asciilib_count() (python#98164)
  pythongh-95756: Free and NULL-out code caches when needed (pythonGH-98181)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

2 participants