Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements in regular expression doc #114357

Open
wants to merge 23 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
817b3f3
Doc: Fix the array.fromfile method doc
adorilson Sep 1, 2020
6b53456
gh-106320: Remove private _PyInterpreterState functions (#106335)
vstinner Jul 2, 2023
1b4d152
[Doc] Divide RE Syntax in subsections
adorilson Jan 20, 2024
6ad009c
[DOC] Add crasis surrounding some RE-matched words
adorilson Jan 20, 2024
94f765f
[DOC] Make clearer what will be matched with a RE
adorilson Jan 20, 2024
292672b
Doc: minor change
adorilson Dec 30, 2023
65b4278
Merge branch 'python:main' into re_improvements
adorilson Feb 3, 2024
fe7389a
Merge branch 'python:main' into re_improvements
adorilson Feb 4, 2024
8394cd3
Merge branch 'python:main' into re_improvements
adorilson Feb 5, 2024
e2023e0
Doc: Put PatternError's attributes inside a table instead of regular …
adorilson Feb 5, 2024
cdaa9ae
Doc: Fix PatternError's attributes
adorilson Feb 5, 2024
bb98dad
Doc: fix lint issue
adorilson Feb 5, 2024
22ffed7
Merge branch 'main' into re_improvements
adorilson Feb 25, 2024
6a1e74e
Merge branch 'python:main' into re_improvements
adorilson Sep 25, 2024
6b357af
Doc: Add extension notation header
adorilson Sep 25, 2024
8f7356d
Doc: Add some more backticks
adorilson Sep 25, 2024
6ed5109
Merge branch 'python:main' into re_improvements
adorilson Sep 26, 2024
9c17aa8
Doc: Fix malformed hyperlink target
adorilson Sep 26, 2024
acb2e38
Merge branch 'main' into re_improvements
adorilson Sep 26, 2024
4d3b8dd
Merge branch 'python:main' into re_improvements
adorilson Oct 1, 2024
643070c
Merge branch 'main' into re_improvements
adorilson Oct 3, 2024
17baf98
Docs: add a 'also' for $ special character and RE examples reference …
adorilson Oct 3, 2024
4e12f7c
Docs: add some RE raw string notation references
adorilson Oct 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 71 additions & 30 deletions Doc/library/re.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,8 @@ usage of the backslash in string literals now generate a :exc:`SyntaxWarning`
and in the future this will become a :exc:`SyntaxError`. This behaviour
will happen even if it is a valid escape sequence for a regular expression.

The solution is to use Python's raw string notation for regular expression
The solution is to use Python's :ref:`raw string notation <raw-string-notation>`
for regular expression
patterns; backslashes are not handled in any special way in a string literal
prefixed with ``'r'``. So ``r"\n"`` is a two-character string containing
``'\'`` and ``'n'``, while ``"\n"`` is a one-character string containing a
Expand Down Expand Up @@ -83,6 +84,12 @@ characters, so ``last`` matches the string ``'last'``. (In the rest of this
section, we'll write RE's in ``this special style``, usually without quotes, and
strings to be matched ``'in single quotes'``.)


.. _re-special-characters:

Special characters
^^^^^^^^^^^^^^^^^^

Some characters, like ``'|'`` or ``'('``, are special. Special
characters either stand for classes of ordinary characters, or affect
how the regular expressions around them are interpreted.
Expand All @@ -93,7 +100,6 @@ directly nested. This avoids ambiguity with the non-greedy modifier suffix
repetition to an inner repetition, parentheses may be used. For example,
the expression ``(?:a{6})*`` matches any multiple of six ``'a'`` characters.


The special characters are:

.. index:: single: . (dot); in regular expressions
Expand All @@ -114,31 +120,33 @@ The special characters are:
``$``
Matches the end of the string or just before the newline at the end of the
string, and in :const:`MULTILINE` mode also matches before a newline. ``foo``
matches both 'foo' and 'foobar', while the regular expression ``foo$`` matches
only 'foo'. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
matches 'foo2' normally, but 'foo1' in :const:`MULTILINE` mode; searching for
matches both ``'foo'`` and ``'foobar'``, while the regular expression ``foo$``
matches
only ``'foo'``. More interestingly, searching for ``foo.$`` in ``'foo1\nfoo2\n'``
matches ``'foo2'`` normally, but also ``'foo1'`` in :const:`MULTILINE` mode; searching
for
a single ``$`` in ``'foo\n'`` will find two (empty) matches: one just before
the newline, and one at the end of the string.

.. index:: single: * (asterisk); in regular expressions

``*``
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as
many repetitions as are possible. ``ab*`` will match 'a', 'ab', or 'a' followed
by any number of 'b's.
many repetitions as are possible. ``ab*`` will match ``'a'``, ``'ab'``, or
``'a'`` followed by any number of ``'b'`` s.

.. index:: single: + (plus); in regular expressions

``+``
Causes the resulting RE to match 1 or more repetitions of the preceding RE.
``ab+`` will match 'a' followed by any non-zero number of 'b's; it will not
match just 'a'.
``ab+`` will match ``'a'`` followed by any non-zero number of ``'b'`` s; it
will not match just ``'a'``.

.. index:: single: ? (question mark); in regular expressions

``?``
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
``ab?`` will match either 'a' or 'ab'.
``ab?`` will match either ``'a'`` or ``'ab'``.

.. index::
single: *?; in regular expressions
Expand Down Expand Up @@ -224,7 +232,8 @@ The special characters are:
``'*'``, ``'?'``, and so forth), or signals a special sequence; special
sequences are discussed below.

If you're not using a raw string to express the pattern, remember that Python
If you're not using a :ref:`raw string to express the
pattern<raw-string-notation>`, remember that Python
also uses the backslash as an escape sequence in string literals; if the escape
sequence isn't recognized by Python's parser, the backslash and subsequent
character are included in the resulting string. However, if Python would
Expand Down Expand Up @@ -315,6 +324,12 @@ The special characters are:
special sequence, described below. To match the literals ``'('`` or ``')'``,
use ``\(`` or ``\)``, or enclose them inside a character class: ``[(]``, ``[)]``.


.. _re_extension_notation:

Extension notation
^^^^^^^^^^^^^^^^^^

.. index:: single: (?; in regular expressions

``(?...)``
Expand Down Expand Up @@ -514,6 +529,9 @@ The special characters are:

.. _re-special-sequences:

Special sequences
^^^^^^^^^^^^^^^^^

The special sequences consist of ``'\'`` and a character from the list below.
If the ordinary character is not an ASCII digit or an ASCII letter, then the
resulting RE will match the second character. For example, ``\$`` matches the
Expand Down Expand Up @@ -586,7 +604,7 @@ character ``'$'``.
(that is, any character in Unicode character category `[Nd]`__).
This includes ``[0-9]``, and also many other digit characters.

Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
Matches only ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.

__ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153

Expand All @@ -600,7 +618,7 @@ character ``'$'``.
Matches any character which is not a decimal digit.
This is the opposite of ``\d``.

Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
Matches only ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.

.. index:: single: \s; in regular expressions

Expand All @@ -610,7 +628,7 @@ character ``'$'``.
This includes ``[ \t\n\r\f\v]``, and also many other characters, for example the
non-breaking spaces mandated by typography rules in many languages.

Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
Matches only ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.

For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set;
Expand All @@ -622,7 +640,7 @@ character ``'$'``.
Matches any character which is not a whitespace character. This is
the opposite of ``\s``.

Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
Matches only ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.

.. index:: single: \w; in regular expressions

Expand All @@ -633,7 +651,7 @@ character ``'$'``.
(as defined by :py:meth:`str.isalnum`),
as well as the underscore (``_``).

Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
Matches only ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.

For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set;
Expand All @@ -649,7 +667,7 @@ character ``'$'``.
By default, matches non-underscore (``_``) characters
for which :py:meth:`str.isalnum` returns ``False``.

Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
Matches only ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.

If the :py:const:`~re.LOCALE` flag is used,
matches characters which are neither alphanumeric in the current locale
Expand All @@ -660,6 +678,12 @@ character ``'$'``.
``\Z``
Matches only at the end of the string.


.. _re-escape-sequences:

Escape sequences
^^^^^^^^^^^^^^^^^

.. index::
single: \a; in regular expressions
single: \b; in regular expressions
Expand Down Expand Up @@ -1191,25 +1215,26 @@ Exceptions
error if a string contains no match for a pattern. The ``PatternError`` instance has
the following additional attributes:

.. attribute:: msg

The unformatted error message.

.. attribute:: pattern
.. list-table::
:header-rows: 1

The regular expression pattern.
* - Attribute
- Meaning

.. attribute:: pos
* - .. attribute:: msg
- The unformatted error message.

The index in *pattern* where compilation failed (may be ``None``).
* - .. attribute:: pattern
- The regular expression pattern.

.. attribute:: lineno
* - .. attribute:: pos
- The index in *pattern* where compilation failed (may be ``None``).

The line corresponding to *pos* (may be ``None``).
* - .. attribute:: lineno
- The line corresponding to *pos* (may be ``None``).

.. attribute:: colno

The column corresponding to *pos* (may be ``None``).
* - .. attribute:: colno
- The column corresponding to *pos* (may be ``None``).

.. versionchanged:: 3.5
Added additional attributes.
Expand Down Expand Up @@ -1578,6 +1603,8 @@ Regular Expression Examples
---------------------------


.. _checking-for-a-pair:

Checking for a Pair
^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -1632,6 +1659,8 @@ To find out what card the pair consists of, one could use the
'a'


.. _simulating-scanf:

Simulating scanf()
^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -1719,6 +1748,8 @@ beginning with ``'^'`` will match at the beginning of each line. ::
<re.Match object; span=(4, 5), match='X'>


.. _making-a-phonebook:

Making a Phonebook
^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -1780,6 +1811,8 @@ house number from the street name:
['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]


.. _text-munging:

Text Munging
^^^^^^^^^^^^

Expand All @@ -1800,6 +1833,8 @@ in each word of a sentence except for the first and last characters::
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'


.. _finding-all-adverbs:

Finding all Adverbs
^^^^^^^^^^^^^^^^^^^

Expand All @@ -1813,6 +1848,8 @@ the following manner::
['carefully', 'quickly']


.. _finding-all-adverbs-and-their-positions:

Finding all Adverbs and their Positions
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Expand All @@ -1829,6 +1866,8 @@ to find all of the adverbs *and their positions* in some text, they would use
40-47: quickly


.. _raw-string-notation:

Raw String Notation
^^^^^^^^^^^^^^^^^^^

Expand All @@ -1853,6 +1892,8 @@ functionally identical::
<re.Match object; span=(0, 1), match='\\'>


.. _writing-a-tokenizer:

Writing a Tokenizer
^^^^^^^^^^^^^^^^^^^

Expand Down
Loading