Skip to content

Commit

Permalink
bump to v2.2.0
Browse files Browse the repository at this point in the history
v2.2.0

- Introducing the concept of "Extensibility": All patterns that
  correspond to classes within "pregex.meta" have numerous assertions
  imposed upon them, that are essential in order for them to be able
  to match what they are supposed to. These assertions are mostly word
  boundaries that are placed at the start and at the end of the pattern,
  but there might be other types of assertions as well. Helpful as they
  are for matching, these assertions might complicate things when it comes
  to the pattern being used as a building block to a larger pattern. For
  that reason, there has been added a "is_extensible" parameter to the
  constructor of every meta class. As a general rule of thumb, you should
  only set "is_extensible" to "True" if you wish to use a pattern as part
  of a larger one. For matching purposes, have "is_extensible" take its
  default value of "False".

- Class "pregex.core.pre.Pregex" now contains a set of
  "{get, iterate}_named_captures" methods through which one has access
   to any named captured groups stored within dictionaries.

- Parameter "pattern" of class "pregex.core.pre.Pregex" constructor
  now defaults to the empty string, thus replacing "pregex.core.pre.Empty"
  which has now been removed.

- All classes within "pregex.core.operators" can now receive one or even
  no arguments without throwing a "NotEnoughArgumentsException" exception.
  This makes it easier to do stuff like "pre = op.Either(*patterns)"
  without having to check whether list "patterns" contains more than
  one pattern.

- Applying the alternation operator between a pattern "P" and the
  empty string pattern now results in pattern "P" itself.

- Wrapping the empty pattern in either a capturing or a non-capturing
  group now results into the empty pattern itself.

- Classes "pregex.core.assertions.{__PositiveLookaround, __NegativeLookaround}"
  have been removed and replaced by a single class "pregex.core.assertions.__Lookaround".

- Classes "pregex.core.assertions.{FollowedBy, PrecededBy, EnclosedBy}" are now
  able to receive more than one assertion patterns, just like their negative
  counterparts.

- Class "pregex.meta.essentials.Date" now receives date formats in a list
  instead of as arbitrary arguments.

- Corrected mistake where method "pregex.core.pre.Pregex.not_enclosed_by"
  could receive multiple arguments.

- Updated documentaton and README.

- Modified some existing tests and added some more.
  • Loading branch information
manoss96 committed Sep 19, 2022
1 parent 061d536 commit 573af3f
Show file tree
Hide file tree
Showing 24 changed files with 2,148 additions and 957 deletions.
224 changes: 218 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ pip install pregex


<!-- Usage example -->
## Usage example
## Usage Example

In PRegEx, everything is a Programmable Regular Expression, or "Pregex" for short. This makes it easy for simple Pregex instances to be combined into more complex ones! Within the code snippet below, we construct a Pregex instance that will match any URL that ends with either ".com" or ".org" as well as any IP address for which a 4-digit port number is specified. Furthermore, in the case of a URL, we would like for its domain name to be separately captured as well.

Expand All @@ -55,7 +55,7 @@ tld = '.' + Either('com', 'org')

ip_octet = AnyDigit().at_least_at_most(n=1, m=3)

port_number = 4 * AnyDigit()
port_number = (AnyDigit() - '0') + 3 * AnyDigit()

# Combine sub-patterns together.
pre: Pregex = \
Expand All @@ -73,7 +73,7 @@ regex = pre.get_pattern()

This is the pattern that we just built. Yikes!
```
(?:https?:\/\/)?(?:(?:www\.)?([A-Za-z\d][A-Za-z\d\-.]{1,61}[A-Za-z\d])\.(?:com|org)|(?:\d{1,3}\.){3}\d{1,3}:\d{4})
(?:https?:\/\/)?(?:(?:www\.)?([A-Za-z\d][A-Za-z\d\-.]{1,61}[A-Za-z\d])\.(?:com|org)|(?:\d{1,3}\.){3}\d{1,3}:[1-9]\d{3})
```

Besides from having access to its underlying pattern, we can use a Pregex instance to find matches within a piece of text. Consider for example the following string:
Expand Down Expand Up @@ -114,17 +114,229 @@ from pregex.core.classes import AnyDigit
from pregex.core.operators import Either
from pregex.meta.essentials import HttpUrl, IPv4

port_number = 4 * AnyDigit()
port_number = (AnyDigit() - '0') + 3 * AnyDigit()

pre: Pregex = Either(
HttpUrl(capture_domain=True),
IPv4() + ":" + port_number
HttpUrl(capture_domain=True, is_extensible=True),
IPv4(is_extensible=True) + ":" + port_number
)
```

By using classes found within the *pregex.meta* subpackage, we were able to
construct more or less the same pattern as before only much more easily!

## Solving Wordle with PRegEx

We are now going to see another example that better exhibits the *programmable* nature of PRegEx.
More specifically, we will be creating a Wordle solver function that, given all currently known
information as well as access to a 5-letter word dictionary, utilizes PRegEx in order to return
a list of candidate words to choose from as a possible solution to the problem.

### Formulating what is known

First things first, we must think of a way to represent what is known so far regarding the
word that we're trying to guess. This information can be encapsulated into three distinct
sets of letters:

1. **Green letters**: Letters that are included in the word, whose position within it is known.
2. **Yellow letters**: Letters that are included in the word, and while their exact position is
unknown, there is one or more positions which we can rule out.
3. **Gray letters**: Letters that are not included in the word.

Green letters can be represented by using a dictionary that maps integers (positions) to strings (letters).
For example, ``{4 : 'T'}`` indicates that the word we are looking for contains the letter ``T`` in its
fourth position. Yellow letters can also be represented as a dictionary with integer keys, whose values
however are going to be lists of strings instead of regular strings, as a position might have been ruled
out for more than a single letter. For example, ``{1 : ['A', 'R'], 3 : ['P']}`` indicates that even though
the word contains letters ``A``, ``R`` and ``P``, it cannot start with either an ``A`` or an ``R`` as
well as it cannot have the letter ``P`` occupying its third position. Finally, gray letters can be simply
stored in a list.

In order to have a concrete example to work with, we will be assuming that our current
information about the problem is expressed by the following three data structures:

```python
green: dict[int, str] = {4 : 'T'}
yellow: dict[int, list[str]] = {1 : ['A', 'R'], 3 : ['P']}
gray: list[str] = ['C', 'D', 'L', 'M', 'N', 'Q', 'U']
```

### Initializing a Pregex class instance

Having come up with a way of programmatically formulating the problem, the first step towards
actually solving it would be to create a ``Pregex`` class instance:
```python
wordle = Pregex()
```

Since we aren't providing a ``pattern`` parameter to the class's constructor, it automatically
defaults to the empty string ``''``. Thus, through this instance we now have access to all methods
of the ``Pregex`` class, though we are not really able to match anything with it yet.

### Yellow letter assertions

Before we go on to dictate what the valid letters for each position within the word
are, we are first going to deal with yellow letters, that is, letters which we know are
included in the word that we are looking for, though their position is still uncertain.
Since we know for a fact that the sought out word contains these letters, we have to
somehow make sure that any candidate word includes them as well. This can easily be
done by using what is known in RegEx lingo as a *positive lookahead assertion*,
represented in PRegEx by the less intimidating *FollowedBy*! Assertions are used in
order to *assert* something about a pattern without really having to *match* any additional
characters. A positive lookahead assertion, in particular, dictates that the pattern to which
it is applied must be followed by some other pattern in order for the former to constitute
a valid match.

In PRegEx, one is able to create a ``Pregex`` instance out of applying a positive
lookahead assertion to some pattern ``p1`` by doing the following:

```python
from pregex.core.assertions import FollowedBy

pre = FollowedBy(p1, p2)
```

where both ``p1`` and ``p2`` are either strings or ``Pregex`` instances. Futhermore, in the
case that ``p1`` already is a ``Pregex`` class instance, one can achieve the same result with:

```python
pre = p1.followed_by(p2)
```

Having initialized ``wordle`` as a ``Pregex`` instance, we can simply simply do
``wordle.followed_by(some_pattern)`` so as to indicate that any potential match
with ``wordle`` must be followed by ``some_pattern``. Recall that ``wordle`` merely
represents the empty string, so we are not really matching anything at this point.
Applying an assertion to the empty string pattern is just a neat little trick that
one can use in order to validate something about their pattern before they even begin
to build it.

Now it's just a matter of figuring out what the value of ``some_pattern`` is.
Surely we can't just do ``wordle = wordle.followed_by(letter)``, as this results
in ``letter`` always having to be at the beginning of the word. Here's however what
we can do: It follows from the rules of Wordle that all words must be comprised of five
letters, any of which is potentially a yellow letter. Thus, every yellow letter is certain
to be preceded by up to four other letters, but no more than that. Therefore, we need a
pattern that represents just that, namely *four letters at most*. By applying quantifier
``at_most(n=4)`` to an instance of ``AnyUppercaseLetter()``, we are able to create such
a pattern. Add a yellow letter to its right and we have our ``some_pattern``. Since there
may be more than one yellow letters, we make sure that we iterate them all one by one so
as to enforce a separate assertion for each:

```python
from pregex.core.classes import AnyUppercaseLetter

yellow_letters_list: list[str] = [l for letter_list in yellow.values() for l in letter_list]

at_most_four_letters = AnyUppercaseLetter().at_most(n=4)

for letter in yellow_letters_list:
wordle = wordle.followed_by(at_most_four_letters + letter)
```

By executing the above code snippet we get a ``Pregex`` instance which
represents the following RegEx pattern:

```
(?=[A-Z]{,4}A)(?=[A-Z]{,4}R)(?=[A-Z]{,4}P)
```

### Building valid character classes

After we have made sure that our pattern will reject any words that do not contain
all the yellow letters, we can finally start building the part of the pattern that
will handle the actual matching. This can easily be achived by performing five iterations,
one for each letter of the word, where at each iteration ``i`` we construct a new character
class, that is then appended to our pattern based on the following logic:

* If the letter that corresponds to the word's i-th position is known, then
make it so that the pattern only matches that letter at that position.

* If the letter that corresponds to the word's i-th position is not known,
then make it so that the pattern matches any letter except for gray letters,
as well as any yellow letters that may have been ruled out for that position.

The following code snippet does just that:

```python
from pregex.core.classes import AnyFrom

for i in range(1, 6):
if i in green:
wordle += green[i]
else:
invalid_chars_at_pos_i = list(gray)
if i in yellow:
invalid_chars_at_pos_i += yellow[i]
wordle += AnyUppercaseLetter() - AnyFrom(*invalid_chars_at_pos_i)
```

After executing the above code, ``wordle`` will contain the following
RegEx pattern:

```
(?=[A-Z]{,4}A)(?=[A-Z]{,4}R)(?=[A-Z]{,4}P)[S-TBV-ZE-KO-P][A-BR-TV-ZE-KO-P][A-BR-TV-ZE-KO]T[A-BR-TV-ZE-KO-P]
```

### Matching from a dictionary

Having built our pattern, the only thing left to do is to actually use it to
match candidate words. Provided that we have access to a text file containing
all possible Wordle words, we are able to invoke our ``Pregex`` instance's
``get_matches`` method in order to scan said text file for any potential matches.

```python
words = wordle.get_matches('word_dictionary.txt', is_path=True)
```

### Putting it all together

Finally, we combine together everything we discussed into a single function that
spews out a list of words which satisfy all necessary conditions so that they
constitute possible solutions to the problem.

```python
def wordle_solver(green: dict[int, str], yellow: dict[int, list[str]], gray: list[str]) -> list[str]:

from pregex.core.pre import Pregex
from pregex.core.classes import AnyUpperCaseLetter, AnyFrom

# Initialize pattern as the empty string pattern.
wordle = Pregex()

# This part ensures that yellow letters
# will appear at least once within the word.
yellow_letters_list = [l for letter_list in yellow.values() for l in letter_list]
at_most_four_letters = AnyUppercaseLetter().at_most(n=4)
for letter in yellow_letters_list:
wordle = wordle.followed_by(at_most_four_letters + letter)

# This part actually dictates the set of valid letters
# for each position within the word.
for i in range(1, 6):
if i in green:
wordle += green[i]
else:
invalid_chars_at_pos_i = list(gray)
if i in yellow:
invalid_chars_at_pos_i += yellow[i]
wordle += AnyUppercaseLetter() - AnyFrom(*invalid_chars_at_pos_i)

# Match candidate words from dictionary and return them in a list.
return wordle.get_matches('word_dictionary.txt', is_path=True)
```

By invoking the above function we get the following list of words:

```python
word_candidates = wordle_solver(green, yellow, gray)

print(word_candidates) # This prints ['PARTY']
```

Looks like there is only one candidate word, which means that we
can consider our problem solved!

You can learn more about PRegEx by visiting the [PRegEx Documentation Page][docs-url].


Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
pregex==2.1.1
pregex==2.2.0
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
author = 'Manos Stoumpos'

# The full version, including alpha/beta/rc tags
release = '2.1.1'
release = '2.2.0'


# -- General configuration ---------------------------------------------------
Expand Down
2 changes: 1 addition & 1 deletion docs/source/documentation/best-practices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ including the following statements at the top of their Python script:
* Module :py:mod:`pregex.core.operators` is imported as ``op``
* Module :py:mod:`pregex.core.quantifiers` is imported as ``qu``
* Module :py:mod:`pregex.core.tokens` is imported as ``tk``
* Classes :class:`pregex.core.pre.Pregex` and :class:`pregex.core.pre.Empty` are imported as is.
* Class :class:`pregex.core.pre.Pregex` is imported as is.

Take a look at the example below to better understand how this works:

Expand Down
96 changes: 94 additions & 2 deletions docs/source/documentation/covering-the-basics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,98 @@ that one also makes use of the multiplication operator ``*`` whenever
possible.


The "empty string" pattern
================================

Invoking the ``Pregex`` class's constructor without supplying it with a
value for parameter ``pattern``, causes said parameter to take its default
value, that is, the empty string ``''``. This is a good starting point
to begin constructing your pattern:

.. code-block:: python
from pregex.core.pre.Pregex
# Initialize your pattern as the empty string pattern.
pre = Pregex()
# Start building your pattern...
for subpattern in subpatterns:
if '!' in subpattern.get_pattern():
pre = pre.concat(subpattern + '?')
else:
pre = pre.concat(subpattern + '!')
On top of that, any ``Pregex`` instance whose underlying pattern
is the empty string pattern, has the following properties:

1. Applying a quantifier to the empty string pattern results in itself:

.. code-block:: python
from pregex.core.pre import Pregex
from pregex.core.quantifiers import OneOrMore
pre = OneOrMore(Pregex())
pre.print_pattern() # This prints ''
2. Creating a group out of the empty string pattern results in itself:

.. code-block:: python
from pregex.core.pre import Pregex
from pregex.core.group import Group
pre = Group(Pregex())
pre.print_pattern() # This prints ''
3. Applying the alternation operation between the empty string
pattern and an ordinary pattern results in the latter:

.. code-block:: python
from pregex.core.pre import Pregex
from pregex.core.operators import Either
pre = Either(Pregex(), 'a')
pre.print_pattern() # This prints 'a'
4. Applying a positive lookahead assertion based on the empty
string pattern to any pattern results in that pattern:

.. code-block:: python
from pregex.core.pre import Pregex
from pregex.core.assertions import FollowedBy
pre = FollowedBy('a', Pregex())
pre.print_pattern() # This prints 'a'
The above properties make it easy to write concise code
like the following, without compromising your pattern:

.. code-block:: python
from pregex.core.pre.Pregex
from pregex.core.groups import Capture
from pregex.core.operators import Either
from pregex.core.quantifiers import OneOrMore
pre = Either(
'a',
'b' if i > 5 else Pregex(),
OneOrMore('c' if i > 10 else Pregex())
) + Capture('d' if i > 15 else Pregex())
This is the underlying pattern of instance ``pre`` when
executing the above code snippet for various values of ``i``:

* For ``i`` equal to ``1`` the resulting pattern is ``a``
* For ``i`` equal to ``6`` the resulting pattern is ``a|b``
* For ``i`` equal to ``11`` the resulting pattern is ``a|b|c+``
* For ``i`` equal to ``16`` the resulting pattern is ``(?:a|b|c+)(d)``


Pattern chaining
==================
Apart from PRegEx's standard pattern-building API which involves
Expand All @@ -186,9 +278,9 @@ to construct more complex patterns. This technique is called

.. code-block:: python
from pregex.core.pre import Empty
from pregex.core.pre import Pregex
pre = Empty() \
pre = Pregex() \
.concat('a') \
.either('b') \
.one_or_more() \
Expand Down
Loading

0 comments on commit 573af3f

Please sign in to comment.