-
Notifications
You must be signed in to change notification settings - Fork 2
Tutorial
Here's an oprex to match CSS color (hashtag-syntax only, e.g. #ff0000
). It compiles to #(?:[\dA-Fa-f]{6}|[\dA-Fa-f]{3})
.
/hash/hexes/
hash = '#'
hexes = <<|
|6 of hexdigit
|3 of hexdigit
hexdigit: digit A..F a..f
The first line:
/hash/hexes/
Specifies that we want a regex that matches hash
-then-hexes
. We then define what those hash
and hexes
are using indented sub-block:
--
hash = '#'
This defines hash
as a string literal. It should match #
literally.
--
hexes = <<|
This defines hexes
as an alternation. The <<|
starts an alternation.
--
|6 of hexdigit
This is the first alternative in our alternation: match hexdigit
six times.
The of
keyword is oprex's operator for doing quantification/repetition.
All |
in an alternation must vertically align.
--
|3 of hexdigit
This is the second alternative in our alternation: match hexdigit
three times. For matching e.g. #f00
-- Then we have a blank line.
In oprex, blank line means end-of-alternation/lookaround, so here we close our alternation block.
--
The definition of hexes
above refers to something called hexdigit
so now we need to define it. Again, to define something we use indented sub-block, this time with deeper indentation since the definition of hexes
is already indented:
hexdigit: digit A..F a..f
This defines hexdigit
as a character-class. Character classes are defined using colon :
then, after the colon, we list the character-class' members, separated by space:
-
digit
is a built-in character-class (it compiles to regex's\d
). Character classes can include other character classes. -
A..F
anda..f
are character ranges.
--
The following oprex matches stuff-inside-brackets, with the brackets can be one of ()``{}``<>
or []
.
- Sample matches:
<html>
{x < 0}
[1]
and(see footnote [1])
- Will not match e.g.
f(g(x))
- Will only partially match
{{citation needed}}
as{{citation needed}
- To also match those, we can combine this example with the balanced parentheses example. But to keep it short and clear, we'll go with the above restrictions.
The output regex will be:
(?>(?P<paren>\()|(?P<curly>\{)|(?P<angle><)|(?P<square>\[))
(?:[^(){}<>\[\]]|(?s:.)(?<!(?>(?(paren)\)|(?(curly)\}|(?(angle)>|(?(square)\]|(?!))))))))*+
(?>(?(paren)\)|(?(curly)\}|(?(angle)>|(?(square)\]|(?!))))))
Here's the oprex:
-- Line 0, things after "--" are comments
/open/contents?/close/ -- Line 1
open = @| -- 2
|paren -- 3
|curly -- 4
|angle -- 5
|square -- 6
-- 7
[paren]: ( -- 8
[curly]: { -- 9
[angle]: < -- 10
[square]: [ -- 11
-- 12
close = @| -- 13
|[paren] ? ')' -- 14
|[curly] ? '}' -- 15
|[angle] ? '>' -- 16
|[square] ? ']' -- 17
|FAIL! -- 18
-- 19
contents = @1.. of <<| -- 20
|not: ( ) { } < > [ ] -- 21
|not_close -- 22
-- 23
not_close = <@> -- 24
|any| -- 25
<!close| -- 26
-- 27
As you can see, comments in oprex starts with --
(a la SQL) not with #
like in python. The reason for that is demonstrated by Line 21. Also, while first and last lines must be blank, comments-only lines are counted as blanks.
--
Line 1: contents?
means that it is optional. It behaves just like ?
in regex.
Line 2: @|
starts an alternation, just like <<|
before (see the previous CSS-color example). The difference is @
means atomic while <<
means allow backtracking. So, alternations started using @|
will be atomic while ones started using <<|
may backtrack. (If you don't know about regex atomic vs. backtracking, here's an excellent reference).
Line 7: Blank line (comments-only lines are treated as blank). It marks the end-of-alternation.
--
Lines 8-11: [paren]
[curly]
[angle]
[square]
define named capture groups. (Capturing group is one of regex-basics, we will not cover it here in an oprex tutorial). They need to be defined as capture groups because we will refer back to them later on lines 14-17.
--
close = @| -- 13
|[paren] ? ')' -- 14
|[curly] ? '}' -- 15
|[angle] ? '>' -- 16
|[square] ? ']' -- 17
|FAIL! -- 18
Line 13: @|
start-of-alternation, atomic.
Line 14-17: are conditionals, e.g. [paren] ? ')'
means: match literal )
only if capture-group paren
is defined. In the definition of open
(lines 2-7), only one of paren-curly-angle-square will be defined because they are in an alternation. So this means that close
must match )
if open
is (
, }
if open
is {
, etc.
Line 18: FAIL!
compiles to regex (?!)
. It always fails. So if none of the conditions in met, it should just fail.
Line 19: Blank line, end-of-alternation.
--
Line 20: @1.. of <<|
-
of
is quantification operator. -
@1..
is the quantifier. Again,@
means atomic, i.e. make it a possessive quantifier.1..
means one-or-more.@1..
compiles to regex++
. - Things after
of
are what-should-be-repeated. So here we are quantifying an alternation. -
<<|
starts an alternation (one that allows backtracking, as previously explained).
Line 21: not: ( ) { } < > [ ]
-
not:
is oprex operator for doing negated character-class. -
( ) { } < > [ ]
are the character-class members, separated by space. - Line 21 also demonstrates why comments are started with
--
-- a#
there will be interpreted as a character class member.
Line 23: Blank line, end-of-alternation.
-- Lines 24-26:
not_close = <@> -- 24
|any| -- 25
<!close| -- 26
-
<@>
starts a lookaround block. The@
reminds you that lookarounds are atomic. -
any
is an oprex-built-in variable. It matches any character -- like regex's.
with DOTALL turned on. -
<!close|
is a negative lookbehind.<
means lookbehind.!
negates.close
is already defined on Line 13. - Like in alternation, all corresponding
|
in a lookaround must vertically align.
Line 27: Blank line, end-of-lookaround. Like alternation, lookaround block is closed with blank line.
The Tutorial should explain most of the syntax used in Examples. The rest are covered here:
The IPv4 Address, Date, and Time examples use the String-of-Digits Range Literal feature to match number-strings between two specified values (and reject non-numbers string/numbers but having value outside the range):
-
byte = '0'..'255'
(leading-zero not allowed, e.g. won't match007
) hh = '1'..'12'
-
mm = '01'..'12'
(leading-zero mandatory for single-digits, e.g. will match02
but not2
) dd = '01'..'31'
mm = ss = '00'..'59'
-
HH = 'o0'..'23'
(theo
means optional leading-zero, e.g. will match both2
and02
)
In the Date example:
/yyyy/separator/mm/=separator/dd/
And the Quoted String example:
/opening_quote/contents/=opening_quote/
The =separator
and =opening_quote
parts are Backreference.
(ignorecase)
in Time and (unicode)
in Password Checks are example of Flags usage.
The __
in BEGIN-something-END example:
/begin/__/end/
And in Password Checks:
has_number = /__?/digit/
has_min_2_symbols = 2 of /__?/non-alnum/
Are sample uses of the Match-Until Operator.
In Comma-Separated Values:
more_values = /comma/value/more_values?/
The more_values
refers to itself. This is an example of Recursion. Other examples can be seen in E-mail Address:
subdomain = /hostname/dot/subdomain?/
Balanced Parentheses:
balanced_parens = /open/contents?/close/ open: ( close: ) contents = @1.. of <<| |non_parens |balanced_parens
And Palindrome:
palindrome = <<| |/letter/palindrome/=letter/ |/letter/=letter/ |letter
In Quoted String:
*) quote: ' "
And Comma-Separated Values:
*) comma: ,
The *)
in the definitions of quote
and comma
marks the variables as Global Variable which makes the variables accessible in latter, different scopes.
In Comma-Separated Values, Balanced Parentheses, and Palindrome:
//value/more_values?//
/non_parens?/balanced_parens/non_parens?/.
./palindrome/.
./
, /.
, and //
are Anchors.
- IPv4 Address
- BEGIN-something-END
- Date
- Time
- Blood Type
- Quoted String
- Comma-Separated Values
- Password Checks
- Balanced Parentheses
- Number-string Range Literal
- Backreference
- Flags
- Match-anything-until
- Recursion
- Global Variables
- Anchors
- Built-in Character Classes
- Built-in Expressions
-
Special Built-ins:
WOB
,wordchar
,non-linechar