Skip to content
Ron Panduwana edited this page Sep 27, 2015 · 7 revisions

Here's an oprex to match CSS color (hashtag-syntax only, e.g. #ff0000). It compiles to #(?:[\dA-Fa-f]{6}|[\dA-Fa-f]{3}).

/hash/hexes/
    hash = '#'
    hexes = <<|
              |6 of hexdigit
              |3 of hexdigit

        hexdigit: digit A..F a..f    

The first line:

/hash/hexes/

Specifies that we want a regex that matches hash-then-hexes. We then define what those hash and hexes are using indented sub-block:

--

    hash = '#'

This defines hash as a string literal. It should match # literally.

--

    hexes = <<|

This defines hexes as an alternation. The <<| starts an alternation.

--

              |6 of hexdigit

This is the first alternative in our alternation: match hexdigit six times.

The of keyword is oprex's operator for doing quantification/repetition.

All | in an alternation must vertically align.

--

              |3 of hexdigit

This is the second alternative in our alternation: match hexdigit three times. For matching e.g. #f00

-- Then we have a blank line.

 

In oprex, blank line means end-of-alternation/lookaround, so here we close our alternation block.

-- The definition of hexes above refers to something called hexdigit so now we need to define it. Again, to define something we use indented sub-block, this time with deeper indentation since the definition of hexes is already indented:

        hexdigit: digit A..F a..f

This defines hexdigit as a character-class. Character classes are defined using colon : then, after the colon, we list the character-class' members, separated by space:

--

Another example: stuff-inside-brackets

The following oprex matches stuff-inside-brackets, with the brackets can be one of ()``{}``<> or [].

  • Sample matches: <html> {x < 0} [1] and (see footnote [1])
  • Will not match e.g. f(g(x))
  • Will only partially match {{citation needed}} as {{citation needed}
  • To also match those, we can combine this example with the balanced parentheses example. But to keep it short and clear, we'll go with the above restrictions.

The output regex will be:

(?>(?P<paren>\()|(?P<curly>\{)|(?P<angle><)|(?P<square>\[))
(?:[^(){}<>\[\]]|(?s:.)(?<!(?>(?(paren)\)|(?(curly)\}|(?(angle)>|(?(square)\]|(?!))))))))*+
(?>(?(paren)\)|(?(curly)\}|(?(angle)>|(?(square)\]|(?!))))))

Here's the oprex:

                                           -- Line 0, things after "--" are comments
/open/contents?/close/                     -- Line 1
    open = @|                                  --  2
            |paren                             --  3
            |curly                             --  4
            |angle                             --  5
            |square                            --  6
                                               --  7
        [paren]: (                             --  8
        [curly]: {                             --  9
        [angle]: <                             -- 10
        [square]: [                            -- 11
                                               -- 12
    close = @|                                 -- 13
             |[paren] ? ')'                    -- 14
             |[curly] ? '}'                    -- 15
             |[angle] ? '>'                    -- 16
             |[square] ? ']'                   -- 17
             |FAIL!                            -- 18
                                               -- 19
    contents = @1.. of <<|                     -- 20
                         |not: ( ) { } < > [ ] -- 21
                         |not_close            -- 22
                                               -- 23
        not_close = <@>                        -- 24
               |any|                           -- 25
            <!close|                           -- 26
                                               -- 27

As you can see, comments in oprex starts with -- (a la SQL) not with # like in python. The reason for that is demonstrated by Line 21. Also, while first and last lines must be blank, comments-only lines are counted as blanks.

-- Line 1: contents? means that it is optional. It behaves just like ? in regex.

Line 2: @| starts an alternation, just like <<| before (see the previous CSS-color example). The difference is @ means atomic while << means allow backtracking. So, alternations started using @| will be atomic while ones started using <<| may backtrack. (If you don't know about regex atomic vs. backtracking, here's an excellent reference).

Line 7: Blank line (comments-only lines are treated as blank). It marks the end-of-alternation.

-- Lines 8-11: [paren] [curly] [angle] [square] define named capture groups. (Capturing group is one of regex-basics, we will not cover it here in an oprex tutorial). They need to be defined as capture groups because we will refer back to them later on lines 14-17.

--

    close = @|                                 -- 13
             |[paren] ? ')'                    -- 14
             |[curly] ? '}'                    -- 15
             |[angle] ? '>'                    -- 16
             |[square] ? ']'                   -- 17
             |FAIL!                            -- 18

Line 13: @| start-of-alternation, atomic.

Line 14-17: are conditionals, e.g. [paren] ? ')' means: match literal ) only if capture-group paren is defined. In the definition of open (lines 2-7), only one of paren-curly-angle-square will be defined because they are in an alternation. So this means that close must match ) if open is (, } if open is {, etc.

Line 18: FAIL! compiles to regex (?!). It always fails. So if none of the conditions in met, it should just fail.

Line 19: Blank line, end-of-alternation.

-- Line 20: @1.. of <<|

  • of is quantification operator.
  • @1.. is the quantifier. Again, @ means atomic, i.e. make it a possessive quantifier. 1.. means one-or-more. @1.. compiles to regex ++.
  • Things after of are what-should-be-repeated. So here we are quantifying an alternation.
  • <<| starts an alternation (one that allows backtracking, as previously explained).

Line 21: not: ( ) { } < > [ ]

  • not: is oprex operator for doing negated character-class.
  • ( ) { } < > [ ] are the character-class members, separated by space.
  • Line 21 also demonstrates why comments are started with -- -- a # there will be interpreted as a character class member.

Line 23: Blank line, end-of-alternation.

-- Lines 24-26:

        not_close = <@>                        -- 24
               |any|                           -- 25
            <!close|                           -- 26
  • <@> starts a lookaround block. The @ reminds you that lookarounds are atomic.
  • any is an oprex-built-in variable. It matches any character -- like regex's . with DOTALL turned on.
  • <!close| is a negative lookbehind. < means lookbehind. ! negates. close is already defined on Line 13.
  • Like in alternation, all corresponding | in a lookaround must vertically align.

Line 27: Blank line, end-of-lookaround. Like alternation, lookaround block is closed with blank line.

Examples explained

The Tutorial should explain most of the syntax used in Examples. The rest are covered here:

1. Match a range of number-strings

The IPv4 Address, Date, and Time examples use the String-of-Digits Range Literal feature to match number-strings between two specified values (and reject non-numbers string/numbers but having value outside the range):

  • byte = '0'..'255' (leading-zero not allowed, e.g. won't match 007)
  • hh = '1'..'12'
  • mm = '01'..'12' (leading-zero mandatory for single-digits, e.g. will match 02 but not 2)
  • dd = '01'..'31'
  • mm = ss = '00'..'59'
  • HH = 'o0'..'23' (the o means optional leading-zero, e.g. will match both 2 and 02)

2. Backreference

In the Date example:

    /yyyy/separator/mm/=separator/dd/

And the Quoted String example:

    /opening_quote/contents/=opening_quote/

The =separator and =opening_quote parts are Backreference.

3. Flags

(ignorecase) in Time and (unicode) in Password Checks are example of Flags usage.

4. Match-anything-until

The __ in BEGIN-something-END example:

    /begin/__/end/

And in Password Checks:

    has_number = /__?/digit/
    has_min_2_symbols = 2 of /__?/non-alnum/

Are sample uses of the Match-Until Operator.

5. Recursion

In Comma-Separated Values:

    more_values = /comma/value/more_values?/

The more_values refers to itself. This is an example of Recursion. Other examples can be seen in E-mail Address:

    subdomain = /hostname/dot/subdomain?/

Balanced Parentheses:

    balanced_parens = /open/contents?/close/
        open: (
        close: )
        contents = @1.. of <<|
                             |non_parens
                             |balanced_parens

And Palindrome:

    palindrome = <<|
                          |/letter/palindrome/=letter/
                          |/letter/=letter/
                          |letter

6. Global Variables

In Quoted String:

*)      quote: ' "

And Comma-Separated Values:

*)      comma: ,

The *) in the definitions of quote and comma marks the variables as Global Variable which makes the variables accessible in latter, different scopes.

7. Anchors

In Comma-Separated Values, Balanced Parentheses, and Palindrome:

    //value/more_values?//

    /non_parens?/balanced_parens/non_parens?/.

    ./palindrome/.

./, /., and // are Anchors.

Clone this wiki locally