Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a small grammar guide #802

Merged
merged 1 commit into from
Apr 11, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions docs/reference/cfg.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,3 +61,89 @@ The following grammars are currently available:
- JSON grammar via `outlines.grammars.json`

If you would like more grammars to be added to the repository, please open an [issue](https://github.com/outlines-dev/outlines/issues) or a [pull request](https://github.com/outlines-dev/outlines/pulls).


## Grammar guide

A grammar is a list of rules and terminals that define a *language*:

- Terminals define the vocabulary of the language; they may be a string, regular expression or combination of these and other terminals.
- Rules define the structure of that language; they are a list of terminals and rules.

Outlines uses the [Lark library](https://github.com/lark-parser/lark) to make Large Language Models generate text in a language of a grammar, it thus uses grammars defined in a format that Lark understands, based on the [EBNF syntax](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form). Read the [Lark documentation](https://lark-parser.readthedocs.io/en/stable/grammar.html) for more details on grammar, the following is a small primer that should help get your started.

In the following we will define a [LOGO-like toy language](https://github.com/lark-parser/lark/blob/master/examples/turtle_dsl.py) for python's [turtle](https://docs.python.org/3/library/turtle.html) library.

### Terminals

A turtle can take 4 different `MOVEMENT` move instructions: forward (`f`), backward (`b`), turn right (`r`) and turn left (`l`). It can take `NUMBER` number of steps in each direction, and draw lines in a specified `COLOR`. These define the vocabulary of our language:

```ebnf
MOVEMENT: "f"|"b"|"r"|"l"
COLOR: LETTER+

%import common.LETTER
%import common.INT -> NUMBER
%import common.WS
%ignore WS
```

The lines that start with `%` are called "directive". They allow to import pre-defined terminals and rules, such as `LETTER` and `NUMBER`. `LETTER+` is a regular expressions, and indicates that a `COLOR` is made of at least one `LETTER`. The last two lines specify that we will ignore white spaces (`WS`) in the grammar.

### Rules

We now need to define our rules, by decomposing instructions we can send to the turtle via our python program. At each line of the program, we can either choose a direction and execute a given number of steps, change the color used to draw the pattern. We can also choose to start filling, make a series of moves, and stop filling. We can also choose to repeat a series of move.

We can easily write the first two rules:

```ebnf
instruction: MOVEMENT NUMBER -> movement
| "c" COLOR [COLOR] -> change_color
```

where `movement` and `change_color` represent aliases for the rules. A whitespace implied concatenating the elements, and `|` choosing either of the elements. The `fill` and `repeat` rules are slightly more complex, since they apply to a code block, which is made of instructions. We thus define a new `code_block` rule that refers to `instruction` and finish implementing our rules:

```ebnf
instruction: MOVEMENT NUMBER -> movement
| "c" COLOR [COLOR] -> change_color
| "fill" code_block -> fill
| "repeat" NUMBER code_block -> repeat

code_block: "{" instruction "}"
```

We can now write the full grammar:

```ebnf
start: instruction+

instruction: MOVEMENT NUMBER -> movement
| "c" COLOR [COLOR] -> change_color
| "fill" code_block -> fill
| "repeat" NUMBER code_block -> repeat

code_block: "{" instruction+ "}"

MOVEMENT: "f"|"b"|"l"|"r"
COLOR: LETTER+

%import common.LETTER
%import common.INT -> NUMBER
%import common.WS
%ignore WS
```

Notice the `start` rule, which defines the starting point of the grammar, i.e. the rule with which a program must start. This full grammars allows us to parse programs such as:

```python
c red yellow
fill { repeat 36 {
f200 l170
}}
```

The result of the parse, the parse tree, can then easily be translated into a Python program that uses the `turtle` library to draw a pattern.

### Next steps

This section provides a very brief overview of grammars and their possibilities. Check out the [Lark documentation](https://lark-parser.readthedocs.io/en/stable/index.html) for more thorough explanations and more examples.
Loading