diff --git a/README.rst b/README.rst index 9f374c4d6..be1d5d94a 100644 --- a/README.rst +++ b/README.rst @@ -33,7 +33,7 @@ A Concrete Syntax Tree (CST) parser and serializer library for Python .. intro-start -LibCST parses Python 3.0, 3.1, 3.3, 3.5, 3.6, 3.7 or 3.8 source code as a CST tree that keeps +LibCST parses Python 3.0 -> 3.11 source code as a CST tree that keeps all formatting details (comments, whitespaces, parentheses, etc). It's useful for building automated refactoring (codemod) applications and linters. @@ -129,6 +129,11 @@ packaging tools. We recommend installing the latest stable release from pip install libcst +For parsing, LibCST ships with a native extension, so releases are distributed as binary +wheels as well as the source code. If a binary wheel is not available for your system +(Linux/Windows x86/x64 and Mac x64/arm are covered), you'll need a recent +`Rust toolchain `_ for installing. + Further Reading --------------- - `Static Analysis at Scale: An Instagram Story. `_ @@ -137,7 +142,9 @@ Further Reading Development ----------- -Start by setting up and activating a virtualenv: +You'll need a recent `Rust toolchain `_ for developing. + +Then, start by setting up and activating a virtualenv: .. code-block:: shell diff --git a/native/libcst/README.md b/native/libcst/README.md index f33563b2e..42eb2f6c4 100644 --- a/native/libcst/README.md +++ b/native/libcst/README.md @@ -1,66 +1,106 @@ -# libcst_native - -A very experimental native extension to speed up LibCST. This does not currently provide -much performance benefit and is therefore not recommended for general use. - -The extension is written in Rust using [PyO3](https://pyo3.rs/). - -This installs as a separate python package that LibCST looks for and will import if it's -available. - - -## Using with LibCST - -[Set up a rust development environment](https://www.rust-lang.org/tools/install). Using -`rustup` is recommended, but not necessary. Rust 1.45.0+ should work. - -Follow the instructions for setting up a virtualenv in the top-level README, then: - -``` -cd libcst_native -maturin develop # install libcst_native to the virtualenv -cd .. # cd back into the main project -python -m unittest -``` - -This will run the python test suite. Nothing special is required to use `libcst_native`, -since `libcst` will automatically use the native extension when it's installed. - -When benchmarking this code, make sure to run `maturin develop` with the `--release` -flag to enable compiler optimizations. - -You can disable the native extension by uninstalling the package from your virtualenv: - -``` -pip uninstall libcst_native -``` - - -## Rust Tests +# libcst/native + +A native extension to enable parsing of new Python grammar in LibCST. + +The extension is written in Rust, and exposed to Python using [PyO3](https://pyo3.rs/). +This is packaged together with libcst, and can be imported from `libcst.native`. When +the `LIBCST_PARSER_TYPE` environment variable is set to `native`, the LibCST APIs use +this module for all parsing. + +Later on, the parser library might be packaged separately as +[a Rust crate](https://crates.io). Pull requests towards this are much appreciated. + +## Goals + +1. Adopt the CPython grammar definition as closely as possible to reduce maintenance + burden. This means using a PEG parser. +2. Feature-parity with the pure-python LibCST parser: the API should be easy to use from + Python, support parsing with a target version, bytes and strings as inputs, etc. +3. [future] Performance. The aspirational goal is to be within 2x CPython performance, + which would enable LibCST to be used in interactive use cases (think IDEs). +4. [future] Error recovery. The parser should be able to handle partially complete + documents, returning a CST for the syntactically correct parts, and a list of errors + found. + +## Structure + +The extension is organized into two rust crates: `libcst_derive` contains some macros to +facilitate various features of CST nodes, and `libcst` contains the `parser` itself +(including the Python grammar), a `tokenizer` implementation by @bgw, and a very basic +representation of CST `nodes`. Parsing is done by +1. **tokenizing** the input utf-8 string (bytes are not supported at the Rust layer, + they are converted to utf-8 strings by the python wrapper) +2. running the **PEG parser** on the tokenized input, which also captures certain anchor + tokens in the resulting syntax tree +3. using the anchor tokens to **inflate** the syntax tree into a proper CST + +These steps are wrapped into a high-level `parse_module` API +[here](https://github.com/Instagram/LibCST/blob/main/native/libcst/src/lib.rs#L43), +along with `parse_statement` and `parse_expression` functions which all just accept the +input string and an optional encoding. + +These Rust functions are exposed to Python +[here](https://github.com/Instagram/LibCST/blob/main/native/libcst/src/py.rs) using the +excellent [PyO3](https://pyo3.rs/) library, plus an `IntoPy` trait which is mostly +implemented via a macro in `libcst_derive`. + + +## Hacking + +### Grammar + +The grammar is mostly a straightforward translation from the [CPython +grammar](https://github.com/python/cpython/blob/main/Grammar/python.gram), with some +exceptions: + +* The output of grammar rules are deflated CST nodes that capture the AST plus + additional anchor token references used for whitespace parsing later on. +* Rules in the grammar must be strongly typed, as enforced by the Rust compiler. The + CPython grammar rules are a bit more loosely-typed in comparison. +* Some features in the CPython peg parser are not supported by rust-peg: keywords, + mutually recursive rules, special `invalid_` rules, the `~` operator, terminating the + parser early. + +The PEG parser is run on a `Vec` of `Token`s, and tries its best to avoid allocating any +strings, working only with references. As such, the output nodes don't own any strings, +but refer to slices of the original input (hence the `'a` lifetime parameter on almost +all nodes). + +### Whitespace parsing + +The `Inflate` trait is responsible for taking a "deflated", skeleton CST node, and +parsing out the relevant whitespace from the anchor tokens to produce an "inflated" +(normal) CST node. In addition to the deflated node, inflation requires a whitespace +config object which contains global information required for certain aspects of +whitespace parsing, like the default indentation. + +Inflation consumes the deflated node, while mutating the tokens referenced by it. This +is important to make sure whitespace is only ever assigned to at most one CST node. The +`Inflate` trait implementation needs to ensure that all whitespace is assigned to a CST +node; this is generally verified using roundtrip tests (i.e. parsing code and then +generating it back to then assert the original and generated are byte-by-byte equal). + +The general convention is that the top-most possible node owns a certain piece of +whitespace, which should be straightforward to achieve in a top-down parser like +`Inflate`. In cases where whitespace is shared between sibling nodes, usually the +leftmost node owns the whitespace except in the case of trailing commas and closing +parentheses, where the latter owns the whitespace (for backwards compatibility with the +pure python parser). See the implementation of `inflate_element` for how this is done. + +### Tests In addition to running the python test suite, you can run some tests written in rust with ``` -cargo test --no-default-features +cd native +cargo test ``` -The `--no-default-features` flag needed to work around an incompatibility between tests -and pyo3's `extension-module` feature. +These include unit and roundtrip tests. +Additionally, some benchmarks can be run on x86-based architectures using `cargo bench`. -## Code Formatting +### Code Formatting Use `cargo fmt` to format your code. - - -## Release - -This isn't currently supported, so there's no releases available, but the end-goal would -be to publish this on PyPI. - -Because this is a native extension, it must be re-built for each platform/architecture. -The per-platform build could be automated using a CI system, [like github -actions][gh-actions]. - -[gh-actions]: https://github.com/PyO3/maturin/blob/master/.github/workflows/release.yml