Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Doc Comments and General Cleanup of Scanner #195

Merged
merged 2 commits into from
Feb 16, 2024

Conversation

ProfDoof
Copy link
Contributor

Hi all!

I was looking at some previous PRs for doc comments like #168 and thought I would take a crack at it addressing some of the issues raised by @the-mikedavis in the other PR to fix #147. I scope creeped myself while I was doing it though and also separated each of the processing steps in the scanner.c file out into their own functions to make it easier to read the file. I was struggling quite a bit to read it myself. I also went through and did some cleanup of functions based on what I read in the documentation for tree-sitter.

If I need to split this into 2 separate PRs, let me know, and I will address that as able.

Quick question though, is there a good way for me to benchmark my changes against the original solution? I want to make sure that the extra processing cost we would be incurring is worth it.

Changes Made

Grammar

  • Added an error sentinel value to the externals array. See the comments in the scanner.c for explanation.
  • Changed from handling the entirety of block comment parsing in the external scanner to just the inner block/outer block signifier scanning and the block comment ending scanning
  • Split the line comment into 3 cases.
    1. The line starts with 4 or more slashes which makes what looks like a doc comment actually a regular comment
    2. The line starts with 2 slashes and either one more slash or at least 1 bang which makes the comment a regular comment
    3. The line starts with only 2 slashes and everything after that is content
  • Added hidden rules for line doc comment, line inner doc comment, and line outer doc comment to be used in aliases when scanning a line comment
  • Added hidden rules for block doc comment, and external scanner handling for block inner and outer doc comment detection.
  • Aliased the line doc comment and block doc comment to a generic doc comment node
  • Aliased the inner line and block comment nodes to a generic inner doc comment node
  • Aliased the outer line and block comment nodes to a generic outer doc comment node

Scanner

  • Switched everywhere where the code was using lexer->lookahead == 0 to instead use eof(lexer) which calls the eof function on the lexer. This is recommended in the tree-sitter documentation I read.
  • Added a skip function similar to the advance function and switched parts of the code from manually calling advance on the lexer to using either the advance function or the skip function.
  • Added a markEnd function for the same reason as above.
  • Converted the block comment processing from a flag boolean to using an enum based state machine. It made it easier for me to read and catch edge cases.
  • In the block comment processing, I now first check whether there's a signifier of one of the doc comment possibilities and if there is I return the correct one. If there isn't, then I just pass it to the block comment processing part of the process function and completely parse the block comment and then return it with the end marked correctly.
  • In block comment processing, previously, the behavior if an EOF was found was to return false. I changed that to return true. This is because for the purposes of tree sitting, if we found an EOF that means we are dealing with an unterminated block comment and the user likely wants that highlighted. If we were the rust parser obviously we would probably choose a different behavior but for our purposes here that behavior seems more reasonable to me. However, I'm more than willing to revert that change. I will need to just change a line or two.
  • Started using markEnd during the block comment processing and moved the block comment processing to the beginning of the scanner function to ensure that we don't skip over whitespace by accident.
  • Added inlines to the lexer caller functions. Not sure that's actually worth it but it's there now. I can remove it if desired.

Test

  • Added tests for all the cases that I saw in the previous PRs. If there's any that I missed that you think I need to double check, feel free to include them and I will attempt to correct them at my earliest convenience.

@ProfDoof
Copy link
Contributor Author

I have fixed the lint problems locally but now I'm hunting down the parser fuzzing error.

@ProfDoof
Copy link
Contributor Author

The fuzzing error seen in the failure on here was caused by the shift in the install process for tree-sitter-cli from installing the api.h and parser.h files to only installing the api.h file. This means that you need to use the relative path to import the parser.h created by tree-sitter generate instead. That needs to be fixed in the vim repo as well it looks like.

However, there was also a recent update to the tree-sitter-fuzz-action which will likely cause an error as well. I will go make an issue on there and link it here.

@ProfDoof
Copy link
Contributor Author

I looked deeper and make install is the one that no longer installs parser.h but cargo and npm still do. So, it's not an issue right now but may become one later.

@ProfDoof
Copy link
Contributor Author

ProfDoof commented Oct 4, 2023

Hey, just wanted to check what the expected timeline on looking at this is.

@maplant
Copy link

maplant commented Oct 23, 2023

Would love to have this added. Not parsing doc comments is very strange

@ProfDoof
Copy link
Contributor Author

ProfDoof commented Dec 8, 2023

Hi! I wanted to check back in on this as it's been sitting in my backlog for a couple of month now. @amaanq is this something that you would need to look at?

@TrueDoctor
Copy link

For me this seems to only highlight the first two characters of the doc comment, #168 does not have the same problem

@ProfDoof
Copy link
Contributor Author

ProfDoof commented Jan 5, 2024

For me this seems to only highlight the first two characters of the doc comment, #168 does not have the same problem

Would you mind posting the code you were using to test it? I am willing to fix it, I just need to know what it is. @TrueDoctor

@TrueDoctor
Copy link

Would you mind posting the code you were using to test it? I am willing to fix it, I just need to know what it is. @TrueDoctor

I hooked it up to the helix editor by changing the url of the rust grammer to point to your pr and then adding
(line_doc_comment) @comment.block.documentation
to the file runtime/queries/rust/highlights.scm and then starting the editor (sometimes you have to manually ask helix to recompile the grammer by doing hx --grammar fetch and hx --grammar build)

@ProfDoof
Copy link
Contributor Author

ProfDoof commented Jan 13, 2024

@TrueDoctor

Oh, I see the issue I think. line_doc_comment isn't a valid selection for the grammar I crafted. Instead, you want to add

(line_comment doc: (_)) @comment.block.documentation
(line_comment) @comment.line
(block_comment doc: (_)) @comment.block.documentation
(block_comment) @comment.block

or

(line_comment (doc_comment)) @comment.block.documentation
(line_comment) @comment.line
(block_comment (doc_comment)) @comment.block.documentation
(block_comment) @comment.block

Or something like that. I'm not an expert on crafting queries yet. Once I have time again (don't do a PhD, you'll never have your own life) I can try to drum up some better versions.

Oh, if you want to differentiate between inner and outer doc comments you can as well. You would do that like this I think:

(line_comment (doc_comment outer: (_))) @comment.block.documentation
(line_comment (doc_comment inner: (_))) @comment.block.documentation
(line_comment) @comment.line
(block_comment (doc_comment outer: (_))) @comment.block.documentation
(block_comment (doc_comment inner: (_))) @comment.block.documentation
(block_comment) @comment.block

or

(line_comment (doc_comment (outer_doc_comment))) @comment.block.documentation.outer
(line_comment (doc_comment (inner_doc_comment))) @comment.block.documentation.inner
(line_comment) @comment.line
(block_comment (doc_comment (outer_doc_comment))) @comment.block.documentation.outer
(block_comment (doc_comment (inner_doc_comment))) @comment.block.documentation.inner
(block_comment) @comment.block

@TrueDoctor
Copy link

Oh, sorry I think made a small error, I think I actually used the (doc_comment) @comment.block.documentation and just copy pasted the line from my current config without checking. Otherwise the editor would complain about missing nodes iirc

@ProfDoof
Copy link
Contributor Author

@TrueDoctor

The doc_comment does not contain the contents of the node IIRC. Instead, doc_comment contains just the characters needed to determine if I'm dealing with an inner or outer comment (theoretically, it's been a while since I wrote this code). The rest of the contents are contained within the line_comment or block_comment. There might be a different way to consider it though. I'm welcome to suggestions.

@amaanq amaanq force-pushed the support_doc_comments branch 2 times, most recently from 1a77348 to 5390f15 Compare February 16, 2024 09:18
@amaanq
Copy link
Member

amaanq commented Feb 16, 2024

I'm sorry for taking so long to get to this, I had a bit of a hiatus November-January and then was just busy afterwards and forgot about this

This looks really good FYI, so thanks a ton @ProfDoof! There's a few things I changed around but otherwise I think we can get this in, most notably:

  • rename block comment end to '*/', the external scanner can recognize and parse literal tokens just fine, and it's better for those looking to leverage the start/end of a comments in their queries
  • mark all non-exported functions static to avoid potential collisions when being compiled with other parsers/c libraries
  • hide the second '//' if it's a regular comment via a regex
  • move generated file changes to their own commit

Might've forgotten something, but that's the meat of it. Thanks!

@clason
Copy link

clason commented Feb 16, 2024

Would be helpful to add queries for this (makes it easier for downstream to include).

@amaanq
Copy link
Member

amaanq commented Feb 16, 2024

not much really needed tbh, there's still queries for comments but this enables querying for doc comments (which is not in the list of upstream's standard captures nvm it is) to be done much more easily

There's @comment.documentation highlight queries now

@amaanq amaanq merged commit 6b7ccc9 into tree-sitter:master Feb 16, 2024
5 checks passed
@TrueDoctor
Copy link

🥳

This was referenced Feb 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Separate regular comments from Doc Strings
5 participants