Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse the text to detect doc comments and inlines properly in spin() #1605

Merged
merged 7 commits into from
Oct 9, 2018

Conversation

yutannihilation
Copy link
Sponsor Collaborator

@yutannihilation yutannihilation commented Sep 15, 2018

Problem

(originally reported at rstudio/rmarkdown#1444)

Currently, spin() detects doc comments by checking if the "^#+'[ ]?" matches on a line-by-line basis.

knitr/R/spin.R

Line 71 in 01407a7

r = rle(grepl(doc, x) | i) # inline expressions are treated as doc instead of code

Accordingly, it mistakenly detects string literals that contain #' as doc comments in cases like this:

code <- "
#' hello, world!"

The same thing occurs on inline in the following case:

code <- "
{{ Sys.Date() }}
"

In addition to string literals, symbols can also be multi-line:

`
{{very_strange_var_name}}
` <- 1

Fix

To identify the true comments and double-brackets correctly, we need to parse the whole code first and ignore string literals and symbols that are over multiple lines. To me, @yihui's suggestion of utils::getParsedData() at rstudio/rmarkdown#1444 (comment) seems simpler and faster (benchmark: #1605 (comment)), so I implemented as such.

The idea is simple; since multi-line STR_CONST or SYMBOL are tokenized as below, if the beginning of a line is contained by them, there are no col1 = 1 token in the line. So, we can filter the parsed data by col1 == 1 to see what lines are "complete".

codes <-
  c("{{ Sys.Date() }}",   # true inline
    '"',
    "{{ Sys.Date() }}\"", # false
    "`",
    "{{ Sys.Date() }}",   # false
    "`",
    "#' Hello world!",    # true doc comment
    '"',
    "#' Hello world!\"",  # false
    "`",
    "#' Hello world!",    # false
    "`")

d <- getParseData(parse(text = codes))
options(width = 400)
d
#>    line1 col1 line2 col2 id parent                token terminal                   text
#> 16     1    1     1   16 16      0                 expr    FALSE                       
#> 1      1    1     1    1  1     16                  '{'     TRUE                      {
#> 12     1    2     1   15 12     16                 expr    FALSE                       
#> 2      1    2     1    2  2     12                  '{'     TRUE                      {
#> 8      1    4     1   13  8     12                 expr    FALSE                       
#> 3      1    4     1   11  3      5 SYMBOL_FUNCTION_CALL     TRUE               Sys.Date
#> 5      1    4     1   11  5      8                 expr    FALSE                       
#> 4      1   12     1   12  4      8                  '('     TRUE                      (
#> 6      1   13     1   13  6      8                  ')'     TRUE                      )
#> 9      1   15     1   15  9     12                  '}'     TRUE                      }
#> 13     1   16     1   16 13     16                  '}'     TRUE                      }
#> 20     2    1     3   17 20     22            STR_CONST     TRUE   "\n{{ Sys.Date() }}"
#> 22     2    1     3   17 22      0                 expr    FALSE                       
#> 25     4    1     6    1 25     27               SYMBOL     TRUE `\n{{ Sys.Date() }}\n`
#> 27     4    1     6    1 27      0                 expr    FALSE                       
#> 30     7    1     7   15 30    -35              COMMENT     TRUE        #' Hello world!
#> 33     8    1     9   16 33     35            STR_CONST     TRUE    "\n#' Hello world!"
#> 35     8    1     9   16 35      0                 expr    FALSE                       
#> 38    10    1    12    1 38     40               SYMBOL     TRUE  `\n#' Hello world!\n`
#> 40    10    1    12    1 40      0                 expr    FALSE

complete_lines <- unique(d[d$col1 == 1, ]$line1)
complete_lines
#> [1]  1  2  4  7  8 10

# doc
grep("^#+'[ ]?", codes[complete_lines], value = TRUE)
#> [1] "#' Hello world!"

# inline
grep("^[{][{](.+)[}][}][ ]*$", codes[complete_lines], value = TRUE)
#> [1] "{{ Sys.Date() }}"

Created on 2018-09-16 by the reprex package (v0.2.0).

@yutannihilation
Copy link
Sponsor Collaborator Author

yutannihilation commented Sep 15, 2018

Benchmark here. In summary,

  1. using utils::getParsedData() is slower than using try_parse() when the text is of a few lines
  2. using utils::getParsedData() is faster than using try_parse() when the text is of hundreds of lines.
  3. using utils::getParsedData() constantly takes a few milleseconds, whereas, the more the number of lines is, using try_parse() takes more time. Especially, the latter is slow when the code contains string literals that is over many lines.

So, I think the overhead of using utils::getParsedData() is acceptable and better than the case of try_parse().

# define functions -------------------------

grepl_doc_comment = function(doc, x) {
  d = getParseData(parse(text = x))
  doc_lines = d[d$col1 == 1 & d$token == "COMMENT" & grepl(doc, d$text), ]$line1
  seq_along(x) %in% doc_lines
}

try_parse = function(code, silent = TRUE) {
  !inherits(
    try(parse(text = code, keep.source = FALSE), silent = silent), 'try-error'
  )
}

grepl_doc_comment2 = function(doc, x) {
  pos = 1
  is_complete = logical(length(x))
  for (i in seq_along(x)) {
    if (!try_parse(x[seq(pos, i)])) {
      next()
    }
    if (i == pos) {
      is_complete[i] = TRUE
    }
    pos = i + 1
  }
  is_complete & grepl(doc, x)
}


txt1 = c("#' test", "code <- \"", "#' test\"")

# confirm results --------------------------

grepl("^#+'[ ]?", txt1) # wrong
#> [1]  TRUE FALSE  TRUE
grepl_doc_comment("^#+'[ ]?", txt1) # ok
#> [1]  TRUE FALSE FALSE
grepl_doc_comment2("^#+'[ ]?", txt1) # ok
#> [1]  TRUE FALSE FALSE

# benchmark --------------------------------

bench::mark(
  grepl("^#+'[ ]?", txt1),
  grepl_doc_comment("^#+'[ ]?", txt1),
  grepl_doc_comment2("^#+'[ ]?", txt1),
  check = FALSE
)
#> # A tibble: 3 x 10
#>   expression     min     mean  median    max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:tm> <bch:t> <bch:>     <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~   5.1us  11.92us   7.5us 1.25ms    83906.        0B     0
#> 2 "grepl_do~ 710.6us   1.05ms 835.5us 4.77ms      952.    72.6KB     4
#> 3 "grepl_do~ 424.1us 656.52us   654us 2.24ms     1523.        0B     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

txt2 = rep(txt1, times = 100)
bench::mark(
  grepl("^#+'[ ]?", txt2),
  grepl_doc_comment("^#+'[ ]?", txt2),
  grepl_doc_comment2("^#+'[ ]?", txt2),
  check = FALSE
)
#> # A tibble: 3 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~  47.2us 71.22us  51.4us  1.61ms   14042.     1.22KB     1
#> 2 "grepl_do~  1.28ms  2.02ms  1.69ms  4.59ms     496.   289.19KB     3
#> 3 "grepl_do~ 43.86ms 53.94ms 53.05ms 71.37ms      18.5    3.66KB     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

txt3 = rep(c("code <- \"", "", "", "", "", "", "", "", "", "#' test\""), times = 30)
bench::mark(
  grepl("^#+'[ ]?", txt3),
  grepl_doc_comment("^#+'[ ]?", txt3),
  grepl_doc_comment2("^#+'[ ]?", txt3),
  check = FALSE
)
#> # A tibble: 3 x 10
#>   expression     min     mean   median      max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~  36.8us  53.86us   39.7us   2.73ms  18567.      1.22KB     0
#> 2 "grepl_do~   844us   1.25ms   1.01ms   3.63ms    798.     75.94KB     4
#> 3 "grepl_do~ 115.4ms 131.84ms 133.41ms 146.72ms      7.58    3.66KB     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

@yutannihilation yutannihilation changed the title [WIP] Parse the text to detect comments properly in spin() [WIP] Parse the text to detect docs, inlines, and comments properly in spin() Sep 16, 2018
@yutannihilation
Copy link
Sponsor Collaborator Author

Hmm, comment is a bit harder since it is allowed to be syntactically invalid, which means we cannot parse them... (# /* is no problem, but /* is)

/*
print("This is comment
*/
")

@yutannihilation
Copy link
Sponsor Collaborator Author

yutannihilation commented Sep 16, 2018

Given that /* and */ are outside of R's syntax, they should be considered as a kind of macro, not a syntactical parts. So, it doesn't matter whether the comment notation is in a string literal or not. Now I feel it is natural that the above code is treated as this:

")

@yutannihilation yutannihilation changed the title [WIP] Parse the text to detect docs, inlines, and comments properly in spin() [WIP] Parse the text to detect doc comments and inlines properly in spin() Sep 16, 2018
@yutannihilation yutannihilation changed the title [WIP] Parse the text to detect doc comments and inlines properly in spin() Parse the text to detect doc comments and inlines properly in spin() Sep 16, 2018
@yutannihilation
Copy link
Sponsor Collaborator Author

Ready for review now.

@yihui
Copy link
Owner

yihui commented Sep 26, 2018

Sounds great! Thanks! I'll review it either this Friday or a week later.

@yutannihilation
Copy link
Sponsor Collaborator Author

Thanks, no hurry at all :)

@yutannihilation
Copy link
Sponsor Collaborator Author

Sorry, this sentence is not true, although the fix may stay the same...

So, we can filter the parsed data by col1 == 1 to see what lines are "complete".

A "complete" line can have spaces at its beginnings, which means there's no col1 == 1. But, it's OK to ignore such lines, since doc, inline, and comment all (effectively) consider only lines without spaces at their beginnings.

It seems I should either

  1. rename is_complete variable to some appropriate name or
  2. change the implementation to check whether the line is NOT the multiline string or symbol.

Let me think a bit.

@yutannihilation
Copy link
Sponsor Collaborator Author

Sorry for deciding late. I chose this because 2. seems to make the implementation a bit complecated.

  1. rename is_complete variable to some appropriate name

@yihui yihui added this to the v1.21 milestone Oct 9, 2018
Copy link
Owner

@yihui yihui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you very much!

@yihui yihui merged commit 45eee67 into yihui:master Oct 9, 2018
yihui added a commit that referenced this pull request Oct 9, 2018
@yutannihilation
Copy link
Sponsor Collaborator Author

Thanks for merging!

@yutannihilation yutannihilation deleted the fix-spin-doc-comment branch October 9, 2018 23:04
@yihui
Copy link
Owner

yihui commented Oct 11, 2018

I feel you understand the knitr source code well enough and is also careful enough, so I just granted the push access to this repo to you, in case I'm too busy and you want to make a critical change (or merge an important PR). I have wished to give other people push access for long, but never did it, and you are the first official collaborator of this repo :)

@yutannihilation
Copy link
Sponsor Collaborator Author

Wow, thanks! I don't think I fully deserve such a great honor, but I'd love to help you when you are too busy. Honestly, I only understand the source codes partially, so I'll need to learn a lot... For the time being, I'll do my best not to make a mess before doing actual contribution. 👍

@yihui
Copy link
Owner

yihui commented Oct 12, 2018

Partially understanding the source code is enough. I don't fully understand the source code of all packages I maintain 😉

yihui pushed a commit to SiriChandanaGoruganti/knitr that referenced this pull request Dec 9, 2018
yihui added a commit to SiriChandanaGoruganti/knitr that referenced this pull request Dec 9, 2018
@Hemken
Copy link
Contributor

Hemken commented Mar 29, 2019

This had the unfortunate side-effect that I can no longer use spin() for languages other than R !

I've borrowed a bunch of your older code, and added a 'language' parameter to make this backward compatible for non-R languages. Thanks!

@yihui
Copy link
Owner

yihui commented Apr 1, 2019

@Hemken This sounds like a bug to me. Do you mind filing a new issue? Thanks!

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants