Parse the text to detect doc comments and inlines properly in spin() #1605

yutannihilation · 2018-09-15T07:37:59Z

Problem

(originally reported at rstudio/rmarkdown#1444)

Currently, spin() detects doc comments by checking if the "^#+'[ ]?" matches on a line-by-line basis.

Line 71 in 01407a7

    
           r = rle(grepl(doc, x) | i)  # inline expressions are treated as doc instead of code

Accordingly, it mistakenly detects string literals that contain #' as doc comments in cases like this:

code <- "
#' hello, world!"

The same thing occurs on inline in the following case:

code <- "
{{ Sys.Date() }}
"

In addition to string literals, symbols can also be multi-line:

`
{{very_strange_var_name}}
` <- 1

Fix

To identify the true comments and double-brackets correctly, we need to parse the whole code first and ignore string literals and symbols that are over multiple lines. To me, @yihui's suggestion of utils::getParsedData() at rstudio/rmarkdown#1444 (comment) seems simpler and faster (benchmark: #1605 (comment)), so I implemented as such.

The idea is simple; since multi-line STR_CONST or SYMBOL are tokenized as below, if the beginning of a line is contained by them, there are no col1 = 1 token in the line. So, we can filter the parsed data by col1 == 1 to see what lines are "complete".

codes <-
  c("{{ Sys.Date() }}",   # true inline
    '"',
    "{{ Sys.Date() }}\"", # false
    "`",
    "{{ Sys.Date() }}",   # false
    "`",
    "#' Hello world!",    # true doc comment
    '"',
    "#' Hello world!\"",  # false
    "`",
    "#' Hello world!",    # false
    "`")

d <- getParseData(parse(text = codes))
options(width = 400)
d
#>    line1 col1 line2 col2 id parent                token terminal                   text
#> 16     1    1     1   16 16      0                 expr    FALSE                       
#> 1      1    1     1    1  1     16                  '{'     TRUE                      {
#> 12     1    2     1   15 12     16                 expr    FALSE                       
#> 2      1    2     1    2  2     12                  '{'     TRUE                      {
#> 8      1    4     1   13  8     12                 expr    FALSE                       
#> 3      1    4     1   11  3      5 SYMBOL_FUNCTION_CALL     TRUE               Sys.Date
#> 5      1    4     1   11  5      8                 expr    FALSE                       
#> 4      1   12     1   12  4      8                  '('     TRUE                      (
#> 6      1   13     1   13  6      8                  ')'     TRUE                      )
#> 9      1   15     1   15  9     12                  '}'     TRUE                      }
#> 13     1   16     1   16 13     16                  '}'     TRUE                      }
#> 20     2    1     3   17 20     22            STR_CONST     TRUE   "\n{{ Sys.Date() }}"
#> 22     2    1     3   17 22      0                 expr    FALSE                       
#> 25     4    1     6    1 25     27               SYMBOL     TRUE `\n{{ Sys.Date() }}\n`
#> 27     4    1     6    1 27      0                 expr    FALSE                       
#> 30     7    1     7   15 30    -35              COMMENT     TRUE        #' Hello world!
#> 33     8    1     9   16 33     35            STR_CONST     TRUE    "\n#' Hello world!"
#> 35     8    1     9   16 35      0                 expr    FALSE                       
#> 38    10    1    12    1 38     40               SYMBOL     TRUE  `\n#' Hello world!\n`
#> 40    10    1    12    1 40      0                 expr    FALSE

complete_lines <- unique(d[d$col1 == 1, ]$line1)
complete_lines
#> [1]  1  2  4  7  8 10

# doc
grep("^#+'[ ]?", codes[complete_lines], value = TRUE)
#> [1] "#' Hello world!"

# inline
grep("^[{][{](.+)[}][}][ ]*$", codes[complete_lines], value = TRUE)
#> [1] "{{ Sys.Date() }}"

Created on 2018-09-16 by the reprex package (v0.2.0).

yutannihilation · 2018-09-15T13:42:25Z

Benchmark here. In summary,

using utils::getParsedData() is slower than using try_parse() when the text is of a few lines
using utils::getParsedData() is faster than using try_parse() when the text is of hundreds of lines.
using utils::getParsedData() constantly takes a few milleseconds, whereas, the more the number of lines is, using try_parse() takes more time. Especially, the latter is slow when the code contains string literals that is over many lines.

So, I think the overhead of using utils::getParsedData() is acceptable and better than the case of try_parse().

# define functions -------------------------

grepl_doc_comment = function(doc, x) {
  d = getParseData(parse(text = x))
  doc_lines = d[d$col1 == 1 & d$token == "COMMENT" & grepl(doc, d$text), ]$line1
  seq_along(x) %in% doc_lines
}

try_parse = function(code, silent = TRUE) {
  !inherits(
    try(parse(text = code, keep.source = FALSE), silent = silent), 'try-error'
  )
}

grepl_doc_comment2 = function(doc, x) {
  pos = 1
  is_complete = logical(length(x))
  for (i in seq_along(x)) {
    if (!try_parse(x[seq(pos, i)])) {
      next()
    }
    if (i == pos) {
      is_complete[i] = TRUE
    }
    pos = i + 1
  }
  is_complete & grepl(doc, x)
}


txt1 = c("#' test", "code <- \"", "#' test\"")

# confirm results --------------------------

grepl("^#+'[ ]?", txt1) # wrong
#> [1]  TRUE FALSE  TRUE
grepl_doc_comment("^#+'[ ]?", txt1) # ok
#> [1]  TRUE FALSE FALSE
grepl_doc_comment2("^#+'[ ]?", txt1) # ok
#> [1]  TRUE FALSE FALSE

# benchmark --------------------------------

bench::mark(
  grepl("^#+'[ ]?", txt1),
  grepl_doc_comment("^#+'[ ]?", txt1),
  grepl_doc_comment2("^#+'[ ]?", txt1),
  check = FALSE
)
#> # A tibble: 3 x 10
#>   expression     min     mean  median    max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:tm> <bch:t> <bch:>     <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~   5.1us  11.92us   7.5us 1.25ms    83906.        0B     0
#> 2 "grepl_do~ 710.6us   1.05ms 835.5us 4.77ms      952.    72.6KB     4
#> 3 "grepl_do~ 424.1us 656.52us   654us 2.24ms     1523.        0B     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

txt2 = rep(txt1, times = 100)
bench::mark(
  grepl("^#+'[ ]?", txt2),
  grepl_doc_comment("^#+'[ ]?", txt2),
  grepl_doc_comment2("^#+'[ ]?", txt2),
  check = FALSE
)
#> # A tibble: 3 x 10
#>   expression     min    mean  median     max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:t> <bch:t> <bch:t>     <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~  47.2us 71.22us  51.4us  1.61ms   14042.     1.22KB     1
#> 2 "grepl_do~  1.28ms  2.02ms  1.69ms  4.59ms     496.   289.19KB     3
#> 3 "grepl_do~ 43.86ms 53.94ms 53.05ms 71.37ms      18.5    3.66KB     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

txt3 = rep(c("code <- \"", "", "", "", "", "", "", "", "", "#' test\""), times = 30)
bench::mark(
  grepl("^#+'[ ]?", txt3),
  grepl_doc_comment("^#+'[ ]?", txt3),
  grepl_doc_comment2("^#+'[ ]?", txt3),
  check = FALSE
)
#> # A tibble: 3 x 10
#>   expression     min     mean   median      max `itr/sec` mem_alloc  n_gc
#>   <chr>      <bch:t> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt> <dbl>
#> 1 "grepl(\"~  36.8us  53.86us   39.7us   2.73ms  18567.      1.22KB     0
#> 2 "grepl_do~   844us   1.25ms   1.01ms   3.63ms    798.     75.94KB     4
#> 3 "grepl_do~ 115.4ms 131.84ms 133.41ms 146.72ms      7.58    3.66KB     2
#> # ... with 2 more variables: n_itr <int>, total_time <bch:tm>

yutannihilation · 2018-09-16T00:22:50Z

Hmm, comment is a bit harder since it is allowed to be syntactically invalid, which means we cannot parse them... (# /* is no problem, but /* is)

/*
print("This is comment
*/
")

yutannihilation · 2018-09-16T00:37:15Z

Given that /* and */ are outside of R's syntax, they should be considered as a kind of macro, not a syntactical parts. So, it doesn't matter whether the comment notation is in a string literal or not. Now I feel it is natural that the above code is treated as this:

")

yutannihilation · 2018-09-16T00:49:37Z

Ready for review now.

yihui · 2018-09-26T15:33:02Z

Sounds great! Thanks! I'll review it either this Friday or a week later.

yutannihilation · 2018-09-26T17:16:11Z

Thanks, no hurry at all :)

yutannihilation · 2018-09-27T04:44:58Z

Sorry, this sentence is not true, although the fix may stay the same...

So, we can filter the parsed data by col1 == 1 to see what lines are "complete".

A "complete" line can have spaces at its beginnings, which means there's no col1 == 1. But, it's OK to ignore such lines, since doc, inline, and comment all (effectively) consider only lines without spaces at their beginnings.

It seems I should either

rename is_complete variable to some appropriate name or
change the implementation to check whether the line is NOT the multiline string or symbol.

Let me think a bit.

yutannihilation · 2018-10-07T05:17:35Z

Sorry for deciding late. I chose this because 2. seems to make the implementation a bit complecated.

rename is_complete variable to some appropriate name

… FALSE)

yihui

This is great! Thank you very much!

yutannihilation · 2018-10-09T23:04:19Z

Thanks for merging!

yihui · 2018-10-11T19:44:42Z

I feel you understand the knitr source code well enough and is also careful enough, so I just granted the push access to this repo to you, in case I'm too busy and you want to make a critical change (or merge an important PR). I have wished to give other people push access for long, but never did it, and you are the first official collaborator of this repo :)

yutannihilation · 2018-10-12T01:08:33Z

Wow, thanks! I don't think I fully deserve such a great honor, but I'd love to help you when you are too busy. Honestly, I only understand the source codes partially, so I'll need to learn a lot... For the time being, I'll do my best not to make a mess before doing actual contribution. 👍

yihui · 2018-10-12T02:58:25Z

Partially understanding the source code is enough. I don't fully understand the source code of all packages I maintain 😉

…ihui#1605) Fixes rstudio/rmarkdown#1444

Hemken · 2019-03-29T20:05:23Z

This had the unfortunate side-effect that I can no longer use spin() for languages other than R !

I've borrowed a bunch of your older code, and added a 'language' parameter to make this backward compatible for non-R languages. Thanks!

yihui · 2019-04-01T19:17:44Z

@Hemken This sounds like a bug to me. Do you mind filing a new issue? Thanks!

Check if doc comments are not in string literals

bb60592

yutannihilation added 3 commits September 15, 2018 23:37

Add a failing test for spin() inline

675532b

Fix outdated description about inline argument of spin()

4bb3dab

Fix spin() inline

c51de62

yutannihilation changed the title ~~[WIP] Parse the text to detect comments properly in spin()~~ [WIP] Parse the text to detect docs, inlines, and comments properly in spin() Sep 16, 2018

yutannihilation changed the title ~~[WIP] Parse the text to detect docs, inlines, and comments properly in spin()~~ [WIP] Parse the text to detect doc comments and inlines properly in spin() Sep 16, 2018

Add more tests

fc4282c

yutannihilation changed the title ~~[WIP] Parse the text to detect doc comments and inlines properly in spin()~~ Parse the text to detect doc comments and inlines properly in spin() Sep 16, 2018

This was referenced Sep 25, 2018

Use proper number of backticks to spin() Rmd #1611

Merged

Specific case when render() can't parse source code rstudio/rmarkdown#1444

Closed

Rename confusing variable in spin.R

d65af60

yihui added this to the v1.21 milestone Oct 9, 2018

use parse(keep.source = TRUE) in case users set options(keep.source =…

ad7c6d3

… FALSE)

yihui approved these changes Oct 9, 2018

View reviewed changes

yihui merged commit 45eee67 into yihui:master Oct 9, 2018

yihui added a commit that referenced this pull request Oct 9, 2018

a news item for #1605 and #1611

3abb642

yutannihilation deleted the fix-spin-doc-comment branch October 9, 2018 23:04

yihui pushed a commit to SiriChandanaGoruganti/knitr that referenced this pull request Dec 9, 2018

Parse the text to detect doc comments and inlines properly in spin() (y…

9c68816

…ihui#1605) Fixes rstudio/rmarkdown#1444

yihui added a commit to SiriChandanaGoruganti/knitr that referenced this pull request Dec 9, 2018

a news item for yihui#1605 and yihui#1611

c04ef9b

yihui mentioned this pull request Nov 22, 2019

Use knitr::spin() with R and Python code in one document #1773

Open

3 tasks

Hemken mentioned this pull request Nov 25, 2019

Using spin() with langages other than R #1779

Closed

3 tasks

github-actions bot locked as resolved and limited conversation to collaborators Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse the text to detect doc comments and inlines properly in spin() #1605

Parse the text to detect doc comments and inlines properly in spin() #1605

yutannihilation commented Sep 15, 2018 •

edited

Loading

yutannihilation commented Sep 15, 2018 •

edited

Loading

yutannihilation commented Sep 16, 2018

yutannihilation commented Sep 16, 2018 •

edited

Loading

yutannihilation commented Sep 16, 2018

yihui commented Sep 26, 2018

yutannihilation commented Sep 26, 2018

yutannihilation commented Sep 27, 2018

yutannihilation commented Oct 7, 2018

yihui left a comment

yutannihilation commented Oct 9, 2018

yihui commented Oct 11, 2018

yutannihilation commented Oct 12, 2018

yihui commented Oct 12, 2018

Hemken commented Mar 29, 2019

yihui commented Apr 1, 2019

Parse the text to detect doc comments and inlines properly in spin() #1605

Parse the text to detect doc comments and inlines properly in spin() #1605

Conversation

yutannihilation commented Sep 15, 2018 • edited Loading

Problem

Fix

yutannihilation commented Sep 15, 2018 • edited Loading

yutannihilation commented Sep 16, 2018

yutannihilation commented Sep 16, 2018 • edited Loading

yutannihilation commented Sep 16, 2018

yihui commented Sep 26, 2018

yutannihilation commented Sep 26, 2018

yutannihilation commented Sep 27, 2018

yutannihilation commented Oct 7, 2018

yihui left a comment

Choose a reason for hiding this comment

yutannihilation commented Oct 9, 2018

yihui commented Oct 11, 2018

yutannihilation commented Oct 12, 2018

yihui commented Oct 12, 2018

Hemken commented Mar 29, 2019

yihui commented Apr 1, 2019

yutannihilation commented Sep 15, 2018 •

edited

Loading

yutannihilation commented Sep 15, 2018 •

edited

Loading

yutannihilation commented Sep 16, 2018 •

edited

Loading