Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow escaping newlines in double-quoted strings to remove the EOL char(s) #1656

Open
krader1961 opened this issue Feb 7, 2023 · 15 comments

Comments

@krader1961
Copy link
Contributor

There was a question in the chat channels about the Elvish equivalent for a POSIX here-doc illustrated by this example:

cat <<EOF
line 1
line 2
EOF

AFAIK, the only straightforward equivalent is this:

echo 'line 1
line 2' | cat

Or this:

echo ^
'line 1
line 2' | cat

If double-quoted strings allowed backslash to precede the EOL char(s) then this would be valid:

print "\
line 1
line 2
" | cat

Alternatively, a triple-quoted string would be aesthetically pleasing and arguably easier to use:

echo '''
line 1
line 2
''' | cat

Here the first and last EOL char(s) are elided by the triple-quoting mechanism. This, however, is not backwards compatible given the existing rule for interpreting single-quotes inside a single-quoted string. There is a similar problem for double-quoted strings. But, if we're willing to special-case triple quotes at the start and end of a line (when the sequence is otherwise a token with no adjacent chars other than EOL), then this is unlikely to break any existing code. It also allows us to introduce an additional filtering rule that removes leading whitespace common to all lines. For example:

elvish> echo '''
                 line 1
                   line 2
                 line 3
             ''' | cat
line 1
  line 2
line 3

Eliding EOL char(s) by prefacing them with a backslash in a double-quoted string is backward compatible and should be straightforward to implement. I can't see any downsides to supporting it and believe it should be implemented. I am ambivalent about the triple-quoted string idea but include it here to see what others think about the idea and whether it sparks any other proposals for making it easier, and clearer, to implement the equivalent of a POSIX here-doc in Elvish. I am also opposed to introducing the POSIX here-doc (cat <<WORD\n...\nWORD) syntax (even if its quirks are avoided) because I don't think it adds enough value relative to echo | cat (or print | cat) to justify the feature.

See also issue #1261.

@krader1961
Copy link
Contributor Author

Note that Elvish does not support interpolating variables in double-quoted strings the way POSIX shells do; e.g., echo "a $var b". See https://elv.sh/ref/language.html#double-quoted-string. I think this is reasonable behavior for Elvish. However, it would be nice if there was an explicit mechanism for such interpolation. Such as a str:template command. Which would be easier to use if triple-quoted strings as described above were supported. Note that I am not suggesting exposing the Go text/template package since it is clearly an inappropriate API for Elvish. But a templating mechanism limited to variable substitution based on the existing builtin:printf command would be extremely useful and would benefit from triple-quoted strings as described above.

@hanche
Copy link
Contributor

hanche commented Feb 8, 2023

Some random thoughts:

Isn't there a (much) greater need for removing the initial newline than the final one? I imagine one often would want the final newline in practice.

I think any syntax that clashes with present syntax, as the python style would, is a really bad idea. Instead, use some syntax that is not allowed today. For example, the line continuation character ^ is currently not allowed other than at the end of a line (or inside a quoted string of course), so we could declare that ^' at the end of a line starts a (possibly) multiline string from which the initial newline (and the final one, if so desired) is to be dropped. Similarly with ^", which would start a string with backslash escapes enabled.

This might make some sense because ^ is already used to suppress newlines, though in a slightly different way.

@hanche
Copy link
Contributor

hanche commented Feb 8, 2023

A stray thought hit me as I was going to bed. Here is a solution in today's elvish:

print '
line 1
line 2
'[1..] | dog

Or you could use [1..-1] if you want to get rid of the final newline as well.

In light of this, removal of initial common white space is a stronger argument for new syntax.

(The dog is there to avert “useless use of cat” complaints.)

@krader1961
Copy link
Contributor Author

krader1961 commented Feb 9, 2023

Isn't there a (much) greater need for removing the initial newline than the final one? I imagine one often would want the final newline in practice.

There are a couple of reasons. First, symmetry. Second, the echo builtin adds a newline. If you don't want a trailing newline simply use print (assuming the triple-quoted string elides the trailing newline). If the trailing newline in a triple-quoted string is retained then it becomes impossible to use print to output the string without a trailing newline. Here, I am, of course, assuming the trailing triple-quote must be at the start of a line; or at least would normally be placed there by convention.

I think any syntax that clashes with present syntax, as the python style would, is a really bad idea.

Okay, but your proposal to special-case a ^ as the last character on a line is also a backward incompatible change. It is also inconsistent with the current backslash sequences for double-quoted strings. Which of these is preferable:

var x = """
abc\t\
def
"""

var y = """
abc\t^
def
"""

And if we do decide to use ^ to elide the newline then we need to introduce \^ to allow a literal ^ inside a triple-double-quoted string.

@hanche
Copy link
Contributor

hanche commented Feb 9, 2023

Very quick folluwup, as I am short on time:

I think you misread my proposal – eh, stray thought. I did not intend to suggest giving ^ any special significance inside a quoted string. I wanted to put it before:

echo ^'
First line
Second line'

That is currently illegal syntax, so this is not a breaking change.

And about the final newline: My thinking was, the echo command adds a newline, so better not have one at the end of the string. But elvish has print, which should obviate the need for that. Further, if you think of newlines as line terminators instead of line separators, the normal expectation would be for a multiline string to end with a newline. But that is a minor point.

@krader1961
Copy link
Contributor Author

I think you misread my proposal – eh, stray thought. I did not intend to suggest giving ^ any special significance inside a quoted string. I wanted to put it before:

Ugh! No. Not least because it only elides the first newline and does not provide a mechanism for eliding newlines, for readability (to keep the literal lines a reasonable length), inside triple double-quoted strings. For example:

var x = """
line 1\
23
line 456
"""

And about the final newline: My thinking was,...

I am confused if we are in agreement or not vis-a-vis eliding any final newline in a triple-quoted string. You wrote "the normal expectation would be for a multiline string to end with a newline." Which implies you expect a multiline, triple-quoted, string to always end in a newline. Which makes it impossible to use print to output the string without a trailing newline. My proposal is to elide the last newline precisely to make it possible to output the string without a trailing newline. Eliding the first newline is based on the assumption the syntax supports this:

var x = '''
line 1
'''

If you really want a leading newline in that case you simply add an empty line:

var x = '''

line 1
'''

@hanche
Copy link
Contributor

hanche commented Feb 12, 2023

I am responding a bit out of order …

I am confused if we are in agreement or not vis-a-vis eliding any final newline in a triple-quoted string.

That is probably because I haven't made up my mind about it, so I am not advocating very strongly for it. In fact, I haven't made up my mind about any aspect of my proposal, except I think it's worth investigating and dismissing out of hand.

So, with that in mind, let's look at the question of eliding the final newline. With my proposed syntax, just put the terminating quote at the end of the last line instead of at the beginning of the next line.

Perhaps we are talking past each other here? Perhaps we have different reasons for wanting the feature in the first place? After all, the good old single quote strings can be multiline strings already. There are two problems with them that I can see:

  1. We want to see the string in the source code as it is, most importantly including horizontal alignment if any. The closest we can get to that with today's syntax is to start the string with ^ at the end of the line, followed b ' at the start of the next line. But that still leaves the alignment one character width off.
  2. For ease of reading the source code, we'd like to remove initial common whitespace, so we can indent our string with the rest of the source.

And that's it, is it not? From this perspective, the final newline might be less important, though I concede that always putting the ending quote(s) on a separate line will help to make the string stand out better in the source code, and that is a good thing. And if we do that, then yes, eliding the final newline is a good idea. (Although the trick of adding [..-1] after the final quote might make that argument less compelling.)

Whether we go for ''' or ^', I really think we should also have a corresponding """ or ^". The idea being the same as the difference between ordinary single quoted and double quoted strings. I suggest almost the exact same difference for multilined strings, except for double quoted strings, also remove initial common whitespace, and allow a single backslash at the end of aline to elide the newline (and the following common whitespace).

Sorry for being long-winded, but since this is not so much technical as human oriented, I couldn't make it any shorter.

@krader1961
Copy link
Contributor Author

I've decided that there should not be new literal string syntax. Instead, there should be a new dedent command that takes a string and removes common leading whitespace from each line, as well as any leading and trailing newline. I've already embedded the code to do this to make writing unit tests easier. See #1698. It would be trivial to expose that functionality as a new dedent command. Yes, a dedent command is less efficient than an equivalent literal string syntax but the added cost should be irrelevant (or nearly so) in every use case I can imagine. There may be pathological use cases I haven't imagined where the cost of a dedent command is unacceptable but it is doubtful they justify a new literal string syntax. And a new command to explicitly remove a leading and trailing newline, as well as common whitespace prefix, is arguably easier to document, less intrusive to the core Elvish language, and possibly useful for processing strings that are not string literals.

@xiaq
Copy link
Member

xiaq commented May 9, 2023

The problem with a dedent command is that it's not possible for the parser to check the syntax. This means that a dedent command either has to always accept the input (which limits the design choice) or throw exceptions (which is not ideal). The testutil.Dedent utility I wrote panics on invalid inputs, which is fine for a test utility but not for other purposes.

Re the syntax, triple quotes would be an obvious choice, but - to recap the conversation a bit - ''' is valid and useful today (it starts a single-quoted string that contains ' as the first character). """ is less problematic: it is valid but not useful. So we can use """ but '''. I don't like the asymmetry.

But there are other choices. Elvish currently reserves `, so it could be used for a new string literal syntax. Another possible choice is <<< and >>>.

@xiaq
Copy link
Member

xiaq commented May 9, 2023

Re the original feature request in the title: I don't see a use case listed other than removing the first newline in a multi-line literal, but multi-line literals are better served by something that also removes initial whitespaces. Unless there are more use cases I'm inclined to reject this feature.

@krader1961
Copy link
Contributor Author

krader1961 commented May 10, 2023

I don't see a use case listed other than removing the first newline in a multi-line literal, but multi-line literals are better served by something that also removes initial whitespaces.

The rationale was to remove the initial newline and elide leading whitespace like the POSIX "heredoc" mechanism does. I see in hindsight my example didn't make that clear. Furthermore, the POSIX "heredoc" mechanism only removes leading tabs, not spaces or a mixture of the two, which I consider an unnecessary and unwanted limitation.

The problem with a dedent command is that it's not possible for the parser to check the syntax.

I don't grok that comment (but then I haven't yet read your testutil.Dedent implmentation). A dedent command simply takes a string and outputs the same string with an initial newline removed and any common whitespace space prefix removed from each line. I don't see what syntax it needs to check for and panic if it isn't valid. If the user fails to provide the intended common whitespace prefix on each line that is their problem and the dedent command should simply produce the best output it can given the definition of its behavior.

@krader1961
Copy link
Contributor Author

I looked at your testutil.Dedent function, @xiaq, and don't understand why you felt it necessary to impose the restriction that the first line contain the reference whitespace prefix. I appreciate that it makes the implementation much simpler. But it seems unnecessarily restrictive. Consider a sequence of lines where the first line (and possibly other lines) are empty. Such lines should be ignored when computing the minimum common whitespace prefix to be removed.

That your implementation panics if any subsequent line does not have the same prefix explains your comment that "the problem with a dedent command is that it's not possible for the parser to check the syntax." However, even if the parser were to validate that a literal string argument to a dedent command did not violate those requirements (thus producing a compile time rather than run time error) that would also necessarily restrict the implementation of dedent to an argument that is a literal string. Excluding the use of a variable whose value is a string is not desirable. As I wrote in my previous comment such strict validation might be useful but it should not be the default behavior. I maintain that the default behavior should be to find the minimum common prefix (ignoring empty lines) and remove that from all lines. That behavior is more expensive but also more useful and less surprising. If someone needs the speed of your simpler implementation it should require an explicit option to invoke it.

I googled a bit and this ECMAscript "String Dedent proposal" description best encapsulates the behavior I expect from an Elvish dedent command:

Common leading indentation is defined as the longest sequence of whitespace characters that match exactly on every line that contains non-whitespace. Empty lines and lines which contain only whitespace do not affect common leading indentation.

When removing the common leading indentation, lines which contain only whitespace will have all whitespace removed.

It is possible, even likely, the implementation I included in my pull-request did not conform to that behavior. But I strongly believe the behavior described above is what we want, at least as the default behavior, and that the behavior of your testutil.Dedent is not optimal and not what should be exposed as an Elvish command.

@krader1961
Copy link
Contributor Author

Consider a sequence of lines where the first line (and possibly other lines) are empty. Such lines should be ignored when computing the minimum common whitespace prefix to be removed.

Note that by "first line" I mean the first line after ignoring a leading newline.

@xiaq
Copy link
Member

xiaq commented May 14, 2023

The reason testutil.Dedent doesn't remove the common indentation is not so much about simplicity of implementation, but because this algorithm handles mixes of tabs and spaces poorly. Consider the following fragment of Go code:

	... dedent(`
		foo
	        bar
		`)

This looks like two lines with the same indentation, containing foo and bar, but the first line is indented with two tabs, while the second is indented with a tab and 8 spaces. Among the languages that adopt the algorithm of removing the common indentation, there are two different strategies to identify what the common indentation is:

  • Identify the minimum number of leading whitespace characters, and remove that number of characters from each line. Using this algorithm, the common prefix is "2 whitespaces", and the return value is "foo\n bar\n". Java's text block uses this.

  • Identify the longest common prefix that consists of whitespace characters, and remove it from each line. Using this algorithm, the common prefix is "\t", and the return value is "\tfoo\n bar\n". The ECMAScript's String.dedent proposal uses this.

Both are not ideal. Granted, one could configure their editor to show markers for spaces and tabs to make this problem easier to spot, but even with such configurations, when the text consists of many lines, using the wrong indentation in one of those lines will mess up the entire literal.

The solution is using a fixed line to establish the indentation, and require that all other lines start with that indentation. My testutil.Dedent implementation uses the first line, and it would panic in the example above because the second line doesn't start with same indentation of the first line. However, this has the disadvantage that the first line in the desired content must be unindented.

But there is a solution to that too: use the last line to establish the reference indentation. This is what Swift's multiline string literal uses. On first look it seems to have just transferred the requirement to the last line, but Swift solves this by requiring the closing delimiter to appear on its own and eliding the trailing newline. Swift's doc has a few examples that should make this clear.

Swift's syntax also solves another problem: it allows writing strings where each line is indented by the same amount. In fact, Swift's syntax is the only one that doesn't place any restriction on what the text content can be.

Strategy Handles mix of tabs and spaces Admits all input Impossible output
Remove common indentation Text with common indentation on each line
Remove first line's indentation (testutil.Dedent) Text with the first line indented
Remove closing delimiter's indentation + elide trailing newline (Swift) None

Now, removing common indentation does have the advantage that it admits all input, and is a sensible choice when it has to be implemented as a function rather than new syntax. But for a new syntax it'd be a shame to not follow Swift's lead. It's very well thought to avoid the problems of other syntax choices. The only drawback I see is that the eliding of the trailing newline is slightly unusual, but it's not that hard to get used to.


Another reason for wanting a new string literal syntax is to allow writing text that has a lot of quotes themselves. This is more important than eliding indentation (which is purely cosmetic).

I've already rejected triple single quotes (because it does something useful today) and triple double quotes (because I don't like the asymmetry if you can do """ but not '''), and I'd like there to be two variants of multi-line literals, one supporting \ escape sequences (like double-quoted strings) and one not supporting them (like single-quoted strings). I'm leaning towards using <<<" / ">>> for the former and <<<' / '>>> for the latter, with more pairs of <> also supported. Examples:

var x = <<<"
  foo\x00
    bar
  ">>>
# equivalent to:
var x = "foo\x00
  bar"

var y = <<<'
  foo\x00
    bar
  '>>>
# equivalent to:
var y = 'foo\x00
  bar'

var z = <<<<'
  var x = <<<'
    foo
    '>>>
  '>>>>
# equivalent to:
var z = "var x = <<<'
  foo
  '>>>"

@krader1961
Copy link
Contributor Author

But there is a solution to that too: use the last line to establish the reference indentation. This is what Swift's multiline string literal uses.

I agree that is the best approach for dedenting string literals. However, that precludes supporting dedenting arbitrary strings; e.g., via variable interpolation to a hypothetical dedent command. As you noted in your reply:

Now, removing common indentation does have the advantage that it admits all input, and is a sensible choice when it has to be implemented as a function rather than new syntax.

This is a case where my experience with other languages (e.g., POSIX 1003.1 and Python) unduly affected my thinking. I can't recall a single instance where I have used, or seen, the Python textwrap.dedent function used with anything other than a string literal. Which means I don't really have a good argument for introducing an Elvish dedent command rather than a new string literal syntax. The latter is clearly preferable; especially if the Swift language heuristic is employed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants