Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specific unicode string literal type (and make ascii the default) #5167

Closed
chriseth opened this issue Oct 8, 2018 · 14 comments · Fixed by #9412
Closed

Add specific unicode string literal type (and make ascii the default) #5167

chriseth opened this issue Oct 8, 2018 · 14 comments · Fixed by #9412
Assignees
Labels
breaking change ⚠️ language design :rage4: Any changes to the language, e.g. new features

Comments

@chriseth
Copy link
Contributor

chriseth commented Oct 8, 2018

From the zeppelin audit:

Strings in Solidity are not only used for displaying information: for example, it is very common to have them be a key of a mapping. Because UTF-8 allows for multiple invisible characters (e.g. ZERO WIDTH SPACE), and for characters that look almost like common characters (e.g. GREEK QUESTION MARK), this usage can be extremely problematic, and lead to underhanded backdoors, exploits, etc. OpenZeppelin’s main access-control contracts are affected by this, as are multiple other string-based implementations.
Consider adding a non-UTF-8 string type to prevent these situations from arising in the first place.

@maraoz
Copy link

maraoz commented Oct 24, 2018

See OpenZeppelin/openzeppelin-contracts#1090 (comment) to learn how OpenZeppelin was affected by this

@chriseth
Copy link
Contributor Author

chriseth commented Nov 7, 2018

An obvious way to introduce them is via a prefix (like in hex"0101") - the main question would be whether the non-prefixed string should allow utf8 or only ascii.

@chriseth chriseth added the language design :rage4: Any changes to the language, e.g. new features label Nov 7, 2018
@axic
Copy link
Member

axic commented Nov 7, 2018

If we do not want a breaking change: ascii"abcd"

If we are happy to do a breaking change, then the current strings would need to be prefixed with unicode or utf8: unicode"this is a string...".

@chriseth
Copy link
Contributor Author

chriseth commented Nov 7, 2018

Note that things like "\u1234" should still be allowed in the "non-utf8-strings".

@axic
Copy link
Member

axic commented Nov 14, 2018

Note that things like "\u1234" should still be allowed in the "non-utf8-strings".

Why?

@chriseth
Copy link
Contributor Author

Because the idea is that the source representation does not have any "weird" characters, but the internal representation can be anything.

@maraoz would you agree?

@maraoz
Copy link

maraoz commented Nov 29, 2018

@chriseth agreed!

@leonardoalt
Copy link
Member

Should this go to the backlog?

@chriseth
Copy link
Contributor Author

Preliminary vote: make ascii strings the default and require a prefix for unicode strings

@chriseth chriseth changed the title Add a non-utf8 string type Add specific unicode string literal type (and make ascii the default) Jan 15, 2020
@axic
Copy link
Member

axic commented Jan 15, 2020

Decision on meeting:

  • Change string literal to not allow anything but ASCII printable characters and escape codes.
  • Introduce unicode prefix for string literals, which also allows Unicode characters.

@axic
Copy link
Member

axic commented Jul 8, 2020

Change string literal to not allow anything but ASCII printable characters and escape codes.

Does an ascii string allow unicode escape?

string a = "\u1234"; // is this valid?
string b = unicode"\u1234"; // is this valid?

@chriseth
Copy link
Contributor Author

chriseth commented Jul 8, 2020

Yes, we said unicode escapes in default strings are fine, but not unicode characters.

@axic
Copy link
Member

axic commented Jul 14, 2020

While implementing I had realised a few things: it is quite a large change allowing escapes in non-unicode literals, because the scanner just turns the escape into codepoints.

First thought implementation should have no effect on the design, but this I think is a useful consideration:

  1. Hex string literals can contain any kind of data (ascii, unicode, etc.)
  2. Regular string literals (i.e. "hello world") should only contain ASCII characters, and cannot contain unicode escapes (disabling the escape in the scanner)
  3. Unicode string literals (i.e. unicode"⚠️" or unicode"\u00a0") can contain ASCII, Unicode or Unicode escapes

Assigning any literal to a string type should check for UTF-8 encoding (this is something we have now).

@axic
Copy link
Member

axic commented Jul 28, 2020

The rules described in #5167 (comment) were implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change ⚠️ language design :rage4: Any changes to the language, e.g. new features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants