Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexp: confusing behavior on invalid utf-8 sequences #11185

Closed
dvyukov opened this issue Jun 12, 2015 · 2 comments
Closed

regexp: confusing behavior on invalid utf-8 sequences #11185

dvyukov opened this issue Jun 12, 2015 · 2 comments
Milestone

Comments

@dvyukov
Copy link
Member

dvyukov commented Jun 12, 2015

The following program:

package main

import "regexp"

func main() {
    re := regexp.MustCompile(".")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
    re = regexp.MustCompile("..")
    println(re.MatchString("\xd1"))
    println(re.MatchString("\xd1\x84"))
    println(re.MatchString("\xd1\xd1"))
}

prints:

true
true
true
false
false
true

While the following C++ program:

#include <stdio.h>
#include <re2/re2.h>

int main() {
    RE2 re1(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re1));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re1));
    RE2 re2(".");
    printf("%d\n", RE2::PartialMatch("\xd1", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\x84", re2));
    printf("%d\n", RE2::PartialMatch("\xd1\xd1", re2));
}

prints:

0
1
0
0
1
0

This raises 2 questions:

  1. Why is behavior different between regexp and re2 (re2 seems to be more consistent)?
  2. Why is "\xd1\xd1" matched against both "." and ".."? I can understand if it is matched against one or another, but not both; is it one character or two?

go version devel +b0532a9 Mon Jun 8 05:13:15 2015 +0000 linux/amd64

@dvyukov
Copy link
Member Author

dvyukov commented Jun 12, 2015

Here are other examples of disagreement between regexp and re2 for invalid utf-8:

re=".$" str="\xb1\x98" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re=".*(..b)." str="(.a|.b\xdb|" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re="\\Q\xb4\\Q" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity

re="\\QT\x82\\E\\QT\\E" str="c^|^\\QTt\\c" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity

re="^((?:.*)+?(?:.*)+?)$" str="\xff\xbf\x80\x80$^^.^^^^((?.^^^" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re="\\Q\x8a-" str="o\\Q" regexp=<nil> re2=false
panic: regexp and re2 disagree on regexp validity

re="." str="\xd6" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

re="[^-9]+z" str="\xbfz)^(?:" regexp=true re2=false
panic: regexp and re2 disagree on regexp match

@ianlancetaylor ianlancetaylor added this to the Go1.6 milestone Jun 12, 2015
@rsc
Copy link
Contributor

rsc commented Oct 14, 2015

In Go, "." matches a single malformed UTF-8 sequence; in RE2 it does not. This is mainly due to the implementation details of each but I wouldn't change either now.

As for the second question, "xx" matches against both "." and ".." too.

@rsc rsc closed this as completed Oct 14, 2015
@golang golang locked and limited conversation to collaborators Oct 17, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants