Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented limited detection of unmatchable token types. #593

Merged
merged 4 commits into from
Nov 9, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/resolving_lexer_errors.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

* [No LINE_BREAKS Error.](#LINE_BREAKS)
* [Unexpected RegExp Anchor Error.](#ANCHORS)
* [Token Can Never Be Matched.](#UNREACHABLE)


### <a name="LINE_BREAKS"></a> No LINE_BREAKS Error.
Expand Down Expand Up @@ -97,5 +98,64 @@ const semVer = createToken({
})
```



### <a name="UNREACHABLE"></a> Token can never be matched.

This error means that A Token type can never be successfully matched as
a **previous** Token type in the lexer definition will **always** matched instead.
This happens because the default behavior of Chevrotain is to attempt to match
tokens **by the order** described in the lexer definition.

For example:

```javascript
const ForKeyword = createToken({
name: "ForKeyword",
pattern: /for/
})

const Identifier = createToken({
name: "Identifier",
pattern: /[a-zA-z]+/
})

// Will throw Token <ForKeyword> can never be matched...
// Because the input "for" is also a valid identifier
// and matching an identifier will be attempted first.
const myLexer = new chevrotain.Lexer([Identifier, ForKeyword])
```

* Note that this validation is limited to simple patterns such as keywords
The more general case of any pattern being a strict subset of a preceding pattern
will require much more in depth RegExp analysis capabilities.

To resolve this simply re-arrange the order of Token types in the lexer
definition such that the more specific Token types will be listed first.

```javascript
// Identifier is now listed as the last Token type.
const myLexer = new chevrotain.Lexer([ForKeyword, Identifier])
```

Note that the solution provided above will create a new problem.
Any identifier **starting with** "for" will be lexed as **two separate** tokens,
a ForKeyword and an identifier. For example:

```javascript
const myLexer = new chevrotain.Lexer([ForKeyword, Identifier])

// [
// {image:"for"}
// {image:"ward"}
// ]
const tokensResult = myLexer.tokenize("forward")
```

To resolve this second problem see how to prefer the **longest match**
as demonstrated in the [keywords vs identifiers example][keywords_idents]


[position_tracking]: http://sap.github.io/chevrotain/documentation/0_34_0/interfaces/_chevrotain_d_.ilexerconfig.html#positiontracking
[line_terminator_docs]: http://sap.github.io/chevrotain/documentation/0_34_0/interfaces/_chevrotain_d_.ilexerconfig.html#lineTerminatorsPattern
[keywords_idents] https://github.com/SAP/Chevrotain/blob/master/examples/lexer/keywords_vs_identifiers/keywords_vs_identifiers.js
14 changes: 8 additions & 6 deletions examples/grammars/css/css.js
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,6 @@
name: "Func",
pattern: MAKE_PATTERN("{{ident}}\\(")
})
// Ident must be before Minus
var Ident = createToken({
name: "Ident",
pattern: MAKE_PATTERN("{{ident}}")
})

var Cdo = createToken({ name: "Cdo", pattern: /<!--/ })
// Cdc must be before Minus
Expand All @@ -121,7 +116,6 @@
var Equals = createToken({ name: "Equals", pattern: /=/ })
var Star = createToken({ name: "Star", pattern: /\*/ })
var Plus = createToken({ name: "Plus", pattern: /\+/ })
var Minus = createToken({ name: "Minus", pattern: /-/ })
var GreaterThan = createToken({ name: "GreaterThan", pattern: />/ })
var Slash = createToken({ name: "Slash", pattern: /\// })

Expand Down Expand Up @@ -257,6 +251,14 @@
pattern: MAKE_PATTERN("{{num}}")
})

// Ident must be before Minus
var Ident = createToken({
name: "Ident",
pattern: MAKE_PATTERN("{{ident}}")
})

var Minus = createToken({ name: "Minus", pattern: /-/ })

var CssLexer = new Lexer(cssTokens)

// ----------------- parser -----------------
Expand Down
96 changes: 95 additions & 1 deletion src/scan/lexer.ts
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ import {
map,
reduce,
reject,
mapValues
mapValues,
cloneArr
} from "../utils/utils"
import { flatten } from "../utils/utils"

Expand Down Expand Up @@ -252,6 +253,8 @@ export function validatePatterns(
findModesThatDoNotExist(validTokenClasses, validModesNames)
)

errors = errors.concat(findUnreachablePatterns(validTokenClasses))

return errors
}

Expand Down Expand Up @@ -537,6 +540,97 @@ export function findModesThatDoNotExist(
return errors
}

export function findUnreachablePatterns(
tokenClasses: TokenConstructor[]
): ILexerDefinitionError[] {
const errors = []

const canBeTested = reduce(
tokenClasses,
(result, tokClass, idx) => {
const pattern = tokClass.PATTERN

if (pattern === Lexer.NA) {
return result
}

// a more comprehensive validation for all forms of regExps would require
// deeper regExp analysis capabilities
if (isString(pattern)) {
result.push({ str: pattern, idx, tokenType: tokClass })
} else if (isRegExp(pattern) && noMetaChar(pattern)) {
result.push({ str: pattern.source, idx, tokenType: tokClass })
}
return result
},
[]
)

forEach(tokenClasses, (tokClass, testIdx) => {
forEach(canBeTested, ({ str, idx, tokenType }) => {
if (testIdx < idx && testTokenClass(str, tokClass.PATTERN)) {
let msg =
`Token: ->${tokenName(
tokenType
)}<- can never be matched.\n` +
`Because it appears AFTER the token ->${tokenName(
tokClass
)}<-` +
`in the lexer's definition.\n` +
`See https://github.com/SAP/chevrotain/blob/master/docs/resolving_lexer_errors.md#UNREACHABLE`
errors.push({
message: msg,
type: LexerDefinitionErrorType.UNREACHABLE_PATTERN,
tokenClasses: [tokClass, tokenType]
})
}
})
})

return errors
}

function testTokenClass(str: string, pattern: any): boolean {
if (isRegExp(pattern)) {
const regExpArray = pattern.exec(str)
return regExpArray !== null && regExpArray.index === 0
} else if (isFunction(pattern)) {
// maintain the API of custom patterns
return pattern(str, 0, [], {})
} else if (has(pattern, "exec")) {
// maintain the API of custom patterns
return pattern.exec(str, 0, [], {})
} else if (typeof pattern === "string") {
return pattern === str
} else {
/* istanbul ignore next */
throw Error("non exhaustive match")
}
}

function noMetaChar(regExp: RegExp): boolean {
//https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
const metaChars = [
".",
"\\",
"[",
"]",
"|",
"^",
"$",
"(",
")",
"?",
"*",
"+",
"{"
]
return (
find(metaChars, char => regExp.source.indexOf(char) !== -1) ===
undefined
)
}

export function addStartOfInput(pattern: RegExp): RegExp {
let flags = pattern.ignoreCase ? "i" : ""
// always wrapping in a none capturing group preceded by '^' to make sure matching can only work on start of input.
Expand Down
3 changes: 2 additions & 1 deletion src/scan/lexer_public.ts
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,8 @@ export enum LexerDefinitionErrorType {
LEXER_DEFINITION_CANNOT_CONTAIN_UNDEFINED,
SOI_ANCHOR_FOUND,
EMPTY_MATCH_PATTERN,
NO_LINE_BREAKS_FLAGS
NO_LINE_BREAKS_FLAGS,
UNREACHABLE_PATTERN
}

export interface ILexerDefinitionError {
Expand Down
25 changes: 25 additions & 0 deletions test/scan/lexer_spec.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import {
findInvalidPatterns,
findMissingPatterns,
findStartOfInputAnchor,
findUnreachablePatterns,
findUnsupportedFlags,
SUPPORT_STICKY
} from "../../src/scan/lexer"
Expand Down Expand Up @@ -326,6 +327,30 @@ function defineLexerSpecs(
expect(errors[0].message).to.contain("InvalidToken")
})

it("will detect unreachable patterns", () => {
const ClassKeyword = createToken({
name: "ClassKeyword",
pattern: /class/
})

const Identifier = createToken({
name: "Identifier",
pattern: /\w+/
})

let tokenClasses = [Identifier, ClassKeyword]
let errors = findUnreachablePatterns(tokenClasses)
expect(errors.length).to.equal(1)
expect(errors[0].tokenClasses).to.deep.equal([
Identifier,
ClassKeyword
])
expect(errors[0].type).to.equal(
LexerDefinitionErrorType.UNREACHABLE_PATTERN
)
expect(errors[0].message).to.contain("can never be matched")
})

it("won't detect negation as using unsupported start of input anchor", () => {
let negationPattern = createToken({
name: "negationPattern",
Expand Down