Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about handling Hollerith format fields #4

Open
apthorpe opened this issue Jul 11, 2021 · 4 comments
Open

Question about handling Hollerith format fields #4

apthorpe opened this issue Jul 11, 2021 · 4 comments

Comments

@apthorpe
Copy link
Contributor

fsource lex does not recognize the token 1H1 in the line FORMAT (1H1)

This should be expected behavior since fsource is advertised as handling F77 and later, and Hollerith fields were considered obsolete by F77.

I am working on a project to detect and evaluate problematic constructs in legacy code such as Hollerith fields, alternate return points, ENTRY statements, character data stored in non-character variables, etc. My question is where I should focus my efforts to extend lexer.py to recognize these difficult and obsolete constructs? After a quick scan of lexer.py, it looks like I should modify formattok in get_lexer_regex() - does this seem reasonable?

@apthorpe
Copy link
Contributor Author

A problem with Hollerith fields is that regular expressions can't match them in one pass. They take the form of ([1-9][0-9]*)H(.{\1}) with the wrinkle that .{\1} isn't legal regex syntax.

Hollerith fields are only legal as constant arguments in subroutine calls, in DATA statements, and in format strings. One thought is to attempt to recover from the IndexError exception in RegexLexer::line_tokens(), try to detect a Hollerith field with a character-by-character scan or multiple-regex-pass, and continue on to raising the LexerError exception if if a Hollerith field isn't detected. I'm not sure how to avoid interrupting the flow of results out of the generator; maybe it's better to attempt recovery outside line_tokens()

@mwallerb
Copy link
Owner

mwallerb commented Jul 13, 2021

@aptthorpe:

attempt to recover from the IndexError exception in RegexLexer::line_tokens(), try to detect a Hollerith field with a character-by-character scan or multiple-regex-pass, and continue on to raising the LexerError exception if if a Hollerith field isn't detected

This seems like a very good strategy! I think it should not interfere with the generator. A character scan should suffice. This area is very performance-sensitive, but it should not incur a huge performance hit, since you are only handling the fallback in the exception handling part. Go for it!

The only thing: it would also be good to add a allow_hollerith=False argument to RegexLeger.__init__() and check for that.

Edited to add: Come to think of it: could you maybe derive from RegexLeger, LegacyLexer or similar, and add it there? I am trying to keep the lexers and expression parsers fairly generic and add the fortran "quirks" in subclasses.

@apthorpe
Copy link
Contributor Author

I have a working solution and there'd be no problem toggling it on or off with allow_hollerith. I'll send more info once I have the repo cloned but basically I wrap the for match in self._finditer(line): loop in a while loop, do a Hollerith character scan if the IndexError exception is thrown, then set line to the remainder of the unscanned line and cycle the while loop so the remainder of the line gets parsed. If the for loop successfully completes, I exit the while loop; if the exception is thrown but a Hollerith string isn't found, it raises the LexerError exception. It's a small amount of code and it should be reasonably clear. Either way, it's a small change to RegexLexer or a single method override to create LegacyLexer

@apthorpe
Copy link
Contributor Author

Here's my current modification:

    def line_tokens(self, line, lineno=None):
        """Tokenizes text using the groups in the regex specified

        Iterates through all matches of the regex on `text`, returning the
        highest matching category with the associated token (group text).
        """
        unscanned_line = line
        scanmore = True
        c2 = 0
        while scanmore:
            # Assume this for-loop will consume all of unscanned_line
            scanmore = False
            try:
                for match in self._finditer(unscanned_line):
                    cat = match.lastindex
                    yield lineno, match.start(cat), cat, match.group(cat)
                    c2 = match.end(cat)
            except IndexError:
                # Match failed; try to match a Hollerith string
                # Note: Hollerith strings cannot be recognized by a simple
                # or single regex
                mhl = re.match(r"(\d+)H", line[c2:])
                if mhl:
                    hstr_len = int(mhl.group(1))
                    hstr_start = mhl.end(0)
                    hs1 = c2 + hstr_start
                    hs2 = hs1 + hstr_len

                    # Strings are code 9
                    # Q: Should string extend from c2:hs2 vs hs1:hs2? (include nH prefix?)
                    # TODO: Trap len(line) < hs2
                    yield lineno, hs1, 9, unscanned_line[hs1:hs2]

                    # Grab remainder of line to continue scanning
                    unscanned_line = unscanned_line[hs2:]

                    # Continue scanning unless the Hollerith string consumed the entire
                    # remainder of the line (nothing left to scan)
                    scanmore = (len(unscanned_line) > 0)
                else:
                    raise LexerError(None, lineno, match.start(), match.end(), line,
                                    "invalid token")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants