Question about handling Hollerith format fields #4

apthorpe · 2021-07-11T16:14:57Z

fsource lex does not recognize the token 1H1 in the line FORMAT (1H1)

This should be expected behavior since fsource is advertised as handling F77 and later, and Hollerith fields were considered obsolete by F77.

I am working on a project to detect and evaluate problematic constructs in legacy code such as Hollerith fields, alternate return points, ENTRY statements, character data stored in non-character variables, etc. My question is where I should focus my efforts to extend lexer.py to recognize these difficult and obsolete constructs? After a quick scan of lexer.py, it looks like I should modify formattok in get_lexer_regex() - does this seem reasonable?

The text was updated successfully, but these errors were encountered:

apthorpe · 2021-07-12T02:07:52Z

A problem with Hollerith fields is that regular expressions can't match them in one pass. They take the form of ([1-9][0-9]*)H(.{\1}) with the wrinkle that .{\1} isn't legal regex syntax.

Hollerith fields are only legal as constant arguments in subroutine calls, in DATA statements, and in format strings. One thought is to attempt to recover from the IndexError exception in RegexLexer::line_tokens(), try to detect a Hollerith field with a character-by-character scan or multiple-regex-pass, and continue on to raising the LexerError exception if if a Hollerith field isn't detected. I'm not sure how to avoid interrupting the flow of results out of the generator; maybe it's better to attempt recovery outside line_tokens()

mwallerb · 2021-07-13T07:46:01Z

@aptthorpe:

attempt to recover from the IndexError exception in RegexLexer::line_tokens(), try to detect a Hollerith field with a character-by-character scan or multiple-regex-pass, and continue on to raising the LexerError exception if if a Hollerith field isn't detected

This seems like a very good strategy! I think it should not interfere with the generator. A character scan should suffice. This area is very performance-sensitive, but it should not incur a huge performance hit, since you are only handling the fallback in the exception handling part. Go for it!

The only thing: it would also be good to add a allow_hollerith=False argument to RegexLeger.__init__() and check for that.

Edited to add: Come to think of it: could you maybe derive from RegexLeger, LegacyLexer or similar, and add it there? I am trying to keep the lexers and expression parsers fairly generic and add the fortran "quirks" in subclasses.

apthorpe · 2021-07-13T13:54:08Z

I have a working solution and there'd be no problem toggling it on or off with allow_hollerith. I'll send more info once I have the repo cloned but basically I wrap the for match in self._finditer(line): loop in a while loop, do a Hollerith character scan if the IndexError exception is thrown, then set line to the remainder of the unscanned line and cycle the while loop so the remainder of the line gets parsed. If the for loop successfully completes, I exit the while loop; if the exception is thrown but a Hollerith string isn't found, it raises the LexerError exception. It's a small amount of code and it should be reasonably clear. Either way, it's a small change to RegexLexer or a single method override to create LegacyLexer

apthorpe · 2021-07-13T13:59:22Z

Here's my current modification:

    def line_tokens(self, line, lineno=None):
        """Tokenizes text using the groups in the regex specified

        Iterates through all matches of the regex on `text`, returning the
        highest matching category with the associated token (group text).
        """
        unscanned_line = line
        scanmore = True
        c2 = 0
        while scanmore:
            # Assume this for-loop will consume all of unscanned_line
            scanmore = False
            try:
                for match in self._finditer(unscanned_line):
                    cat = match.lastindex
                    yield lineno, match.start(cat), cat, match.group(cat)
                    c2 = match.end(cat)
            except IndexError:
                # Match failed; try to match a Hollerith string
                # Note: Hollerith strings cannot be recognized by a simple
                # or single regex
                mhl = re.match(r"(\d+)H", line[c2:])
                if mhl:
                    hstr_len = int(mhl.group(1))
                    hstr_start = mhl.end(0)
                    hs1 = c2 + hstr_start
                    hs2 = hs1 + hstr_len

                    # Strings are code 9
                    # Q: Should string extend from c2:hs2 vs hs1:hs2? (include nH prefix?)
                    # TODO: Trap len(line) < hs2
                    yield lineno, hs1, 9, unscanned_line[hs1:hs2]

                    # Grab remainder of line to continue scanning
                    unscanned_line = unscanned_line[hs2:]

                    # Continue scanning unless the Hollerith string consumed the entire
                    # remainder of the line (nothing left to scan)
                    scanmore = (len(unscanned_line) > 0)
                else:
                    raise LexerError(None, lineno, match.start(), match.end(), line,
                                    "invalid token")

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about handling Hollerith format fields #4

Question about handling Hollerith format fields #4

apthorpe commented Jul 11, 2021

apthorpe commented Jul 12, 2021

mwallerb commented Jul 13, 2021 •

edited

Loading

apthorpe commented Jul 13, 2021

apthorpe commented Jul 13, 2021

Question about handling Hollerith format fields #4

Question about handling Hollerith format fields #4

Comments

apthorpe commented Jul 11, 2021

apthorpe commented Jul 12, 2021

mwallerb commented Jul 13, 2021 • edited Loading

apthorpe commented Jul 13, 2021

apthorpe commented Jul 13, 2021

mwallerb commented Jul 13, 2021 •

edited

Loading