Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft: what printf-like error message might look like #1426

Conversation

flavorjones
Copy link
Contributor

Looking for feedback on this "spike" implementation of printf-style formatted error messages, first described in part in #941.

Some positive things:

  • the example error message demonstrates how we can use positional printf specifiers to allow for translations that might change the order of the arguments
  • it seems safe to conclude that we can compress our list of error messages by quite a lot by using format specifiers in place of token literals

Some questionable things I'd like feedback on:

  • this implementation always allocates a string, even though many of our error messages do not contain format specifiers
    • and so we will need to be extra careful to escape bare % characters in the error messages
  • most of our string pointers within YARP are not null-terminated, so we will need to be careful to specify string field size with a positional precision specifier (e.g., %1$.*2$s)
  • to be portable, I'm using a local implementation of GNU's vasprintf which makes me feel icky
  • because we will need to pass varargs to functions like yarp.c:expect(), we'll need a variation of yp_diagnostic_list_append (yp_diagnostic_list_vappend?)
    • and because we will need to pass varargs to functions like parse_starred_expression we may need more variations on functions (e.g.,. parse_expression/parse_expression_va?)
  • I'm pretty sure we can't take advantage of __attribute__((format(printf))) because the format strings aren't constants, but if someone can teach me that I'm wrong I'd love to understand this better.

A linter can probably help us manage the format specifier checks, if we determine we need it.

@flavorjones flavorjones marked this pull request as draft September 7, 2023 22:24
@flavorjones flavorjones force-pushed the flavorjones-spike-printf-error-messages branch from f23c9c7 to 8c0d650 Compare September 8, 2023 13:06
@flavorjones
Copy link
Contributor Author

It seems likely that Windows doesn't support positional format specifiers (which I think we would need for robust translation support). See the failing test and the Microsoft docs.

@kddnewton
Copy link
Collaborator

Sorry I haven't gotten a chance to look at this this week, but I will first thing next week!

@kddnewton
Copy link
Collaborator

kddnewton commented Sep 11, 2023

Okay so I've had some time to look at this for a while now and I'd like to suggest a couple of things and offer some thoughts.

First of all, I really love what you've done here. I think being able to put extra information into error messages is going to make this a much more user-friendly experience. It's clear you were thinking about it and I appreciate it!

Now that I've seen this work, I think we shouldn't go with printf-style formatting, mostly for the reasons you mentioned. But also, it seems like if we do go with printf-style formatting, it's going to "infect" lots of the call stack into having to accept all of the various parameters necessary for the error message, which seems not great. For these reasons I'd like to go in a different direction that doesn't involve printf-style formatting and also doesn't involve stdargs.

To solve these problems, I'd like to suggest a particular approach. That being said, I'm not looking for this to be 100% prescriptive. So if you like this direction, let's do it, otherwise let's talk about (i.e., please don't read this as a demand).

I think we shouldn't attempt to build the error messages in C at all. Instead, since we know the number of substitutions at compile-time, so we could template stuff out like we do with the node structs. Then, instead of passing around yp_diagnostic_id_t that refer to pending errors in the case of failed parsing, we could pass around pointers to stack-allocated structs, and then only reify them in the case that they need to get added to the list of errors. What that would look like would be something like:

// The base struct for all of the diagnostic messages.
typedef struct {
    yp_diagnostic_id_t type;
} yp_diagnostic_message_t;

// Template out diagnostic messages that have substitutions, otherwise just use base.
typedef struct {
    yp_diagnostic_message_t base;
    const uint8_t *encoding;
    size_t encoding_length;
} yp_error_invalid_encoding_magic_comment_t;

// Update diagnostic structure to have an embedded message.
typedef struct {
    yp_list_node_t node;
    const uint8_t *start;
    const uint8_t *end;
    yp_diagnostic_message_t *message;
} yp_diagnostic_t;

// Update the append method to accept a pointer to a message.
bool
yp_diagnostic_list_append(yp_list_t *list, const uint8_t *start, const uint8_t *end, yp_diagnostic_message_t *message) {
    yp_diagnostic_t *diagnostic = (yp_diagnostic_t *) malloc(sizeof(yp_diagnostic_t));
    if (diagnostic == NULL) return false;

    diagnostic->start = start;
    diagnostic->end = end;

    switch (message->type) {
        case YP_ERR_INVALID_ENCODING_MAGIC_COMMENT:
            diagnostic->message = malloc(sizeof(yp_error_invalid_encoding_magic_comment_t));
            diagnostic->message = *((yp_error_invalid_encoding_magic_comment_t *) message);
            break;
        default:
            diagnostic->message = malloc(sizeof(yp_diagnostic_message_t));
            diagnostic->message = *message;
            break;
    }

    yp_list_append(list, (yp_list_node_t *) diagnostic);
    return true;
}

As an optimization on top of this that we could do in another pass, since we'd be passing around pointers, we could do a trick for the errors that do not contain substitutions and tag the pointer in those cases. So for example we would do:

bool
yp_diagnostic_list_append(yp_list_t *list, const uint8_t *start, const uint8_t *end, yp_diagnostic_message_t *message) {
    yp_diagnostic_t *diagnostic = (yp_diagnostic_t *) malloc(sizeof(yp_diagnostic_t));
    if (diagnostic == NULL) return false;

    diagnostic->start = start;
    diagnostic->end = end;

    if (((uintptr_t) message) & 1) {
        diagnostic->message = message;
    } else {
        switch (message->type) {
            case YP_ERR_INVALID_ENCODING_MAGIC_COMMENT:
                diagnostic->message = malloc(sizeof(yp_error_invalid_encoding_magic_comment_t));
                diagnostic->message = *((yp_error_invalid_encoding_magic_comment_t *) message);
                break;
            default:
                assert(false && "Invalid error message");
                abort();
        }
    }

    yp_list_append(list, (yp_list_node_t *) diagnostic);
    return true;
}

The benefit being that all of the way up the call stack with expect() and such, you could pass expect(.., .., (YP_ERR << 1) | 1) instead of expect(.., .., & ((yp_diagnostic_message_t) { .type = YP_ERR })), and also that we wouldn't need to allocate for the majority of use cases.

Then when we're translating into Ruby, we would also template out subclasses of SyntaxError. For example:

module YARP
  class InvalidEncodingCommentError < SyntaxError
    attr_reader :encoding

    def initialize(encoding)
      @encoding = encoding
      super("Unknown or invalid encoding `#{encoding}` in the magic comment")
    end
  end
end

When we're serializing, we would serialize the type and then each of the substitutions. When deserializing, a similar process would happen where we could build the errors from the serialized output.

The reason I like this approach is that it leaves the string manipulation/potential for translation entirely out of C, so we don't have to mess with varargs or printf. I also like that it means we don't have to allocate a message every time.

I recognize that this is quite a bigger change than you had originally intended with this PR, so I'm sorry about that. Let me know what you think.

@flavorjones
Copy link
Contributor Author

@kddnewton I agree the printf-style formatting is not the right choice.

I like what you've sketched out here, and I'm going to spike it out to see how it feels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants