Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: introducing to_unicode #11

Closed

Conversation

miguelteixeiraa
Copy link
Contributor

Ref: #4

Sorry for the delay folks! Still need more tests + add some comments, but how does it look like until now?

std::string to_unicode(const std::string_view& input);
} // namespace ada::idna

#endif // ADA_IDNA_TO_UNICODE_H
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will complain here that the file does end with an empty line. :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a clang-format rule for this :)

src/to_unicode.cpp Outdated Show resolved Hide resolved
@lemire
Copy link
Member

lemire commented Feb 11, 2023

@miguelteixeiraa Can you check the CI errors?

I think your PR looks good. It might even be correct (minus the build errors). I am not 100% sure what to_unicode is supposed to do. Maybe you have checked the specification? I haven't.

@lemire
Copy link
Member

lemire commented Feb 11, 2023

You have a number of files that do not end with an empty line. I do not care personally but that is a big deal to some.

(Very old editors that nobody use anymore required that text files ended with an empty line. It seems to have since become a cult with the dogma that an empty final line is 'good'.)

@miguelteixeiraa
Copy link
Contributor Author

@miguelteixeiraa Can you check the CI errors?

I think your PR looks good. It might even be correct (minus the build errors). I am not 100% sure what to_unicode is supposed to do. Maybe you have checked the specification? I haven't.

Yes, I've checked:
https://www.rfc-editor.org/rfc/inline-errata/rfc3490.html
and also https://www.rfc-editor.org/rfc/inline-errata/rfc3492.html

Based on: https://www.rfc-editor.org/rfc/inline-errata/rfc3490.html
The step 2 would be
2. Perform the steps specified in [NAMEPREP] and fail if there is an
error. (If step 3 of ToASCII is also performed here, it will not
affect the overall behavior of ToUnicode, but it is not
necessary.) The AllowUnassigned flag is used in [NAMEPREP].

So the NAMEPREP thing is missing (which involves things like normalization, and checks for the label validity (sizes...etc etc ))

I'm doing 1, 3, 4, 5, 6, 7 (through the tests), 8

  1. If all code points in the sequence are in the ASCII range (0..7F)
    then skip to step 3.

  2. Perform the steps specified in [NAMEPREP] and fail if there is an
    error. (If step 3 of ToASCII is also performed here, it will not
    affect the overall behavior of ToUnicode, but it is not
    necessary.) The AllowUnassigned flag is used in [NAMEPREP].

  3. Verify that the sequence begins with the ACE prefix, and save a
    copy of the sequence.

  4. Remove the ACE prefix.

  5. Decode the sequence using the decoding algorithm in [PUNYCODE] and
    fail if there is an error. Save a copy of the result of this
    step.

  6. Apply ToASCII.

  7. Verify that the result of step 6 matches the saved copy from step
    3, using a case-insensitive ASCII comparison.

  8. Return the saved copy from step 5.

This is the algo to be applied to each label. It is basically the reverse operation that is being done for to_ascii.

@miguelteixeiraa
Copy link
Contributor Author

miguelteixeiraa commented Feb 13, 2023

@lemire I have a question about our utf8_punycode_alternating.txt
Is it correct? I thought that valid punycodes starts with xn-- (thats probably why this PR mess up all the other tests haha)

@miguelteixeiraa
Copy link
Contributor Author

Anyways, I know how to fix it

@lemire
Copy link
Member

lemire commented Feb 13, 2023

@miguelteixeiraa No punycode by itself does not start by xn--. See https://en.wikipedia.org/wiki/Punycode or https://datatracker.ietf.org/doc/html/rfc3492

The prefix xn-- is used within a domain to indicate the presence of a punycode label.

src/punycode.cpp Outdated
@@ -22,6 +23,40 @@ static constexpr char digit_to_char(int32_t digit) {
return digit < 26 ? char(digit + 97) : char(digit + 22);
}

static constexpr bool begins_with(std::string_view view,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have multiple definitions of begins_with and is_ascii. The final PR should clean that up and have just one definition per function.

@lemire
Copy link
Member

lemire commented Feb 13, 2023

Anyways, I know how to fix it

I read more carefully your code and it looks pretty good.

@lemire
Copy link
Member

lemire commented Feb 14, 2023

to_unicode.cpp:20:32: error: 'find' is not a member of 'std'

Make sure to add...

#include <algorithm>

at the top of to_unicode.cpp.

@miguelteixeiraa
Copy link
Contributor Author

I just mess up everything and I will reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants