src: introducing to_unicode #11

miguelteixeiraa · 2023-02-11T17:38:11Z

Ref: #4

Sorry for the delay folks! Still need more tests + add some comments, but how does it look like until now?

lemire · 2023-02-11T18:23:19Z

include/ada/idna/to_unicode.h

+std::string to_unicode(const std::string_view& input);
+}  // namespace ada::idna
+
+#endif  // ADA_IDNA_TO_UNICODE_H


It will complain here that the file does end with an empty line. :-)

We should have a clang-format rule for this :)

src/to_unicode.cpp

lemire · 2023-02-11T18:45:57Z

@miguelteixeiraa Can you check the CI errors?

I think your PR looks good. It might even be correct (minus the build errors). I am not 100% sure what to_unicode is supposed to do. Maybe you have checked the specification? I haven't.

lemire · 2023-02-11T18:51:37Z

You have a number of files that do not end with an empty line. I do not care personally but that is a big deal to some.

(Very old editors that nobody use anymore required that text files ended with an empty line. It seems to have since become a cult with the dogma that an empty final line is 'good'.)

miguelteixeiraa · 2023-02-13T16:06:13Z

@miguelteixeiraa Can you check the CI errors?

I think your PR looks good. It might even be correct (minus the build errors). I am not 100% sure what to_unicode is supposed to do. Maybe you have checked the specification? I haven't.

Yes, I've checked:
https://www.rfc-editor.org/rfc/inline-errata/rfc3490.html
and also https://www.rfc-editor.org/rfc/inline-errata/rfc3492.html

Based on: https://www.rfc-editor.org/rfc/inline-errata/rfc3490.html
The step 2 would be
2. Perform the steps specified in [NAMEPREP] and fail if there is an
error. (If step 3 of ToASCII is also performed here, it will not
affect the overall behavior of ToUnicode, but it is not
necessary.) The AllowUnassigned flag is used in [NAMEPREP].

So the NAMEPREP thing is missing (which involves things like normalization, and checks for the label validity (sizes...etc etc ))

I'm doing 1, 3, 4, 5, 6, 7 (through the tests), 8

If all code points in the sequence are in the ASCII range (0..7F)
then skip to step 3.
Perform the steps specified in [NAMEPREP] and fail if there is an
error. (If step 3 of ToASCII is also performed here, it will not
affect the overall behavior of ToUnicode, but it is not
necessary.) The AllowUnassigned flag is used in [NAMEPREP].
Verify that the sequence begins with the ACE prefix, and save a
copy of the sequence.
Remove the ACE prefix.
Decode the sequence using the decoding algorithm in [PUNYCODE] and
fail if there is an error. Save a copy of the result of this
step.
Apply ToASCII.
Verify that the result of step 6 matches the saved copy from step
3, using a case-insensitive ASCII comparison.
Return the saved copy from step 5.

This is the algo to be applied to each label. It is basically the reverse operation that is being done for to_ascii.

miguelteixeiraa · 2023-02-13T16:30:36Z

@lemire I have a question about our utf8_punycode_alternating.txt
Is it correct? I thought that valid punycodes starts with xn-- (thats probably why this PR mess up all the other tests haha)

miguelteixeiraa · 2023-02-13T17:30:31Z

Anyways, I know how to fix it

lemire · 2023-02-13T19:10:47Z

@miguelteixeiraa No punycode by itself does not start by xn--. See https://en.wikipedia.org/wiki/Punycode or https://datatracker.ietf.org/doc/html/rfc3492

The prefix xn-- is used within a domain to indicate the presence of a punycode label.

lemire · 2023-02-13T21:41:34Z

src/punycode.cpp

@@ -22,6 +23,40 @@ static constexpr char digit_to_char(int32_t digit) {
  return digit < 26 ? char(digit + 97) : char(digit + 22);
 }

+static constexpr bool begins_with(std::string_view view,


We now have multiple definitions of begins_with and is_ascii. The final PR should clean that up and have just one definition per function.

lemire · 2023-02-13T21:42:43Z

Anyways, I know how to fix it

I read more carefully your code and it looks pretty good.

lemire · 2023-02-14T02:08:06Z

to_unicode.cpp:20:32: error: 'find' is not a member of 'std'

Make sure to add...

#include <algorithm>

at the top of to_unicode.cpp.

miguelteixeiraa · 2023-02-18T01:34:34Z

I just mess up everything and I will reopen it.

lemire reviewed Feb 11, 2023

View reviewed changes

src/to_unicode.cpp Outdated Show resolved Hide resolved

lemire reviewed Feb 13, 2023

View reviewed changes

src: introducing utils.h

b67fb7e

miguelteixeiraa added 2 commits February 15, 2023 21:06

src: add missing algorithm lib in to_unicode.cpp

68e1058

src: WIP make the punycode test to pass

fc378ca

miguelteixeiraa closed this Feb 18, 2023

miguelteixeiraa force-pushed the to_unicode_issue-4 branch from 146d689 to fc378ca Compare February 18, 2023 01:05

miguelteixeiraa mentioned this pull request Feb 19, 2023

To unicode #4 - clean PR #17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: introducing to_unicode #11

src: introducing to_unicode #11

miguelteixeiraa commented Feb 11, 2023

lemire Feb 11, 2023

anonrig Feb 15, 2023

lemire commented Feb 11, 2023

lemire commented Feb 11, 2023

miguelteixeiraa commented Feb 13, 2023

miguelteixeiraa commented Feb 13, 2023 •

edited

Loading

miguelteixeiraa commented Feb 13, 2023

lemire commented Feb 13, 2023 •

edited

Loading

lemire Feb 13, 2023

lemire commented Feb 13, 2023

lemire commented Feb 14, 2023

miguelteixeiraa commented Feb 18, 2023

src: introducing to_unicode #11

src: introducing to_unicode #11

Conversation

miguelteixeiraa commented Feb 11, 2023

lemire Feb 11, 2023

Choose a reason for hiding this comment

anonrig Feb 15, 2023

Choose a reason for hiding this comment

lemire commented Feb 11, 2023

lemire commented Feb 11, 2023

miguelteixeiraa commented Feb 13, 2023

miguelteixeiraa commented Feb 13, 2023 • edited Loading

miguelteixeiraa commented Feb 13, 2023

lemire commented Feb 13, 2023 • edited Loading

lemire Feb 13, 2023

Choose a reason for hiding this comment

lemire commented Feb 13, 2023

lemire commented Feb 14, 2023

miguelteixeiraa commented Feb 18, 2023

miguelteixeiraa commented Feb 13, 2023 •

edited

Loading

lemire commented Feb 13, 2023 •

edited

Loading