Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement til::au16 and til::u16a conversion functions & make first use in WriteConsoleAImpl #4493

Closed
wants to merge 32 commits into from

Conversation

german-one
Copy link
Contributor

@german-one german-one commented Feb 6, 2020

Summary of the Pull Request

Implement conversion functions that complement the existing til::u8u16 and til::u16u8 for the use along with other codepages than UTF-8. A state class will take care of partials at the boundaries of DBCS-encoded text.

References

Proposed in #4422 (comment)

PR Checklist

  • Closes #xxx
  • CLA signed. If not, go over here and sign the CLA
  • Tests added/passed
  • Requires documentation to be updated
  • I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

These functions have the potential to supersede ConvertToW and ConvertToA by adding partials handling.
NOTE: The astate class supports only a subset of the possible codepages (see the remarks in the code).

In order to see how it works I updated WriteConsoleAImpl accordingly.
I plan to search the code base for more points to apply these functions to, and commit those updates in an upcoming PR.

Validation Steps Performed

  • long DBCS strings split into short chunks with an odd number of bytes, and outputted to the console using WriteConsoleA
  • unit test added

@DHowett-MSFT
Copy link
Contributor

Initial thought: I'm not sure I love that it complicates u8state; I don't think that all u8-only consumers should be forced to have a codepage check and branch because conhost has some A-codepage legacy. I dunno; @miniksa?

@miniksa
Copy link
Member

miniksa commented Feb 6, 2020

Initial thought: I'm not sure I love that it complicates u8state; I don't think that all u8-only consumers should be forced to have a codepage check and branch because conhost has some A-codepage legacy. I dunno; @miniksa Michael Niksa FTE?

I have not read the code yet, but my immediate reaction is "I'd rather have u8state stay u8state and let all legacy codepages have astate so we can walk away from the A theoretically in Terminal."

@german-one
Copy link
Contributor Author

@miniksa @DHowett-MSFT
They share a lot of code that would be redundant otherwise, And if you look at WriteConsoleAImpl, see you can just call til::au16 with any codepage id incl. UTF-8. The astate is always the same and partials are discarded automatically if the codepage changes.

If you still don't like it i will of course divide them into two.

@miniksa
Copy link
Member

miniksa commented Feb 6, 2020

@miniksa @DHowett-MSFT
They share a lot of code that would be redundant otherwise, And if you look at WriteConsoleAImpl, see you can just call til::au16 with any codepage id incl. UTF-8. The astate is always the same and partials are discarded automatically if the codepage changes.

If you still don't like it i will of course divide them into two.

Nah, I'll reconsider when I get to reading this. I have to fix the tests for the project first. Sorry, I've been busy doing that instead of reading PRs.

@german-one
Copy link
Contributor Author

Haha, I was surprised anyway to get immediate reactions from you guys. I'm not in a hurry 🙂

@german-one
Copy link
Contributor Author

@DHowett-MSFT I tried to make it easier for the compiler to remove the additional branching for u8u16 calls. But maybe this is only wishful thinking.

@miniksa
Copy link
Member

miniksa commented Feb 7, 2020

Haha, I was surprised anyway to get immediate reactions from you guys. I'm not in a hurry 🙂

Yeah, I don't know what got into me. Haha. I have you on my list, but I'm trying to track down the CI test failures now as top priority. So my full review will take a bit longer. Thanks for your patience.

@german-one
Copy link
Contributor Author

german-one commented Feb 10, 2020

This is annoying now. I have any kind of const and non-const types in my test code, used til::at as rvalue and lvalue, turned all Microsoft analysis rules on, and it passed the analysis without any warning. But eventually it fails here.
Are we really not able to avoid additional suppression of warnings if we want to access an array element?

@miniksa
Copy link
Member

miniksa commented Feb 11, 2020

This is annoying now. I have any kind of const and non-const types in my test code, used til::at as rvalue and lvalue, turned all Microsoft analysis rules on, and it passed the analysis without any warning. But eventually it fails here.
Are we really not able to avoid additional suppression of warnings if we want to access an array element?

We're likely to approve a few #pragma warning(suppress:)s in library code because it keeps us from making mistakes elsewhere.

@german-one
Copy link
Contributor Author

We're likely to approve a few #pragma warning(suppress:)s in library code because it keeps us from making mistakes elsewhere.

@miniksa Actually that was not the point. I still think my approach to update til::at was right.
The current implementation limits it to class types that have a size() method. IMO it should work for most to all types that support the subscript operator, such like random-access iterators, C-arrays, and pointers. We should accept signed index types because iterators and pointers support negative offsets. And we should avoid to cast the passed index to ptrdiff_t or size_t because the expected type size might be smaller. E.g. winrt::hstring::size_type is always uin32_t.
However, I'm disenchanted because I spent a few hours with testing to avoid another rejected commit. There have been enough of them in this PR for the same reason already. And I still don't understand why it shouted out that I used pointer arithmetic at the points I called at.

Nevermind. This is all rather off-topic now 🙂

@miniksa
Copy link
Member

miniksa commented Feb 11, 2020

We're likely to approve a few #pragma warning(suppress:)s in library code because it keeps us from making mistakes elsewhere.

@miniksa Actually that was not the point. I still think my approach to update til::at was right.
The current implementation limits it to class types that have a size() method. IMO it should work for most to all types that support the subscript operator, such like random-access iterators, C-arrays, and pointers. We should accept signed index types because iterators and pointers support negative offsets. And we should avoid to cast the passed index to ptrdiff_t or size_t because the expected type size might be smaller. E.g. winrt::hstring::size_type is always uin32_t.
However, I'm disenchanted because I spent a few hours with testing to avoid another rejected commit. There have been enough of them in this PR for the same reason already. And I still don't understand why it shouted out that I used pointer arithmetic at the points I called at.

Nevermind. This is all rather off-topic now 🙂

It's fine. Thanks for your effort either way.

Just keep in mind that our analysis system is our best effort of what we think we need to do for static analysis. You might prove it wrong in ways that we haven't thought about yet!

@german-one
Copy link
Contributor Author

german-one commented Feb 12, 2020

You might prove it wrong in ways that we haven't thought about yet!

I can only proof my local VS wrong which didn't report any warning in the analysis. But of course it's my fault. I should have seen that additional overloads for arrays are necessary to avoid the array decay. I don't want to tinker with that any longer. At least not in this PR. Maybe I start another one if the time is right ...

@german-one
Copy link
Contributor Author

The list of codepages without partials handling is a little shorter now, which also means that the support is better than what we have today.

@german-one
Copy link
Contributor Author

Oh and CP54936 might be of interest to meet the GB 18030 standard.

@german-one
Copy link
Contributor Author

I have to find out what's going wrong in #4673. Probably I'll commit the bug fix here.

@german-one
Copy link
Contributor Author

german-one commented Mar 7, 2020

@miniksa I'll most likely file an issue that the console API functions need an overhaul. And I'll certainly jump on this. However, whenever I investigate the code I stumble over the same thing. It's the read parameter of WriteConsoleAImpl (also in other functions, respectively), which is related to the lpNumberOfCharsWritten parameter in the actual WriteConsoleA function I guess. There is a semantic difference between 'read' and 'written' and I still don't understand what the actual definition of the value in read is. Could be:

  • the bytes consumed, which is the same number of bytes that we received
  • the bytes used for the output, which is the number of bytes we received, plus the bytes taken from the cache, minus the bytes captured in the cache

But how does this go together with lpNumberOfCharsWritten? And what is meant here?

  • the number of glyphs written
  • the number of character cells occupied for the output

Can you shed some light on that?

@jsoref
Copy link
Contributor

jsoref commented Mar 25, 2020

Most of these are from master (you can look at master to see what's pending). It doesn't like au which is I think something you added.

@jsoref
Copy link
Contributor

jsoref commented Mar 25, 2020

I made #5124 to get master to a green check mark, any items that you see in the output in #4493 (comment) which aren't in jsoref@e80b602 are things that you'll need to add to a file.

That appears to be just:

  • au
  • GSM

That's reasonable as both appear to be present in your PR. Going forward, the spell checker should automatically annotate your commits instead of just saying "I found stuff".

I'd personally add both items to the main whitelist.txt file (it would like to be sorted using sort -u -f, but it doesn't really care too much).

@german-one
Copy link
Contributor Author

Great, thanks!

@jsoref
Copy link
Contributor

jsoref commented Mar 25, 2020

(And that's merged, so now it really should just be those two.)

@german-one
Copy link
Contributor Author

Hmm, seems that I still should have added all annotated words? Sorry for my ignorance.

@jsoref
Copy link
Contributor

jsoref commented Mar 25, 2020

You'd need to merge master again. You're missing e80b602.

(Or you can have faith that when this is merged, it'd be merged in w/ master which would have that commit and you'd be fine.)

@jsoref
Copy link
Contributor

jsoref commented Mar 25, 2020

The bot is judging your branch on its own merits, not considering that it might be merged into master. In the normal case, you would be branching off a green check (passing) master and thus any new items will be your own responsibility. Unfortunately your branch started off before the spell checker was merged, and between when the spell checker was written and when it was merged additional items were added. You then merged in master and got the spell checker which was still annoyed at master at that point in time. I then wrote and had merged into master a fix to make master green. And you added things which cleared your contribution. Merging master at this point should thus give you green. (As would being merged into master, assuming everyone is happy w/ your commit.)

Sorry for the bumpy ride. It's hard getting things merged into a fast moving project. But this should be the end of that adventure. Anyone who rebases on top of master now will be past it / as would anyone who merges to master now.

@german-one
Copy link
Contributor Author

Even if the misspellings are not marked to be resolved automatically by the bot (as I've seen in other PRs) that should be okay now since it obviously didn't find anything in the recent commits.

@jsoref
Copy link
Contributor

jsoref commented Mar 27, 2020

Interesting... yeah, I could have the bot go back and look for previous comments and resolve them (or at least mark them as obsolete). I'll add that to my imaginary backlog.

In testing, I've actually been manually doing that on some of my test PRs, so, yeah... that's definitely worth doing.

(Right now, if it doesn't find any errors, it just won't leave a comment, and you'll get your green check mark. -- Here, you can see the check mark in german-one@ab4e0d7)

@german-one
Copy link
Contributor Author

Way too stale now 😀
This PR doesn't reference an issue which indicates that partials handling in a couple of codepages has never been a real problem in the past.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants