Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some multibyte chars, e.g. 'α', cannot be read by ReadFile() under codepage 932 (Japanese). #7589

Closed
tyan0 opened this issue Sep 10, 2020 · 12 comments · Fixed by #14745
Closed
Assignees
Labels
Area-Input Related to input processing (key presses, mouse, etc.) In-PR This issue has a related PR Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Terminal The new Windows Terminal.
Milestone

Comments

@tyan0
Copy link

tyan0 commented Sep 10, 2020

Environment

Windows build number: Microsoft Windows [Version 10.0.18363.1082]
Windows Terminal version (if applicable): 1.2.2381.0

Steps to reproduce

  1. Compile the code below.
/* test-prog.c */
#include <windows.h>
#include <stdio.h>

int main()
{
        unsigned char buf[256];
        DWORD len;
        int i;
        ReadFile(GetStdHandle(STD_INPUT_HANDLE), buf, sizeof(buf), &len, NULL);
        for (i=0; i<len; i++) printf("%02x ", buf[i]);
        printf("\n");
        return 0;
}
  1. Open Windows Terminal with cmd.exe or powershell.exe.
  2. Run chcp 932.
  3. Run the test-prog above.
  4. Enter 'α' + [enter key].

Expected behavior

If the test-prog above is executed in command prompt, the multibyte char 'α' can be read correctly as follows.

C:\Test>test-prog
α
83 bf 0d 0a

Actual behavior

If the test-prog above is executed in Windows Terminal, the multibyte char 'α' cannot be read correctly as follows.

C:\Test>test-prog
α
00 0d 0a

This also happens with getchar(), getwchar(), gets(), ReadConsoleA(), etc.
As far as I tested, only ReadConsoleW() can read 'α' correctly.

Most of Japanese multibyte chars can be read correctly with the code above as follows

C:\Test>test-prog
あ
82 a0 0d 0a

even in Windows Terminal.

Supplement

Similar happens under CP936 (Simplified Chinese) and CP949 (Korean).

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Sep 10, 2020
@zadjii-msft
Copy link
Member

I'm pretty sure that's just how codepages work, though @miniksa can correct me if I'm wrong. I think if you want to use ReadFile (NOT ReadFileW), you'll probably want to stick to UTF-8 (codepage 65001).

Did this ever work in previous versions of Windows?

@zadjii-msft zadjii-msft added Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Issue-Question For questions or discussion Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something Product-Conhost For issues in the Console codebase labels Oct 9, 2020
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Oct 9, 2020
@DHowett
Copy link
Member

DHowett commented Oct 9, 2020

So it looks like this actually does work outside of WT (!)

@zadjii-msft
Copy link
Member

Well that's unfortunate. I wonder if it's a bad MB2WC, or maybe bad input events getting written via win32 input mode. Hmmmmm.

@zadjii-msft zadjii-msft added Issue-Bug It either shouldn't be doing this or needs an investigation. Priority-2 A description (P2) Product-Terminal The new Windows Terminal. Area-Input Related to input processing (key presses, mouse, etc.) and removed Issue-Question For questions or discussion Product-Conhost For issues in the Console codebase Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Needs-Author-Feedback The original author of the issue/PR needs to come back and respond to something Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Oct 9, 2020
@zadjii-msft zadjii-msft added this to the Terminal v2.0 milestone Oct 9, 2020
@tyan0
Copy link
Author

tyan0 commented Oct 9, 2020

I think this is ConPTY issue because this can be reproducible also in WSL outside Windows Terminal.

Did this ever work in previous versions of Windows?

What do you mean by previous versions? Such as Windows 7 or Windows 8.1? Or Windows 10 1903 or earlier?
Since Windows Terminal supports only in Windows 1903 or later, This cannot be reproducible in Windows 7 and 8.1.

This issue can be reproduced with ConPTY in Windows 10 1809.

@eryksun
Copy link

eryksun commented Oct 10, 2020

I think if you want to use ReadFile (NOT ReadFileW), you'll probably want to stick to UTF-8 (codepage 65001).

In which version(s) of Windows and Windows Terminal, or with which forms of input, should one expect UTF-8 to work correctly as the input codepage? It's not working with typed and pasted input with conhost.exe in Windows 10 2004, or with openconsole.exe in Windows Terminal Preview 1.4.2652.0. Is there a separate code path for East-Asian locales that use IME input, and does it work correctly in that case?

For example, in Python's REPL shell, under Windows Terminal, the following uses ctypes (libffi) to directly call ReadConsoleA and ReadFile:

>>> import ctypes
>>> kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
>>> kernel32.SetConsoleCP(65001)
1
>>> kernel32.GetConsoleCP()
65001
>>> h = kernel32.GetStdHandle(-10)
>>> buf = (ctypes.c_char * 10)()
>>> pn = ctypes.pointer(ctypes.c_ulong())

>>> kernel32.ReadConsoleA(h, buf, 10, pn, None)
ab_α_ba
1
>>> buf[:9]
b'ab_\x00_ba\r\n'

>>> kernel32.ReadFile(h, buf, 10, pn, None)
cd_α_dc
1
>>> buf[:9]
b'cd_\x00_dc\r\n'

Note that in both cases "α" gets replaced with a null byte.

Setting the output codepage to UTF-8 started working correctly with WriteFile and WriteConsoleA in Windows 8, due to the required rewrite when the old LPC-based 'files' were replaced by real files provided by the ConDrv device. With the old implementation in Windows 7 and earlier, writing UTF-8 returned the number of decoded wide characters written instead of the number of bytes written. The latter caused buffered writers to repeat a write multiple times, leading to the display of garbage text after every write with non-ASCII characters.

Setting the input codepage to UTF-8 has never worked correctly with ReadFile and ReadConsoleA in a Western locale in any version of Windows or the console host that I've used. It's limited to reading 7-bit ASCII (i.e. ordinals 0-127). Each non-ASCII character gets replaced by a null character. Since chcp.com sets both the input and output codepages, the advice to use chcp 65001, which gets repeatedly online ad nauseum, is generally misguided. It's a bad choice for people who need to read non-English input (e.g. a Spanish locale), either typed or pasted into the console. It's fine in Windows 8+ if just the output codepage is set via SetConsoleOutputCP(CP_UTF8).

@miniksa
Copy link
Member

miniksa commented Oct 16, 2020

Part 1:

This is strictly between 437 and 932 per the original filer.

I have this spreadsheet from the investigation of how it works in classic conhost since that's what services these API calls:

image

It looks like there is a regression in having the input codepage as 932 and the output codepage as 437... that's a scenario that returns null in the new conhost but not in the legacy conhostv1.dll. So I need to get that fixed.

Part 2:

For Windows Terminal, which uses ConPTY, which tries to use UTF-8/65001 for everything...

@DHowett noticed this

bool InteractDispatch::WriteString(const std::wstring_view string)
{
if (string.empty())
{
return true;
}
unsigned int codepage = 0;
bool success = _pConApi->GetConsoleOutputCP(codepage);

Where the interactive dispatcher (input for virtual terminal) is using the OUTPUT codepage to determine what to encode with. That's clearly not right either.

Summary so far:

  1. I have a test/investigation table for conhost, but no line of code to point to as broken
  2. I have a line of code broken in conpty (for terminal), but don't have a specific test/investigation for it yet.

I will continue.

@miniksa miniksa self-assigned this Oct 16, 2020
@miniksa
Copy link
Member

miniksa commented Oct 19, 2020

Part 1:

The function where this converts the α into 0x00 instead of the two byte sequence 0x83 0xbf is at

return LOG_IF_WIN32_BOOL_FALSE(WideCharToMultiByte(uiCodePage, 0, pwchSource, cchSource, pchTarget, cchTarget, nullptr, nullptr));
.

The α is given to WideCharToMultiByte as 1 byte and asked to convert to codepage 932 with only 1 byte of available output space. Since it obviously needs 2 in that code page, it's converted to 0x00.

  1. It should probably be getting converted to the appropriate replacement character for the codepage (which I believe we can lookup) instead of to 0x00.
  2. It should be given 2 bytes of space, not 1, because that's how much it needs.

I'm going to explore item 2 further.

Part 1, Item 2

The stack where this occurs is

>	OpenConsole.exe!ConvertToOem(const unsigned int uiCodePage, const wchar_t * const pwchSource, const unsigned int cchSource, char * const pchTarget, const unsigned int cchTarget) Line 272	C++
 	OpenConsole.exe!TranslateUnicodeToOem(const wchar_t * pwchUnicode, const unsigned long cchUnicode, char * pchAnsi, const unsigned long cbAnsi, std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>> & partialEvent) Line 208	C++
 	OpenConsole.exe!COOKED_READ_DATA::_handlePostCharInputLoop(const bool isUnicode, unsigned __int64 & numBytes, unsigned long & controlKeyState) Line 1176	C++
 	OpenConsole.exe!COOKED_READ_DATA::Read(const bool isUnicode, unsigned __int64 & numBytes, unsigned long & controlKeyState) Line 451	C++
 	OpenConsole.exe!COOKED_READ_DATA::Notify(const WaitTerminationReason TerminationReason, const bool fIsUnicode, long * const pReplyStatus, unsigned __int64 * const pNumBytes, unsigned long * const pControlKeyState, void * const __formal) Line 411	C++
 	OpenConsole.exe!ConsoleWaitBlock::Notify(const WaitTerminationReason TerminationReason) Line 152	C++
 	OpenConsole.exe!ConsoleWaitQueue::_NotifyBlock(ConsoleWaitBlock * pWaitBlock, const WaitTerminationReason TerminationReason) Line 117	C++
 	OpenConsole.exe!ConsoleWaitQueue::NotifyWaiters(const bool fNotifyAll, const WaitTerminationReason TerminationReason) Line 90	C++
 	OpenConsole.exe!ConsoleWaitQueue::NotifyWaiters(const bool fNotifyAll) Line 65	C++
 	OpenConsole.exe!InputBuffer::WakeUpReadersWaitingForData() Line 165	C++
 	OpenConsole.exe!InputBuffer::Write(std::deque<std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>>,std::allocator<std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>>>> & inEvents) Line 581	C++
 	OpenConsole.exe!InputBuffer::Write(std::unique_ptr<IInputEvent,std::default_delete<IInputEvent>> inEvent) Line 539	C++
 	OpenConsole.exe!HandleGenericKeyEvent(KeyEvent keyEvent, const bool generateBreak) Line 165	C++
 	OpenConsole.exe!HandleKeyEvent(HWND__ * const hWnd, const unsigned int Message, const unsigned __int64 wParam, const __int64 lParam, int * pfUnlockConsole) Line 467	C++
 	OpenConsole.exe!Microsoft::Console::Interactivity::Win32::Window::ConsoleWindowProc(HWND__ * hWnd, unsigned int Message, unsigned __int64 wParam, __int64 lParam) Line 523	C++
 	OpenConsole.exe!Microsoft::Console::Interactivity::Win32::Window::s_ConsoleWindowProc(HWND__ * hWnd, unsigned int Message, unsigned __int64 wParam, __int64 lParam) Line 58	C++
 	user32.dll!UserCallWinProcCheckWow(_ACTIVATION_CONTEXT * pActCtx, __int64(*)(tagWND *, unsigned int, unsigned __int64, __int64) pfn, HWND__ * hwnd, _WM_VALUE msg, unsigned __int64 wParam, __int64 lParam, void * fEnableLiteHooks, int) Line 280	C++
 	user32.dll!DispatchMessageWorker(tagMSG * pmsg, int fAnsi) Line 3157	C++
 	OpenConsole.exe!ConsoleInputThreadProcWin32(void * __formal) Line 1082	C++
 	kernel32.dll!BaseThreadInitThunk(unsigned long RunProcessInit, long(*)(void *) StartAddress, void * Argument) Line 70	C
 	ntdll.dll!RtlUserThreadStart(long(*)(void *) StartAddress, void * Argument) Line 1152	C

One frame up Code reference:

terminal/src/host/dbcs.cpp

Lines 157 to 208 in 09471c3

_Ret_range_(0, cbAnsi)
ULONG TranslateUnicodeToOem(_In_reads_(cchUnicode) PCWCHAR pwchUnicode,
const ULONG cchUnicode,
_Out_writes_bytes_(cbAnsi) PCHAR pchAnsi,
const ULONG cbAnsi,
_Out_ std::unique_ptr<IInputEvent>& partialEvent)
{
const CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
PWCHAR const TmpUni = new (std::nothrow) WCHAR[cchUnicode];
if (TmpUni == nullptr)
{
return 0;
}
memcpy(TmpUni, pwchUnicode, cchUnicode * sizeof(WCHAR));
BYTE AsciiDbcs[2];
AsciiDbcs[1] = 0;
ULONG i, j;
for (i = 0, j = 0; i < cchUnicode && j < cbAnsi; i++, j++)
{
if (IsGlyphFullWidth(TmpUni[i]))
{
ULONG const NumBytes = sizeof(AsciiDbcs);
ConvertToOem(gci.CP, &TmpUni[i], 1, (LPSTR)&AsciiDbcs[0], NumBytes);
if (IsDBCSLeadByteConsole(AsciiDbcs[0], &gci.CPInfo))
{
if (j < cbAnsi - 1)
{ // -1 is safe DBCS in buffer
pchAnsi[j] = AsciiDbcs[0];
j++;
pchAnsi[j] = AsciiDbcs[1];
AsciiDbcs[1] = 0;
}
else
{
pchAnsi[j] = AsciiDbcs[0];
break;
}
}
else
{
pchAnsi[j] = AsciiDbcs[0];
AsciiDbcs[1] = 0;
}
}
else
{
ConvertToOem(gci.CP, &TmpUni[i], 1, &pchAnsi[j], 1);
}
}

The length of space given to ConvertToOem is 1 when IsGlyphFullWidth returns false on the given character α. Otherwise, two bytes are given as the length of the array BYTE AsciiDbcs[2];

I momentarily drilled into IsGlyphFullWidth to check what it's doing before realizing that the full-width or half-widthness of a glyph really has nothing to do with how many bytes it's going to be consuming when returned to the user. But it does lead me down one potential revelation: When full width is checked and we don't have an answer for it in the table, we ask the font. The font chosen in conhost varies between choosing Output Codepage 437 and Output Codepage 932.

The questions then are:
A. Is there a divergence here between output CP 437 and output CP 932 in respect to what IsGlyphFullWidth reports?
B. Why is this using the glyph width to determine whether it fits? (Shouldn't it just use whatever is necessary?)
C. How is this different from conhostv1.dll because it worked there? (Was it checking width? Did it treat fonts differently?)

@miniksa
Copy link
Member

miniksa commented Oct 20, 2020

Part 1, Item 2, Questions

A. Is there a divergence here between output CP 437 and output CP 932 in respect to what IsGlyphFullWidth reports?

No. That's not the issue. It's the font that's the issue. There's two circumstances I've now observed in conhostv1.dll, but keep in mind that it tends to choose slightly different fonts than the most recent one because we've unlocked the ability to switch codepages early in the conhostv2 journey (as the locking to a specific one, the OEMCP one from startup for DBCS codepages, was a relic from when we sold a "Japanese Edition" of Windows or a "Chinese Edition" of Windows.)

If the font says that the α is wide, it gets translated into two bytes and fits and we see the 0x83 0xbf answer.
If the font says that the α is narrow, it literally returns one byte of uninitialized memory! The code didn't check the return code from the conversion and just walks on as if 1 byte was converted!

B. Why is this using the glyph width to determine whether it fits? (Shouldn't it just use whatever is necessary?)

As far as I can tell, this is the long standing mix about that someone did long ago when implementing this. The terminology of "full width" and "double byte" and the number of CHARs used to store something are all sort of mixed up as the same thing even though they can vary dramatically. For instance, a narrow character can be 2 bytes in a DBCS codepage.

C. How is this different from conhostv1.dll because it worked there? (Was it checking width? Did it treat fonts differently?)

It's different because it was using RtlUnicodeToOemN when the system startup OEMCP matched the currently chosen input code page. For conhostv2.dll, the current conhost.exe, and openconsole.exe, we run everything through the WideCharToMultiByte function instead, deduplicating the code path (in theory) AND we tend to end up with a different font that is saying this is narrow because of the broader unlocked choice. Finally, the v2 one actually checks error return codes and tries to compensate for them where v1 just returns either uninitialized memory or just so happens to have enough memory for the conversion half by fluke due to the conflation of the "wide" and "2 byte" terminology.

Conclusion

Due to the total mix around of DBCS and wide/narrow and 1/2 bytes in conhostv1 combined with the fact that certain fonts returned uninitialized memory, the v1 variant is not a model for anyone to follow as sanity.

I believe that the fix here is to make the v2 one not necessarily check the "width" of the character as a measure of how many bytes it will take and instead just do the conversion filling as much buffer as it has access to over returning a \0 as failure (and stow the leftover bytes for the next call, per the read model of many other ReadConsole* family functions.)

@miniksa
Copy link
Member

miniksa commented Oct 21, 2020

Okay, so I have a proposed fix for the translation part in ConvertUnicodeToOem that:

  1. Doesn't try to count the width of characters in bytes or the widthness of a byte. It just converts up to the available space given.
  2. Uses the default character if the conversion fails instead of leaving behind a null byte (or the even worse v1 policy of an uninitialized byte.)

It's here: c1ca8f3

But now there's another problem, the parent calling it isn't giving enough buffer space. And that isn't because the user buffer provided in the API call was necessarily too small, it's because the parent function _handlePostCharInputLoop is pre-guessing how many bytes it will take by checking the WIDTH of every character and reserving two codepage bytes for a wide and one for a narrow.

[[nodiscard]] NTSTATUS COOKED_READ_DATA::_handlePostCharInputLoop(const bool isUnicode, size_t& numBytes, ULONG& controlKeyState) noexcept
{
DWORD LineCount = 1;
if (_echoInput)
{
// Figure out where real string ends (at carriage return or end of buffer).
PWCHAR StringPtr = _backupLimit;
size_t StringLength = _bytesRead;
bool FoundCR = false;
for (size_t i = 0; i < (_bytesRead / sizeof(WCHAR)); i++)
{
if (*StringPtr++ == UNICODE_CARRIAGERETURN)
{
StringLength = i * sizeof(WCHAR);
FoundCR = true;
break;
}
}
if (FoundCR)
{
if (_commandHistory)
{
// add to command line recall list if we have a history list.
CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
LOG_IF_FAILED(_commandHistory->Add({ _backupLimit, StringLength / sizeof(wchar_t) },
WI_IsFlagSet(gci.Flags, CONSOLE_HISTORY_NODUP)));
}
// check for alias
ProcessAliases(LineCount);
}
}
bool fAddDbcsLead = false;
size_t NumBytes = 0;
// at this point, a->NumBytes contains the number of bytes in
// the UNICODE string read. UserBufferSize contains the converted
// size of the app's buffer.
if (_bytesRead > _userBufferSize || LineCount > 1)
{
if (LineCount > 1)
{
PWSTR Tmp;
if (!isUnicode)
{
if (_pInputBuffer->IsReadPartialByteSequenceAvailable())
{
fAddDbcsLead = true;
std::unique_ptr<IInputEvent> event = GetInputBuffer()->FetchReadPartialByteSequence(false);
const KeyEvent* const pKeyEvent = static_cast<const KeyEvent* const>(event.get());
*_userBuffer = static_cast<char>(pKeyEvent->GetCharData());
_userBuffer++;
_userBufferSize -= sizeof(wchar_t);
}
NumBytes = 0;
for (Tmp = _backupLimit;
*Tmp != UNICODE_LINEFEED && _userBufferSize / sizeof(WCHAR) > NumBytes;
Tmp++)
{
NumBytes += IsGlyphFullWidth(*Tmp) ? 2 : 1;
}
}
// clang-format off
#pragma prefast(suppress: __WARNING_BUFFER_OVERFLOW, "LineCount > 1 means there's a UNICODE_LINEFEED")
// clang-format on
for (Tmp = _backupLimit; *Tmp != UNICODE_LINEFEED; Tmp++)
{
FAIL_FAST_IF(!(Tmp < (_backupLimit + _bytesRead)));
}
numBytes = (ULONG)(Tmp - _backupLimit + 1) * sizeof(*Tmp);
}
else
{
if (!isUnicode)
{
PWSTR Tmp;
if (_pInputBuffer->IsReadPartialByteSequenceAvailable())
{
fAddDbcsLead = true;
std::unique_ptr<IInputEvent> event = GetInputBuffer()->FetchReadPartialByteSequence(false);
const KeyEvent* const pKeyEvent = static_cast<const KeyEvent* const>(event.get());
*_userBuffer = static_cast<char>(pKeyEvent->GetCharData());
_userBuffer++;
_userBufferSize -= sizeof(wchar_t);
}
NumBytes = 0;
size_t NumToWrite = _bytesRead;
for (Tmp = _backupLimit;
NumToWrite && _userBufferSize / sizeof(WCHAR) > NumBytes;
Tmp++, NumToWrite -= sizeof(WCHAR))
{
NumBytes += IsGlyphFullWidth(*Tmp) ? 2 : 1;
}
}
numBytes = _userBufferSize;
}
__analysis_assume(numBytes <= _userBufferSize);
memmove(_userBuffer, _backupLimit, numBytes);
INPUT_READ_HANDLE_DATA* const pInputReadHandleData = GetInputReadHandleData();
const std::wstring_view pending{ _backupLimit + (numBytes / sizeof(wchar_t)), (_bytesRead - numBytes) / sizeof(wchar_t) };
if (LineCount > 1)
{
pInputReadHandleData->SaveMultilinePendingInput(pending);
}
else
{
pInputReadHandleData->SavePendingInput(pending);
}
}
else
{
if (!isUnicode)
{
PWSTR Tmp;
if (_pInputBuffer->IsReadPartialByteSequenceAvailable())
{
fAddDbcsLead = true;
std::unique_ptr<IInputEvent> event = GetInputBuffer()->FetchReadPartialByteSequence(false);
const KeyEvent* const pKeyEvent = static_cast<const KeyEvent* const>(event.get());
*_userBuffer = static_cast<char>(pKeyEvent->GetCharData());
_userBuffer++;
_userBufferSize -= sizeof(wchar_t);
if (_userBufferSize == 0)
{
numBytes = 1;
return STATUS_SUCCESS;
}
}
NumBytes = 0;
size_t NumToWrite = _bytesRead;
for (Tmp = _backupLimit;
NumToWrite && _userBufferSize / sizeof(WCHAR) > NumBytes;
Tmp++, NumToWrite -= sizeof(WCHAR))
{
NumBytes += IsGlyphFullWidth(*Tmp) ? 2 : 1;
}
}
numBytes = _bytesRead;
if (numBytes > _userBufferSize)
{
return STATUS_BUFFER_OVERFLOW;
}
memmove(_userBuffer, _backupLimit, numBytes);
}
controlKeyState = _controlKeyState;
if (!isUnicode)
{
// if ansi, translate string.
std::unique_ptr<char[]> tempBuffer;
try
{
tempBuffer = std::make_unique<char[]>(NumBytes);
}
catch (...)
{
return STATUS_NO_MEMORY;
}
std::unique_ptr<IInputEvent> partialEvent;
numBytes = TranslateUnicodeToOem(_userBuffer,
gsl::narrow<ULONG>(numBytes / sizeof(wchar_t)),
tempBuffer.get(),
gsl::narrow<ULONG>(NumBytes),
partialEvent);
if (partialEvent.get())
{
GetInputBuffer()->StoreReadPartialByteSequence(std::move(partialEvent));
}
if (numBytes > _userBufferSize)
{
return STATUS_BUFFER_OVERFLOW;
}
memmove(_userBuffer, tempBuffer.get(), numBytes);
if (fAddDbcsLead)
{
numBytes++;
}
}
return STATUS_SUCCESS;
}

That sort of behavior is standard practice for COOKED READ.... that is guessing byte count based on character width which can be dependent on the font chosen which can be dependent on the code page selected. Gross.

So my plan is to try to make a targeted fix in _handlePostCharInputLoop for this for now, but we're going to need to book all up COOKED READ rework soonish because very similar issues are impacting every other issue on the tracker with "cooked read", "utf8 and read*a APIs", and "emoji in cmd.exe" like this one #1503.

Also.... ConvertUnicodeToOem might just be completely unnecessary as it sounds an awful lot like some of our other ConvertToA like functions and/or the til::u16a and friends from #4493 that I might have an opportunity to resurrect (@german-one, fyi).

@miniksa
Copy link
Member

miniksa commented Oct 23, 2020

@german-one
Copy link
Contributor

german-one commented Oct 24, 2020

(@german-one, fyi)

Feel free to do whatever is necessary to make any good out of the code. Get back to me if you'd like me to work on it any further. Note: There are still codepages where the handling of partials is not supported. They are listed in the method description of operator(). That's either because I didn't find the spec or because it's quite tricky (like for UTF-7 which is a mix of ASCII and base64).

@miniksa
Copy link
Member

miniksa commented Oct 27, 2020

(@german-one, fyi)

Feel free to do whatever is necessary to make any good out of the code. Get back to me if you'd like me to work on it any further. Note: There are still codepages where the handling of partials is not supported. They are listed in the method description of operator(). That's either because I didn't find the spec or because it's quite tricky (like for UTF-7 which is a mix of ASCII and base64).

That's fine. Thank you!

@zadjii-msft zadjii-msft modified the milestones: Terminal v2.0, 22H2 Jan 4, 2022
@miniksa miniksa removed their assignment Mar 31, 2022
@lhecker lhecker self-assigned this Jan 23, 2023
@ghost ghost added the In-PR This issue has a related PR label Jan 31, 2023
DHowett pushed a commit that referenced this issue Feb 28, 2023
The overarching intention of this PR is to improve our Unicode support.
Most
of our APIs still don't support anything beyond UCS-2 and DBCS
sequences.
This commit doesn't fix the UTF-16 support (by supporting surrogate
pairs),
but it does improve support for UTF-8 by allowing longer `char`
sequences.

It does so by removing `TranslateUnicodeToOem` which seems to have had
an almost
viral effect on code quality wherever it was used. It made the
assumption that
_all_ narrow glyphs encode to 1 `char` and most wide glyphs to 2
`char`s.
It also didn't bother to check whether `WideCharToMultiByte` failed or
returned
a different amount of `char`s. So up until now it was easily possible to
read
uninitialized stack memory from conhost. Any code that used this
function was
forced to do the same "measurement" of narrow/wide glyphs, because _of
course_
it didn't had any way to indicate to the caller how much memory it needs
to
store the result. Instead all callers were forced to sorta replicate how
it
worked to calculate the required storage ahead of time.
Unsurprisingly, none of the callers used the same algorithm...

Without it the code is much leaner and easier to understand now. The
best
example is `COOKED_READ_DATA::_handlePostCharInputLoop` which used to
contain 3
blocks of _almost_ identical code, but with ever so subtle differences.
After
reading the old code for hours I still don't know if they were relevant
or not.
It used to be 200 lines of code lacking any documentation and it's now
50 lines
with descriptive function names. I hope this doesn't break anything, but
to
be honest I can't imagine anyone having relied on this mess in the first
place.

I needed some helpers to handle byte slices (`std::span<char>`), which
is why
a new `til/bytes.h` header was added. Initially I wrote a `buf_writer`
class
but felt like such a wrapper around a slice/span was annoying to use.
As such I've opted for freestanding functions which take slices as
mutable
references and "advance" them (offset the start) whenever they're read
from
or written to. I'm not particularly happy with the design but they do
the job.

Related to #8000
Fixes #4551
Fixes #7589
Fixes #8663

## Validation Steps Performed
* Unit and feature tests ✅
* Far Manager ✅
* Fixes test cases in #4551, #7589 and #8663
@microsoft-github-policy-service microsoft-github-policy-service bot added the Needs-Tag-Fix Doesn't match tag requirements label Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Input Related to input processing (key presses, mouse, etc.) In-PR This issue has a related PR Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Terminal The new Windows Terminal.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants