Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add optional UTF-8 Display/File character support. #73

Open
taviso opened this issue Jun 3, 2022 · 7 comments
Open

Add optional UTF-8 Display/File character support. #73

taviso opened this issue Jun 3, 2022 · 7 comments
Milestone

Comments

@taviso
Copy link
Owner

taviso commented Jun 3, 2022

Lotus 1-2-3 predates UTF-8, and uses LMBCS internally, which is sort of a precursor to unicode.

I see no reason we couldn't add a UTF-8 option for file/display charset, for better i18n support. It supports character set translation, we just have to teach it how and figure out the CBD (character bundle) format. I already know the BDLREC format, from my lotusdrv project - it's basically a TLV (tag, length, value) encoding system.

@sjuswede
Copy link

I sometimes work with Chinese and Japanese text, and if this could get working, I'd be extremely happy.

Right now not even Swedish characters like åäö work for me in rxvt-unicode or XTerm, when I try to enter them. When I import a csv containing them (in UTF8) they predictably get stripped out.

@taviso
Copy link
Owner Author

taviso commented Jun 16, 2022

Yeah, not great right now, I can't even use £ lol. I looked at the code a bit today, I think I can make a few improvements easily, some might be harder though!

Internally, lotus uses LMBCS, which is actually pretty impressive foresight considering unicode wasn't invented and everyone else was using codepages. This is good, because internally it can tell the difference between åäa.

You can see it knows about å, and calls it a ring:

https://archive.org/details/lotus-1-2-3-release-3.1-reference/Lotus%201-2-3%20Release%203.1%20-%20Reference/page/n637/mode/2up

It stores these characters correctly but doesn't know how to display them, so right now it uses a "fallback" ascii character translation table (å => a and £ => L, and so on). That actually seems pretty easy to solve, I'll just add a lmbcs => utf-8 table, then pass it to waddch() instead.

I'll give it a shot this weekend.

@taviso
Copy link
Owner Author

taviso commented Jun 16, 2022

I think display and keyboard input might be easy, but the question is what to do with /File Import, always assume UTF-8? I guess we could have an environment variable like $LOTUS_IMPORT_CHARSET or whatever.

@sjuswede
Copy link

An environment variable would of course be great for legacy files. I would default to UTF-8, since that is standard in Linux today. It's a lot of work to set a normal distro to use anything else. But there are a lot of legacy files out there, and many systems which still spit out very strange formats. Don't ask me how I know.

@taviso taviso added this to the 1.0.0 milestone Jun 18, 2022
@taviso
Copy link
Owner Author

taviso commented Jun 20, 2022

Okay, I think I've got a plan. I have an easy temporary improvement, and a plan for a harder complete solution.

I can change the keymap code to translate UTF-8 on input to all the supported lmbcs characters. There are no collisions (I checked) so this will be super easy, I can do this in a day or two.

This is easy but not a complete solution -- there's no cjk for a start... but it is better than nothing - most of the latin extended characters are covered (so I'll get £, you'll get all the Swedish characters, things like éßçñ are all there). There is no €, but it has ¤, it seems pretty safe to just steal that for € for now? I don't know.

The complete solution will be adding lmbcs<->UTF-8 charset support, but this is a much bigger job.

taviso added a commit that referenced this issue Jun 21, 2022
This is the first step in improving i18n support. If any
UTF-8 sequences have LMBCS encodings, translate them on
input.

These characters are stored as LMBCS internally, and you
can differentiate them with @code, but they are not
displayed correctly (they are transliterated to ASCII, see
the 1-2-3R3.1 manual, Appendix 2).

The next part of this change will be displaying them
as UTF-8.
@krackout
Copy link

krackout commented Jan 3, 2023

The complete solution will be adding lmbcs<->UTF-8 charset support, but this is a much bigger job.

@taviso If any help can be given, I'm willing; especially regarding Greek. It may be a waste of time to get me programmatically involved, but it'll be easier regarding conversion tables I suppose.

@taviso
Copy link
Owner Author

taviso commented Jan 4, 2023

Thank you! I'm slowly working on this, it will work eventually! 😆

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants