Unable to render non-English letters supported by the font #2

fatso83 · 2018-06-25T12:45:15Z

I can't get Orson PDF to display non-English letters contained in the text when not rendering the text as vector graphic. I only get '???' where it should say "ØÆÅ". I have checked that the Font supports this letter using Font#canDisplay() and it is of course installed on the system (checked using Font#getFamily() being not equal to DIALOGUE).

    Font f = new Font("DejaVu Serif", PLAIN, 16);
    graphics2D.setFont(f);
    System.out.println("family" + f.getFamily() + " " + f.getName());
    System.out.println("Can display Ø:" + f.canDisplay('Ø') );

    //vector works
    graphics2D.setRenderingHint(PDFHints.KEY_DRAW_STRING_TYPE, PDFHints.VALUE_DRAW_STRING_TYPE_VECTOR);
    graphics2D.drawString("VECTOR: æøå ØÅÆ", 0, 20);

    //normal text doesn't
    graphics2D.setRenderingHint(PDFHints.KEY_DRAW_STRING_TYPE, PDFHints.VALUE_DRAW_STRING_TYPE_STANDARD);
    graphics2D.drawString("TEXT: æøå ØÅÆ", 300, 20);

The text was updated successfully, but these errors were encountered:

mhschmieder · 2020-06-30T22:47:51Z

I am hoping to do some work on encoding issues tonight so will keep this in mind as it may be a related issue. Much of the code is hard-wired for 7-bit US-ASCII for some reason, which is an unnecessary restriction as PDF supports UTF-8 and UTF-16 and as UTF-8 (my preference) covers the same Unicode subset as UTF-16 but results in byte-for-byte identical files when the content adheres to 7-bit US-ASCII. I will check whether this affects font selection and extended font mappings as well as whether it is being applied prior to glyph conversion and vectorizing the text. I am hoping to find a way to solve these character set limitations in the library, in a way that still allows the downstream client to be in control of the encoding.

mhschmieder · 2020-07-07T07:46:13Z

Although I am now setting the Rendering Hint to get vectored text, just because my clients want as exact a match to the on-screen GUI look as possible (and as most modern applications recognize common fonts and can back-convert to selectable text as long as the original font was a common one), I then looked again at the OrsonPDF source code to see if it would be safe to switch the encoding to UTF-8 in the two "toBytes()" functions (one in PDFUtils, the other in PDFDocument).

After looking at where those functions are called, it seems perfectly safe to make this change and I don't feel it should even require "additional" functions that take the charset as an argument and are called by these older functions using US-ASCII as the charset value for backward compatibility. As I stated above, if the content is all US-ASCII anyway, the resulting file will be 100% identical when the String is converted to a byte array using UTF-8 encoding.

On the other hand, the Dictionary class in OrsonPDF also uses PDFUtils.toBytes() to convert the String containing the PDF text describing the Dictionary. So it may be that US-ASCII is needed for that particular encoding, as the Dictionary entries go first in the PDF output and likely are only valid if US-ASCII-limited.

As this issue got no comments after two years, I'm not sure if I should just go ahead and make the changes with a pull request that includes this explanation? If no comments are made here soon, I will probably do that. After all, the development team can always reject the change and say why.

Of course I will discover immediately if this changes causes a font-mapping issue, before I even commit any code changes, as every PDF output that I do at least has the degree symbol in it, which isn't in the US-ASCII character set. I may have to try several available font mappings to verify though, as I thought Helvetica supported some of the more common non-US characters like degrees, copyright, etc.

jfree · 2020-07-07T20:41:58Z

Font support is definitely an area where OrsonPDF has limitations. I'll be happy to look at pull requests that extend what's possible with the API.

mhschmieder · 2020-07-07T21:15:20Z

I just spent about two hours trying to understand every possible code path that might be depending on the functions that force to US-ASCII encoding. I'm not quite ready yet to risk the change to UTF-8 vs. making it an option in enhanced versions of the byte-array-conversion functions that would only affect text content, but I should have time tomorrow to give that a trial run as a safe approach that doesn't affect the internals of the library, and then see if the built-in PDF Fonts choke on non-US ASCII characters such as the degrees symbol and the copyright sign, for the core text content of the PDF document.

If indeed it ends up being an issue of font support for UTF-8, then I'll come up with some creative plan for setting the document to map to fonts that support expanded character sets. But if it comes to that, then Issue #6 would have to be addressed first, and as I mentioned there, I am hoping to have time to work on that one later this week.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to render non-English letters supported by the font #2

Unable to render non-English letters supported by the font #2

fatso83 commented Jun 25, 2018

mhschmieder commented Jun 30, 2020

mhschmieder commented Jul 7, 2020 •

edited

Loading

jfree commented Jul 7, 2020

mhschmieder commented Jul 7, 2020 •

edited

Loading

Unable to render non-English letters supported by the font #2

Unable to render non-English letters supported by the font #2

Comments

fatso83 commented Jun 25, 2018

mhschmieder commented Jun 30, 2020

mhschmieder commented Jul 7, 2020 • edited Loading

jfree commented Jul 7, 2020

mhschmieder commented Jul 7, 2020 • edited Loading

mhschmieder commented Jul 7, 2020 •

edited

Loading

mhschmieder commented Jul 7, 2020 •

edited

Loading