Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSM: encoding not correct #8289

Closed
pathmapper opened this issue Aug 30, 2023 · 5 comments
Closed

OSM: encoding not correct #8289

pathmapper opened this issue Aug 30, 2023 · 5 comments

Comments

@pathmapper
Copy link
Contributor

pathmapper commented Aug 30, 2023

Expected behavior and actual behavior.

OGR uses UTF-8 as encoding for OSM data but instead it uses something else so some chars are not correctly encoded, e.g. ö, example:
Astrid-Lindgren-Förderschule becomes
Astrid-Lindgren-F├Ârderschule.

Prolog:
grafik

Steps to reproduce the problem.

ogrinfo test.osm multipolygons (sample data: test.zip)
-> some values for name are not correctly encoded,e.g.

grafik

Operating system

Windows 10 Version 1809

GDAL version and provenance

3.6.4 and 3.7.1 from OSGeo4W

Likely related issues

qgis/QGIS#48328
3liz/QuickOSM#369

@jratike80
Copy link
Collaborator

I bet that this is not a bug but what you see comes from the locale settings of your computer.

See this:

ogrinfo liechtenstein-latest.osm.pbf  -sql "select * from points where osm_id='32011360'"
name (String) = Vaduz Quäderle

Then check your Windows codepage (my locale is Finnish) and change the codepage into UTF-8 (65001)

chcp
Active code page: 850

chcp 65001
Active code page: 65001

Try again

ogrinfo liechtenstein-latest.osm.pbf  -sql "select * from points where osm_id='32011360'"
name (String) = Vaduz Quäderle

@pathmapper
Copy link
Contributor Author

change the codepage into UTF-8 (65001)

Thanks, works for me, too.

not a bug but what you see comes from the locale settings of your computer

The XML Prolog in the file has

<?xml version="1.0" encoding="UTF-8"?>

so shouldn't UTF-8 be used regardless which system encoding is set?

@pathmapper
Copy link
Contributor Author

When the sample file is drag'n'dropped in QGIS, the Data source encoding shows windows-1252:

grafik

I assumed this comes from ogr but maybe not...

@jratike80
Copy link
Collaborator

so shouldn't UTF-8 be used regardless which system encoding is set?
When is comes to printing text on a console, in this use case the Operating System never reads the XML header. GDAL does read it. If the console is set to use Latin 1 character set it awaits that it gets Latin 1 formulated data. When GDAL sends out UTF8 there is a conflict - UTF8 strings are shown as if they were Latin 1 strings. That can be avoided in two ways: set the console to use UTF8, or convert the output of GDAL from UTF8 into Latin 1 for example with iconv https://en.wikipedia.org/wiki/Iconv.

There may be a bug in QGIS but I could not reproduce qgis/QGIS#48328 with QGIS 3.32.2.

@pathmapper
Copy link
Contributor Author

Thanks, seems to be related to the used system encoding and how QGIS handles it - not a GDAL issue.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants