Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import-wikidata should prefer name statements over labels #437

Open
1ec5 opened this issue Oct 2, 2023 · 3 comments
Open

import-wikidata should prefer name statements over labels #437

1ec5 opened this issue Oct 2, 2023 · 3 comments

Comments

@1ec5
Copy link

1ec5 commented Oct 2, 2023

import-wikidata fetches the label of each linked Wikidata item in each available language:

query = f"""\
SELECT ?id ?label WHERE {{
VALUES ?id {{ {' '.join(batch)} }}
?id rdfs:label ?label.
}}"""

This is suboptimal because Wikidata labels are technically mainly for labeling items on the Wikidata site. Even though a label usually corresponds to a concept’s common name, it may sometimes contain some modifications to be recognizable on the site. (The closest analogy in OSM would be the name of a route relation that a mapper has optimized for display in the osm.org sidebar or JOSM’s relation list.)

A better alternative is the name (P2561) property. When an item has statements for this property, the query should prefer those statements. If there’s no statement for a given language, it should fall back to the label in that language.

If there are multiple name statements in a given language, the query should prefer the one with preferred rank, or without an end time (P582). Better yet, it should prefer the statement with the object has role (P3831) qualifier set to map label (Q104642575). For example, this will avoid adding an extra “D.C.” disambiguator to Washington, D.C. (which is correct in most written mediums, just not maps).

/ref osm-americana/openstreetmap-americana#592 (comment)

@1ec5
Copy link
Author

1ec5 commented Oct 2, 2023

Here’s a modified query that pulls in name statements qualified as map labels. It needs a little extra work to pull in any name statement, even those not qualified as map labels:

SELECT ?id ?label ?name ?bestName (LANG(?label) AS ?lang) WHERE {
  VALUES ?id { wd:Q61 }
  ?id rdfs:label ?label.
  OPTIONAL {
    ?id p:P2561 [ps:P2561 ?name; pq:P3831 wd:Q104642575].
    FILTER(LANG(?name) = LANG(?label))
  }
  BIND(COALESCE(?name, ?label) AS ?bestName)
}

@Danysan1
Copy link

Beyond name (P2561) other properties could be used, for example official name (P1448) and native label (P1705).

I created SPARQL query to manually check some values for the name property:

SELECT ?label (GROUP_CONCAT(?name;separator='; ') AS ?names) ?description
WHERE {
  ?x wdt:P625 []; # Has a coordinate location => Likely a geographic feature
     p:P2561 ?nameStatement;
     rdfs:label ?label FILTER(LANG(?label)="en").
  MINUS { ?nameStatement pq:P582|pq:P585 [] } # The name is not deprecated
  ?nameStatement ps:P2561 ?name FILTER(LANG(?name)="en" && LCASE(?label) != LCASE(?name)). # The name is different than the label
  OPTIONAL { ?x schema:description ?description FILTER(LANG(?description)="en"). }
}
GROUP BY ?label ?description 
LIMIT 20
OFFSET 0

(link to WDQS)

The results are not great, as in some cases the value of the name property is better than the label but in some other cases the label is better:
Screenshot 2023-10-17 224309
Screenshot 2023-10-17 224130

So IMO this is not usable for getting the map label.

I also created a query to check names with the map label role qualifier:

SELECT ?x (LANG(?label) AS ?lang) ?label (GROUP_CONCAT(?name;separator='; ') AS ?names) ?description
WHERE {
  ?x wdt:P625 []; # Has a coordinate location => Likely a geographic feature
     p:P2561 ?nameStatement;
     rdfs:label ?label.
  MINUS { ?nameStatement pq:P582|pq:P585 [] } # The name is not deprecated
  ?nameStatement ps:P2561 ?name FILTER(LANG(?label) = LANG(?name) && LCASE(?label) != LCASE(?name)). # The name is different than the label
  ?nameStatement pq:P3831 wd:Q104642575. # The name is a map label
  OPTIONAL { ?x schema:description ?description FILTER(LANG(?label) = LANG(?description)). }
}
GROUP BY ?x ?label ?description 

(link to WDQS)

Currently there are only two names with this qualifier, both on the same entity:
image

So IMO theorically this is a good idea but in practice it would not have a big impact on the map

@1ec5
Copy link
Author

1ec5 commented Oct 18, 2023

Thanks for taking a look. Yes, I agree that looking at the name-related statements would affect only a limited number of features at this time. I personally added those statements to the Washington, D.C., item. 😉 That said, the Washington, D.C., example demonstrates that the name-related statements could be a powerful escape hatch in cases where Wikidata’s labels are correct but unsuitable for the map label use case. As with many things, the data is in a messy state in large part because it isn’t being exposed anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants