-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15
Comments
Yes, very good point. Analiticcl returns UTF-8 byte offsets (mentioned @brambg: This is relevant for our Golden Agents to Web Annotation export |
Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?
Yes, very good point. Analiticcl returns UTF-8 byte offsets (mentioned
explicitly in the README too). Python string slices use unicode points. So
there's indeed a mismatch there. I should probably implement an option
that makes analiticcl return unicode points, which would probably make
more sense to be used as the default in at least the Python binding.
From a data-representation perspective, using unicode points would be
the most elegant option too. It does come at a slight performance penalty,
which is why I'm not using it internally.
@brambg: This is relevant for our Golden Agents to Web Annotation export
too, as if I'm not mistaken, web annotations represents offsets with
unicode points as well (and rightly so).
|
I implemented support for this now, to be released in the upcoming 0.4.0 release. You'll need to explicitly enable it though, using the |
Implemented and released |
in python I have a VariantModel, called xmodel and
lexicon (fn) : schutte_1672_names.txt
text:
Ontfangen een Missive van het Collegie ter Admiraliteijt opde Maze, geschreven tot Rotterdam, den 28en. deses, houdende ingevolge ende tot voldoeninge van haer Ho:Mo: resolutie vanden 17en. daer te vooren der selver consideratien ende advis op de reqte. van Lijsbeth Andries Huijsvrouw ende Maertie Jans, moeder van Jan Jansz van Delff, versoeckende dat de voorn Jan Jansz, uijt het spin„ huijs der Stadt Rotterdam souden mogen werden ontslagen, mits hem in dienst van desen Staet te water off te Lande begevende, en is bij die occasie mede gelesen de nadere requeste vande voorsr. Lijsbeth Andries en Maertie Jansz, noghmaels de voorsr ontslaginge versoeckende: Waerop gedelibereert sijnde, Is goetgevonden ende ver„ staen, dat int voorsr versoeck niet en can werden getreden.
xmodel.find_all_matches(text, SearchParameters(max_edit_distance=3))
result:
which has a shifting offset, presumably because of the
„
character (unicode '\u201e'). For example the last reported input ('getreden') gives an offset of'offset': {'begin': 772, 'end': 780}
, but a pythontext.find('getreden')
reports768
text[772:780]
is'eden.'
Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?
The text was updated successfully, but these errors were encountered: