Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15

HoekR · 2022-02-22T10:21:47Z

in python I have a VariantModel, called xmodel and


result = defaultdict(list)
resoluties = resolutions_1672
xmodel = VariantModel(os.path.join(bdir, "examples/simple.alphabet.tsv"), Weights(), debug=False)
xmodel.read_lexicon(fn)
xmodel.build()

text:
Ontfangen een Missive van het Collegie ter Admiraliteijt opde Maze, geschreven tot Rotterdam, den 28en. deses, houdende ingevolge ende tot voldoeninge van haer Ho:Mo: resolutie vanden 17en. daer te vooren der selver consideratien ende advis op de reqte. van Lijsbeth Andries Huijsvrouw ende Maertie Jans, moeder van Jan Jansz van Delff, versoeckende dat de voorn Jan Jansz, uijt het spin„ huijs der Stadt Rotterdam souden mogen werden ontslagen, mits hem in dienst van desen Staet te water off te Lande begevende, en is bij die occasie mede gelesen de nadere requeste vande voorsr. Lijsbeth Andries en Maertie Jansz, noghmaels de voorsr ontslaginge versoeckende: Waerop gedelibereert sijnde, Is goetgevonden ende ver„ staen, dat int voorsr versoeck niet en can werden getreden.

xmodel.find_all_matches(text, SearchParameters(max_edit_distance=3))

result:


[{'input': 'Ontfangen', 'offset': {'begin': 0, 'end': 9}, 'variants': []},
 {'input': 'een', 'offset': {'begin': 10, 'end': 13}, 'variants': []},
 {'input': 'Missive', 'offset': {'begin': 14, 'end': 21}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 22, 'end': 25}, 'variants': []},
 {'input': 'het', 'offset': {'begin': 26, 'end': 29}, 'variants': []},
 {'input': 'Collegie', 'offset': {'begin': 30, 'end': 38}, 'variants': []},
 {'input': 'ter', 'offset': {'begin': 39, 'end': 42}, 'variants': []},
 {'input': 'Admiraliteijt',
  'offset': {'begin': 43, 'end': 56},
  'variants': []},
 {'input': 'opde', 'offset': {'begin': 57, 'end': 61}, 'variants': []},
 {'input': 'Maze', 'offset': {'begin': 62, 'end': 66}, 'variants': []},
 {'input': 'geschreven', 'offset': {'begin': 68, 'end': 78}, 'variants': []},
 {'input': 'tot', 'offset': {'begin': 79, 'end': 82}, 'variants': []},
 {'input': 'Rotterdam', 'offset': {'begin': 83, 'end': 92}, 'variants': []},
 {'input': 'den', 'offset': {'begin': 94, 'end': 97}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 100, 'end': 102}, 'variants': []},
 {'input': 'deses', 'offset': {'begin': 104, 'end': 109}, 'variants': []},
 {'input': 'houdende', 'offset': {'begin': 111, 'end': 119}, 'variants': []},
 {'input': 'ingevolge', 'offset': {'begin': 120, 'end': 129}, 'variants': []},
 {'input': 'ende', 'offset': {'begin': 130, 'end': 134}, 'variants': []},
 {'input': 'tot', 'offset': {'begin': 135, 'end': 138}, 'variants': []},
 {'input': 'voldoeninge',
  'offset': {'begin': 139, 'end': 150},
  'variants': []},
 {'input': 'van', 'offset': {'begin': 151, 'end': 154}, 'variants': []},
 {'input': 'haer', 'offset': {'begin': 155, 'end': 159}, 'variants': []},
 {'input': 'Ho', 'offset': {'begin': 160, 'end': 162}, 'variants': []},
 {'input': 'Mo', 'offset': {'begin': 163, 'end': 165}, 'variants': []},
 {'input': 'resolutie', 'offset': {'begin': 167, 'end': 176}, 'variants': []},
 {'input': 'vanden', 'offset': {'begin': 177, 'end': 183}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 186, 'end': 188}, 'variants': []},
 {'input': 'daer', 'offset': {'begin': 190, 'end': 194}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 195, 'end': 197}, 'variants': []},
 {'input': 'vooren', 'offset': {'begin': 198, 'end': 204}, 'variants': []},
 {'input': 'der', 'offset': {'begin': 205, 'end': 208}, 'variants': []},
 {'input': 'selver', 'offset': {'begin': 209, 'end': 215}, 'variants': []},
 {'input': 'consideratien',
  'offset': {'begin': 216, 'end': 229},
  'variants': []},
 {'input': 'ende', 'offset': {'begin': 230, 'end': 234}, 'variants': []},
 {'input': 'advis', 'offset': {'begin': 235, 'end': 240}, 'variants': []},
 {'input': 'op', 'offset': {'begin': 241, 'end': 243}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 244, 'end': 246}, 'variants': []},
 {'input': 'reqte', 'offset': {'begin': 247, 'end': 252}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 254, 'end': 257}, 'variants': []},
 {'input': 'Lijsbeth', 'offset': {'begin': 258, 'end': 266}, 'variants': []},
 {'input': 'Andries', 'offset': {'begin': 267, 'end': 274}, 'variants': []},
 {'input': 'Huijsvrouw', 'offset': {'begin': 275, 'end': 285}, 'variants': []},
 {'input': 'ende', 'offset': {'begin': 286, 'end': 290}, 'variants': []},
 {'input': 'Maertie', 'offset': {'begin': 291, 'end': 298}, 'variants': []},
 {'input': 'Jans', 'offset': {'begin': 299, 'end': 303}, 'variants': []},
 {'input': 'moeder', 'offset': {'begin': 305, 'end': 311}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 312, 'end': 315}, 'variants': []},
 {'input': 'Jan', 'offset': {'begin': 316, 'end': 319}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 320, 'end': 325}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 326, 'end': 329}, 'variants': []},
 {'input': 'Delff', 'offset': {'begin': 330, 'end': 335}, 'variants': []},
 {'input': 'versoeckende',
  'offset': {'begin': 337, 'end': 349},
  'variants': []},
 {'input': 'dat', 'offset': {'begin': 350, 'end': 353}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 354, 'end': 356}, 'variants': []},
 {'input': 'voorn', 'offset': {'begin': 357, 'end': 362}, 'variants': []},
 {'input': 'Jan', 'offset': {'begin': 363, 'end': 366}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 367, 'end': 372}, 'variants': []},
 {'input': 'uijt', 'offset': {'begin': 374, 'end': 378}, 'variants': []},
 {'input': 'het', 'offset': {'begin': 379, 'end': 382}, 'variants': []},
 {'input': 'spin', 'offset': {'begin': 383, 'end': 387}, 'variants': []},
 {'input': 'huijs', 'offset': {'begin': 391, 'end': 396}, 'variants': []},
 {'input': 'der', 'offset': {'begin': 397, 'end': 400}, 'variants': []},
 {'input': 'Stadt', 'offset': {'begin': 401, 'end': 406}, 'variants': []},
 {'input': 'Rotterdam', 'offset': {'begin': 407, 'end': 416}, 'variants': []},
 {'input': 'souden', 'offset': {'begin': 417, 'end': 423}, 'variants': []},
 {'input': 'mogen', 'offset': {'begin': 424, 'end': 429}, 'variants': []},
 {'input': 'werden', 'offset': {'begin': 430, 'end': 436}, 'variants': []},
 {'input': 'ontslagen', 'offset': {'begin': 437, 'end': 446}, 'variants': []},
 {'input': 'mits', 'offset': {'begin': 448, 'end': 452}, 'variants': []},
 {'input': 'hem', 'offset': {'begin': 453, 'end': 456}, 'variants': []},
 {'input': 'in', 'offset': {'begin': 457, 'end': 459}, 'variants': []},
 {'input': 'dienst', 'offset': {'begin': 460, 'end': 466}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 467, 'end': 470}, 'variants': []},
 {'input': 'desen', 'offset': {'begin': 471, 'end': 476}, 'variants': []},
 {'input': 'Staet', 'offset': {'begin': 477, 'end': 482}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 483, 'end': 485}, 'variants': []},
 {'input': 'water', 'offset': {'begin': 486, 'end': 491}, 'variants': []},
 {'input': 'off', 'offset': {'begin': 492, 'end': 495}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 496, 'end': 498}, 'variants': []},
 {'input': 'Lande', 'offset': {'begin': 499, 'end': 504}, 'variants': []},
 {'input': 'begevende', 'offset': {'begin': 505, 'end': 514}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 516, 'end': 518}, 'variants': []},
 {'input': 'is', 'offset': {'begin': 519, 'end': 521}, 'variants': []},
 {'input': 'bij', 'offset': {'begin': 522, 'end': 525}, 'variants': []},
 {'input': 'die', 'offset': {'begin': 526, 'end': 529}, 'variants': []},
 {'input': 'occasie', 'offset': {'begin': 530, 'end': 537}, 'variants': []},
 {'input': 'mede', 'offset': {'begin': 538, 'end': 542}, 'variants': []},
 {'input': 'gelesen', 'offset': {'begin': 543, 'end': 550}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 551, 'end': 553}, 'variants': []},
 {'input': 'nadere', 'offset': {'begin': 554, 'end': 560}, 'variants': []},
 {'input': 'requeste', 'offset': {'begin': 561, 'end': 569}, 'variants': []},
 {'input': 'vande', 'offset': {'begin': 570, 'end': 575}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 576, 'end': 582}, 'variants': []},
 {'input': 'Lijsbeth', 'offset': {'begin': 584, 'end': 592}, 'variants': []},
 {'input': 'Andries', 'offset': {'begin': 593, 'end': 600}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 601, 'end': 603}, 'variants': []},
 {'input': 'Maertie', 'offset': {'begin': 604, 'end': 611}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 612, 'end': 617}, 'variants': []},
 {'input': 'noghmaels', 'offset': {'begin': 619, 'end': 628}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 629, 'end': 631}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 632, 'end': 638}, 'variants': []},
 {'input': 'ontslaginge',
  'offset': {'begin': 639, 'end': 650},
  'variants': []},
 {'input': 'versoeckende',
  'offset': {'begin': 651, 'end': 663},
  'variants': []},
 {'input': 'Waerop', 'offset': {'begin': 665, 'end': 671}, 'variants': []},
 {'input': 'gedelibereert',
  'offset': {'begin': 672, 'end': 685},
  'variants': []},
 {'input': 'sijnde', 'offset': {'begin': 686, 'end': 692}, 'variants': []},
 {'input': 'Is', 'offset': {'begin': 694, 'end': 696}, 'variants': []},
 {'input': 'goetgevonden',
  'offset': {'begin': 697, 'end': 709},
  'variants': []},
 {'input': 'ende', 'offset': {'begin': 710, 'end': 714}, 'variants': []},
 {'input': 'ver', 'offset': {'begin': 715, 'end': 718}, 'variants': []},
 {'input': 'staen', 'offset': {'begin': 722, 'end': 727}, 'variants': []},
 {'input': 'dat', 'offset': {'begin': 729, 'end': 732}, 'variants': []},
 {'input': 'int', 'offset': {'begin': 733, 'end': 736}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 737, 'end': 743}, 'variants': []},
 {'input': 'versoeck', 'offset': {'begin': 744, 'end': 752}, 'variants': []},
 {'input': 'niet', 'offset': {'begin': 753, 'end': 757}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 758, 'end': 760}, 'variants': []},
 {'input': 'can', 'offset': {'begin': 761, 'end': 764}, 'variants': []},
 {'input': 'werden', 'offset': {'begin': 765, 'end': 771}, 'variants': []},
 {'input': 'getreden', 'offset': {'begin': 772, 'end': 780}, 'variants': []}]

which has a shifting offset, presumably because of the „ character (unicode '\u201e'). For example the last reported input ('getreden') gives an offset of 'offset': {'begin': 772, 'end': 780}, but a python
text.find('getreden') reports 768

text[772:780] is 'eden.'

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

The text was updated successfully, but these errors were encountered:

proycon · 2022-02-22T10:46:42Z

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

Yes, very good point. Analiticcl returns UTF-8 byte offsets (mentioned
explicitly in the README too). Python string slices use unicode points. So
there's indeed a mismatch there. I should probably implement an option
that makes analiticcl return unicode points, which would probably make
more sense to be used as the default in at least the Python binding.
From a data-representation perspective, using unicode points would be
the most elegant option too. It does come at a slight performance penalty,
which is why I'm not using it internally.

@brambg: This is relevant for our Golden Agents to Web Annotation export
too, as if I'm not mistaken, web annotations represents offsets with
unicode points as well (and rightly so).

proycon · 2022-02-22T10:52:18Z

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

Yes, very good point. Analiticcl returns UTF-8 byte offsets (mentioned explicitly in the README too). Python string slices use unicode points. So there's indeed a mismatch there. I should probably implement an option that makes analiticcl return unicode points, which would probably make more sense to be used as the default in at least the Python binding. From a data-representation perspective, using unicode points would be the most elegant option too. It does come at a slight performance penalty, which is why I'm not using it internally. @brambg: This is relevant for our Golden Agents to Web Annotation export too, as if I'm not mistaken, web annotations represents offsets with unicode points as well (and rightly so).

…ad of UTF-8 byte offsets #15

proycon · 2022-05-06T12:50:58Z

I implemented support for this now, to be released in the upcoming 0.4.0 release. You'll need to explicitly enable it though, using the --unicode-offsets parameters (or unicodeoffsets=True from Python as keyword argument to SearchParameters).

proycon · 2022-05-16T09:30:41Z

Implemented and released

proycon changed the title ~~find_all_matches shifting offsets?~~ Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) Feb 22, 2022

proycon self-assigned this Feb 22, 2022

proycon added the enhancement New feature or request label Feb 22, 2022

proycon added this to the v0.4.0 milestone May 4, 2022

proycon added the ToDo staged to be worked on soon label May 4, 2022

proycon added a commit that referenced this issue May 6, 2022

Implemented support for (optionally) outputting unicode offsets inste…

8285153

…ad of UTF-8 byte offsets #15

proycon added ready ready; implemented and ready for release but not released yet and removed ToDo staged to be worked on soon labels May 6, 2022

proycon added a commit that referenced this issue May 10, 2022

fix in outputting unicode offset #15

d346196

proycon added a commit that referenced this issue May 10, 2022

fix in outputting unicode offset #15 (for real now I hope)

79ba61b

proycon closed this as completed May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15

Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15

HoekR commented Feb 22, 2022

proycon commented Feb 22, 2022

proycon commented Feb 22, 2022 via email

proycon commented May 6, 2022

proycon commented May 16, 2022

Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15

Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15

Comments

HoekR commented Feb 22, 2022

proycon commented Feb 22, 2022

proycon commented Feb 22, 2022 via email

proycon commented May 6, 2022

proycon commented May 16, 2022