Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) #15

Closed
HoekR opened this issue Feb 22, 2022 · 4 comments
Assignees
Labels
enhancement New feature or request ready ready; implemented and ready for release but not released yet
Milestone

Comments

@HoekR
Copy link

HoekR commented Feb 22, 2022

in python I have a VariantModel, called xmodel and

lexicon (fn) : schutte_1672_names.txt


result = defaultdict(list)
resoluties = resolutions_1672
xmodel = VariantModel(os.path.join(bdir, "examples/simple.alphabet.tsv"), Weights(), debug=False)
xmodel.read_lexicon(fn)
xmodel.build()

text:
Ontfangen een Missive van het Collegie ter Admiraliteijt opde Maze, geschreven tot Rotterdam, den 28en. deses, houdende ingevolge ende tot voldoeninge van haer Ho:Mo: resolutie vanden 17en. daer te vooren der selver consideratien ende advis op de reqte. van Lijsbeth Andries Huijsvrouw ende Maertie Jans, moeder van Jan Jansz van Delff, versoeckende dat de voorn Jan Jansz, uijt het spin„ huijs der Stadt Rotterdam souden mogen werden ontslagen, mits hem in dienst van desen Staet te water off te Lande begevende, en is bij die occasie mede gelesen de nadere requeste vande voorsr. Lijsbeth Andries en Maertie Jansz, noghmaels de voorsr ontslaginge versoeckende: Waerop gedelibereert sijnde, Is goetgevonden ende ver„ staen, dat int voorsr versoeck niet en can werden getreden.

xmodel.find_all_matches(text, SearchParameters(max_edit_distance=3))

result:


[{'input': 'Ontfangen', 'offset': {'begin': 0, 'end': 9}, 'variants': []},
 {'input': 'een', 'offset': {'begin': 10, 'end': 13}, 'variants': []},
 {'input': 'Missive', 'offset': {'begin': 14, 'end': 21}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 22, 'end': 25}, 'variants': []},
 {'input': 'het', 'offset': {'begin': 26, 'end': 29}, 'variants': []},
 {'input': 'Collegie', 'offset': {'begin': 30, 'end': 38}, 'variants': []},
 {'input': 'ter', 'offset': {'begin': 39, 'end': 42}, 'variants': []},
 {'input': 'Admiraliteijt',
  'offset': {'begin': 43, 'end': 56},
  'variants': []},
 {'input': 'opde', 'offset': {'begin': 57, 'end': 61}, 'variants': []},
 {'input': 'Maze', 'offset': {'begin': 62, 'end': 66}, 'variants': []},
 {'input': 'geschreven', 'offset': {'begin': 68, 'end': 78}, 'variants': []},
 {'input': 'tot', 'offset': {'begin': 79, 'end': 82}, 'variants': []},
 {'input': 'Rotterdam', 'offset': {'begin': 83, 'end': 92}, 'variants': []},
 {'input': 'den', 'offset': {'begin': 94, 'end': 97}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 100, 'end': 102}, 'variants': []},
 {'input': 'deses', 'offset': {'begin': 104, 'end': 109}, 'variants': []},
 {'input': 'houdende', 'offset': {'begin': 111, 'end': 119}, 'variants': []},
 {'input': 'ingevolge', 'offset': {'begin': 120, 'end': 129}, 'variants': []},
 {'input': 'ende', 'offset': {'begin': 130, 'end': 134}, 'variants': []},
 {'input': 'tot', 'offset': {'begin': 135, 'end': 138}, 'variants': []},
 {'input': 'voldoeninge',
  'offset': {'begin': 139, 'end': 150},
  'variants': []},
 {'input': 'van', 'offset': {'begin': 151, 'end': 154}, 'variants': []},
 {'input': 'haer', 'offset': {'begin': 155, 'end': 159}, 'variants': []},
 {'input': 'Ho', 'offset': {'begin': 160, 'end': 162}, 'variants': []},
 {'input': 'Mo', 'offset': {'begin': 163, 'end': 165}, 'variants': []},
 {'input': 'resolutie', 'offset': {'begin': 167, 'end': 176}, 'variants': []},
 {'input': 'vanden', 'offset': {'begin': 177, 'end': 183}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 186, 'end': 188}, 'variants': []},
 {'input': 'daer', 'offset': {'begin': 190, 'end': 194}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 195, 'end': 197}, 'variants': []},
 {'input': 'vooren', 'offset': {'begin': 198, 'end': 204}, 'variants': []},
 {'input': 'der', 'offset': {'begin': 205, 'end': 208}, 'variants': []},
 {'input': 'selver', 'offset': {'begin': 209, 'end': 215}, 'variants': []},
 {'input': 'consideratien',
  'offset': {'begin': 216, 'end': 229},
  'variants': []},
 {'input': 'ende', 'offset': {'begin': 230, 'end': 234}, 'variants': []},
 {'input': 'advis', 'offset': {'begin': 235, 'end': 240}, 'variants': []},
 {'input': 'op', 'offset': {'begin': 241, 'end': 243}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 244, 'end': 246}, 'variants': []},
 {'input': 'reqte', 'offset': {'begin': 247, 'end': 252}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 254, 'end': 257}, 'variants': []},
 {'input': 'Lijsbeth', 'offset': {'begin': 258, 'end': 266}, 'variants': []},
 {'input': 'Andries', 'offset': {'begin': 267, 'end': 274}, 'variants': []},
 {'input': 'Huijsvrouw', 'offset': {'begin': 275, 'end': 285}, 'variants': []},
 {'input': 'ende', 'offset': {'begin': 286, 'end': 290}, 'variants': []},
 {'input': 'Maertie', 'offset': {'begin': 291, 'end': 298}, 'variants': []},
 {'input': 'Jans', 'offset': {'begin': 299, 'end': 303}, 'variants': []},
 {'input': 'moeder', 'offset': {'begin': 305, 'end': 311}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 312, 'end': 315}, 'variants': []},
 {'input': 'Jan', 'offset': {'begin': 316, 'end': 319}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 320, 'end': 325}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 326, 'end': 329}, 'variants': []},
 {'input': 'Delff', 'offset': {'begin': 330, 'end': 335}, 'variants': []},
 {'input': 'versoeckende',
  'offset': {'begin': 337, 'end': 349},
  'variants': []},
 {'input': 'dat', 'offset': {'begin': 350, 'end': 353}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 354, 'end': 356}, 'variants': []},
 {'input': 'voorn', 'offset': {'begin': 357, 'end': 362}, 'variants': []},
 {'input': 'Jan', 'offset': {'begin': 363, 'end': 366}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 367, 'end': 372}, 'variants': []},
 {'input': 'uijt', 'offset': {'begin': 374, 'end': 378}, 'variants': []},
 {'input': 'het', 'offset': {'begin': 379, 'end': 382}, 'variants': []},
 {'input': 'spin', 'offset': {'begin': 383, 'end': 387}, 'variants': []},
 {'input': 'huijs', 'offset': {'begin': 391, 'end': 396}, 'variants': []},
 {'input': 'der', 'offset': {'begin': 397, 'end': 400}, 'variants': []},
 {'input': 'Stadt', 'offset': {'begin': 401, 'end': 406}, 'variants': []},
 {'input': 'Rotterdam', 'offset': {'begin': 407, 'end': 416}, 'variants': []},
 {'input': 'souden', 'offset': {'begin': 417, 'end': 423}, 'variants': []},
 {'input': 'mogen', 'offset': {'begin': 424, 'end': 429}, 'variants': []},
 {'input': 'werden', 'offset': {'begin': 430, 'end': 436}, 'variants': []},
 {'input': 'ontslagen', 'offset': {'begin': 437, 'end': 446}, 'variants': []},
 {'input': 'mits', 'offset': {'begin': 448, 'end': 452}, 'variants': []},
 {'input': 'hem', 'offset': {'begin': 453, 'end': 456}, 'variants': []},
 {'input': 'in', 'offset': {'begin': 457, 'end': 459}, 'variants': []},
 {'input': 'dienst', 'offset': {'begin': 460, 'end': 466}, 'variants': []},
 {'input': 'van', 'offset': {'begin': 467, 'end': 470}, 'variants': []},
 {'input': 'desen', 'offset': {'begin': 471, 'end': 476}, 'variants': []},
 {'input': 'Staet', 'offset': {'begin': 477, 'end': 482}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 483, 'end': 485}, 'variants': []},
 {'input': 'water', 'offset': {'begin': 486, 'end': 491}, 'variants': []},
 {'input': 'off', 'offset': {'begin': 492, 'end': 495}, 'variants': []},
 {'input': 'te', 'offset': {'begin': 496, 'end': 498}, 'variants': []},
 {'input': 'Lande', 'offset': {'begin': 499, 'end': 504}, 'variants': []},
 {'input': 'begevende', 'offset': {'begin': 505, 'end': 514}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 516, 'end': 518}, 'variants': []},
 {'input': 'is', 'offset': {'begin': 519, 'end': 521}, 'variants': []},
 {'input': 'bij', 'offset': {'begin': 522, 'end': 525}, 'variants': []},
 {'input': 'die', 'offset': {'begin': 526, 'end': 529}, 'variants': []},
 {'input': 'occasie', 'offset': {'begin': 530, 'end': 537}, 'variants': []},
 {'input': 'mede', 'offset': {'begin': 538, 'end': 542}, 'variants': []},
 {'input': 'gelesen', 'offset': {'begin': 543, 'end': 550}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 551, 'end': 553}, 'variants': []},
 {'input': 'nadere', 'offset': {'begin': 554, 'end': 560}, 'variants': []},
 {'input': 'requeste', 'offset': {'begin': 561, 'end': 569}, 'variants': []},
 {'input': 'vande', 'offset': {'begin': 570, 'end': 575}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 576, 'end': 582}, 'variants': []},
 {'input': 'Lijsbeth', 'offset': {'begin': 584, 'end': 592}, 'variants': []},
 {'input': 'Andries', 'offset': {'begin': 593, 'end': 600}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 601, 'end': 603}, 'variants': []},
 {'input': 'Maertie', 'offset': {'begin': 604, 'end': 611}, 'variants': []},
 {'input': 'Jansz', 'offset': {'begin': 612, 'end': 617}, 'variants': []},
 {'input': 'noghmaels', 'offset': {'begin': 619, 'end': 628}, 'variants': []},
 {'input': 'de', 'offset': {'begin': 629, 'end': 631}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 632, 'end': 638}, 'variants': []},
 {'input': 'ontslaginge',
  'offset': {'begin': 639, 'end': 650},
  'variants': []},
 {'input': 'versoeckende',
  'offset': {'begin': 651, 'end': 663},
  'variants': []},
 {'input': 'Waerop', 'offset': {'begin': 665, 'end': 671}, 'variants': []},
 {'input': 'gedelibereert',
  'offset': {'begin': 672, 'end': 685},
  'variants': []},
 {'input': 'sijnde', 'offset': {'begin': 686, 'end': 692}, 'variants': []},
 {'input': 'Is', 'offset': {'begin': 694, 'end': 696}, 'variants': []},
 {'input': 'goetgevonden',
  'offset': {'begin': 697, 'end': 709},
  'variants': []},
 {'input': 'ende', 'offset': {'begin': 710, 'end': 714}, 'variants': []},
 {'input': 'ver', 'offset': {'begin': 715, 'end': 718}, 'variants': []},
 {'input': 'staen', 'offset': {'begin': 722, 'end': 727}, 'variants': []},
 {'input': 'dat', 'offset': {'begin': 729, 'end': 732}, 'variants': []},
 {'input': 'int', 'offset': {'begin': 733, 'end': 736}, 'variants': []},
 {'input': 'voorsr', 'offset': {'begin': 737, 'end': 743}, 'variants': []},
 {'input': 'versoeck', 'offset': {'begin': 744, 'end': 752}, 'variants': []},
 {'input': 'niet', 'offset': {'begin': 753, 'end': 757}, 'variants': []},
 {'input': 'en', 'offset': {'begin': 758, 'end': 760}, 'variants': []},
 {'input': 'can', 'offset': {'begin': 761, 'end': 764}, 'variants': []},
 {'input': 'werden', 'offset': {'begin': 765, 'end': 771}, 'variants': []},
 {'input': 'getreden', 'offset': {'begin': 772, 'end': 780}, 'variants': []}]

which has a shifting offset, presumably because of the character (unicode '\u201e'). For example the last reported input ('getreden') gives an offset of 'offset': {'begin': 772, 'end': 780}, but a python
text.find('getreden') reports 768

text[772:780] is 'eden.'

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

@proycon
Copy link
Owner

proycon commented Feb 22, 2022

Should this be considered a mismatch between the python and the analitticl text representation or are there shifting offsets?

Yes, very good point. Analiticcl returns UTF-8 byte offsets (mentioned
explicitly in the README too). Python string slices use unicode points. So
there's indeed a mismatch there. I should probably implement an option
that makes analiticcl return unicode points, which would probably make
more sense to be used as the default in at least the Python binding.
From a data-representation perspective, using unicode points would be
the most elegant option too. It does come at a slight performance penalty,
which is why I'm not using it internally.

@brambg: This is relevant for our Golden Agents to Web Annotation export
too, as if I'm not mistaken, web annotations represents offsets with
unicode points as well (and rightly so).

@proycon proycon changed the title find_all_matches shifting offsets? Implement support for outputting unicode point offsets instead of UTF-8 byte offsets (was: find_all_matches shifting offsets?) Feb 22, 2022
@proycon proycon self-assigned this Feb 22, 2022
@proycon proycon added the enhancement New feature or request label Feb 22, 2022
@proycon
Copy link
Owner

proycon commented Feb 22, 2022 via email

@proycon proycon added this to the v0.4.0 milestone May 4, 2022
@proycon proycon added the ToDo staged to be worked on soon label May 4, 2022
proycon added a commit that referenced this issue May 6, 2022
@proycon
Copy link
Owner

proycon commented May 6, 2022

I implemented support for this now, to be released in the upcoming 0.4.0 release. You'll need to explicitly enable it though, using the --unicode-offsets parameters (or unicodeoffsets=True from Python as keyword argument to SearchParameters).

@proycon proycon added ready ready; implemented and ready for release but not released yet and removed ToDo staged to be worked on soon labels May 6, 2022
proycon added a commit that referenced this issue May 10, 2022
@proycon
Copy link
Owner

proycon commented May 16, 2022

Implemented and released

@proycon proycon closed this as completed May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request ready ready; implemented and ready for release but not released yet
Projects
None yet
Development

No branches or pull requests

2 participants