Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplify deduplication logic #1607

Merged
merged 1 commit into from
Mar 4, 2022
Merged

simplify deduplication logic #1607

merged 1 commit into from
Mar 4, 2022

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Feb 28, 2022

this PR greatly simplifies the deduplication logic.

The main difference is that previously we were focussing on keeping the superior rows and now we're focussing on removing the inferior rows, a subtle but important distinction!

In particular the new logic makes one positive functional change:

in the case where A dedupes B and B dedupes C, prior to this PR the C record may or may not remain in the results, depending on the order, due to B being present in the 'skip list'.

the unit test included demonstrates this ^ as it fails prior to this PR

@missinglink missinglink force-pushed the simplified_dedupe branch 5 times, most recently from ca0c541 to 94517ed Compare March 2, 2022 10:20
@missinglink missinglink marked this pull request as ready for review March 2, 2022 10:20
@missinglink
Copy link
Member Author

This PR improves the deduplication of https://pelias.github.io/compare/#/v1/autocomplete?text=berlin

@missinglink missinglink force-pushed the simplified_dedupe branch 3 times, most recently from 93ec12d to 5bc912d Compare March 2, 2022 11:33
@missinglink
Copy link
Member Author

Good-to-go! I think this is a positive change which should help reduce some of the noisier deduplication cases that have been plaguing us for some time now 🎉

@missinglink
Copy link
Member Author

housekeeping: the dedupe-without-skiplist branch can probably be deleted after merging this.

@orangejulius
Copy link
Member

Cool, I really love the simplicity of this logic. Even if it's technically more computationally expensive (though I would imagine the effect is not measurable), it's much much easier to understand.

I do think the case you mentioned with transitive deduplication has prevented a lot of our deduplication checks from performing as we expect.

My guess is this is good to merge but let me try it out with some of the other recent changes I've been experimenting with, like #1609 and #1606.

@orangejulius
Copy link
Member

orangejulius commented Mar 2, 2022

Okay, after some fairly complex testing I can confirm this PR improves deduplication, or more precisely it allows some of the newer deduplication strategies we are experimenting with to be more effective.

Here's a summary of some results from testing with pelias/acceptance-tests#557

branch Passing test count
master 2
A: simplified dedupe (this PR) 2
B: dedupe-placetype-in-name (#1371) 3
C: concordance-dedupe (#1606) 3
A + B 3
A + C 3
B + C 4
A + B + C 5

The interesting thing to note is that while this PR by itself does not change the test results, nor does it change the results when combined with either of our two other deduplication PRs, when all three PRs are combined, there is an improvement over not using this PR.

So that seems to confirm to me the idea that transitive deduplication is important.

One last thing to test is that this PR (as well as a combination of this and the other PRs) does not introduce regressions. That info will be coming soon, then I think we are good to go.

orangejulius added a commit to pelias/acceptance-tests that referenced this pull request Mar 3, 2022
This test case came out of evaluating
pelias/api#1607. It passes with the current API
master branch, fails with pelias/api#1607, but
passes with pelias/api#1612
@orangejulius orangejulius merged commit 560fff8 into master Mar 4, 2022
@orangejulius orangejulius deleted the simplified_dedupe branch March 4, 2022 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants