Near-duplicate detection and address deduping #294

albarrentine · 2017-12-31T20:44:44Z

This PR adds three important groups of functions to libpostal's C API to support the lieu address/venue deduping project. The APIs are somewhat low-level at this point, but should still be useful in a wide range of geo applications, particularly for batch geocoding large data sets. This is the realization of some of the earlier work on address expansion.

Near-dupe hashing

Near-dupe hashing builds on the expand_address functionality to allow hashing a parsed address into strings suitable for direct comparison and automatic clustering. The hash keys are used to group similar records together prior to pairwise deduping so that we don't need to compare every record to every other record (i.e. N² comparisons). Instead, if we have a function that can generate the same hash key for records that are possible dupes (like "100 Main" and "100 E Main St"), while also being highly selective, we can ensure that most duplicates will be captured for further comparison downstream, and that dissimilar records can be safely considered non-dupes. In a MapReduce context, near-dupe hashes can be used as keys to ensure that possible dupes will be grouped together on the same shard for pairwise checks, and in a search/database context, they can be used as an index for quick lookups of candidate dupes before running more thorough comparisons with the few records that match the hash. This is the first step in the deduping process to identify candidate dupes, and can be thought of as the blocking function in record linkage (this is a highly selective version of a blocking function) or as a form of locally sensitive hashing in the near-duplicate detection literature. Libpostal's near-dupe hashes use a combination of several new features of the library:

Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "clib or another package manager for dependencies #2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses.
Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.
Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.

Component-wise deduping

Once we have potential candidate dupe pairs, we provide per-component methods for comparing address/name pairs and determining if they're duplicates. Each relevant address component has it own function, with certain logic for each, including which libpostal dictionaries to use, and whether a root expansion match counts as an exact duplicate or not. For instance, in a secondary unit, "# 2", "Apt 2", and "Apt # 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street names e.g. "Park Ave" and "Park Pl". In the latter case, we can still classify the street names as needing to be reviewed by a human.

The duplicate functions return one of the following values:

LIBPOSTAL_NULL_DUPLICATE_STATUS
LIBPOSTAL_NON_DUPLICATE
LIBPOSTAL_POSSIBLE_DUPLICATE_NEEDS_REVIEW
LIBPOSTAL_LIKELY_DUPLICATE
LIBPOSTAL_EXACT_DUPLICATE

The likely and exact classifications can be considered duplicates and merged automatically, whereas the needs_review response is for flagging possible duplicates.

Having special functions for each component can also be useful down the line e.g. for deduping with compound house numbers/ranges (though this is not implemented yet).

Since identifying the correct language is crucial to effective matching, and individual components like house_number and unit may not provide any useful information about the language, we also provide a function that returns the language(s) for an entire parsed/labeled address using all of its textual components. The returned language codes can be reused for subsequent calls.

Fuzzy deduping for names

For venue/street names, we also want to be able to handle inexact name matches, minor spelling differences, words out of order (see this often with human names, which can sometimes be listed as Last, First Middle), and removing tokens that may not be ignorable in terms of libpostal's dictionaries but are very common, or very common in a particular geography.

In this release, we implement a custom version of the Soft-TFIDF method, which blends a local similarity function (usually Jaro-Winkler in the literature, though we use a hybrid method), with global corpus statistics (TFIDF weights or similar, supplied by the user in our case, see the lieu project for constructing the relevant TFIDF and/or Geo-TFIDF scores from a given data set).

Here's how it works:

for strings s1 and s2, each token in s1 is aligned with its most similar token in s2 in terms of a user-specified local similarity metric, provided that it meets a specified similarity threshold. This allows for small spelling mistakes in the individual words and also makes the method invariant to word-order.
given a vector of L2-normalized TFIDF scores for each string, the final similarity is, for each token t1 in s1 and its closest match t2 in s2 (if local_sim >= theta): local_sim * tf_idf[t1] * tf_idf[t2]. Using TFIDF means that rare words are given more weight in the similarity metric than very common words like "Salon" or "Barbershop." It's also possible to give all words equal weight using a uniform distribution (give each token a weight of 1 / # of tokens)
Assuming the chosen scores add up to 1, which L2-normalized TFIDF scores roughly will (may be slightly > 1), the sum of token similarity scores gives a total similarity score for the string that's between 0 and 1, and there are user-specified thresholds for when to consider the records as various classes of dupes. The default threshold is 0.9 for likely dupes and 0.7 for needs_review, but may be changed depending on tolerance for false positives.

Note: for the lieu project we use a linear combination of TFIDF and a geo-specific TFIDF score where the IDF index is computed for a specific, roughly city-sized geohash tile, where smaller tiles are lumped in with their larger neighbors. The geo-specific scores mean that something like "San Francisco Department of Building Inspection" and "Department of Building Inspection" can match because the words "San Francisco" are very common in the region. This approach was inspired by some of the research in https://research.fb.com/publications/deduplicating-a-places-database/.

Unique to this implementation, we use a number of different local similarity metrics to qualify a given string for inclusion in the final similarity score:

Jaro-Winkler similarity: this is a string similarity metric developed for comparing names in the U.S. Census. It detects small spelling differences in words based on the number of matches and transpositions relative to the lengths of the two strings. The Winkler variant gives a more favorable score to words with a shared common prefix. This is the local similarity metric used in most of the Soft-TFIDF literature, and we use the commonly-cited value of 0.9 for the inclusion threshold, which works reasonably well in practice. Note: all of our string similarity methods use unicode characters rather than bytes in their calculations.
Damerau-Levenshtein distance: the traditional edit distance metric, where transpositions of two characters count as a single edit. If a string does not meet the Jaro-Winkler threshold, but has a maximum edit distance of 1 (could be that the first character was transposed), and a minimum length of 4 (many short strings are within edit distance 1 of each other, so don't want to generate too many false positives). Note: since distances and similarities are not on the same scale, we use the Damerau-Levenshtein only as a qualifying threshold, and use the Jaro-Winkler similarity value (even though it did not meet the threshold) for the qualifying pair in the final similarity calculation.
Sequence alignment with affine gap penalty and edit operation subtotals: a new, efficient method for sequence alignment and abbreviation detection. This builds on the Smith-Waterman-Gotoh algorithm with affine gap penalties, which was originally used for alignment of DNA sequences, but works well for other types of text. When we see a rare abbreviation that's not in the libpostal dictionaries, say "Service" and "Svc", the alignment would be "S--v-c-". In other words, we match "S", open a gap, extend that gap for two characters, then match "v", open another gap, extend it one character, match "c", open a gap, and extend it one more character at the end. In the original Smith-Waterman, O(mn) time and space was required to compute this alignment (where m is the length of the first string and n is the length of the second). Gotoh's improvement still needs O(mn) time and O(m) space (where m is the length of the longer string), but it does not store the sequence of operations, only a single cost where each type of edit pays a particular penalty, where the affine gap penalty is the idea that we should pay more for opening a gap than extending it. The problem with the single cost is it's not always clear what to make of that single combined score. The new method we use in libpostal stores and returns a breakdown of the counts and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return or compute the full alignment as in Needleman-Wunsch or Hirschberg's variant. Using this method we know that for "Service" and "Svc", the number of matches is equal to the length of the shorter string, regardless of how many gaps were opened, so "Svc" can be considered a possible abbreviation for "Service". When we find one of these possible abbreviations, and none of the other thresholds are met (which can easily happen with abbreviations), it qualifies both tokens for inclusion in the final similarity, again using their Jaro-Winkler similarity as the weight in the final calculation.
Acronym alignments: especially prevalent in universities, museums, government agencies, etc. We provide a language-based stopword-aware acronym alignment method which can match "Museum of Modern Art" to "moma" (no capitalization needed), "University of California Berkeley" to "UC Berkeley", etc. If tokens in the shorter string are an acronym for tokens in the longer string, all of the above are included in the similarity score with a 1.0 local similarity (so those tokens' TFIDF scores will be counted as evidence for a match, not against it).

The above assumes non-ideographic strings. In Chinese, Japanese, Korean, etc. we currently use the Jaccard similarity of the set of individual ideograms instead. In future versions it might be useful to weight the Jaccard similarity by TFIDF scores as well, and if we ever add a statistical word segmentation model for CJK languages, the word boundaries from that model could be used instead of ideograms.

The fuzzy duplicate methods are currently implemented for venue names and street names, which seemed to make the most sense. The output for these methods is a struct containing the dupe classification as well as the similarity value itself.

…lavic phonetics

…-8 string, a few bug fixes to string_utils

…aro-Winkler distances. Both operate on unicode points internally for lengths, etc. instead of byte strings and the Levenshtein distance uses only one array instead of needing to store the full matrix of transitions.

…ges if there are contiguous rules with no right context rules (example: something that wouldn't make sense like VL in Latin)

…at the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.)

…s like IXe in French

…d of specifying the number of arguments, should be more maintainable

…e a partial Roman numeral to get added for the MI portion of "Michael"

…riod charater in a string

…iod where there's an expansion at the prefix/suffix (for #218 and #216 (comment)). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility

…ew use case (i.e. returns "does this substring in the trie?" regardless of if it's stored under the special prefixes/suffixes namespaces)

…o-Winkler distances

…uality of unicode char arrays

…th-Waterman-Gotoh with affine gap penalties. Like Smith-Waterman, it performs a local alignment, and like the cost-only version of Gotoh's improvement, it needs O(mn) time and O(m) space (where m is the length of the longer string). However, this version of the algorithm stores and returns a breakdown of the number and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return/compute the full alignment as in Needleman-Wunsch or Hirschberg's variant

…ell if any two words are an abbreviation. The loose variant requires that the alignment covers all characters in the shortest string, which matches things like Services vs. Svc, whereas the strict variant requires that either the shorter string is a prefix of the longer one (Inc and Incorporated) or that the two strings share both a prefix and a suffix (Dept and Department). Both variants require that the strings share at least the first letter in common.

… everything const char *

… in this branch

…ffine gap implementation (mixed indices)

… phrase matches in Soft-TFIDF. Acronym alignments will give higher similarity to NYU vs. "New York University" whereas phrase matches would match known phrases that share the same canonical like "Cty Rd" vs. "C.R." vs. "County Road" within the Soft-TFIDF similarity calculation.

…cronym token alignments

… creates the string sets internally for convenience

…ifying languages consistently from components (may need to make several calls using the same languages and don't necessarily want the language classifier to be run on house numbers when we already know the languages from e.g. the street name - this provides a simple window into the language classifier focused on the entire address/record

… most of the work on this branch. Includes simple phrase-aware exact deduping methods, with per-component variations as to whether e.g. a root expansion match counts as an exact duplicate or not (in a secondary unit, "No. 2" and "Apt 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street e.g. "Park Ave" and "Park Pl"). The API is fairly low-level at present, and may require a few calls. Notably, we leave the TFIDF scores or other weighting schemes to the client. Since each component gets its own dupe classification, it leaves the door open for doing more specific checks around e.g. compound house numbers/ranges in the future.

…naming convention

…e hashing at 50 unique tokens, fixing memory leaks, checking for valid geo components and returning NULL if one of the required fields isn't present

… converting them to char arrays

…POSTAL_ADDRESS_ANY component in each function call so it can be removed as needed.

…n accordance with Winkler's paper

albarrentine · 2017-12-31T21:59:52Z

This breaks the Windows build temporarily as the Appveyor config is missing a few steps to construct the necessary address dictionary/numex/transliteration files when certain files change in the commit range. Because some of the new tests in this PR depend on re-building those files, Appveyor is trying to run them while still relying on the pre-built versions.

Merging and re-running Appveyor once the new files are pushed.

mkaranta · 2018-01-02T15:49:08Z

This functionality (and implementation) mirrors much of my (& co-workers) work with similarity. It's good to know other people share the same ideas, and probably based it off of the same research.

I'd like to compare these new APIs against our internal ones, with a preference towards switching over to libpostal's methods for international. I'm kind of pigeonholed into using Java, and wouldn't mind working on extending jpostal to cover the new API. Is there someone already doing that? Is the C API stable enough for that?

Our main use case is on-demand processing of small volumes of addresses (1-100) rather than processing large data sets so I'm not sure how useful the lieu code is. Going to learn from it anyways, might implement something similar in Java.

Maurice-Betzel · 2018-01-02T18:42:37Z

I would be interested in these bindings as well for the JavaCPP integration i am creating. This lib makes JNI a whole lot easier.

albarrentine · 2018-01-02T20:47:32Z

Hey @mkaranta, happy 2018. I'd imagine there are similarities, probably read the same handful of (awesome) papers in the record linkage/healthcare literature. The libpostal implementations have their own little nuances and use all the international goodness, and I think at least one of the methods, the subtotaling affine gap for detecting abbreviations, is entirely new. The near-dupe hashing function also diverges a bit from the literature, comes from document deduplication and can be thought of as a cheap clustering algorithm like e.g. MinHash.

So far these methods have performed well on venue/place data sets (qualitatively, we didn't have much ground truth to work with), including the 20M venues from Who's On First/SimpleGeo. I'm also excited to use it for some of my work around voting rights in the US.

The C API itself should be stable (there might make a few changes to the implementation over the next few days, but that should not affect the API), so can definitely feel free to implement on the jpostal side. I've only implemented the Python bindings for the moment for use in lieu. Parts of the lieu project are still being tested/pushed, so may be in a partially-broken state for a few days, but keep an eye out for a README update when it's all ready to use. There's a command-line version of lieu as well, which can work for smaller data sets, as well as a reference implementation (again, pushing soon) of a server that uses Elasticsearch as an index and checks new documents against it.

albarrentine · 2018-01-02T20:53:14Z

@Maurice-Betzel sounds good. If either you want to take a crack at it, the new APIs use similar constructs to the existing ones. Happy to accept pull requests for jpostal.

mkaranta · 2018-01-03T13:58:43Z

@albarrentine I'm learning a lot of this for the first time. The papers are fascinating and, most of the time, when I mention them to my coworkers, their response is "yeah I knew that". This is the first time I'm digging in to the libpostal code and the theory backing it. It's an elegant introduction to applied statistical NLP.

At work, we have a lot similarity test data to throw at this once it's integrated. I'll get a timeline for jpostal based on whether I can work on it on the clock & make an Issue on that repo to track the work.

Timeline update:
We have a lot of higher priority stuff at work so it'll be an effort on my personal time, which should hopefully exist starting next week.

albarrentine added 30 commits October 12, 2017 01:41

[merge] merging commit from v1.1

448ca6a

[utils] adding utf8_len function for strings, and utf8_is_digit

f8a808e

[utils] adding utf8_equal_ignore_separators to string utils

09fbb02

[test] test for utf8_equal_ignore_separators

2f2d3da

[similarity] adding basic double metaphone implementation

3a3aca8

[similarity] bug fixes and additional French, Spanish, Italian, and S…

c610073

…lavic phonetics

[utils] function to create an array of uint32_t codepoints from a UTF…

245aa22

…-8 string, a few bug fixes to string_utils

[numex] when parsing numex, bail on rules in whole_tokens_only langua…

9d2a111

…ges if there are contiguous rules with no right context rules (example: something that wouldn't make sense like VL in Latin)

[phrases] when skipping/ignoring hyphens in trie search, make sure th…

1c5afca

…at the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.)

[numex] adding functions to parse and validate a Roman numeral

1fbc238

[utils] adding utf8_is_digit to string_utils.h

b7eda37

[expand] adding ability to expand Roman numerals with ordinal suffixe…

5c927e7

…s like IXe in French

[fix] making string args const in string_similarity module

4ccc2a9

[dedupe] Jaccard similarity

5c0ecf8

[similarity] using NULL-terminated varargs in double metaphone instea…

e8ae3bb

…d of specifying the number of arguments, should be more maintainable

[numex] fixing edge case where something like "IV Michael" could caus…

e38e57b

…e a partial Roman numeral to get added for the MI portion of "Michael"

[utils] adding functions for finding the next index of a full stop/pe…

6d430f7

…riod charater in a string

[expand] adding a normalization for a single non-acronym internal per…

053dca8

…iod where there's an expansion at the prefix/suffix (for #218 and #216 (comment)). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility

[expand] added search_address_dictionaries_substring to support the n…

2d6079b

…ew use case (i.e. returns "does this substring in the trie?" regardless of if it's stored under the special prefixes/suffixes namespaces)

[similarity] exposing unicode versions of Damerau-Levenshtein and Jar…

bc9f11d

…o-Winkler distances

[dictionaries] adding variants of & as synonyms in all languages

3c6629a

[utils] adding unicode_equals function in string_utils for testing eq…

665b780

…uality of unicode char arrays

[similarity] adding possible abbreviation functions to header, making…

fbf88ae

… everything const char *

[merge] merging in the Ohio expansion numex changes from master

1a64ad6

Merge branch 'master' into lieu_api

ec4d683

[api] adding LIBPOSTAL_EXPORT to some of the new public API functions…

e27f5f1

… in this branch

Merge branch 'master' into lieu_api

252d5a0

albarrentine added 21 commits December 28, 2017 23:54

[fix] bug in Jaro distance

8fd4242

[fix] using same order in root expansions

cabdbfc

[fix] another valgrind error in counting transposes in our counting a…

24a77ea

…ffine gap implementation (mixed indices)

[similarity] moving stopword tokens array to a separate function in a…

c5ad080

…cronym token alignments

[similarity] adding a string array version of Jaccard similarity that…

1d1ce10

… creates the string sets internally for convenience

[fix] making a few internal functions static

cadf52d

[api] adding pairwise-dupe functions/structs to the public header

8495cda

[build] adding new source files to Makefile for the lieu APIs

53543be

[api] adding APIs for getting default options and using a consistent …

6dff154

…naming convention

[dedupe] fixes to near dupe hashing, geohash lengths, cutting off nam…

c48c2b7

…e hashing at 50 unique tokens, fixing memory leaks, checking for valid geo components and returning NULL if one of the required fields isn't present

[api] checking for NULL responses in the cstring_array methods before…

86d5eca

… converting them to char arrays

[fix] removing unused vars

434bbd4

[api] using uint32_t for geohash precision option

3263c84

[dedupe/test] checking for NULL in near_dupe test program

668e467

[expand] adding a few of the address phrase checks to the expand header

34fe7ec

[fix] update to struct

34c3ee7

[dedupe] fixing toponym matching for city-equivalents, adding the LIB…

4e32565

…POSTAL_ADDRESS_ANY component in each function call so it can be removed as needed.

[similarity] max out the Jaro-Winkler shared prefix at 4 characters i…

3bdb8c8

…n accordance with Winkler's paper

albarrentine merged commit 8a917d8 into master Dec 31, 2017

albarrentine deleted the lieu_api branch January 11, 2018 22:53

iantabolt mentioned this pull request May 10, 2018

Bindings for near-duplicate detection and address deduping openvenues/jpostal#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Near-duplicate detection and address deduping #294

Near-duplicate detection and address deduping #294

albarrentine commented Dec 31, 2017 •

edited

Loading

albarrentine commented Dec 31, 2017

mkaranta commented Jan 2, 2018

Maurice-Betzel commented Jan 2, 2018

albarrentine commented Jan 2, 2018

albarrentine commented Jan 2, 2018

mkaranta commented Jan 3, 2018 •

edited

Loading

Near-duplicate detection and address deduping #294

Near-duplicate detection and address deduping #294

Conversation

albarrentine commented Dec 31, 2017 • edited Loading

Near-dupe hashing

Component-wise deduping

Fuzzy deduping for names

albarrentine commented Dec 31, 2017

mkaranta commented Jan 2, 2018

Maurice-Betzel commented Jan 2, 2018

albarrentine commented Jan 2, 2018

albarrentine commented Jan 2, 2018

mkaranta commented Jan 3, 2018 • edited Loading

albarrentine commented Dec 31, 2017 •

edited

Loading

mkaranta commented Jan 3, 2018 •

edited

Loading