Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Near-duplicate detection and address deduping #294

Merged
merged 93 commits into from
Dec 31, 2017
Merged

Near-duplicate detection and address deduping #294

merged 93 commits into from
Dec 31, 2017

Conversation

albarrentine
Copy link
Contributor

@albarrentine albarrentine commented Dec 31, 2017

This PR adds three important groups of functions to libpostal's C API to support the lieu address/venue deduping project. The APIs are somewhat low-level at this point, but should still be useful in a wide range of geo applications, particularly for batch geocoding large data sets. This is the realization of some of the earlier work on address expansion.

Near-dupe hashing

Near-dupe hashing builds on the expand_address functionality to allow hashing a parsed address into strings suitable for direct comparison and automatic clustering. The hash keys are used to group similar records together prior to pairwise deduping so that we don't need to compare every record to every other record (i.e. N² comparisons). Instead, if we have a function that can generate the same hash key for records that are possible dupes (like "100 Main" and "100 E Main St"), while also being highly selective, we can ensure that most duplicates will be captured for further comparison downstream, and that dissimilar records can be safely considered non-dupes. In a MapReduce context, near-dupe hashes can be used as keys to ensure that possible dupes will be grouped together on the same shard for pairwise checks, and in a search/database context, they can be used as an index for quick lookups of candidate dupes before running more thorough comparisons with the few records that match the hash. This is the first step in the deduping process to identify candidate dupes, and can be thought of as the blocking function in record linkage (this is a highly selective version of a blocking function) or as a form of locally sensitive hashing in the near-duplicate detection literature. Libpostal's near-dupe hashes use a combination of several new features of the library:

  1. Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "clib or another package manager for dependencies #2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses.

  2. Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.

  3. Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.

Component-wise deduping

Once we have potential candidate dupe pairs, we provide per-component methods for comparing address/name pairs and determining if they're duplicates. Each relevant address component has it own function, with certain logic for each, including which libpostal dictionaries to use, and whether a root expansion match counts as an exact duplicate or not. For instance, in a secondary unit, "# 2", "Apt 2", and "Apt # 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street names e.g. "Park Ave" and "Park Pl". In the latter case, we can still classify the street names as needing to be reviewed by a human.

The duplicate functions return one of the following values:

  • LIBPOSTAL_NULL_DUPLICATE_STATUS
  • LIBPOSTAL_NON_DUPLICATE
  • LIBPOSTAL_POSSIBLE_DUPLICATE_NEEDS_REVIEW
  • LIBPOSTAL_LIKELY_DUPLICATE
  • LIBPOSTAL_EXACT_DUPLICATE

The likely and exact classifications can be considered duplicates and merged automatically, whereas the needs_review response is for flagging possible duplicates.

Having special functions for each component can also be useful down the line e.g. for deduping with compound house numbers/ranges (though this is not implemented yet).

Since identifying the correct language is crucial to effective matching, and individual components like house_number and unit may not provide any useful information about the language, we also provide a function that returns the language(s) for an entire parsed/labeled address using all of its textual components. The returned language codes can be reused for subsequent calls.

Fuzzy deduping for names

For venue/street names, we also want to be able to handle inexact name matches, minor spelling differences, words out of order (see this often with human names, which can sometimes be listed as Last, First Middle), and removing tokens that may not be ignorable in terms of libpostal's dictionaries but are very common, or very common in a particular geography.

In this release, we implement a custom version of the Soft-TFIDF method, which blends a local similarity function (usually Jaro-Winkler in the literature, though we use a hybrid method), with global corpus statistics (TFIDF weights or similar, supplied by the user in our case, see the lieu project for constructing the relevant TFIDF and/or Geo-TFIDF scores from a given data set).

Here's how it works:

  1. for strings s1 and s2, each token in s1 is aligned with its most similar token in s2 in terms of a user-specified local similarity metric, provided that it meets a specified similarity threshold. This allows for small spelling mistakes in the individual words and also makes the method invariant to word-order.
  2. given a vector of L2-normalized TFIDF scores for each string, the final similarity is, for each token t1 in s1 and its closest match t2 in s2 (if local_sim >= theta): local_sim * tf_idf[t1] * tf_idf[t2]. Using TFIDF means that rare words are given more weight in the similarity metric than very common words like "Salon" or "Barbershop." It's also possible to give all words equal weight using a uniform distribution (give each token a weight of 1 / # of tokens)
  3. Assuming the chosen scores add up to 1, which L2-normalized TFIDF scores roughly will (may be slightly > 1), the sum of token similarity scores gives a total similarity score for the string that's between 0 and 1, and there are user-specified thresholds for when to consider the records as various classes of dupes. The default threshold is 0.9 for likely dupes and 0.7 for needs_review, but may be changed depending on tolerance for false positives.

Note: for the lieu project we use a linear combination of TFIDF and a geo-specific TFIDF score where the IDF index is computed for a specific, roughly city-sized geohash tile, where smaller tiles are lumped in with their larger neighbors. The geo-specific scores mean that something like "San Francisco Department of Building Inspection" and "Department of Building Inspection" can match because the words "San Francisco" are very common in the region. This approach was inspired by some of the research in https://research.fb.com/publications/deduplicating-a-places-database/.

Unique to this implementation, we use a number of different local similarity metrics to qualify a given string for inclusion in the final similarity score:

  1. Jaro-Winkler similarity: this is a string similarity metric developed for comparing names in the U.S. Census. It detects small spelling differences in words based on the number of matches and transpositions relative to the lengths of the two strings. The Winkler variant gives a more favorable score to words with a shared common prefix. This is the local similarity metric used in most of the Soft-TFIDF literature, and we use the commonly-cited value of 0.9 for the inclusion threshold, which works reasonably well in practice. Note: all of our string similarity methods use unicode characters rather than bytes in their calculations.
  2. Damerau-Levenshtein distance: the traditional edit distance metric, where transpositions of two characters count as a single edit. If a string does not meet the Jaro-Winkler threshold, but has a maximum edit distance of 1 (could be that the first character was transposed), and a minimum length of 4 (many short strings are within edit distance 1 of each other, so don't want to generate too many false positives). Note: since distances and similarities are not on the same scale, we use the Damerau-Levenshtein only as a qualifying threshold, and use the Jaro-Winkler similarity value (even though it did not meet the threshold) for the qualifying pair in the final similarity calculation.
  3. Sequence alignment with affine gap penalty and edit operation subtotals: a new, efficient method for sequence alignment and abbreviation detection. This builds on the Smith-Waterman-Gotoh algorithm with affine gap penalties, which was originally used for alignment of DNA sequences, but works well for other types of text. When we see a rare abbreviation that's not in the libpostal dictionaries, say "Service" and "Svc", the alignment would be "S--v-c-". In other words, we match "S", open a gap, extend that gap for two characters, then match "v", open another gap, extend it one character, match "c", open a gap, and extend it one more character at the end. In the original Smith-Waterman, O(mn) time and space was required to compute this alignment (where m is the length of the first string and n is the length of the second). Gotoh's improvement still needs O(mn) time and O(m) space (where m is the length of the longer string), but it does not store the sequence of operations, only a single cost where each type of edit pays a particular penalty, where the affine gap penalty is the idea that we should pay more for opening a gap than extending it. The problem with the single cost is it's not always clear what to make of that single combined score. The new method we use in libpostal stores and returns a breakdown of the counts and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return or compute the full alignment as in Needleman-Wunsch or Hirschberg's variant. Using this method we know that for "Service" and "Svc", the number of matches is equal to the length of the shorter string, regardless of how many gaps were opened, so "Svc" can be considered a possible abbreviation for "Service". When we find one of these possible abbreviations, and none of the other thresholds are met (which can easily happen with abbreviations), it qualifies both tokens for inclusion in the final similarity, again using their Jaro-Winkler similarity as the weight in the final calculation.
  4. Acronym alignments: especially prevalent in universities, museums, government agencies, etc. We provide a language-based stopword-aware acronym alignment method which can match "Museum of Modern Art" to "moma" (no capitalization needed), "University of California Berkeley" to "UC Berkeley", etc. If tokens in the shorter string are an acronym for tokens in the longer string, all of the above are included in the similarity score with a 1.0 local similarity (so those tokens' TFIDF scores will be counted as evidence for a match, not against it).

The above assumes non-ideographic strings. In Chinese, Japanese, Korean, etc. we currently use the Jaccard similarity of the set of individual ideograms instead. In future versions it might be useful to weight the Jaccard similarity by TFIDF scores as well, and if we ever add a statistical word segmentation model for CJK languages, the word boundaries from that model could be used instead of ideograms.

The fuzzy duplicate methods are currently implemented for venue names and street names, which seemed to make the most sense. The output for these methods is a struct containing the dupe classification as well as the similarity value itself.

…aro-Winkler distances. Both operate on unicode points internally for lengths, etc. instead of byte strings and the Levenshtein distance uses only one array instead of needing to store the full matrix of transitions.
…ges if there are contiguous rules with no right context rules (example: something that wouldn't make sense like VL in Latin)
…at the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.)
…d of specifying the number of arguments, should be more maintainable
…e a partial Roman numeral to get added for the MI portion of "Michael"
…iod where there's an expansion at the prefix/suffix (for #218 and #216 (comment)). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility
…ew use case (i.e. returns "does this substring in the trie?" regardless of if it's stored under the special prefixes/suffixes namespaces)
…th-Waterman-Gotoh with affine gap penalties. Like Smith-Waterman, it performs a local alignment, and like the cost-only version of Gotoh's improvement, it needs O(mn) time and O(m) space (where m is the length of the longer string). However, this version of the algorithm stores and returns a breakdown of the number and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return/compute the full alignment as in Needleman-Wunsch or Hirschberg's variant
…ell if any two words are an abbreviation. The loose variant requires that the alignment covers all characters in the shortest string, which matches things like Services vs. Svc, whereas the strict variant requires that either the shorter string is a prefix of the longer one (Inc and Incorporated) or that the two strings share both a prefix and a suffix (Dept and Department). Both variants require that the strings share at least the first letter in common.
… phrase matches in Soft-TFIDF. Acronym alignments will give higher similarity to NYU vs. "New York University" whereas phrase matches would match known phrases that share the same canonical like "Cty Rd" vs. "C.R." vs. "County Road" within the Soft-TFIDF similarity calculation.
… creates the string sets internally for convenience
…ifying languages consistently from components (may need to make several calls using the same languages and don't necessarily want the language classifier to be run on house numbers when we already know the languages from e.g. the street name - this provides a simple window into the language classifier focused on the entire address/record
… most of the work on this branch. Includes simple phrase-aware exact deduping methods, with per-component variations as to whether e.g. a root expansion match counts as an exact duplicate or not (in a secondary unit, "No. 2" and "Apt 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street e.g. "Park Ave" and "Park Pl"). The API is fairly low-level at present, and may require a few calls. Notably, we leave the TFIDF scores or other weighting schemes to the client. Since each component gets its own dupe classification, it leaves the door open for doing more specific checks around e.g. compound house numbers/ranges in the future.
…e hashing at 50 unique tokens, fixing memory leaks, checking for valid geo components and returning NULL if one of the required fields isn't present
…POSTAL_ADDRESS_ANY component in each function call so it can be removed as needed.
@albarrentine
Copy link
Contributor Author

This breaks the Windows build temporarily as the Appveyor config is missing a few steps to construct the necessary address dictionary/numex/transliteration files when certain files change in the commit range. Because some of the new tests in this PR depend on re-building those files, Appveyor is trying to run them while still relying on the pre-built versions.

Merging and re-running Appveyor once the new files are pushed.

@albarrentine albarrentine merged commit 8a917d8 into master Dec 31, 2017
@mkaranta
Copy link
Contributor

mkaranta commented Jan 2, 2018

This functionality (and implementation) mirrors much of my (& co-workers) work with similarity. It's good to know other people share the same ideas, and probably based it off of the same research.

I'd like to compare these new APIs against our internal ones, with a preference towards switching over to libpostal's methods for international. I'm kind of pigeonholed into using Java, and wouldn't mind working on extending jpostal to cover the new API. Is there someone already doing that? Is the C API stable enough for that?

Our main use case is on-demand processing of small volumes of addresses (1-100) rather than processing large data sets so I'm not sure how useful the lieu code is. Going to learn from it anyways, might implement something similar in Java.

@Maurice-Betzel
Copy link
Contributor

I would be interested in these bindings as well for the JavaCPP integration i am creating. This lib makes JNI a whole lot easier.

@albarrentine
Copy link
Contributor Author

Hey @mkaranta, happy 2018. I'd imagine there are similarities, probably read the same handful of (awesome) papers in the record linkage/healthcare literature. The libpostal implementations have their own little nuances and use all the international goodness, and I think at least one of the methods, the subtotaling affine gap for detecting abbreviations, is entirely new. The near-dupe hashing function also diverges a bit from the literature, comes from document deduplication and can be thought of as a cheap clustering algorithm like e.g. MinHash.

So far these methods have performed well on venue/place data sets (qualitatively, we didn't have much ground truth to work with), including the 20M venues from Who's On First/SimpleGeo. I'm also excited to use it for some of my work around voting rights in the US.

The C API itself should be stable (there might make a few changes to the implementation over the next few days, but that should not affect the API), so can definitely feel free to implement on the jpostal side. I've only implemented the Python bindings for the moment for use in lieu. Parts of the lieu project are still being tested/pushed, so may be in a partially-broken state for a few days, but keep an eye out for a README update when it's all ready to use. There's a command-line version of lieu as well, which can work for smaller data sets, as well as a reference implementation (again, pushing soon) of a server that uses Elasticsearch as an index and checks new documents against it.

@albarrentine
Copy link
Contributor Author

@Maurice-Betzel sounds good. If either you want to take a crack at it, the new APIs use similar constructs to the existing ones. Happy to accept pull requests for jpostal.

@mkaranta
Copy link
Contributor

mkaranta commented Jan 3, 2018

@albarrentine I'm learning a lot of this for the first time. The papers are fascinating and, most of the time, when I mention them to my coworkers, their response is "yeah I knew that". This is the first time I'm digging in to the libpostal code and the theory backing it. It's an elegant introduction to applied statistical NLP.

At work, we have a lot similarity test data to throw at this once it's integrated. I'll get a timeline for jpostal based on whether I can work on it on the clock & make an Issue on that repo to track the work.

Timeline update:
We have a lot of higher priority stuff at work so it'll be an effort on my personal time, which should hopefully exist starting next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants