-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Near-duplicate detection and address deduping #294
Conversation
…-8 string, a few bug fixes to string_utils
…aro-Winkler distances. Both operate on unicode points internally for lengths, etc. instead of byte strings and the Levenshtein distance uses only one array instead of needing to store the full matrix of transitions.
…ges if there are contiguous rules with no right context rules (example: something that wouldn't make sense like VL in Latin)
…at the new longer phrase ends at a word boundary (space, hyphen, end of string, etc.)
…s like IXe in French
…d of specifying the number of arguments, should be more maintainable
…e a partial Roman numeral to get added for the MI portion of "Michael"
…riod charater in a string
…iod where there's an expansion at the prefix/suffix (for #218 and #216 (comment)). Helps in cases like "St.Michaels" or "Jln.Utara" without needing to specify concatenated prefix phrases for every possibility
…ew use case (i.e. returns "does this substring in the trie?" regardless of if it's stored under the special prefixes/suffixes namespaces)
…o-Winkler distances
…uality of unicode char arrays
…th-Waterman-Gotoh with affine gap penalties. Like Smith-Waterman, it performs a local alignment, and like the cost-only version of Gotoh's improvement, it needs O(mn) time and O(m) space (where m is the length of the longer string). However, this version of the algorithm stores and returns a breakdown of the number and specific types of edits it makes (matches, mismatches, gap opens, gap extensions, and transpositions) rather than rolling them up into a single cost, and without needing to return/compute the full alignment as in Needleman-Wunsch or Hirschberg's variant
…ell if any two words are an abbreviation. The loose variant requires that the alignment covers all characters in the shortest string, which matches things like Services vs. Svc, whereas the strict variant requires that either the shorter string is a prefix of the longer one (Inc and Incorporated) or that the two strings share both a prefix and a suffix (Dept and Department). Both variants require that the strings share at least the first letter in common.
… everything const char *
…ffine gap implementation (mixed indices)
… phrase matches in Soft-TFIDF. Acronym alignments will give higher similarity to NYU vs. "New York University" whereas phrase matches would match known phrases that share the same canonical like "Cty Rd" vs. "C.R." vs. "County Road" within the Soft-TFIDF similarity calculation.
…cronym token alignments
… creates the string sets internally for convenience
…ifying languages consistently from components (may need to make several calls using the same languages and don't necessarily want the language classifier to be run on house numbers when we already know the languages from e.g. the street name - this provides a simple window into the language classifier focused on the entire address/record
… most of the work on this branch. Includes simple phrase-aware exact deduping methods, with per-component variations as to whether e.g. a root expansion match counts as an exact duplicate or not (in a secondary unit, "No. 2" and "Apt 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street e.g. "Park Ave" and "Park Pl"). The API is fairly low-level at present, and may require a few calls. Notably, we leave the TFIDF scores or other weighting schemes to the client. Since each component gets its own dupe classification, it leaves the door open for doing more specific checks around e.g. compound house numbers/ranges in the future.
…naming convention
…e hashing at 50 unique tokens, fixing memory leaks, checking for valid geo components and returning NULL if one of the required fields isn't present
… converting them to char arrays
…POSTAL_ADDRESS_ANY component in each function call so it can be removed as needed.
…n accordance with Winkler's paper
This breaks the Windows build temporarily as the Appveyor config is missing a few steps to construct the necessary address dictionary/numex/transliteration files when certain files change in the commit range. Because some of the new tests in this PR depend on re-building those files, Appveyor is trying to run them while still relying on the pre-built versions. Merging and re-running Appveyor once the new files are pushed. |
This functionality (and implementation) mirrors much of my (& co-workers) work with similarity. It's good to know other people share the same ideas, and probably based it off of the same research. I'd like to compare these new APIs against our internal ones, with a preference towards switching over to libpostal's methods for international. I'm kind of pigeonholed into using Java, and wouldn't mind working on extending jpostal to cover the new API. Is there someone already doing that? Is the C API stable enough for that? Our main use case is on-demand processing of small volumes of addresses (1-100) rather than processing large data sets so I'm not sure how useful the lieu code is. Going to learn from it anyways, might implement something similar in Java. |
I would be interested in these bindings as well for the JavaCPP integration i am creating. This lib makes JNI a whole lot easier. |
Hey @mkaranta, happy 2018. I'd imagine there are similarities, probably read the same handful of (awesome) papers in the record linkage/healthcare literature. The libpostal implementations have their own little nuances and use all the international goodness, and I think at least one of the methods, the subtotaling affine gap for detecting abbreviations, is entirely new. The near-dupe hashing function also diverges a bit from the literature, comes from document deduplication and can be thought of as a cheap clustering algorithm like e.g. MinHash. So far these methods have performed well on venue/place data sets (qualitatively, we didn't have much ground truth to work with), including the 20M venues from Who's On First/SimpleGeo. I'm also excited to use it for some of my work around voting rights in the US. The C API itself should be stable (there might make a few changes to the implementation over the next few days, but that should not affect the API), so can definitely feel free to implement on the jpostal side. I've only implemented the Python bindings for the moment for use in lieu. Parts of the lieu project are still being tested/pushed, so may be in a partially-broken state for a few days, but keep an eye out for a README update when it's all ready to use. There's a command-line version of lieu as well, which can work for smaller data sets, as well as a reference implementation (again, pushing soon) of a server that uses Elasticsearch as an index and checks new documents against it. |
@Maurice-Betzel sounds good. If either you want to take a crack at it, the new APIs use similar constructs to the existing ones. Happy to accept pull requests for jpostal. |
@albarrentine I'm learning a lot of this for the first time. The papers are fascinating and, most of the time, when I mention them to my coworkers, their response is "yeah I knew that". This is the first time I'm digging in to the libpostal code and the theory backing it. It's an elegant introduction to applied statistical NLP. At work, we have a lot similarity test data to throw at this once it's integrated. I'll get a timeline for jpostal based on whether I can work on it on the clock & make an Issue on that repo to track the work. Timeline update: |
This PR adds three important groups of functions to libpostal's C API to support the lieu address/venue deduping project. The APIs are somewhat low-level at this point, but should still be useful in a wide range of geo applications, particularly for batch geocoding large data sets. This is the realization of some of the earlier work on address expansion.
Near-dupe hashing
Near-dupe hashing builds on the expand_address functionality to allow hashing a parsed address into strings suitable for direct comparison and automatic clustering. The hash keys are used to group similar records together prior to pairwise deduping so that we don't need to compare every record to every other record (i.e. N² comparisons). Instead, if we have a function that can generate the same hash key for records that are possible dupes (like "100 Main" and "100 E Main St"), while also being highly selective, we can ensure that most duplicates will be captured for further comparison downstream, and that dissimilar records can be safely considered non-dupes. In a MapReduce context, near-dupe hashes can be used as keys to ensure that possible dupes will be grouped together on the same shard for pairwise checks, and in a search/database context, they can be used as an index for quick lookups of candidate dupes before running more thorough comparisons with the few records that match the hash. This is the first step in the deduping process to identify candidate dupes, and can be thought of as the blocking function in record linkage (this is a highly selective version of a blocking function) or as a form of locally sensitive hashing in the near-duplicate detection literature. Libpostal's near-dupe hashes use a combination of several new features of the library:
Address root expansions: removes tokens that are ignorable such as "Ave", "Pl", "Road", etc. in street names so that something like "West 125th St" can potentially match "W 125". This also allows for exact comparison of apartment numbers where "Apt 2" and "clib or another package manager for dependencies #2" mean the same thing. Every address component uses certain dictionaries in libpostal to determine what is ignorable or not, and although the method is rule-based and deterministic, it can also identify the correct root tokens in many complex cases like "Avenue Rd", "Avenue E", "E St SE", "E Ctr St", etc. While many of the test cases used so far are for English, libpostal's dictionary structure also allows it to work relatively well around the world, e.g. matching Spanish street names where "Calle" might be included in a government data set but is rarely used colloquially or in self-reported addresses.
Phonetic matching for names: the near-dupe hashes for venue/place/company names written in Latin script include a modified version of the double metaphone algorithm which can be useful for comparing misspelled human names, as well as comparing machine transliterations against human ones in languages where names might written in multiple scripts in different data sets e.g. Arabic or Japanese.
Geo qualifiers: for address data sets with lat/lons, geohash tiles (with a precision of 6 characters by default) and their 8 neighbors (to avoid faultlines) are used to narrow down the comparisons to addresses/places in a similar location. If there's no lat/lon, and the data are known to be from a single country, the postal code or the city name can optionally be used as the geo qualifier. Future improvements include disambiguating toponyms and mapping them to IDs in a hierarchy, such that multiple names for cities, etc. can resolve to one or more IDs, and e.g. an NYC address that uses a neighborhood name in place of the city e.g. "Harlem, NY" could match "New York, NY" by traversing the hierarchy and outputting the city's ID instead.
Component-wise deduping
Once we have potential candidate dupe pairs, we provide per-component methods for comparing address/name pairs and determining if they're duplicates. Each relevant address component has it own function, with certain logic for each, including which libpostal dictionaries to use, and whether a root expansion match counts as an exact duplicate or not. For instance, in a secondary unit, "# 2", "Apt 2", and "Apt # 2" can be considered an exact match in English whereas we wouldn't want to make that kind of assumption for street names e.g. "Park Ave" and "Park Pl". In the latter case, we can still classify the street names as needing to be reviewed by a human.
The duplicate functions return one of the following values:
The likely and exact classifications can be considered duplicates and merged automatically, whereas the needs_review response is for flagging possible duplicates.
Having special functions for each component can also be useful down the line e.g. for deduping with compound house numbers/ranges (though this is not implemented yet).
Since identifying the correct language is crucial to effective matching, and individual components like
house_number
andunit
may not provide any useful information about the language, we also provide a function that returns the language(s) for an entire parsed/labeled address using all of its textual components. The returned language codes can be reused for subsequent calls.Fuzzy deduping for names
For venue/street names, we also want to be able to handle inexact name matches, minor spelling differences, words out of order (see this often with human names, which can sometimes be listed as Last, First Middle), and removing tokens that may not be ignorable in terms of libpostal's dictionaries but are very common, or very common in a particular geography.
In this release, we implement a custom version of the Soft-TFIDF method, which blends a local similarity function (usually Jaro-Winkler in the literature, though we use a hybrid method), with global corpus statistics (TFIDF weights or similar, supplied by the user in our case, see the lieu project for constructing the relevant TFIDF and/or Geo-TFIDF scores from a given data set).
Here's how it works:
Note: for the lieu project we use a linear combination of TFIDF and a geo-specific TFIDF score where the IDF index is computed for a specific, roughly city-sized geohash tile, where smaller tiles are lumped in with their larger neighbors. The geo-specific scores mean that something like "San Francisco Department of Building Inspection" and "Department of Building Inspection" can match because the words "San Francisco" are very common in the region. This approach was inspired by some of the research in https://research.fb.com/publications/deduplicating-a-places-database/.
Unique to this implementation, we use a number of different local similarity metrics to qualify a given string for inclusion in the final similarity score:
The above assumes non-ideographic strings. In Chinese, Japanese, Korean, etc. we currently use the Jaccard similarity of the set of individual ideograms instead. In future versions it might be useful to weight the Jaccard similarity by TFIDF scores as well, and if we ever add a statistical word segmentation model for CJK languages, the word boundaries from that model could be used instead of ideograms.
The fuzzy duplicate methods are currently implemented for venue names and street names, which seemed to make the most sense. The output for these methods is a struct containing the dupe classification as well as the similarity value itself.