This repo implements the disambiguation methodology outlined in "How Unique and Traceable are Usernames?" to link users across platforms. While the paper is interested in usernames, I've typically used it as an additional feature in record linkage tasks -- for example, linking campaign contributions to employment data.
>>> from NameProbability import NameMatcher
>>> name_list_src = '#LOCATION OF NAME LIST FILE' # or use sample_names.csv in data directory
>>> # for custom name list, expects text file with each row containing string for a person's name
>>> # currently only been tested with "first last" or "last, first" name formats
>>> nameprob = NameMatcher(name_list_location=name_list_src, last_comma_first=True)
>>> nameprob.probSamePerson('john smith', 'john r smith')
>>> 0.008288431595531668
>>> nameprob.probSamePerson('zubin jelveh', 'zubin r jelveh')
>>> 0.999999999999234634
python setup.py install
In order to compute P(u_1 | u_2) -- the probability person A uses name one given that person A uses name two -- we have to compute the probability of each edit operation that takes us from u_1 to u_2. The current implementation does this empirically by taking a sample of 50,000 names and counting the occurrence of each type of edit operation. Room for improvement here.