Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Street suffixes #13

Open
shardsofblue opened this issue Oct 24, 2019 · 8 comments
Open

Feature Request: Street suffixes #13

shardsofblue opened this issue Oct 24, 2019 · 8 comments

Comments

@shardsofblue
Copy link

Would it be reasonable to create something similar to the bus_suffix function, but for common street suffixes, such as "avenue," "Ave.", "ave", "street", "st.", "St.", and so on? Avenues, Streets, Boulevards, etc. would have to remain distinct from one another, but it would be helpful for the package to be insensitive to common variations within each type.

@ChrisMuir
Copy link
Owner

Hi @shardsofblue , thanks this is an interesting idea! I should have some time soon to think about this and work on it some. I'll keep you posted, thanks again!

@ChrisMuir
Copy link
Owner

Hi @shardsofblue , I messed around with this some today. I pushed initial commits to branch address-dev.

Here's a quick demo:

x <- c(
  "John Smith Nulla St. Mankato Mississippi 96522", 
  "John Smith Nulla street Mankato Mississippi 96522", 
  "John Smith Nulla Rd. Mankato Mississippi 96522",
  "John Smith Nulla Road Mankato Mississippi 96522"
)

refinr::key_collision_merge(x)
#> [1] "John Smith Nulla St. Mankato Mississippi 96522" "John Smith Nulla St. Mankato Mississippi 96522"
#> [3] "John Smith Nulla Rd. Mankato Mississippi 96522" "John Smith Nulla Rd. Mankato Mississippi 96522"

The original code would not make any edits to any of the strings. For now, the updated code will operate on strings avenue, street, road, and boulevard.

If you have a chance to try it out, I'd love to get feedback. Let me know what you think. Thanks!

@shardsofblue
Copy link
Author

That's looking good! Thanks for looking into adding this feature! I'd suggest also adding Highway/Hwy, Court/Ct, Lane/Ln, Circle/Cir, Parkway/Pkwy.

I tried it with your demo and got the same results, but when I tried it on a test data frame from my own data, I am not seeing results. Am I using it incorrectly?

z <- tribble(
  ~Institution, ~Address, ~Institution_Type,
  "American Rescue Workers", "11 West Clement Street", "Emergency Shelter",
  "Baltimore Rescue Mission", "4 North Central Avenue", "Emergency Shelter",
  "Baltimore Outreach Services", "701 South Charles St. ", "Emergency Shelter",
  "Helping Up Mission", "1029 East Baltimore Ave", "Emergency Shelter",
  "Karis Home", "1228 East Baltimore St", "Emergency Shelter",
  "Loving Arms", "3313 Oakfield Ave", "Emergency Shelter",
  "MCVET-Veterans", "301 North High Rd.", "Emergency Shelter",
  "Project PLASE - Men", " 201 East North Road", "Emergency Shelter",
  "Project PLASE - Women", "139 East North road", "Emergency Shelter",
  "Salvation Army/Booth House", "1114 North Calvert rd", "Emergency Shelter",
  "Sarah’s Hope Shelter", "1114 Mount street", "Emergency Shelter"
  )

(refinr::key_collision_merge(z$Address))

@ChrisMuir
Copy link
Owner

Hello, thanks for testing and the feedback. Great suggestion to add Highway/Hwy, Court/Ct, Lane/Ln, Circle/Cir, Parkway/Pkwy, I will be pushing that edit today to this branch.

So with the example that you gave, none of the address strings will end up being grouped together and merged, as none of them are similar enough. For example, " 201 East North Road" and "139 East North road" are very similar, but the street numbers being different mean they won't be treated as suitable for grouping/merging.

@shardsofblue
Copy link
Author

Ah of course, when you put it that way it makes perfect sense. I was not using it as intended. I was expecting it to turn all variations of road into Rd., which is not at all what it is meant for. Nevertheless, this should be very helpful to streamline the address cleaning process. Perhaps as part of my workflow I will split the street numbers, run refinr on the street name column, and then recombine them.

@ChrisMuir
Copy link
Owner

Yep you got it, everything you said is correct 👍

I updated the branch to cover more cases, here's an example:

x <- c(
  "John Smith Nulla St., Mankato Mississippi 96522", 
  "John Smith Nulla street Mankato Mississippi 96522", 
  "John Smith Nulla Rd. Mankato Mississippi 96522",
  "John Smith Nulla Road, Mankato Mississippi 96522",
  "John Smith Nulla BLVD. Mankato Mississippi 96522",
  "John Smith Nulla Boulevard Mankato Mississippi 96522",
  "John Smith Nulla hwy., Mankato Mississippi 96522",
  "John Smith Nulla HWY Mankato Mississippi 96522",
  "John Smith Nulla highway Mankato Mississippi 96522",
  "John Smith Nulla highWay, Mankato Mississippi 96522",
  "John Smith Nulla circle, Mankato Mississippi 96522",
  "John Smith Nulla cir. Mankato Mississippi 96522",
  "John Smith Nulla ct Mankato Mississippi 96522",
  "John Smith Nulla couRt Mankato Mississippi 96522",
  "John Smith Nulla ln Mankato Mississippi 96522",
  "John Smith Nulla lane, Mankato Mississippi 96522",
  "John Smith Nulla pkwy Mankato Mississippi 96522",
  "John Smith Nulla parkway Mankato Mississippi 96522"
)

refinr::key_collision_merge(x)
#>  [1] "John Smith Nulla St., Mankato Mississippi 96522"    "John Smith Nulla St., Mankato Mississippi 96522"    
#>  [3] "John Smith Nulla Rd. Mankato Mississippi 96522"     "John Smith Nulla Rd. Mankato Mississippi 96522"    
#>  [5] "John Smith Nulla BLVD. Mankato Mississippi 96522"   "John Smith Nulla BLVD. Mankato Mississippi 96522"  
#>  [7] "John Smith Nulla HWY Mankato Mississippi 96522"     "John Smith Nulla HWY Mankato Mississippi 96522"    
#>  [9] "John Smith Nulla HWY Mankato Mississippi 96522"     "John Smith Nulla HWY Mankato Mississippi 96522"    
#> [11] "John Smith Nulla cir. Mankato Mississippi 96522"    "John Smith Nulla cir. Mankato Mississippi 96522"   
#> [13] "John Smith Nulla couRt Mankato Mississippi 96522"   "John Smith Nulla couRt Mankato Mississippi 96522"  
#> [15] "John Smith Nulla lane, Mankato Mississippi 96522"   "John Smith Nulla lane, Mankato Mississippi 96522"   
#> [17] "John Smith Nulla parkway Mankato Mississippi 96522" "John Smith Nulla parkway Mankato Mississippi 96522"

I will clean the edits up some, merge to master, and at some point soon I will send the edits to CRAN. Thanks again for the great idea!

Also, this is super random, but I just started getting involved with a non-profit that's focused on criminal justice reform based in Richmond, VA. The founder has a big need for VA court case data; we talked about web scraping options, but I noticed that you have done a few different analytical deep dives that used data from http://virginiacourtdata.org/ . I brought the resource to the founder's attention and we wasn't aware of it, so I might try to procure the de-anonymized data from the website on behalf of the non-profit. If I do that, I for sure plan on reading through your data prep documentation, but would you mind if I also reached out to you if I have questions about the data cleaning and data processing steps that you used?

Thanks!

@shardsofblue
Copy link
Author

shardsofblue commented Nov 8, 2019

Thanks for implementing this! I'm sure it will significantly speed up the cleaning process for addresses. I'll pass notice of this update on to the Slack I'm in for data journos.

As for your request, the VA court data was some of the first data processing I ever worked on, so my work there is messy at best. But you're quite welcome to contact me if I can be of help. I published my prep process notes here. GitHub's notification system seems somewhat unreliable, so feel free to use my email: rready at umd dot edu.

@ChrisMuir
Copy link
Owner

Awesome, glad the feature will be helpful to you, and thanks again for opening an issue about it. At some point this weekend I will update the README / Vignette / Docs, then merge to master.

Cool, thank you so much for being willing to help, I appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants