Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization and simplification suggestions #31

Closed
petri opened this issue Feb 7, 2017 · 1 comment
Closed

optimization and simplification suggestions #31

petri opened this issue Feb 7, 2017 · 1 comment
Labels

Comments

@petri
Copy link
Collaborator

petri commented Feb 7, 2017

switch to function-based API

  • it makes no sense to instantiate a class for each cleaned name; it's overcomplex, extra work and unnecessary, especially when most of setup code is now outside the class

switch to working on whitespace-separated name parts rather than full strings

In effect we would check for example in case of suffix for business_name.split()[-1] == term rather than business_name.endswith(' ' + term). Of course the splitting would be done just once in the beginning.

  • at the moment, the class is splitting and rejoining the name already, to get rid of extra whitespaces
  • at the moment, the code already looks for a prefix/suffix that's padded by a single whitespace, so in effect it's the same

If we can just handle the fact that some legal terms are "multi-part" (whitespace-separated), this would simplify the code and make it run faster since for example we'd only have to work on the last whitespace-separated name part for suffix, and just the first for prefix. There are other cases, too.

We would not have to presort the data, either.

don't use both legal and countrywise suffixes in clean_name

  • there are a lot of duplicates, it should be enough to use just either (preferably countrywise data since that would allow dropping off countries easily)
@petri
Copy link
Collaborator Author

petri commented Apr 26, 2020

Since 2.0, there are now following optimizations:

  • function-based API added
  • term search works on splitting the names & terms rather than directly on strings; see optimization2 branch for code to compare the effect of this (x3 speedup)
  • the term preparation code generates unique terms

These are pretty much what this request was asking for, so closing.

@petri petri closed this as completed Apr 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants