Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added tests and sub-domain validation #20

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

byarmis
Copy link

@byarmis byarmis commented Oct 11, 2019

Description

A complete description of what your pull request changes.

Added tests for utils/urls.py and validation for sub-domains (www.foobar.google.com is considered part of the google domain while www.google.foobar.com is not)

Closes (Issues)

Closes #16

Additional Information

Added nose==1.3.7 to the dev-requirements.txt

Copy link
Owner

@vitorfhc vitorfhc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I hope to discuss more about the changes requested!

Comment on lines +7 to +10
if string.startswith(to_remove):
return string[len(to_remove):]
else:
return string
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting doing it recursively.
while strings.startswith(to_remove):
Inputs like https://https://google.com would be treated correctly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not a valid URL to begin with-- what do you think the function should do?

validate_url('an invalid url that ends with google.com', 'google.com')-- I'm thinking that should return False

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a valid URL but since the user is the one making the input we have to consider all test cases. If he uses https://https://bla.com and we remove only one of the https:// we get https://bla.com which breaks our logic. I think it breaks when it splits the URL using / as the delimiter.
Not sure if the right decision is accepting invalid inputs and treating them (HTML does it a lot) or sending an error to the user, I think a regex following an RFC specification of how URLs should be is interesting.

return True
return False

for leader in ('https://', 'http://', 'www', 'ftp://'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think www. would be the right choice instead of www. It is not giving us any errors because the split on line 28 fixes it, but it would be interesting not to propagate any error to be fixed in another line.

Comment on lines +30 to +31
if expected_domain not in url:
return False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a code duplication form line 18-19, both checks could be deleted. They are making a weak check that will be covered correctly later on the same function.

Comment on lines +33 to +49
for loc, url_piece in enumerate(url[::-1]):
if loc == 0:
continue
elif loc == 1:
if url_piece == expected_domain:
return True
if len(url_piece) > 3:
return False

elif loc == 2:
if url_piece == expected_domain:
return True
if len(url_piece) > 3:
return False

else:
return False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if loc == 0: is not a good practice, I see the intention of ignoring the top level domains (.com, .net, etc) but if this was going to be really needed would be a smarter choice to iterate through list[1:] instead since you always ignore the first.
An even better approach is to remove all top level domains which could be given from the string: top_level_domains = ['.com', '.net', '.edu', ... ]. It would be done the same way we remove the leaders.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I instead add in just the list of possible TLDs since there could be quite a few?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea but it looks like we could get a lot of false positives. Maybe since we are using only codeforces as a scrape target we could only consider .com TLD.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Ok, haha that'll greatly simplify things :)

@vitorfhc vitorfhc added the tests label Oct 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] URL validation is not considering subdomains
2 participants