Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests with Python 3.11.4+ #213

Merged
merged 3 commits into from
Aug 2, 2023
Merged

Fix tests with Python 3.11.4+ #213

merged 3 commits into from
Aug 2, 2023

Conversation

shadchin
Copy link
Contributor

Fix #212

@codecov
Copy link

codecov bot commented Jul 3, 2023

Codecov Report

Merging #213 (8abe8ee) into master (5393dbe) will not change coverage.
The diff coverage is n/a.

❗ Current head 8abe8ee differs from pull request most recent head 6a5235a. Consider uploading reports for the commit 6a5235a to get more accurate results

@@           Coverage Diff           @@
##           master     #213   +/-   ##
=======================================
  Coverage   96.18%   96.18%           
=======================================
  Files           9        9           
  Lines         498      498           
  Branches       94       94           
=======================================
  Hits          479      479           
  Misses          9        9           
  Partials       10       10           

@wRAR
Copy link
Member

wRAR commented Jul 3, 2023

Can you please run black?

@shadchin
Copy link
Contributor Author

shadchin commented Jul 3, 2023

Done

else:
KNOWN_SAFE_URL_STRING_URL_ISSUES.add(
f"https://{USERNAME_TO_ENCODE}:{PASSWORD_TO_ENCODE}@example.com"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, to make sure I understand this correctly:

  1. Both safe_url_string("http://[2a01:5cc0:1:2:3:4]") and safe_url_string('https://"%;<=>@[]^`{|}\x7f:"%;<=>@[]^`{|}\x7f:@example.com') work on older Python versions but raise a ValueError on 3.11.4.
  2. test_safe_url_string_url is a test that takes input and output of safe_url_string() from SAFE_URL_URL_CASES and expects that output unless the input is in KNOWN_SAFE_URL_STRING_URL_ISSUES.
  3. Hence, KNOWN_SAFE_URL_STRING_URL_ISSUES is basically a list of cases that should xfail.
  4. SAFE_URL_URL_CASES says that "http://[2a01:5cc0:1:2:3:4]" should give ValueError and this is true only on the new Python, so this change removes xfail for it on the new Python.
  5. SAFE_URL_URL_CASES says that safe_url_string('https://"%;<=>@[]^`{|}\x7f:"%;<=>@[]^`{|}\x7f:@example.com') should be encoded correctly and that works only on older Python, so this change adds xfail for it on the new Python.

With this in mind I feel like we should instead change the USERNAME_TO_ENCODE and/or PASSWORD_TO_ENCODE so that it works on all Python versions instead. It's probably enough to remove [] from there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if remove [] from USERNAME_TO_ENCODE, then test pass

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, can you please make this change?

Copy link
Member

@wRAR wRAR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@wRAR wRAR requested a review from Gallaecio July 13, 2023 15:04
@Gallaecio
Copy link
Member

I have been delaying reviewing this change until I could put some time aside to properly look into it, as it felt to me like something was not right.

After a proper review, I think this change is actually perfect as it is, so thank you very much @shadchin.

But I feel the need to clarify, for future reference, what’s happening here since Python 3.11.4, which is 2 things:

  1. Python now fails as expected with invalid IPv6 domains. This is good news for us.
  2. Python now fails on user and password fields containing brackets ([]). This is bad news for us.

To elaborate on 2:

On Python 3.11.3 and earlier, safe_url_string was able to fix https://user[]:password[]@example.com as a browser would:

>>> safe_url_string("https://user[]:password[]@example.com")
'https://user%5B%5D:password%5B%5D@example.com'

Since 3.11.4, that’s not possible, the URL is unparsable altogether by the Python standard library:

>>> safe_url_string("https://user[]:password[]@example.com")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/adrian/temporal/w3lib-213/venv/lib/python3.11/site-packages/w3lib/url.py", line 142, in safe_url_string
    parts = urlsplit(_strip(decoded))
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adrian/.pyenv/versions/3.11.4/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '' does not appear to be an IPv4 or IPv6 address

This is OK for Python, but not for us, that strive to fix bad URLs where possible, and here it should be (and in fact used to be) possible.

I dream to some day have our own URL parsing implementation, with a good performance but a more browser-like behavior, so that we have control over things like this 😓 .

@Gallaecio Gallaecio merged commit 5357628 into scrapy:master Aug 2, 2023
12 checks passed
@shadchin shadchin deleted the patch-1 branch August 2, 2023 21:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_safe_url_string_url regressed on 3.11.4
3 participants