Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse incorrectly retrieves IPv4 and regular name hosts from inside of brackets #103848

Closed
JohnJamesUtley opened this issue Apr 25, 2023 · 6 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@JohnJamesUtley
Copy link
Contributor

JohnJamesUtley commented Apr 25, 2023

Background

RFC 3986 defines a host as follows

host = IP-literal / IPv4address / reg-name

Where

IP-literal = "[" ( IPv6address / IPvFuture  ) "]"
reg-name = *( unreserved / pct-encoded / sub-delims )
IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

WhatWG says that "A valid host string must be a valid domain string, a valid IPv4-address string, or: U+005B ([), followed by a valid IPv6-address string, followed by U+005D (])."

The Bug

This is code from Lib/urllib/parse.py:196-208 used for retrieving the hostname from the netloc

    @property
    def _hostinfo(self):
        netloc = self.netloc
        _, _, hostinfo = netloc.rpartition('@')
        _, have_open_br, bracketed = hostinfo.partition('[')
        if have_open_br:
            hostname, _, port = bracketed.partition(']')
            _, _, port = port.partition(':')
        else:
            hostname, _, port = hostinfo.partition(':')
        if not port:
            port = None
        return hostname, port

It will incorrectly retrieve IPv4 addresses and regular name hosts from inside brackets. This is in violation of both specifications.

Minimally reproducible example:

from urllib.parse import urlsplit

parsedURL = urlsplit('scheme://user@[regname]/Path')
print(parsedURL.hostname) # Prints 'regname'

Your environment

  • CPython versions tested on:
  • Operating system and architecture:
    • Arch Linux x86_64

Linked PRs

@JohnJamesUtley JohnJamesUtley added the type-bug An unexpected behavior, bug, or error label Apr 25, 2023
@arhadthedev arhadthedev added the stdlib Python modules in the Lib dir label Apr 26, 2023
@gpshead gpshead self-assigned this May 9, 2023
gpshead added a commit that referenced this issue May 10, 2023
…it are of IPv6 or IPvFuture format (#103849)

* Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format

---------

Co-authored-by: Gregory P. Smith <greg@krypto.org>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue May 10, 2023
…urlsplit are of IPv6 or IPvFuture format (pythonGH-103849)

* Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format

---------

(cherry picked from commit 29f348e)

Co-authored-by: JohnJamesUtley <81572567+JohnJamesUtley@users.noreply.github.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
gpshead pushed a commit to gpshead/cpython that referenced this issue May 10, 2023
…und by urlsplit are of IPv6 or IPvFuture format (pythonGH-103849)

* Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format

---------

Co-authored-by: Gregory P. Smith <greg@krypto.org>
(cherry picked from commit 29f348e)

Co-authored-by: JohnJamesUtley <81572567+JohnJamesUtley@users.noreply.github.com>
gpshead added a commit that referenced this issue May 10, 2023
… urlsplit are of IPv6 or IPvFuture format (GH-103849) (#104349)

gh-103848: Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format (GH-103849)

* Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format

---------

(cherry picked from commit 29f348e)

Co-authored-by: JohnJamesUtley <81572567+JohnJamesUtley@users.noreply.github.com>
Co-authored-by: Gregory P. Smith <greg@krypto.org>
carljm added a commit to carljm/cpython that referenced this issue May 10, 2023
* main:
  pythonGH-102181: Improve specialization stats for SEND (pythonGH-102182)
  pythongh-103000: Optimise `dataclasses.asdict` for the common case (python#104364)
  pythongh-103538: Remove unused TK_AQUA code (pythonGH-103539)
  pythonGH-87695: Fix OSError from `pathlib.Path.glob()` (pythonGH-104292)
  pythongh-104263: Rely on Py_NAN and introduce Py_INFINITY (pythonGH-104202)
  pythongh-104010: Separate and improve docs for `typing.get_origin` and `typing.get_args` (python#104013)
  pythongh-101819: Adapt _io._BufferedIOBase_Type methods to Argument Clinic (python#104355)
  pythongh-103960: Dark mode: invert image brightness (python#103983)
  pythongh-104252: Immortalize Py_EMPTY_KEYS (pythongh-104253)
  pythongh-101819: Clean up _io windows console io after pythongh-104197 (python#104354)
  pythongh-101819: Harden _io init (python#104352)
  pythongh-103247: clear the module cache in a test in test_importlib/extensions/test_loader.py (pythonGH-104226)
  pythongh-103848: Adds checks to ensure that bracketed hosts found by urlsplit are of IPv6 or IPvFuture format (python#103849)
  pythongh-74895: adjust tests to work on Solaris (python#104326)
  pythongh-101819: Refactor _io in preparation for module isolation (python#104334)
  pythongh-90953: Don't use deprecated AST nodes in clinic.py (python#104322)
  pythongh-102327: Extend docs for "url" and "headers" parameters to HTTPConnection.request()
  pythongh-104328: Fix typo in ``typing.Generic`` multiple inheritance error message (python#104335)
@mjpieters
Copy link
Contributor

As implemented, the URL scheme://foobar.[::1]/ is considered correct, but results in the hostname value of ::1. The hostname portion of the netloc is invalid however, as it doesn't start with [. Shouldn't the parser at least validate that the [ bracket is the first character or directly following a @?

@MarcoGlauser
Copy link

MarcoGlauser commented Jun 12, 2023

This change caused an issue on our system. We have brackets in a password. Since 3.11.4, urlparse tries to parse an ip address if it finds brackets. The brackets don't even have to be in the correct order.

Running this code:

from urllib.parse import urlparse
urlparse("https://user:some]password[@host.com")

In 3.11.3 it works as expected:

ParseResult(scheme='https', netloc='user:some]password[@host.com', path='', params='', query='', fragment='')

In 3.11.4 it throws an exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.11/urllib/parse.py", line 395, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/urllib/parse.py", line 500, in urlsplit
    _check_bracketed_host(bracketed_host)
  File "/usr/local/lib/python3.11/urllib/parse.py", line 446, in _check_bracketed_host
    ip = ipaddress.ip_address(hostname) # Throws Value Error if not IPv6 or IPv4
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/ipaddress.py", line 54, in ip_address
    raise ValueError(f'{address!r} does not appear to be an IPv4 or IPv6 address')
ValueError: '@host.com' does not appear to be an IPv4 or IPv6 address

@mjpieters
Copy link
Contributor

mjpieters commented Jun 12, 2023

This change caused an issue on our system. We have brackets in a password. Since 3.11.4, urlparse tries to parse an ip address if it finds brackets. The brackets don't even have to be in the correct order.

Running this code:

from urllib.parse import urlparse
urlparse("https://user:some]password[@host.com")

In 3.11.3 it works as expected:

ParseResult(scheme='https', netloc='user:some]password[@host.com', path='', params='', query='', fragment='')

That this worked before was a bug. urlparse() was violating the various and miriad URL standards before; [ and ] are reserved characters with meaning in the netloc portion of a URL (the part between the scheme:// and /path) and must be URL encoded. See the WhatWG living URL standard percent-encoding section; for the userinfo part (username + password), you need to encode:

  • U+002F (/), U+003A (:), U+003B (;), U+003D (=), U+0040 (@), U+005B ([) to U+005E (^), inclusive, and U+007C (|).
  • U+003F (?), U+0060 (`), U+007B ({), and U+007D (}) (path percent-encode set).
  • U+0020 SPACE, U+0022 ("), U+0023 (#), U+003C (<), and U+003E (>) (query percent-encode set).
  • U+0000 NULL to U+001F INFORMATION SEPARATOR ONE, inclusive, and all codepoints past U+007E (~) (C0 control percent-encode set)

So this works:

>>> from urllib.parse import urlparse
>>> urlparse("https://user:some%5Dpassword%5B@host.com")
ParseResult(scheme='https', netloc='user:some%5Dpassword%5B@host.com', path='', params='', query='', fragment='')
>>> _.password
'some%5Dpassword%5B'

The password is still URL encoded, use urllib.parse.unquote() for that.

Further references:

@MarcoGlauser
Copy link

Thank you for the clarification! This makes a lot of sense :)

@Jc2k
Copy link

Jc2k commented Jun 28, 2023

I'm also hitting "ValueError: 'xxxxx' does not appear to be an IPv4 or IPv6 address" on code that was working.

In my case I have no control over the input data, it's background noise from the internet that i'm parsing and analyzing. Some of the crap in my dataset is very much not standards compliant. None the less, as it was working and now isn't i figured i should 👍 before i start vendoring old versions of the stdlib into my project 😆

openstack-mirroring pushed a commit to openstack/taskflow that referenced this issue Aug 9, 2023
Python 3.11.4 [1] includes a fix for gh-103848 [2] and
urllib.parse.urlsplit will now validate that bracketed IP addresses are
valid IPv6 address. Fix this.

[1] https://docs.python.org/release/3.11.4/whatsnew/changelog.html#python-3-11-4
[2] python/cpython#103848

Change-Id: Ibd3d24e07f0c5670224b3e186b329c207666a2ab
Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
openstack-mirroring pushed a commit to openstack/openstack that referenced this issue Aug 9, 2023
* Update taskflow from branch 'master'
  to 171580c4d355a12d42faa6102ad4e5ecd779b864
  - tests: Use valid IPv6 address
    
    Python 3.11.4 [1] includes a fix for gh-103848 [2] and
    urllib.parse.urlsplit will now validate that bracketed IP addresses are
    valid IPv6 address. Fix this.
    
    [1] https://docs.python.org/release/3.11.4/whatsnew/changelog.html#python-3-11-4
    [2] python/cpython#103848
    
    Change-Id: Ibd3d24e07f0c5670224b3e186b329c207666a2ab
    Signed-off-by: Stephen Finucane <stephenfin@redhat.com>
@gpshead
Copy link
Member

gpshead commented Oct 3, 2023

The original issue was fixed in May... it sounds like there may be some new issues that unfortunately arose? People with new parsing problems, please file new issues to track those.

@gpshead gpshead closed this as completed Oct 3, 2023
j-pepito added a commit to VerticalRelevance/control-broker that referenced this issue Oct 30, 2023
…changes made in Python 3.11 urllib.parse.urlsplit() which is implicitly called by urllib.parse.urljoin(): python/cpython#103848

CDK guarantee's a static URL is returned from the default Api stage: https://github.com/aws/aws-cdk/blob/v2.103.1/packages/@aws-cdk/aws-apigatewayv2-alpha/lib/http/stage.ts#L188-L192

So, since we're only appending some desired path to the base of the HttpApi URL, we extrapolate and format the scheme + netloc, concatenate the path, then return the result.
amercader added a commit to ckan/ckan that referenced this issue Apr 9, 2024
`urlparse` got more strict with the values it considers valid so this
"URL" used in our tests is no longer correctly parsed:

'sh+eme://[net:loc]:12345/a/path?a=b&c=d'

This should help turn some of the versions green here:

https://github.com/ckan/ckan-python-monitor

For reference:
python/cpython#103848
smotornyuk pushed a commit to ckan/ckan that referenced this issue Apr 17, 2024
`urlparse` got more strict with the values it considers valid so this
"URL" used in our tests is no longer correctly parsed:

'sh+eme://[net:loc]:12345/a/path?a=b&c=d'

This should help turn some of the versions green here:

https://github.com/ckan/ckan-python-monitor

For reference:
python/cpython#103848
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants