Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urlparse does not flag hostname *containing* [ or ] as incorrect #105704

Open
mjpieters opened this issue Jun 12, 2023 · 8 comments
Open

urlparse does not flag hostname *containing* [ or ] as incorrect #105704

mjpieters opened this issue Jun 12, 2023 · 8 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@mjpieters
Copy link
Contributor

mjpieters commented Jun 12, 2023

#103848 updated the URL parsing algorithm to handle IPv6 and IPvFuture addresses when parsing URLs.

However, the algorithm is incomplete. [ and ] are only permitted in the hostname portion if they are the first and last characters and only if they then contain an IPv6 or IPvFuture address. The current implementation ignores everything before the first [ and everything after the first ] found in the netloc portion.

The WhatWG URL standard states that [ and ] are forbidden characters in a hostname, and the host parser only looks for IPv6 or IPvFuture if the [ and ] characters are the first and last characters of the section, respectively.

The current implementation thus accepts such bizarre hostnames as:

  • http://prefix.[v1.example]/
  • http://[v1.example].postfix/

but then only reports the portion between the brackets as the hostname:

>>> urlparse('http://prefix.[v1.example]/').hostname
'v1.example'
>>> urlparse('http://[v1.example].postfix/').hostname
'v1.example'

The .netloc attribute, in both cases, contains the whole string.

Both URLs should have been rejected instead.

Your environment

  • CPython versions tested on: 3.12.0b1
  • Operating system and architecture: Darwin M1

Linked PRs

@mjpieters mjpieters added the type-bug An unexpected behavior, bug, or error label Jun 12, 2023
@arhadthedev arhadthedev added the stdlib Python modules in the Lib dir label Jun 12, 2023
@arhadthedev
Copy link
Member

@orsenthil (as an urllib module expert)

@csreddy98
Copy link

csreddy98 commented Jun 12, 2023

I think this should return a valid object only if the hostname starts with a [ and ends with ]. With the current logic any string with [ and ] inside the hostname is being considered as a valid IPv6 or IPvFutire hostnames. I believe we must verify the start and end characters of a hostname as mentioned here for IPv6 and optionally IPvFuture.

@orsenthil orsenthil self-assigned this Jun 13, 2023
@lscottod
Copy link

lscottod commented Jun 29, 2023

Just to add to this thread,

I'm using django, django-environ and postgres. This issue is quite significant since I, unfortunately, happen to have brackets inside the postgres passwords in several prod/staging servers. This password is given inside a URL to django-environ that parses it (using urllib) then sends it back to django.

As far as Python 3.11.3, everything was going well, but since 3.11.4 it's all broken now.

It is related to what is stated above. urlsplit now spots '[' and ']' inside the netloc and what's inside theses brackets (a fragment of the password) is wrongfully considered a hostname. It then throws an exception since it tries to convert it to an ip address.

The lines that seem to be of importance :
in urllib/parse.py -- urlsplit()

if '[' in netloc and ']' in netloc:
      bracketed_host = netloc.partition('[')[2].partition(']')[0]
      _check_bracketed_host(bracketed_host)

It feels like an important breaking change/regression that doesn't seem documented

@pschoen-itsc
Copy link

Just to add to this thread,

I'm using django, django-environ and postgres. This issue is quite significant since I, unfortunately, happen to have brackets inside the postgres passwords in several prod/staging servers. This password is given inside a URL to django-environ that parses it (using urllib) then sends it back to django.

As far as Python 3.11.3, everything was going well, but since 3.11.4 it's all broken now.

It is related to what is stated above. urlsplit now spots '[' and ']' inside the netloc and what's inside theses brackets (a fragment of the password) is wrongfully considered a hostname. It then throws an exception since it tries to convert it to an ip address.

The lines that seem to be of importance : in urllib/parse.py -- urlsplit()

if '[' in netloc and ']' in netloc:
      bracketed_host = netloc.partition('[')[2].partition(']')[0]
      _check_bracketed_host(bracketed_host)

It feels like an important breaking change/regression that doesn't seem documented

I experienced the same problem. Just to be sure I also checked the URL spec referenced in the code and it states that username and password can be an ASCII string, so this is clearly a bug.

@bcail
Copy link

bcail commented Oct 23, 2023

@orsenthil Is it OK if I open a PR for this issue for you to look at?

@orsenthil
Copy link
Member

@bcail , yes please.

@bcail
Copy link

bcail commented Oct 24, 2023

@orsenthil I opened a PR.

@bcail
Copy link

bcail commented Nov 29, 2023

Hi @orsenthil. Are you going to be able to take a look at my PR? If not, in a week or so I can post on Discourse and see if someone else can take a look. Thanks.

bcail added a commit to Crossway/cpython that referenced this issue Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

7 participants