Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv/pl.Dataframe.rename broken?! #2004

Closed
Arengard opened this issue Dec 6, 2021 · 4 comments
Closed

read_csv/pl.Dataframe.rename broken?! #2004

Arengard opened this issue Dec 6, 2021 · 4 comments

Comments

@Arengard
Copy link

Arengard commented Dec 6, 2021

Are you using Python or Rust?

Python

What version of polars are you using?

0.10.27

What operating system are you using polars on?

windows 10

Describe your bug.

the newest version (0.10.27) is not able to read/rename csv with same column names but different values

# reading dataframes
df = pl.DataFrame(
    {"a": [1, 2, 3], "b": [6, 7, 8], "c": ["a", "b", "c"]}
)
df

#the first column is missing
af = pl.DataFrame(
    {"a": [1, 2, 3], "a": [6, 7, 8], "c": ["a", "b", "c"]}
)
af


### renaming RuntimeError

df = pl.DataFrame(
    {"a": [1, 2, 3], "b": [6, 7, 8], "c": ["a", "b", "c"]}
)
df

# RuntimeError: Any(SchemaMisMatch("duplicate column names found"))
af = (
    df
    .clone()
    .rename({'b':'a'})
)
af



expected behavior?

df = pd.DataFrame(
    {"a": [1, 2, 3], "b": [6, 7, 8], "c": ["a", "b", "c"]}
)
df

af = (
    df
    .copy(True)
    .rename(columns = {'b':'a'})
)
af
@ritchie46
Copy link
Member

Yes, this is correct. A polars dataframe may only have unique column names. This was not checked at renaming before polars 0.10.27, but this was a bug. So as the error message shows, your rename leads to duplicate column names.

@Arengard
Copy link
Author

Arengard commented Dec 6, 2021

woah... what? so we can't read in data (csv, parquet) that have duplicate column names?

@ritchie46
Copy link
Member

You could not create a DataFrame from them no. Polars would throw an error saying that the unique columns invariant is broken.

Same like you cannot create a database table with duplicate columns.
For the IO readers we could implement something that generates unique column names if duplicates are found.

Just out of curiosity? Why do you want to rename columns to the same name?

@MarcoGorelli
Copy link
Collaborator

Polars would throw an error saying that the unique columns invariant is broken.

This is great. Sorry to chime in, just wanted to point out that duplicate columns have lead to some very tricky bugs in pandas (e.g. here), so for what it's worth I'm hoping that polars will continue to disallow them

@Arengard Arengard closed this as completed Dec 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants