Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add validation checks for non-unique merge keys #35373

Closed

Conversation

LewisDavies
Copy link

@LewisDavies LewisDavies commented Jul 22, 2020

After implementing the suggestion in #27430 I realized it should also apply to 1:m and m:1 merges. When any type of many merge is specified, an error will be raised if the many side is actually unique.

@LewisDavies LewisDavies changed the title ENH: Add many-to-many merge validation check ENH: Add validation checks for non-unique merge keys Jul 22, 2020
@simonjayhawkins
Copy link
Member

Thanks @LewisDavies for the PR. The current behaviour is documented so this is an breaking API change. Would first need a deprecation.

from https://pandas.pydata.org/pandas-docs/version/1.1.0/reference/api/pandas.merge.html

validate : str, optional
If specified, checks if merge is of specified type.

“one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

“one_to_many” or “1:m”: check if merge keys are unique in left dataset.

“many_to_one” or “m:1”: check if merge keys are unique in right dataset.

“many_to_many” or “m:m”: allowed, but does not result in checks.

FWIW i'm -1 on changing this.

@simonjayhawkins simonjayhawkins added Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 22, 2020
@LewisDavies
Copy link
Author

Happy to contribute! I didn't realize the current behaviour was intentional so I understand not wanting to implement this. That said, it feels odd that if a user specifies m:m they won't get an error if both sides have unique keys.

With the benefit of hindsight the names could be clearer, especially for one-to-many merges. Adding 1: & :1 options could make this explicit but could just add confusion so I'm not sure what's best. What do you think @simonjayhawkins?

@simonjayhawkins
Copy link
Member

Happy to contribute! I didn't realize the current behaviour was intentional so I understand not wanting to implement this.

Thanks @LewisDavies. That's just my opinion. others may disagree so we'll leave this open for a while.

With the benefit of hindsight the names could be clearer, especially for one-to-many merges. Adding 1: & :1 options could make this explicit but could just add confusion so I'm not sure what's best. What do you think @simonjayhawkins?

another alternative could be to have, say, a check_unique keyword parameter that accepts, say, left, right and both. I think something that that may also make the intention of the functionality clearer.

There is functionality currently being added to pandas, to disallow allow duplicates and propagate that property in further operations, see #28394. It could be that in the future the validate parameter of merge becomes redundant.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we would likely break the world with erroring on this now, try making this a warning, and then see what tests you need to catch the warning.

@simonjayhawkins
Copy link
Member

@LewisDavies are you able to address @jreback comment #35373 (review)

@dsaxton dsaxton added the Stale label Sep 15, 2020
@dsaxton
Copy link
Member

dsaxton commented Sep 26, 2020

Closing as stale, @LewisDavies let us know if you'd still like to work on this

@dsaxton dsaxton closed this Sep 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

validate ='m:m' in dataframe.merge should have a meaningful purpose
4 participants