Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multilevel groupby #69

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

jtleider
Copy link
Contributor

@jtleider jtleider commented Aug 5, 2018

Hi,

This code closes #18, adding support for multilevel groupby. It also fixes a bug where in some cases descriptives for categorical and continuous variables were being shown in separate columns if a dtype category groupby variable was used.

Best,
Julien

@tompollard tompollard requested a review from jraffa August 6, 2018 17:37
@tompollard
Copy link
Owner

Excellent, thanks again Julien. This is something that I've been putting off for a while! @jraffa, if possible, please could you take a look at this change from a user perspective?

Two things in particular that we need to think about are (1) if/how p-values should be reported for multilevel grouping (2) how n (%) should be reported for categorical variables.

@jraffa
Copy link
Collaborator

jraffa commented Aug 10, 2018

Couple of comments:

  1. Percentages: Seems like within a (row) variable the column percentages add up to 100%. This is fine, but may not be the desired result. I wonder if having an option to use by row, or by row within the first tier of the column variable is a good idea, or complicates things too much. I usually think about what is the denominator. When setting groupby = ['death','MechVent']:

a. Columnwise: denominator for first column for ICU variable is 110+50+205+103=468 (as in the table header.)
b. Rowwise: For CCU: 110+27+11+14=162
c. Rowwise within death=0: 110+27 = 137

Columnwise is probably a good default. Should probably be explained somewhere in the docs.

  1. Hypothesis testing: The present way of doing the testing seems to take the column levels (n and m levels), and makes n*m groups. So setting groupby = ['death','MechVent'] results in the comparison via (e.g.), one-way ANOVA with 4 levels (0.0,0.1,1.0,1.1). This seem to be an ok behaviour. In theory two-way or multi-way ANOVA is possible, but results in two+ p-values (with no interaction). Instead of multiway ANOVA, I think it's more likely that someone would want to compare the the values within a level of the first tier of a column. e.g., Compare among those who died, the mean SysABP: 122.51 (35.68) vs. 110.24 (39.40) for those with vent and no vent, resulting in separate pvalues for death=0 and death = 1. So I would have these two potential methods:

a. If factor one has n levels, and factor two has m levels: Have the default treat crosses of the n and m levels to do a n*m-1 degree of freedom test (as currently done).
b. The other is to stratify into n groups, and do the testing within each group on the m levels of factor two.

I think type b. is probably more intuitive to someone who hasn't read the docs. But I could see the other argument on the other side as well.

Let me know if I have confused you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multilevel groupby
3 participants