Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Effective sample size per row, not total sample size in report_sample #306

Closed
Lakens opened this issue Dec 11, 2022 · 9 comments · Fixed by #307
Closed

Effective sample size per row, not total sample size in report_sample #306

Lakens opened this issue Dec 11, 2022 · 9 comments · Fixed by #307
Assignees
Labels
enhancement 💥 Implemented features can be improved or revised

Comments

@Lakens
Copy link

Lakens commented Dec 11, 2022

Instead of adding a total N on top of the table, add a column at the end (after total, or as part of total) reporting the effective n for each row. If there is missing data, it is good to see how many observations underlie the means in the table.

@Lakens
Copy link
Author

Lakens commented Dec 12, 2022

One solution (that does not yet has your nice lay-out) is to summarize as follows:

df.sum <- df %>%
select(var1, var2) %>%
summarise_all(funs(min = min(., na.rm = TRUE),
median = median(., na.rm = TRUE),
max = max(., na.rm = TRUE),
mean = mean(., na.rm = TRUE),
sd = sd(., na.rm = TRUE),
n = sum(!is.na(.))))

@strengejacke strengejacke added the enhancement 💥 Implemented features can be improved or revised label Dec 12, 2022
@rempsyc rempsyc self-assigned this Dec 12, 2022
@rempsyc
Copy link
Sponsor Member

rempsyc commented Dec 12, 2022

Thanks for the suggestion @Lakens. On it!

@rempsyc
Copy link
Sponsor Member

rempsyc commented Dec 12, 2022

Reprex of example above:

library(dplyr, warn.conflicts = FALSE)
df.sum <- airquality %>%
  select(Ozone, Solar.R) %>%
  summarise_all(funs(min = min(., na.rm = TRUE),
                     median = median(., na.rm = TRUE),
                     max = max(., na.rm = TRUE),
                     mean = mean(., na.rm = TRUE),
                     sd = sd(., na.rm = TRUE),
                     n = sum(!is.na(.))))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> ℹ Please use a list of either functions or lambdas:
#> 
#> # Simple named list: list(mean = mean, median = median)
#> 
#> # Auto named with `tibble::lst()`: tibble::lst(mean, median)
#> 
#> # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
df.sum
#>   Ozone_min Solar.R_min Ozone_median Solar.R_median Ozone_max Solar.R_max
#> 1         1           7         31.5            205       168         334
#>   Ozone_mean Solar.R_mean Ozone_sd Solar.R_sd Ozone_n Solar.R_n
#> 1   42.12931     185.9315 32.98788   90.05842     116       146

Created on 2022-12-12 with reprex v2.0.2

@rempsyc
Copy link
Sponsor Member

rempsyc commented Dec 12, 2022

Were you thinking of something like this @Lakens?

devtools::load_all("D:/github/forks/report")
#> ℹ Loading report

report_sample(airquality, effective_n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable          |        Summary | Effective n
#> ------------------------------------------------
#> Mean Ozone (SD)   |  42.13 (32.99) |         116
#> Mean Solar.R (SD) | 185.93 (90.06) |         146
#> Mean Wind (SD)    |    9.96 (3.52) |         153
#> Mean Temp (SD)    |   77.88 (9.47) |         153
#> Mean Month (SD)   |    6.99 (1.42) |         153
#> Mean Day (SD)     |   15.80 (8.86) |         153

Created on 2022-12-12 with reprex v2.0.2

It's a bit more challenging when using groups since groups won't have the same n for the same rows. This is current behaviour:

library(report)

report_sample(airquality, group_by = "Month")
#> # Descriptive Statistics
#> 
#> Variable          |        5 (n=31) |       6 (n=30) |       7 (n=31) |       8 (n=31) |       9 (n=30) |  Total (n=153)
#> ------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD)   |   23.62 (22.22) |  29.44 (18.21) |  59.12 (31.64) |  59.96 (39.68) |  31.45 (24.14) |  42.13 (32.99)
#> Mean Solar.R (SD) | 181.30 (115.08) | 190.17 (92.88) | 216.48 (80.57) | 171.86 (76.83) | 167.43 (79.12) | 185.93 (90.06)
#> Mean Wind (SD)    |    11.62 (3.53) |   10.27 (3.77) |    8.94 (3.04) |    8.79 (3.23) |   10.18 (3.46) |    9.96 (3.52)
#> Mean Temp (SD)    |    65.55 (6.85) |   79.10 (6.60) |   83.90 (4.32) |   83.97 (6.59) |   76.90 (8.36) |   77.88 (9.47)
#> Mean Day (SD)     |    16.00 (9.09) |   15.50 (8.80) |   16.00 (9.09) |   16.00 (9.09) |   15.50 (8.80) |   15.80 (8.86)

Created on 2022-12-12 with reprex v2.0.2

One possibility would be to double the number of columns by adding an effective n column for each group. Another possibility would be to include that info as a third value in each cell. What do you think would be best?

@Lakens
Copy link
Author

Lakens commented Dec 12, 2022

Hi, the first table is perfect (I would just call it n, not ;'Effective n'). For the second table adding it as a third value to each cell seems the best approach. If you make it optional (even opt-in), it would not interfere too much with the table if there are no missing values. In my data, there is attrition, so showing later questions have lower n is important. Thanks for picking this up so quickly! Love the functions!

@rempsyc
Copy link
Sponsor Member

rempsyc commented Dec 12, 2022

Ok what about this?

devtools::load_all("D:/github/forks/report")
#> ℹ Loading report

report_sample(airquality)
#> # Descriptive Statistics
#> 
#> Variable          |        Summary
#> ----------------------------------
#> Mean Ozone (SD)   |  42.13 (32.99)
#> Mean Solar.R (SD) | 185.93 (90.06)
#> Mean Wind (SD)    |    9.96 (3.52)
#> Mean Temp (SD)    |   77.88 (9.47)
#> Mean Month (SD)   |    6.99 (1.42)
#> Mean Day (SD)     |   15.80 (8.86)

report_sample(airquality, n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |             Summary
#> ------------------------------------------
#> Mean Ozone (SD, n)   |  42.13 (32.99, 116)
#> Mean Solar.R (SD, n) | 185.93 (90.06, 146)
#> Mean Wind (SD, n)    |    9.96 (3.52, 153)
#> Mean Temp (SD, n)    |   77.88 (9.47, 153)
#> Mean Month (SD, n)   |    6.99 (1.42, 153)
#> Mean Day (SD, n)     |   15.80 (8.86, 153)

report_sample(airquality, group_by = "Month")
#> # Descriptive Statistics
#> 
#> Variable          |        5 (n=31) |       6 (n=30) |       7 (n=31) |       8 (n=31) |       9 (n=30) |  Total (n=153)
#> ------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD)   |   23.62 (22.22) |  29.44 (18.21) |  59.12 (31.64) |  59.96 (39.68) |  31.45 (24.14) |  42.13 (32.99)
#> Mean Solar.R (SD) | 181.30 (115.08) | 190.17 (92.88) | 216.48 (80.57) | 171.86 (76.83) | 167.43 (79.12) | 185.93 (90.06)
#> Mean Wind (SD)    |    11.62 (3.53) |   10.27 (3.77) |    8.94 (3.04) |    8.79 (3.23) |   10.18 (3.46) |    9.96 (3.52)
#> Mean Temp (SD)    |    65.55 (6.85) |   79.10 (6.60) |   83.90 (4.32) |   83.97 (6.59) |   76.90 (8.36) |   77.88 (9.47)
#> Mean Day (SD)     |    16.00 (9.09) |   15.50 (8.80) |   16.00 (9.09) |   16.00 (9.09) |   15.50 (8.80) |   15.80 (8.86)

report_sample(airquality, group_by = "Month", n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |            5 (n=31) |           6 (n=30) |           7 (n=31) |           8 (n=31) |           9 (n=30) |       Total (n=153)
#> ----------------------------------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD, n)   |   23.62 (22.22, 26) |   29.44 (18.21, 9) |  59.12 (31.64, 26) |  59.96 (39.68, 26) |  31.45 (24.14, 29) |  42.13 (32.99, 116)
#> Mean Solar.R (SD, n) | 181.30 (115.08, 27) | 190.17 (92.88, 30) | 216.48 (80.57, 31) | 171.86 (76.83, 28) | 167.43 (79.12, 30) | 185.93 (90.06, 146)
#> Mean Wind (SD, n)    |    11.62 (3.53, 31) |   10.27 (3.77, 30) |    8.94 (3.04, 31) |    8.79 (3.23, 31) |   10.18 (3.46, 30) |    9.96 (3.52, 153)
#> Mean Temp (SD, n)    |    65.55 (6.85, 31) |   79.10 (6.60, 30) |   83.90 (4.32, 31) |   83.97 (6.59, 31) |   76.90 (8.36, 30) |   77.88 (9.47, 153)
#> Mean Day (SD, n)     |    16.00 (9.09, 31) |   15.50 (8.80, 30) |   16.00 (9.09, 31) |   16.00 (9.09, 31) |   15.50 (8.80, 30) |   15.80 (8.86, 153)

I also realize that there is a legacy total argument but setting it to TRUE or FALSE does not seem to change anything, the Total column is always there when grouping, and never there when not grouping (because then it is already providing the total). The reason seems to be that when the sample size was added to the columns names, the “Total” column was renamed and so was not removed correctly anymore. I have corrected this in this version.

report_sample(airquality, group_by = "Month", total = FALSE)
#> # Descriptive Statistics
#> 
#> Variable          |        5 (n=31) |       6 (n=30) |       7 (n=31) |       8 (n=31) | 9 (n=30) (n=153)
#> ---------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD)   |   23.62 (22.22) |  29.44 (18.21) |  59.12 (31.64) |  59.96 (39.68) |    31.45 (24.14)
#> Mean Solar.R (SD) | 181.30 (115.08) | 190.17 (92.88) | 216.48 (80.57) | 171.86 (76.83) |   167.43 (79.12)
#> Mean Wind (SD)    |    11.62 (3.53) |   10.27 (3.77) |    8.94 (3.04) |    8.79 (3.23) |     10.18 (3.46)
#> Mean Temp (SD)    |    65.55 (6.85) |   79.10 (6.60) |   83.90 (4.32) |   83.97 (6.59) |     76.90 (8.36)
#> Mean Day (SD)     |    16.00 (9.09) |   15.50 (8.80) |   16.00 (9.09) |   16.00 (9.09) |     15.50 (8.80)

Created on 2022-12-12 with reprex v2.0.2

@Lakens
Copy link
Author

Lakens commented Dec 12, 2022

Lovely! This is exactly the behavior I would think people find useful!
The (22.22, 26) is a nice idea. (22.22), n=26 might be clearer but makes tables wider. (22.22), 26 would actually also be fine, I guess? And maybe most intuitive (difficult to know without use testing).
The total = FALSE still shows the "(n=153)" in the top row - I would assume that is also not needed if total = FALSE? It is fine if it is still there though - useful, takes up little space.
Thanks again for the responsiveness - amazing :)

@rempsyc
Copy link
Sponsor Member

rempsyc commented Dec 12, 2022

(22.22), 26 was my first thought, and then I changed it to the parenthesis. I’ve changed it back, how do you like it?

devtools::load_all("D:/github/forks/report")
#> ℹ Loading report

report_sample(airquality, n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |             Summary
#> ------------------------------------------
#> Mean Ozone (SD), n   |  42.13 (32.99), 116
#> Mean Solar.R (SD), n | 185.93 (90.06), 146
#> Mean Wind (SD), n    |    9.96 (3.52), 153
#> Mean Temp (SD), n    |   77.88 (9.47), 153
#> Mean Month (SD), n   |    6.99 (1.42), 153
#> Mean Day (SD), n     |   15.80 (8.86), 153

report_sample(airquality, group_by = "Month", n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |            5 (n=31) |           6 (n=30) |           7 (n=31) |           8 (n=31) |           9 (n=30) |       Total (n=153)
#> ----------------------------------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD), n   |   23.62 (22.22), 26 |   29.44 (18.21), 9 |  59.12 (31.64), 26 |  59.96 (39.68), 26 |  31.45 (24.14), 29 |  42.13 (32.99), 116
#> Mean Solar.R (SD), n | 181.30 (115.08), 27 | 190.17 (92.88), 30 | 216.48 (80.57), 31 | 171.86 (76.83), 28 | 167.43 (79.12), 30 | 185.93 (90.06), 146
#> Mean Wind (SD), n    |    11.62 (3.53), 31 |   10.27 (3.77), 30 |    8.94 (3.04), 31 |    8.79 (3.23), 31 |   10.18 (3.46), 30 |    9.96 (3.52), 153
#> Mean Temp (SD), n    |    65.55 (6.85), 31 |   79.10 (6.60), 30 |   83.90 (4.32), 31 |   83.97 (6.59), 31 |   76.90 (8.36), 30 |   77.88 (9.47), 153
#> Mean Day (SD), n     |    16.00 (9.09), 31 |   15.50 (8.80), 30 |   16.00 (9.09), 31 |   16.00 (9.09), 31 |   15.50 (8.80), 30 |   15.80 (8.86), 153

Besides, the total argument was always always meant to refer to the last Totalcolumn when using grouped data, not to the n of individual columns (the documentation defines that parameter simply as “Add a Total column.”). But I agree with you there is little harm in keeping it there either way. If you are satisfied with this, I will submit it as a formal PR.

Created on 2022-12-12 with reprex v2.0.2

@Lakens
Copy link
Author

Lakens commented Dec 12, 2022

This looks perfect to me! Amazingly fast response - impressive. And I am confident this well be useful for many. Love the work you are doing on easystats!

IndrajeetPatil pushed a commit that referenced this issue Dec 13, 2022
* report_sample: add effective n (closes #306)

* Add snapshot tests
IndrajeetPatil pushed a commit that referenced this issue Dec 21, 2022
* report_sample: add effective n (closes #306)

* remove pipe in vignette
IndrajeetPatil added a commit that referenced this issue Jan 10, 2023
* report_sample: add effective n (closes #306)

* Addresses #309 part 1: add type and rules to chi2 objects

* Add tests + styler

* remove duplicate author in DESCRIPTION

* Harmonize snapshot testing with OS platform variant.

* styler

* Run tests only on Windows

closes #312

* Use devel effectsize

* run only once a week [skip ci]

* Rerun snapshot tests on Windows with latest version of effectsize

* change snapshots variant = .Platform$OS.type to 'windows', styler, lints

Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement 💥 Implemented features can be improved or revised
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants