Effective sample size per row, not total sample size in report_sample #306

Lakens · 2022-12-11T19:19:51Z

Instead of adding a total N on top of the table, add a column at the end (after total, or as part of total) reporting the effective n for each row. If there is missing data, it is good to see how many observations underlie the means in the table.

Lakens · 2022-12-12T07:50:39Z

One solution (that does not yet has your nice lay-out) is to summarize as follows:

df.sum <- df %>%
select(var1, var2) %>%
summarise_all(funs(min = min(., na.rm = TRUE),
median = median(., na.rm = TRUE),
max = max(., na.rm = TRUE),
mean = mean(., na.rm = TRUE),
sd = sd(., na.rm = TRUE),
n = sum(!is.na(.))))

rempsyc · 2022-12-12T17:17:56Z

Thanks for the suggestion @Lakens. On it!

rempsyc · 2022-12-12T18:22:24Z

Reprex of example above:

library(dplyr, warn.conflicts = FALSE)
df.sum <- airquality %>%
  select(Ozone, Solar.R) %>%
  summarise_all(funs(min = min(., na.rm = TRUE),
                     median = median(., na.rm = TRUE),
                     max = max(., na.rm = TRUE),
                     mean = mean(., na.rm = TRUE),
                     sd = sd(., na.rm = TRUE),
                     n = sum(!is.na(.))))
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> ℹ Please use a list of either functions or lambdas:
#> 
#> # Simple named list: list(mean = mean, median = median)
#> 
#> # Auto named with `tibble::lst()`: tibble::lst(mean, median)
#> 
#> # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
df.sum
#>   Ozone_min Solar.R_min Ozone_median Solar.R_median Ozone_max Solar.R_max
#> 1         1           7         31.5            205       168         334
#>   Ozone_mean Solar.R_mean Ozone_sd Solar.R_sd Ozone_n Solar.R_n
#> 1   42.12931     185.9315 32.98788   90.05842     116       146

^{Created on 2022-12-12 with reprex v2.0.2}

rempsyc · 2022-12-12T18:28:12Z

Were you thinking of something like this @Lakens?

devtools::load_all("D:/github/forks/report")
#> ℹ Loading report

report_sample(airquality, effective_n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable          |        Summary | Effective n
#> ------------------------------------------------
#> Mean Ozone (SD)   |  42.13 (32.99) |         116
#> Mean Solar.R (SD) | 185.93 (90.06) |         146
#> Mean Wind (SD)    |    9.96 (3.52) |         153
#> Mean Temp (SD)    |   77.88 (9.47) |         153
#> Mean Month (SD)   |    6.99 (1.42) |         153
#> Mean Day (SD)     |   15.80 (8.86) |         153

^{Created on 2022-12-12 with reprex v2.0.2}

It's a bit more challenging when using groups since groups won't have the same n for the same rows. This is current behaviour:

library(report)

report_sample(airquality, group_by = "Month")
#> # Descriptive Statistics
#> 
#> Variable          |        5 (n=31) |       6 (n=30) |       7 (n=31) |       8 (n=31) |       9 (n=30) |  Total (n=153)
#> ------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD)   |   23.62 (22.22) |  29.44 (18.21) |  59.12 (31.64) |  59.96 (39.68) |  31.45 (24.14) |  42.13 (32.99)
#> Mean Solar.R (SD) | 181.30 (115.08) | 190.17 (92.88) | 216.48 (80.57) | 171.86 (76.83) | 167.43 (79.12) | 185.93 (90.06)
#> Mean Wind (SD)    |    11.62 (3.53) |   10.27 (3.77) |    8.94 (3.04) |    8.79 (3.23) |   10.18 (3.46) |    9.96 (3.52)
#> Mean Temp (SD)    |    65.55 (6.85) |   79.10 (6.60) |   83.90 (4.32) |   83.97 (6.59) |   76.90 (8.36) |   77.88 (9.47)
#> Mean Day (SD)     |    16.00 (9.09) |   15.50 (8.80) |   16.00 (9.09) |   16.00 (9.09) |   15.50 (8.80) |   15.80 (8.86)

^{Created on 2022-12-12 with reprex v2.0.2}

One possibility would be to double the number of columns by adding an effective n column for each group. Another possibility would be to include that info as a third value in each cell. What do you think would be best?

Lakens · 2022-12-12T18:45:58Z

Hi, the first table is perfect (I would just call it n, not ;'Effective n'). For the second table adding it as a third value to each cell seems the best approach. If you make it optional (even opt-in), it would not interfere too much with the table if there are no missing values. In my data, there is attrition, so showing later questions have lower n is important. Thanks for picking this up so quickly! Love the functions!

rempsyc · 2022-12-12T19:47:12Z

Ok what about this?

devtools::load_all("D:/github/forks/report")
#> ℹ Loading report

report_sample(airquality)
#> # Descriptive Statistics
#> 
#> Variable          |        Summary
#> ----------------------------------
#> Mean Ozone (SD)   |  42.13 (32.99)
#> Mean Solar.R (SD) | 185.93 (90.06)
#> Mean Wind (SD)    |    9.96 (3.52)
#> Mean Temp (SD)    |   77.88 (9.47)
#> Mean Month (SD)   |    6.99 (1.42)
#> Mean Day (SD)     |   15.80 (8.86)

report_sample(airquality, n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |             Summary
#> ------------------------------------------
#> Mean Ozone (SD, n)   |  42.13 (32.99, 116)
#> Mean Solar.R (SD, n) | 185.93 (90.06, 146)
#> Mean Wind (SD, n)    |    9.96 (3.52, 153)
#> Mean Temp (SD, n)    |   77.88 (9.47, 153)
#> Mean Month (SD, n)   |    6.99 (1.42, 153)
#> Mean Day (SD, n)     |   15.80 (8.86, 153)

report_sample(airquality, group_by = "Month")
#> # Descriptive Statistics
#> 
#> Variable          |        5 (n=31) |       6 (n=30) |       7 (n=31) |       8 (n=31) |       9 (n=30) |  Total (n=153)
#> ------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD)   |   23.62 (22.22) |  29.44 (18.21) |  59.12 (31.64) |  59.96 (39.68) |  31.45 (24.14) |  42.13 (32.99)
#> Mean Solar.R (SD) | 181.30 (115.08) | 190.17 (92.88) | 216.48 (80.57) | 171.86 (76.83) | 167.43 (79.12) | 185.93 (90.06)
#> Mean Wind (SD)    |    11.62 (3.53) |   10.27 (3.77) |    8.94 (3.04) |    8.79 (3.23) |   10.18 (3.46) |    9.96 (3.52)
#> Mean Temp (SD)    |    65.55 (6.85) |   79.10 (6.60) |   83.90 (4.32) |   83.97 (6.59) |   76.90 (8.36) |   77.88 (9.47)
#> Mean Day (SD)     |    16.00 (9.09) |   15.50 (8.80) |   16.00 (9.09) |   16.00 (9.09) |   15.50 (8.80) |   15.80 (8.86)

report_sample(airquality, group_by = "Month", n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |            5 (n=31) |           6 (n=30) |           7 (n=31) |           8 (n=31) |           9 (n=30) |       Total (n=153)
#> ----------------------------------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD, n)   |   23.62 (22.22, 26) |   29.44 (18.21, 9) |  59.12 (31.64, 26) |  59.96 (39.68, 26) |  31.45 (24.14, 29) |  42.13 (32.99, 116)
#> Mean Solar.R (SD, n) | 181.30 (115.08, 27) | 190.17 (92.88, 30) | 216.48 (80.57, 31) | 171.86 (76.83, 28) | 167.43 (79.12, 30) | 185.93 (90.06, 146)
#> Mean Wind (SD, n)    |    11.62 (3.53, 31) |   10.27 (3.77, 30) |    8.94 (3.04, 31) |    8.79 (3.23, 31) |   10.18 (3.46, 30) |    9.96 (3.52, 153)
#> Mean Temp (SD, n)    |    65.55 (6.85, 31) |   79.10 (6.60, 30) |   83.90 (4.32, 31) |   83.97 (6.59, 31) |   76.90 (8.36, 30) |   77.88 (9.47, 153)
#> Mean Day (SD, n)     |    16.00 (9.09, 31) |   15.50 (8.80, 30) |   16.00 (9.09, 31) |   16.00 (9.09, 31) |   15.50 (8.80, 30) |   15.80 (8.86, 153)

I also realize that there is a legacy total argument but setting it to TRUE or FALSE does not seem to change anything, the Total column is always there when grouping, and never there when not grouping (because then it is already providing the total). The reason seems to be that when the sample size was added to the columns names, the “Total” column was renamed and so was not removed correctly anymore. I have corrected this in this version.

report_sample(airquality, group_by = "Month", total = FALSE)
#> # Descriptive Statistics
#> 
#> Variable          |        5 (n=31) |       6 (n=30) |       7 (n=31) |       8 (n=31) | 9 (n=30) (n=153)
#> ---------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD)   |   23.62 (22.22) |  29.44 (18.21) |  59.12 (31.64) |  59.96 (39.68) |    31.45 (24.14)
#> Mean Solar.R (SD) | 181.30 (115.08) | 190.17 (92.88) | 216.48 (80.57) | 171.86 (76.83) |   167.43 (79.12)
#> Mean Wind (SD)    |    11.62 (3.53) |   10.27 (3.77) |    8.94 (3.04) |    8.79 (3.23) |     10.18 (3.46)
#> Mean Temp (SD)    |    65.55 (6.85) |   79.10 (6.60) |   83.90 (4.32) |   83.97 (6.59) |     76.90 (8.36)
#> Mean Day (SD)     |    16.00 (9.09) |   15.50 (8.80) |   16.00 (9.09) |   16.00 (9.09) |     15.50 (8.80)

^{Created on 2022-12-12 with reprex v2.0.2}

Lakens · 2022-12-12T20:18:59Z

Lovely! This is exactly the behavior I would think people find useful!
The (22.22, 26) is a nice idea. (22.22), n=26 might be clearer but makes tables wider. (22.22), 26 would actually also be fine, I guess? And maybe most intuitive (difficult to know without use testing).
The total = FALSE still shows the "(n=153)" in the top row - I would assume that is also not needed if total = FALSE? It is fine if it is still there though - useful, takes up little space.
Thanks again for the responsiveness - amazing :)

rempsyc · 2022-12-12T20:28:50Z

(22.22), 26 was my first thought, and then I changed it to the parenthesis. I’ve changed it back, how do you like it?

devtools::load_all("D:/github/forks/report")
#> ℹ Loading report

report_sample(airquality, n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |             Summary
#> ------------------------------------------
#> Mean Ozone (SD), n   |  42.13 (32.99), 116
#> Mean Solar.R (SD), n | 185.93 (90.06), 146
#> Mean Wind (SD), n    |    9.96 (3.52), 153
#> Mean Temp (SD), n    |   77.88 (9.47), 153
#> Mean Month (SD), n   |    6.99 (1.42), 153
#> Mean Day (SD), n     |   15.80 (8.86), 153

report_sample(airquality, group_by = "Month", n = TRUE)
#> # Descriptive Statistics
#> 
#> Variable             |            5 (n=31) |           6 (n=30) |           7 (n=31) |           8 (n=31) |           9 (n=30) |       Total (n=153)
#> ----------------------------------------------------------------------------------------------------------------------------------------------------
#> Mean Ozone (SD), n   |   23.62 (22.22), 26 |   29.44 (18.21), 9 |  59.12 (31.64), 26 |  59.96 (39.68), 26 |  31.45 (24.14), 29 |  42.13 (32.99), 116
#> Mean Solar.R (SD), n | 181.30 (115.08), 27 | 190.17 (92.88), 30 | 216.48 (80.57), 31 | 171.86 (76.83), 28 | 167.43 (79.12), 30 | 185.93 (90.06), 146
#> Mean Wind (SD), n    |    11.62 (3.53), 31 |   10.27 (3.77), 30 |    8.94 (3.04), 31 |    8.79 (3.23), 31 |   10.18 (3.46), 30 |    9.96 (3.52), 153
#> Mean Temp (SD), n    |    65.55 (6.85), 31 |   79.10 (6.60), 30 |   83.90 (4.32), 31 |   83.97 (6.59), 31 |   76.90 (8.36), 30 |   77.88 (9.47), 153
#> Mean Day (SD), n     |    16.00 (9.09), 31 |   15.50 (8.80), 30 |   16.00 (9.09), 31 |   16.00 (9.09), 31 |   15.50 (8.80), 30 |   15.80 (8.86), 153

Besides, the total argument was always always meant to refer to the last Totalcolumn when using grouped data, not to the n of individual columns (the documentation defines that parameter simply as “Add a Total column.”). But I agree with you there is little harm in keeping it there either way. If you are satisfied with this, I will submit it as a formal PR.

^{Created on 2022-12-12 with reprex v2.0.2}

Lakens · 2022-12-12T20:40:14Z

This looks perfect to me! Amazingly fast response - impressive. And I am confident this well be useful for many. Love the work you are doing on easystats!

* report_sample: add effective n (closes #306) * Add snapshot tests

* report_sample: add effective n (closes #306) * remove pipe in vignette

* report_sample: add effective n (closes #306) * Addresses #309 part 1: add type and rules to chi2 objects * Add tests + styler * remove duplicate author in DESCRIPTION * Harmonize snapshot testing with OS platform variant. * styler * Run tests only on Windows closes #312 * Use devel effectsize * run only once a week [skip ci] * Rerun snapshot tests on Windows with latest version of effectsize * change snapshots variant = .Platform$OS.type to 'windows', styler, lints Co-authored-by: Indrajeet Patil <patilindrajeet.science@gmail.com>

strengejacke added the enhancement 💥 Implemented features can be improved or revised label Dec 12, 2022

rempsyc self-assigned this Dec 12, 2022

rempsyc mentioned this issue Dec 12, 2022

report_sample: add effective n (closes #306) #307

Merged

IndrajeetPatil closed this as completed in #307 Dec 13, 2022

IndrajeetPatil pushed a commit that referenced this issue Dec 13, 2022

report_sample: add effective n (closes #306) (#307)

f661ec6

* report_sample: add effective n (closes #306) * Add snapshot tests

IndrajeetPatil pushed a commit that referenced this issue Dec 21, 2022

Remove pipe in vignette to fix failing GHA (#316)

8cce4e8

* report_sample: add effective n (closes #306) * remove pipe in vignette

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Effective sample size per row, not total sample size in report_sample #306

Effective sample size per row, not total sample size in report_sample #306

Lakens commented Dec 11, 2022

Lakens commented Dec 12, 2022

rempsyc commented Dec 12, 2022

rempsyc commented Dec 12, 2022

rempsyc commented Dec 12, 2022

Lakens commented Dec 12, 2022

rempsyc commented Dec 12, 2022 •

edited

Loading

Lakens commented Dec 12, 2022

rempsyc commented Dec 12, 2022

Lakens commented Dec 12, 2022

Effective sample size per row, not total sample size in report_sample #306

Effective sample size per row, not total sample size in report_sample #306

Comments

Lakens commented Dec 11, 2022

Lakens commented Dec 12, 2022

rempsyc commented Dec 12, 2022

rempsyc commented Dec 12, 2022

rempsyc commented Dec 12, 2022

Lakens commented Dec 12, 2022

rempsyc commented Dec 12, 2022 • edited Loading

Lakens commented Dec 12, 2022

rempsyc commented Dec 12, 2022

Lakens commented Dec 12, 2022

rempsyc commented Dec 12, 2022 •

edited

Loading