Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complex Filtering #73

Open
mattkumar opened this issue Jul 8, 2022 · 1 comment
Open

Complex Filtering #73

mattkumar opened this issue Jul 8, 2022 · 1 comment

Comments

@mattkumar
Copy link

mattkumar commented Jul 8, 2022

Hi guys,

I'm attempting to use Tplyr to compute a group_count layer that I'm not sure how to specify. To give some background, I've simulated a partial adae table below that has USUBJID, ARM and AETOXGRN. AETOXGRN is a toxicity grading used frequently within Oncology and ranges from 1 to 5.

What I'm interested in counting is each subjects worst (i.e. highest) toxicity grades. I'm interested in distinct counts, so for example, if subject X had two AEs, one graded with AETOXGRN = 1, and another with AETOXGRN = 4, I'd like this subject to be counted in the "4" category only.

I can achieve this in dplyr, and also achieve this in Tplyr with some up-front filtering. However, I'm wondering if I can specify something like this directly in Tplyr.

Here is some code for my exploration.

library(dplyr)
library(Tplyr)

adae <- tibble::tribble(
  ~USUBJID,        ~ARM, ~AETOXGRN,
  1L, "Treatment",        3L,
  1L, "Treatment",        1L,
  1L, "Treatment",        2L,
  1L, "Treatment",        3L,
  1L, "Treatment",        1L,
  2L,   "Placebo",        3L,
  2L,   "Placebo",        3L,
  2L,   "Placebo",        4L,
  2L,   "Placebo",        5L,
  2L,   "Placebo",        4L,
  2L,   "Placebo",        2L,
  3L, "Treatment",        1L,
  4L,   "Placebo",        1L,
  5L, "Treatment",        1L,
  5L, "Treatment",        1L,
  5L, "Treatment",        5L,
  5L, "Treatment",        3L,
  5L, "Treatment",        2L,
  5L, "Treatment",        4L,
  5L, "Treatment",        1L
)
# using dplyr
adae %>%
  group_by(USUBJID) %>%
    arrange(desc(AETOXGRN)) %>%
    slice(1) %>%
  ungroup %>%
  count(ARM, AETOXGRN)

# dplyr output
# A tibble: 5 x 3
# ARM          AETOXGRN     n
# <chr>        <int>      <int>
# Placebo           1           1
# Placebo           5           1
# Treatment        1          1
# Treatment        3          1
# Treatment        5          1
# Using Tplyr
t <- tplyr_table(adae, ARM) %>%
  add_layer(
    group_count(AETOXGRN, where = AETOXGRN == max(AETOXGRN)) %>%
      set_distinct_by(USUBJID)
  )

t %>% build()

# Tplyr output
# A tibble: 1 x 5
# row_label1 var1_Placebo var1_Treatment ord_layer_index ord_layer_1
# <chr>      <chr>             <chr>            <int>       <dbl>
#   5       1 (100.0%)       1 (100.0%)             1               5

I can see that Tplyr only outputs the result for the max(AETOXGRN) grade, 5, which looks correct. So it seems my filter is acting on a data set level rather than a per USUBJID level. Is there a good way to specify a where filter of this nature or have I maybe missed other options in Tplyr?

Curious to hear any thoughts!

Thanks!
Matt

@mstackhouse
Copy link
Contributor

@mattkumar thanks for submitting this!

Currently this wouldn't be possible because we don't really have a clean way to make groups pass down into where filter conditions are applied. We didn't really plan for that so it would take a good bit of thought for how to do it elegantly. Like I'm almost thinking that it would be safer to pre-derive a flag and use the flag, which is how ADaM datasets would typically set things up. Because grouping and ungrouping here is a bit tricky.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants