Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incrorrect counts with reports having the drug of interest and the event of interest on different rows #6

Open
ThomasSoeiro opened this issue May 21, 2024 · 5 comments

Comments

@ThomasSoeiro
Copy link

ThomasSoeiro commented May 21, 2024

Hello,

I think that there is an issue in how {pvda} deals with reports having the drug of interest and the event of interest on different rows.

For example, I am interested in the ROR of Drug_1/Event_1 in the following data:

report_id drug event
1 Drug_1 Event_1
1 Drug_2 Event_2
2 Drug_1 Event_3
3 Drug_3 Event_1
4 Drug_2 Event_2
5 Drug_1 Event_2
5 Drug_4 Event_1

If we exclude report_id 5 (i.e. the report having the drug of interest and the event of interest on different rows), pvda::da() returns correct counts:

  • a (i.e. number of reports with the drug, with the event) = 1 (i.e. report_id 1)
  • b (i.e. number of reports with the drug, without the event) = 1 (i.e. report_id 2)
  • c (i.e. number of reports without the drug, with the event) = 1 (i.e. report_id 3)
  • d (i.e. number of reports without the drug, without the event) = 1 (i.e. report_id 4)

But if we include report_id 5, pvda::da() returns incorrect counts (and inconsistent with above):

  • a = 1
  • b = 2
  • c = 2
  • d = 0

To fix this issue, I think that we first need to decide how to classify reports having the drug of interest and the event of interest on different rows.

Reproducible example:

(df1 <- data.frame(
  report_id = c(1, 1, 2, 3, 4, 5, 5),
  drug = c("Drug_1", "Drug_2", "Drug_1", "Drug_3", "Drug_2", "Drug_1", "Drug_4"),
  event = c("Event_1", "Event_2", "Event_3", "Event_1", "Event_2", "Event_2", "Event_1")
))
#   report_id   drug   event
# 1         1 Drug_1 Event_1
# 2         1 Drug_2 Event_2
# 3         2 Drug_1 Event_3
# 4         3 Drug_3 Event_1
# 5         4 Drug_2 Event_2
# 6         5 Drug_1 Event_2
# 7         5 Drug_4 Event_1

# correct counts if we exclude report_id 5
df2 <- subset(df1, report_id != 5)
da2 <- pvda::da(df2, rule_of_N = 0)
subset(da2$da_df, drug == "Drug_1" & event == "Event_1", select = -c(ic2.5:ror97.5))
# # A tibble: 1 × 12
#   drug   event     obs exp_rrr n_drug n_event n_tot n_event_prr n_tot_prr     b     c     d
#   <chr>  <chr>   <int>   <dbl>  <int>   <int> <int>       <int>     <int> <int> <int> <int>
# 1 Drug_1 Event_1     1       1      2       2     4           1         2     1     1     1

# incorrect counts if we include report_id 5
da1 <- pvda::da(df1, rule_of_N = 0)
subset(da1$da_df, drug == "Drug_1" & event == "Event_1", select = -c(ic2.5:ror97.5))
# # A tibble: 1 × 12
#   drug   event     obs exp_rrr n_drug n_event n_tot n_event_prr n_tot_prr     b     c     d
#   <chr>  <chr>   <int>   <dbl>  <int>   <int> <int>       <int>     <int> <int> <int> <int>
# 1 Drug_1 Event_1     1     1.8      3       3     5           2         2     2     2     0
@OskarGauffin
Copy link
Owner

Dear Thomas,

Many thanks for your input to the package. As this is a recently published package with limited use so far, this input is most helpful.

There is indeed something incorrect here. The d-count should obviously not be lowered by including additional reports. If you think there are other inconsistencies apart from the d-count, please clarify.

I will fix this bug, include a unit-test based on your example and submit a corrected version to CRAN. Once that is done, I will close this issue.

@ThomasSoeiro
Copy link
Author

Thanks for the feedback.
I did not find other inconsistencies so far (but I will report if I do).

How do you plan to classify such reports (i.e. reports having the drug of interest and the event of interest on different rows) in the 2-by-2 contingency table (i.e. a, b, c, or d)?

  Reports with the event Reports without the event
Reports with the drug a b
Reports without the drug c d

@OskarGauffin
Copy link
Owner

OskarGauffin commented May 23, 2024

Dear Thomas,

I guess that apart from the d-count decreasing, you also found counting the report_id = 5 in both cells b and c inconsistent and unexpected behaviour. As the underlying table summary is called contingency table, one could expect cell counts to be mutually exclusive, that is - I understand how this could be confusing.

If one considers the problem from a data entry perspective, my take on this would be that such reports could be seen as two separate reports (and some would recommend such information to be reported as two separate reports). Two different concerns for two different exposures are communicated.

Potential complications from counting this way is the confusion around what the overall sample size is (a+b+c+d or the total number of unique reportIDs). It does also, with magnitude dependent on the prevalence of this type of report, drive the point estimate downwards (as both b and c are in the denominator of ROR := ad/bc) and decrease confidence intervals width (through the terms 1/b + 1/c). But as long as these reports are considered a slightly malformed entry, I would argue that this solution is acceptable.

If one does not agree that reports as report_id = 5 could be thought of as two different reports, one could consider other ad-hoc solutions, such as instead of 1 adding 0.5 to b and c, or decide that one of the two cells b or c be preferred. But in this package, I will proceed as described above, and include such reports as +1 for both b and c-cells. To clarify this, I will also incorporate this question and information in the vignette.

As I understood from our email correspondence, you wanted to compare a crude ROR (from pvda) to the output from a logistic regression (fitted through glm). I’m not sure if this splitting of reports causes discrepancies for you, but if it does, an adjustment of the input data (e.g. assign two different reportIDs to such reports) could perhaps be helpful.

@ThomasSoeiro
Copy link
Author

Thanks you for your detailed answer.

I guess that apart from the d-count decreasing, you also found counting the report_id = 5 in both cells b and c inconsistent and unexpected behaviour. As the underlying table summary is called contingency table, one could expect cell counts to be mutually exclusive, that is - I understand how this could be confusing.

Yes, exactly, it confused me at first but now I agree it is needed (see below).

If one considers the problem from a data entry perspective, my take on this would be that such reports could be seen as two separate reports (and some would recommend such information to be reported as two separate reports). Two different concerns for two different exposures are communicated.

In the meantime, I also discussed this issue with Dr Charles Khouri (an expert in disproportionality analysis and one of the leader of the READUS-PV working group). He also suggested to consider drug/event pairs individually rather than aggregate by ICSR in such cases. Alternatively, he suggested to use the Information Component (which uses different data and is thus not impacted by this issue).

Potential complications from counting this way is the confusion around what the overall sample size is (a+b+c+d or the total number of unique reportIDs). It does also, with magnitude dependent on the prevalence of this type of report, drive the point estimate downwards (as both b and c are in the denominator of ROR := ad/bc) and decrease confidence intervals width (through the terms 1/b + 1/c). But as long as these reports are considered a slightly malformed entry, I would argue that this solution is acceptable.

If one does not agree that reports as report_id = 5 could be thought of as two different reports, one could consider other ad-hoc solutions, such as instead of 1 adding 0.5 to b and c, or decide that one of the two cells b or c be preferred. But in this package, I will proceed as described above, and include such reports as +1 for both b and c-cells. To clarify this, I will also incorporate this question and information in the vignette.

I agree first option seems more appropriate. In addition to the adding details in the vignette, I would suggest to also raise a warning, e.g. something like (untested, but I guess you get the idea):

if ((a + b + c + d) > length(unique(report_id))
  warning("Some reports have been considered as separate reports. See the vignette for more information.")

As I understood from our email correspondence, you wanted to compare a crude ROR (from pvda) to the output from a logistic regression (fitted through glm). I’m not sure if this splitting of reports causes discrepancies for you, but if it does, an adjustment of the input data (e.g. assign two different reportIDs to such reports) could perhaps be helpful.

Thank for the suggestion. I will do that.

@OskarGauffin
Copy link
Owner

OskarGauffin commented May 23, 2024

Happy to hear that.

Thats an interesting idea. I think it might be more of a note than a warning. In my experience these reports exist to some frequency, so perhaps such a note would be triggered a lot, unless you had a very small dataset. I'll consider it once I've checked how common it seems.

I also want to point out that the IC, defined as

log2((O+0.5)/(E+0.5))

where O is the observed count, and E is the expected count,

does count these reports in some sense "twice" as well, if you define the expected count based on the same contingency table, as done in (2) in (https://journals.sagepub.com/doi/epub/10.1177/0962280211403604). Reports as nr 5 are included twice in the total sample size (a+b+c+d)-count.

However, if one defines the E using a n_tot based on unique reportIDs, such reports only contribute once to the total sample size, i.e.

Expected count E = n_drug * n_reac/n_tot

where n_drug is defined as the number of reports for the drug of interest, n_reac is the number of reports with the reaction of interest and n_tot the total number of reports in the analysis. I prefer this latter definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants