Incrorrect counts with reports having the drug of interest and the event of interest on different rows #6

ThomasSoeiro · 2024-05-21T21:35:34Z

Hello,

I think that there is an issue in how {pvda} deals with reports having the drug of interest and the event of interest on different rows.

For example, I am interested in the ROR of Drug_1/Event_1 in the following data:

report_id	drug	event
1	Drug_1	Event_1
1	Drug_2	Event_2
2	Drug_1	Event_3
3	Drug_3	Event_1
4	Drug_2	Event_2
5	Drug_1	Event_2
5	Drug_4	Event_1

If we exclude report_id 5 (i.e. the report having the drug of interest and the event of interest on different rows), pvda::da() returns correct counts:

a (i.e. number of reports with the drug, with the event) = 1 (i.e. report_id 1)
b (i.e. number of reports with the drug, without the event) = 1 (i.e. report_id 2)
c (i.e. number of reports without the drug, with the event) = 1 (i.e. report_id 3)
d (i.e. number of reports without the drug, without the event) = 1 (i.e. report_id 4)

But if we include report_id 5, pvda::da() returns incorrect counts (and inconsistent with above):

a = 1
b = 2
c = 2
d = 0

To fix this issue, I think that we first need to decide how to classify reports having the drug of interest and the event of interest on different rows.

Reproducible example:

(df1 <- data.frame(
  report_id = c(1, 1, 2, 3, 4, 5, 5),
  drug = c("Drug_1", "Drug_2", "Drug_1", "Drug_3", "Drug_2", "Drug_1", "Drug_4"),
  event = c("Event_1", "Event_2", "Event_3", "Event_1", "Event_2", "Event_2", "Event_1")
))
#   report_id   drug   event
# 1         1 Drug_1 Event_1
# 2         1 Drug_2 Event_2
# 3         2 Drug_1 Event_3
# 4         3 Drug_3 Event_1
# 5         4 Drug_2 Event_2
# 6         5 Drug_1 Event_2
# 7         5 Drug_4 Event_1

# correct counts if we exclude report_id 5
df2 <- subset(df1, report_id != 5)
da2 <- pvda::da(df2, rule_of_N = 0)
subset(da2$da_df, drug == "Drug_1" & event == "Event_1", select = -c(ic2.5:ror97.5))
# # A tibble: 1 × 12
#   drug   event     obs exp_rrr n_drug n_event n_tot n_event_prr n_tot_prr     b     c     d
#   <chr>  <chr>   <int>   <dbl>  <int>   <int> <int>       <int>     <int> <int> <int> <int>
# 1 Drug_1 Event_1     1       1      2       2     4           1         2     1     1     1

# incorrect counts if we include report_id 5
da1 <- pvda::da(df1, rule_of_N = 0)
subset(da1$da_df, drug == "Drug_1" & event == "Event_1", select = -c(ic2.5:ror97.5))
# # A tibble: 1 × 12
#   drug   event     obs exp_rrr n_drug n_event n_tot n_event_prr n_tot_prr     b     c     d
#   <chr>  <chr>   <int>   <dbl>  <int>   <int> <int>       <int>     <int> <int> <int> <int>
# 1 Drug_1 Event_1     1     1.8      3       3     5           2         2     2     2     0

The text was updated successfully, but these errors were encountered:

OskarGauffin · 2024-05-22T08:27:23Z

Dear Thomas,

Many thanks for your input to the package. As this is a recently published package with limited use so far, this input is most helpful.

There is indeed something incorrect here. The d-count should obviously not be lowered by including additional reports. If you think there are other inconsistencies apart from the d-count, please clarify.

I will fix this bug, include a unit-test based on your example and submit a corrected version to CRAN. Once that is done, I will close this issue.

ThomasSoeiro · 2024-05-22T09:29:23Z

Thanks for the feedback.
I did not find other inconsistencies so far (but I will report if I do).

How do you plan to classify such reports (i.e. reports having the drug of interest and the event of interest on different rows) in the 2-by-2 contingency table (i.e. a, b, c, or d)?

	Reports with the event	Reports without the event
Reports with the drug	a	b
Reports without the drug	c	d

OskarGauffin · 2024-05-23T13:16:01Z

Dear Thomas,

I guess that apart from the d-count decreasing, you also found counting the report_id = 5 in both cells b and c inconsistent and unexpected behaviour. As the underlying table summary is called contingency table, one could expect cell counts to be mutually exclusive, that is - I understand how this could be confusing.

If one considers the problem from a data entry perspective, my take on this would be that such reports could be seen as two separate reports (and some would recommend such information to be reported as two separate reports). Two different concerns for two different exposures are communicated.

Potential complications from counting this way is the confusion around what the overall sample size is (a+b+c+d or the total number of unique reportIDs). It does also, with magnitude dependent on the prevalence of this type of report, drive the point estimate downwards (as both b and c are in the denominator of ROR := ad/bc) and decrease confidence intervals width (through the terms 1/b + 1/c). But as long as these reports are considered a slightly malformed entry, I would argue that this solution is acceptable.

If one does not agree that reports as report_id = 5 could be thought of as two different reports, one could consider other ad-hoc solutions, such as instead of 1 adding 0.5 to b and c, or decide that one of the two cells b or c be preferred. But in this package, I will proceed as described above, and include such reports as +1 for both b and c-cells. To clarify this, I will also incorporate this question and information in the vignette.

As I understood from our email correspondence, you wanted to compare a crude ROR (from pvda) to the output from a logistic regression (fitted through glm). I’m not sure if this splitting of reports causes discrepancies for you, but if it does, an adjustment of the input data (e.g. assign two different reportIDs to such reports) could perhaps be helpful.

ThomasSoeiro · 2024-05-23T14:10:43Z

Thanks you for your detailed answer.

I guess that apart from the d-count decreasing, you also found counting the report_id = 5 in both cells b and c inconsistent and unexpected behaviour. As the underlying table summary is called contingency table, one could expect cell counts to be mutually exclusive, that is - I understand how this could be confusing.

Yes, exactly, it confused me at first but now I agree it is needed (see below).

If one considers the problem from a data entry perspective, my take on this would be that such reports could be seen as two separate reports (and some would recommend such information to be reported as two separate reports). Two different concerns for two different exposures are communicated.

In the meantime, I also discussed this issue with Dr Charles Khouri (an expert in disproportionality analysis and one of the leader of the READUS-PV working group). He also suggested to consider drug/event pairs individually rather than aggregate by ICSR in such cases. Alternatively, he suggested to use the Information Component (which uses different data and is thus not impacted by this issue).

Potential complications from counting this way is the confusion around what the overall sample size is (a+b+c+d or the total number of unique reportIDs). It does also, with magnitude dependent on the prevalence of this type of report, drive the point estimate downwards (as both b and c are in the denominator of ROR := ad/bc) and decrease confidence intervals width (through the terms 1/b + 1/c). But as long as these reports are considered a slightly malformed entry, I would argue that this solution is acceptable.

If one does not agree that reports as report_id = 5 could be thought of as two different reports, one could consider other ad-hoc solutions, such as instead of 1 adding 0.5 to b and c, or decide that one of the two cells b or c be preferred. But in this package, I will proceed as described above, and include such reports as +1 for both b and c-cells. To clarify this, I will also incorporate this question and information in the vignette.

I agree first option seems more appropriate. In addition to the adding details in the vignette, I would suggest to also raise a warning, e.g. something like (untested, but I guess you get the idea):

if ((a + b + c + d) > length(unique(report_id))
  warning("Some reports have been considered as separate reports. See the vignette for more information.")

As I understood from our email correspondence, you wanted to compare a crude ROR (from pvda) to the output from a logistic regression (fitted through glm). I’m not sure if this splitting of reports causes discrepancies for you, but if it does, an adjustment of the input data (e.g. assign two different reportIDs to such reports) could perhaps be helpful.

Thank for the suggestion. I will do that.

OskarGauffin · 2024-05-23T17:32:33Z

Happy to hear that.

Thats an interesting idea. I think it might be more of a note than a warning. In my experience these reports exist to some frequency, so perhaps such a note would be triggered a lot, unless you had a very small dataset. I'll consider it once I've checked how common it seems.

I also want to point out that the IC, defined as

log2((O+0.5)/(E+0.5))

where O is the observed count, and E is the expected count,

does count these reports in some sense "twice" as well, if you define the expected count based on the same contingency table, as done in (2) in (https://journals.sagepub.com/doi/epub/10.1177/0962280211403604). Reports as nr 5 are included twice in the total sample size (a+b+c+d)-count.

However, if one defines the E using a n_tot based on unique reportIDs, such reports only contribute once to the total sample size, i.e.

Expected count E = n_drug * n_reac/n_tot

where n_drug is defined as the number of reports for the drug of interest, n_reac is the number of reports with the reaction of interest and n_tot the total number of reports in the analysis. I prefer this latter definition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incrorrect counts with reports having the drug of interest and the event of interest on different rows #6

Incrorrect counts with reports having the drug of interest and the event of interest on different rows #6

ThomasSoeiro commented May 21, 2024 •

edited

Loading

OskarGauffin commented May 22, 2024

ThomasSoeiro commented May 22, 2024

OskarGauffin commented May 23, 2024 •

edited

Loading

ThomasSoeiro commented May 23, 2024

OskarGauffin commented May 23, 2024 •

edited

Loading

Incrorrect counts with reports having the drug of interest and the event of interest on different rows #6

Incrorrect counts with reports having the drug of interest and the event of interest on different rows #6

Comments

ThomasSoeiro commented May 21, 2024 • edited Loading

OskarGauffin commented May 22, 2024

ThomasSoeiro commented May 22, 2024

OskarGauffin commented May 23, 2024 • edited Loading

ThomasSoeiro commented May 23, 2024

OskarGauffin commented May 23, 2024 • edited Loading

ThomasSoeiro commented May 21, 2024 •

edited

Loading

OskarGauffin commented May 23, 2024 •

edited

Loading

OskarGauffin commented May 23, 2024 •

edited

Loading