-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incrorrect counts with reports having the drug of interest and the event of interest on different rows #6
Comments
Dear Thomas, Many thanks for your input to the package. As this is a recently published package with limited use so far, this input is most helpful. There is indeed something incorrect here. The d-count should obviously not be lowered by including additional reports. If you think there are other inconsistencies apart from the d-count, please clarify. I will fix this bug, include a unit-test based on your example and submit a corrected version to CRAN. Once that is done, I will close this issue. |
Thanks for the feedback. How do you plan to classify such reports (i.e. reports having the drug of interest and the event of interest on different rows) in the 2-by-2 contingency table (i.e. a, b, c, or d)?
|
Dear Thomas, I guess that apart from the d-count decreasing, you also found counting the report_id = 5 in both cells b and c inconsistent and unexpected behaviour. As the underlying table summary is called contingency table, one could expect cell counts to be mutually exclusive, that is - I understand how this could be confusing. If one considers the problem from a data entry perspective, my take on this would be that such reports could be seen as two separate reports (and some would recommend such information to be reported as two separate reports). Two different concerns for two different exposures are communicated. Potential complications from counting this way is the confusion around what the overall sample size is (a+b+c+d or the total number of unique reportIDs). It does also, with magnitude dependent on the prevalence of this type of report, drive the point estimate downwards (as both b and c are in the denominator of ROR := ad/bc) and decrease confidence intervals width (through the terms 1/b + 1/c). But as long as these reports are considered a slightly malformed entry, I would argue that this solution is acceptable. If one does not agree that reports as report_id = 5 could be thought of as two different reports, one could consider other ad-hoc solutions, such as instead of 1 adding 0.5 to b and c, or decide that one of the two cells b or c be preferred. But in this package, I will proceed as described above, and include such reports as +1 for both b and c-cells. To clarify this, I will also incorporate this question and information in the vignette. As I understood from our email correspondence, you wanted to compare a crude ROR (from pvda) to the output from a logistic regression (fitted through glm). I’m not sure if this splitting of reports causes discrepancies for you, but if it does, an adjustment of the input data (e.g. assign two different reportIDs to such reports) could perhaps be helpful. |
Thanks you for your detailed answer.
Yes, exactly, it confused me at first but now I agree it is needed (see below).
In the meantime, I also discussed this issue with Dr Charles Khouri (an expert in disproportionality analysis and one of the leader of the READUS-PV working group). He also suggested to consider drug/event pairs individually rather than aggregate by ICSR in such cases. Alternatively, he suggested to use the Information Component (which uses different data and is thus not impacted by this issue).
I agree first option seems more appropriate. In addition to the adding details in the vignette, I would suggest to also raise a warning, e.g. something like (untested, but I guess you get the idea): if ((a + b + c + d) > length(unique(report_id))
warning("Some reports have been considered as separate reports. See the vignette for more information.")
Thank for the suggestion. I will do that. |
Happy to hear that. Thats an interesting idea. I think it might be more of a note than a warning. In my experience these reports exist to some frequency, so perhaps such a note would be triggered a lot, unless you had a very small dataset. I'll consider it once I've checked how common it seems. I also want to point out that the IC, defined as log2((O+0.5)/(E+0.5)) where O is the observed count, and E is the expected count, does count these reports in some sense "twice" as well, if you define the expected count based on the same contingency table, as done in (2) in (https://journals.sagepub.com/doi/epub/10.1177/0962280211403604). Reports as nr 5 are included twice in the total sample size (a+b+c+d)-count. However, if one defines the E using a n_tot based on unique reportIDs, such reports only contribute once to the total sample size, i.e. Expected count E = n_drug * n_reac/n_tot where n_drug is defined as the number of reports for the drug of interest, n_reac is the number of reports with the reaction of interest and n_tot the total number of reports in the analysis. I prefer this latter definition. |
Hello,
I think that there is an issue in how
{pvda}
deals with reports having the drug of interest and the event of interest on different rows.For example, I am interested in the ROR of Drug_1/Event_1 in the following data:
If we exclude report_id 5 (i.e. the report having the drug of interest and the event of interest on different rows),
pvda::da()
returns correct counts:But if we include report_id 5,
pvda::da()
returns incorrect counts (and inconsistent with above):To fix this issue, I think that we first need to decide how to classify reports having the drug of interest and the event of interest on different rows.
Reproducible example:
The text was updated successfully, but these errors were encountered: