Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imputation functionality #21

Closed
ericbair-sciome opened this issue Nov 12, 2023 · 11 comments · Fixed by #67
Closed

Imputation functionality #21

ericbair-sciome opened this issue Nov 12, 2023 · 11 comments · Fixed by #67
Assignees
Labels
enhancement New feature or request version 1.0 Features that should be included in the initial PrestoGP release

Comments

@ericbair-sciome
Copy link
Collaborator

Functions to impute data (due to limit of detection or otherwise) need to be added to the package.

@ericbair-sciome ericbair-sciome added the enhancement New feature or request label Nov 12, 2023
@ericbair-sciome ericbair-sciome self-assigned this Nov 12, 2023
@ericbair-sciome ericbair-sciome added the version 1.0 Features that should be included in the initial PrestoGP release label Nov 12, 2023
sciome-bot pushed a commit that referenced this issue Dec 28, 2023
Squashed commit of the following:

commit d7794fd
Merge: 03c3589 2e94403
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 15:10:33 2023 -0500

    Merge branch 'master' of ssh://sciome-bot/stat/prestogp

commit 03c3589
Merge: d2a2e3a 961880a
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 15:08:24 2023 -0500

    Merge branch 'to-git' of ssh://sciome-bot/stat/prestogp

commit 2e94403
Merge: 710520c 8dac095
Author: Shail Choksi <shail.choksi@sciome.com>
Date:   Thu Dec 28 15:07:40 2023 -0500

    Pull request #20: R CMD check fixes

    Merge in STAT/prestogp from build-workflow to master

    * commit '8dac09511058405d630121ca13c589781c932bf4':
      R CMD check fixes
      Add new files to Collate section in DESCRIPTION file

commit 8dac095
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 15:06:38 2023 -0500

    R CMD check fixes

commit f037a0b
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 14:22:52 2023 -0500

    Add new files to Collate section in DESCRIPTION file

commit 710520c
Merge: d2a2e3a 11771e4
Author: Shail Choksi <shail.choksi@sciome.com>
Date:   Thu Dec 28 14:09:40 2023 -0500

    Pull request #18: Added missing imports and ran auto-formatter in vscode for R and C++

    Merge in STAT/prestogp from build-workflow to master

    * commit '11771e4ccbd6311aa35334c5ab4b7e4299a8db56':
      Added missing imports
      Ran auto-formatter/linter for R and C++ in vscode. Added some missing imports

commit 11771e4
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 14:07:28 2023 -0500

    Added missing imports

commit 73aed75
Merge: 7c0bbfe d2a2e3a
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 13:58:37 2023 -0500

    Merge branch 'master' of ssh://sciome-bot/stat/prestogp into build-workflow

commit d2a2e3a
Merge: bea8382 85517e7
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 13:57:14 2023 -0500

    Merge branch 'master' of ssh://sciome-bot/stat/prestogp

commit bea8382
Merge: 17f5284 1e4361d
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 13:55:26 2023 -0500

    Merge branch 'main-sciome' of sciome-bot-git:Spatiotemporal-Exposures-and-Toxicology/PrestoGP

commit 7c0bbfe
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 13:52:27 2023 -0500

    Ran auto-formatter/linter for R and C++ in vscode. Added some missing imports

commit 85517e7
Merge: 17f5284 2c9c2ed
Author: Shail Choksi <shail.choksi@sciome.com>
Date:   Thu Dec 28 13:12:00 2023 -0500

    Pull request #16: Add UBSAN/ASAN sanitizers

    Merge in STAT/prestogp from build-workflow to master

    * commit '2c9c2ede3d40e6e5af24b38a6acfc2dfb2994975':
      Add USAN/ASAN pipeline
      Pull request #14: Additional Testing
      Additional Testing
      Add UBSAN/ASAN sanitizers

commit 2c9c2ed
Merge: 6af43f4 56cdbf2
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 13:11:14 2023 -0500

    Merge branch 'to-git' of ssh://sciome-bot/stat/prestogp into build-workflow

commit 6af43f4
Merge: e530819 17f5284
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 13:07:50 2023 -0500

    Merge branch 'master' of ssh://sciome-bot/stat/prestogp into build-workflow

commit e530819
Author: sciome-bot <software.tools@sciome.com>
Date:   Thu Dec 28 13:06:21 2023 -0500

    Add USAN/ASAN pipeline

commit 17f5284
Merge: ea78ffa 895100d
Author: Eric Bair <eric.bair@sciome.com>
Date:   Fri Dec 1 10:49:09 2023 -0500

    Pull request #12: Testing

    Merge in STAT/prestogp from testing to master

    * commit '895100db78d89c5082ddfcf1411dad69bdf1b6c5':
      Fixed another sparseNN bug and updated tests
      Fixed a bug in SparseNN
      Added more likelihood and maxmin ordering tests

commit 895100d
Merge: 7f3ca3c fb060ff
Author: Eric Bair <eric.bair@sciome.com>
Date:   Wed Nov 29 17:02:10 2023 -0500

    Merge branch 'to-git' of http://192.168.167.103:7990/bitbucket/scm/stat/prestogp into testing

    Conflicts:
    	DESCRIPTION
    	man/prestogp_fit-PrestoGPModel-method.Rd
    	tests/testthat/test-Log_Likelihood.R

    Fixed merge conflicts.

commit 7f3ca3c
Author: Eric Bair <eric.bair@sciome.com>
Date:   Wed Nov 29 16:43:40 2023 -0500

    Fixed another sparseNN bug and updated tests

commit 1ca9586
Author: Eric Bair <eric.bair@sciome.com>
Date:   Tue Nov 28 18:40:36 2023 -0500

    Fixed a bug in SparseNN

commit 80a9818
Author: Eric Bair <eric.bair@sciome.com>
Date:   Tue Nov 21 17:49:19 2023 -0500

    Added more likelihood and maxmin ordering tests

... and 8 more commits
@kyle-messier
Copy link
Collaborator

@brian-bk22 @ericbair-sciome The overall functionality was added with PR #56 . For the pesticide work, we need to allow for variable LOD. Currently it appears that each outcome can only have 1 LOD:
Error in check_input(model, Y, X, locs, center.y, impute.y, lod) : Each lod must have length 1

Also, we don't have to implement this now if it is not trivial, but I think a more standard or straight forward approach
to implement an LOD would be to have a single value for the Y and then an indicator for whether it is observed or a limit of detection.

@ericbair-sciome
Copy link
Collaborator Author

I had been meaning to ask you about that. I thought you had said something about that in one of our meetings, but I wasn't sure. I'll go ahead and change this. It should be an easy fix.

By the way, I have finished my more detailed testing of the imputation algorithm. It seems to work very well for MAR missingness, but there is some bias in the LOD case. (And it gets steadily worse as the proportion of missing data increases.) I'm hoping to send a new version to Shail tonight. (Everything is done other than documentation at this point plus the aforementioned change, which should be easy.) My plan was to fix the show/accessor methods after that (since that is important for model interpretation) and then double back to see if we can figure out a way to improve the LOD imputation.

@kyle-messier
Copy link
Collaborator

@ericbair-sciome Thanks for the quick reply and fix. That is good to hear it will be an easy fix. For the LOD imputation effectiveness, also good to hear it is working for random case. It is expected that it will get worst as the proportion of missingness increases. One thing I mentioned in an email was the idea of multiple imputation. I'm developing the pesticide analysis through the targets package, which makes mapping over parameters easy and reproducible. We'll see how the analysis turns out, but we could discuss the idea of fitting k different PrestoGP models where each imputation is allowed to vary and average over those.

As an example, here is the current working version of the visualization of the targets pipeline. I have pre-processing, exploratory analysis, testing on a vanilla glmnet model, sub-sample dataset for testing PrestoGP, which is where it is currently failing:

image

@kyle-messier
Copy link
Collaborator

@ericbair-sciome Also, question related to the scaling input in prestogp_fit. Does it need to be a list for each outcome? Or would something like c(1,1,2) be sufficient for the multivariate. I think the latter is fine. I can't imagine a scenario where you would allow each outcome to vary in terms of a spatial vs spatiotemporal model.

@ericbair-sciome
Copy link
Collaborator Author

No, there is one scaling input for all outcomes for basically the reason you just said. :)

@ericbair-sciome
Copy link
Collaborator Author

@brian-bk22 @ericbair-sciome The overall functionality was added with PR #56 . For the pesticide work, we need to allow for variable LOD. Currently it appears that each outcome can only have 1 LOD: Error in check_input(model, Y, X, locs, center.y, impute.y, lod) : Each lod must have length 1

Also, we don't have to implement this now if it is not trivial, but I think a more standard or straight forward approach to implement an LOD would be to have a single value for the Y and then an indicator for whether it is observed or a limit of detection.

As an FYI, in the latest version, each outcome can have a separate LOD. Let me know if the syntax is unclear and I will try to improve the documentation.

I'm going to keep this issue open for now. While imputation is implemented in the current release, it seems to be biased when the percentage of missing data due to LOD is very high. We are working on some alternative approaches that seem to work better. If all goes well, we should have an improved imputation algorithm implemented in the next few weeks.

@ericbair-sciome ericbair-sciome linked a pull request Jul 30, 2024 that will close this issue
@ericbair-sciome
Copy link
Collaborator Author

This took much longer than I hoped, but the imputation functionality is significantly improved in the new version. There are still issues that need to be worked out. In particular, I think the new version is going to be prohibitively slow when the number of missing values is high. But I think I will close this issue and create some new issues for the new unresolved problems.

@kyle-messier
Copy link
Collaborator

@ericbair-sciome Thanks and no worries. I'm also taking longer on getting the pesticide data processed for analysis. I'll check out the new version ASAP. One question in regards to computational time. Does the new imputation method allow the user to control the number of iterations or the tolerance in re-estimating $\beta$ ?

@ericbair-sciome
Copy link
Collaborator Author

At the moment, the number of iterations and tolerance are hard coded. I should probably give users the option of changing that. I'll put that on the list to do.

@kyle-messier
Copy link
Collaborator

@ericbair-sciome Great - thank you. Whatever it is set to right now can be an easy default, but control would be good if someone wants to sacrifice some precision for speed. A print() or verbose option would be good to output each iterations value to monitor how long things are taking to converge.

@ericbair-sciome
Copy link
Collaborator Author

Good suggestions. I'll put that on the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request version 1.0 Features that should be included in the initial PrestoGP release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants