Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About em problem #98

Open
ROOKLO opened this issue Sep 20, 2020 · 3 comments
Open

About em problem #98

ROOKLO opened this issue Sep 20, 2020 · 3 comments

Comments

@ROOKLO
Copy link

ROOKLO commented Sep 20, 2020

I created NAN in my data set randomly, and i want to compare the performance of EM methods in SPSS and impyute .
and i got
spss_em*
MSE_spss: 22.177916455492653
r_spss: 0.721709731654166
impyute_em
MSE_impyute: 289.1830722478248
r_impyute: 0.002467765572835078
the em from impyute seems to not work very well , and i do not know why

@BaoxueLi
Copy link

I am not very clear about the details of SPSS EM implementation, but I read the source code of the em from impyute. I found that the implementation is very simple. It is to continuously resample the Gaussian distribution formed by the mean and variance of the current column until the gap with the last filling value is very small. This method may not be effective when dealing with data with more complex characteristics.

@ROOKLO
Copy link
Author

ROOKLO commented Jan 22, 2021

I am not very clear about the details of SPSS EM implementation, but I read the source code of the em from impyute. I found that the implementation is very simple. It is to continuously resample the Gaussian distribution formed by the mean and variance of the current column until the gap with the last filling value is very small. This method may not be effective when dealing with data with more complex characteristics.

Maybe the data is not normally distributed or not missing randomly. The normal distribution formed by the mean and standard deviation of the existing data in every column(feature) could not represent the data's true distribution, and bias was introduced in the first iteration.

@mkrtl
Copy link

mkrtl commented Feb 25, 2021

I also do not think the implementation here at impyute is correct, as it does not use any covariance structure and just uses the mean and standard deviation of the current column. Murphy's "Machine Learning: a statistical perspective", chapter 11.6. shows how to use the EM-algorithm for derivating the sufficient statistics in the normal case. Is the algorithm converging actually for any delta?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants