Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skimming when all values are NA #666

Open
elinw opened this issue Jul 1, 2021 · 2 comments
Open

Skimming when all values are NA #666

elinw opened this issue Jul 1, 2021 · 2 comments

Comments

@elinw
Copy link
Collaborator

elinw commented Jul 1, 2021

Recently I came across a situation where all of the values of some variables were classed NA. In this case skimr

> df <- data.frame("x" = 1:10, "y" = NA   )
> df

── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             10    
Number of columns          2     
_______________________          
Column type frequency:           
  logical                  1     
  numeric                  1     
________________________         
Group variables            None  

── Variable type: logical ───────────────────────────────────
  skim_variable n_missing complete_rate  mean count
1 y                    10             0   NaN ": " 

── Variable type: numeric ───────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0
1 x                     0             1   5.5  3.03     1
    p25   p50   p75  p100 hist 
1  3.25   5.5  7.75    10 ▇▇▇▇▇
> df <- data.frame("x" = 1:10, "y" = NA_integer_   )
> skimr::skim(df)
── Data Summary ────────────────────────
                           Values
Name                       df    
Number of rows             10    
Number of columns          2     
_______________________          
Column type frequency:           
  numeric                  2     
________________________         
Group variables            None  

── Variable type: numeric ───────────────────────────────────
  skim_variable n_missing complete_rate  mean    sd    p0
1 x                     0             1   5.5  3.03     1
2 y                    10             0 NaN   NA       NA
    p25   p50   p75  p100 hist   
1  3.25   5.5  7.75    10 "▇▇▇▇▇"
2 NA     NA   NA       NA " "    
> 

I think the base columns are okay (n_missing, complte_rate) but probably we should not do the other statistics.
@michaelquinn32 thoughts?

@elinw
Copy link
Collaborator Author

elinw commented Jul 1, 2021

I guess it could be that we push the count to 0 so it works like the NA_NUMERIC case.

@michaelquinn32
Copy link
Collaborator

I think the issue is primarily how we handle NA's in some of the summary stats that we include: count and hist. We could probably add some simple updates to check if all the data is NA, and if so, have them return NA_character_ too. How does that sound?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants