Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault when tabulating data incorrectly guessed as type "logical" #385

Closed
nealrichardson opened this issue Sep 10, 2017 · 5 comments · Fixed by #398
Closed

segfault when tabulating data incorrectly guessed as type "logical" #385

nealrichardson opened this issue Sep 10, 2017 · 5 comments · Fixed by #398

Comments

@nealrichardson
Copy link

After reading the data in this file, I get a segmentation fault whenever I try to aggregate most columns in the resulting tbl_df:

> library(readxl)
> df <- read_xlsx("City Survey MASTER Data 1996-2017.xlsx", sheet=2)
> table(df$parkvis)

 *** caught segfault ***
address 0x211ebd268, cause 'memory not mapped'

Traceback:
 1: unique.default(x, nmax = nmax)
 2: unique(x, nmax = nmax)
 3: factor(a, exclude = exclude)
 4: table(df$parkvis)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1
R is aborting now ...
Segmentation fault: 11

The problem is probably related to the fact that, because there are so many missing values at the top of the dataset, this column (and a majority of the others in the dataset) were guessed as type "logical":

> class(df$parkvis)
[1] "logical"
> head(df$parkvis)
[1] NA NA NA NA NA NA
> tail(df$parkvis)
[1] TRUE TRUE TRUE TRUE TRUE TRUE

It turns out that if you specify the col_types correctly as "numeric", the data are perfectly readable without segfaulting:

> df <- read_xlsx("City Survey MASTER Data 1996-2017.xlsx", sheet=2, col_types="numeric")
> table(df$parkvis, useNA="always")

    1     2     3     4     5     6  <NA> 
 2027  2863  4931  6400  9325   828 11598 
> sessionInfo()
R Under development (unstable) (2017-09-07 r73219)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_1.0.0

loaded via a namespace (and not attached):
[1] compiler_3.5.0   tools_3.5.0      tibble_1.3.4     Rcpp_0.12.12    
[5] cellranger_1.1.0 rlang_0.1.2.9000
@jennybc
Copy link
Member

jennybc commented Sep 10, 2017

Can you provide "City Survey MASTER Data 1996-2017.xlsx"?

@nealrichardson
Copy link
Author

Sure, it was linked on the issue. You can download it from http://sfgov.org/citysurvey/sites/default/files/City%20Survey%20MASTER%20Data%201996-2017.xlsx

@jennybc
Copy link
Member

jennybc commented Sep 10, 2017

Oh sorry, I didn't notice the blue text 😬

@jennybc
Copy link
Member

jennybc commented Sep 11, 2017

I can reproduce it. It very weird, because the data frame appears normal in most respects. Thanks for this specimen.

Two things I noticed in a quick examination:

  • dplyr::count() seems able to tabulate df$parkvis without segfaulting.
  • Instead of typing all the variables, another workaround is to increase guess_max to something high enough to consult all the data.
library(readxl)
library(tidyverse)
df <- read_xlsx("~/Downloads/City Survey MASTER Data 1996-2017.xlsx", sheet = 2)
df %>% 
  count(parkvis)
#> # A tibble: 7 x 2
#>   parkvis     n
#>     <lgl> <int>
#> 1    TRUE  2027
#> 2    TRUE  2863
#> 3    TRUE  4931
#> 4    TRUE  6400
#> 5    TRUE  9325
#> 6    TRUE   828
#> 7      NA 11598

df2 <- read_xlsx("~/Downloads/City Survey MASTER Data 1996-2017.xlsx", sheet = 2, guess_max = 40000)
table(df2$parkvis)
#> 
#>    1    2    3    4    5    6 
#> 2027 2863 4931 6400 9325  828

@nealrichardson
Copy link
Author

Yeah, I can get to the right parsing of the data a few ways, and the version that segfaults isn't useful, so this doesn't block the analysis I was trying to do. But I figured you'd want to know about a segfault :)

Odd that dplyr::count actually tabulates the data correctly: even though all of the values show as TRUE, somewhere in there the distinct values are there.

base::table and stats::xtabs both trigger the segfault when they call unique (unique.default, to be precise). Indeed, calling unique(df$parkvis) directly also segfaults.

jennybc pushed a commit that referenced this issue Oct 20, 2017
Fixes #385

Weirdly, the test passes even without this fix when run with `testthat::test()`.  To get the test to fail, checkout the previous commit, install, and do

```r
library(readxl)
library(testthat)
source("./tests/testthat/helper.R")
source("./tests/testthat/test-col-types.R")
```

from within the project directory.  It will segfault.
jennybc pushed a commit that referenced this issue Oct 21, 2017
* Transform int to 0/1 before coercion to logical

Fixes #385

Weirdly, the test passes even without this fix when run with `testthat::test()`.  To get the test to fail, checkout the previous commit, install, and do

```r
library(readxl)
library(testthat)
source("./tests/testthat/helper.R")
source("./tests/testthat/test-col-types.R")
```

from within the project directory.  It will segfault.

* Add bugfix of #385 to NEWS
@lock lock bot locked and limited conversation to collaborators Oct 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants