segfault when tabulating data incorrectly guessed as type "logical" #385

nealrichardson · 2017-09-10T04:05:00Z

After reading the data in this file, I get a segmentation fault whenever I try to aggregate most columns in the resulting tbl_df:

> library(readxl)
> df <- read_xlsx("City Survey MASTER Data 1996-2017.xlsx", sheet=2)
> table(df$parkvis)

 *** caught segfault ***
address 0x211ebd268, cause 'memory not mapped'

Traceback:
 1: unique.default(x, nmax = nmax)
 2: unique(x, nmax = nmax)
 3: factor(a, exclude = exclude)
 4: table(df$parkvis)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 1
R is aborting now ...
Segmentation fault: 11

The problem is probably related to the fact that, because there are so many missing values at the top of the dataset, this column (and a majority of the others in the dataset) were guessed as type "logical":

> class(df$parkvis)
[1] "logical"
> head(df$parkvis)
[1] NA NA NA NA NA NA
> tail(df$parkvis)
[1] TRUE TRUE TRUE TRUE TRUE TRUE

It turns out that if you specify the col_types correctly as "numeric", the data are perfectly readable without segfaulting:

> df <- read_xlsx("City Survey MASTER Data 1996-2017.xlsx", sheet=2, col_types="numeric")
> table(df$parkvis, useNA="always")

    1     2     3     4     5     6  <NA> 
 2027  2863  4931  6400  9325   828 11598

> sessionInfo()
R Under development (unstable) (2017-09-07 r73219)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] readxl_1.0.0

loaded via a namespace (and not attached):
[1] compiler_3.5.0   tools_3.5.0      tibble_1.3.4     Rcpp_0.12.12    
[5] cellranger_1.1.0 rlang_0.1.2.9000

The text was updated successfully, but these errors were encountered:

jennybc · 2017-09-10T16:15:39Z

Can you provide "City Survey MASTER Data 1996-2017.xlsx"?

nealrichardson · 2017-09-10T16:50:34Z

Sure, it was linked on the issue. You can download it from http://sfgov.org/citysurvey/sites/default/files/City%20Survey%20MASTER%20Data%201996-2017.xlsx

jennybc · 2017-09-10T16:58:33Z

Oh sorry, I didn't notice the blue text 😬

jennybc · 2017-09-11T01:35:20Z

I can reproduce it. It very weird, because the data frame appears normal in most respects. Thanks for this specimen.

Two things I noticed in a quick examination:

dplyr::count() seems able to tabulate df$parkvis without segfaulting.
Instead of typing all the variables, another workaround is to increase guess_max to something high enough to consult all the data.

library(readxl)
library(tidyverse)
df <- read_xlsx("~/Downloads/City Survey MASTER Data 1996-2017.xlsx", sheet = 2)
df %>% 
  count(parkvis)
#> # A tibble: 7 x 2
#>   parkvis     n
#>     <lgl> <int>
#> 1    TRUE  2027
#> 2    TRUE  2863
#> 3    TRUE  4931
#> 4    TRUE  6400
#> 5    TRUE  9325
#> 6    TRUE   828
#> 7      NA 11598

df2 <- read_xlsx("~/Downloads/City Survey MASTER Data 1996-2017.xlsx", sheet = 2, guess_max = 40000)
table(df2$parkvis)
#> 
#>    1    2    3    4    5    6 
#> 2027 2863 4931 6400 9325  828

nealrichardson · 2017-09-11T04:17:54Z

Yeah, I can get to the right parsing of the data a few ways, and the version that segfaults isn't useful, so this doesn't block the analysis I was trying to do. But I figured you'd want to know about a segfault :)

Odd that dplyr::count actually tabulates the data correctly: even though all of the values show as TRUE, somewhere in there the distinct values are there.

base::table and stats::xtabs both trigger the segfault when they call unique (unique.default, to be precise). Indeed, calling unique(df$parkvis) directly also segfaults.

Fixes #385 Weirdly, the test passes even without this fix when run with `testthat::test()`. To get the test to fail, checkout the previous commit, install, and do ```r library(readxl) library(testthat) source("./tests/testthat/helper.R") source("./tests/testthat/test-col-types.R") ``` from within the project directory. It will segfault.

* Transform int to 0/1 before coercion to logical Fixes #385 Weirdly, the test passes even without this fix when run with `testthat::test()`. To get the test to fail, checkout the previous commit, install, and do ```r library(readxl) library(testthat) source("./tests/testthat/helper.R") source("./tests/testthat/test-col-types.R") ``` from within the project directory. It will segfault. * Add bugfix of #385 to NEWS

nacnudus mentioned this issue Oct 19, 2017

Transform int to 0/1 before coercion to logical #398

Merged

jennybc closed this as completed in #398 Oct 20, 2017

nacnudus mentioned this issue Dec 18, 2017

inconsistent logical created with read_xlsx #414

Closed

lock bot locked and limited conversation to collaborators Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segfault when tabulating data incorrectly guessed as type "logical" #385

segfault when tabulating data incorrectly guessed as type "logical" #385

nealrichardson commented Sep 10, 2017

jennybc commented Sep 10, 2017

nealrichardson commented Sep 10, 2017

jennybc commented Sep 10, 2017

jennybc commented Sep 11, 2017

nealrichardson commented Sep 11, 2017

segfault when tabulating data incorrectly guessed as type "logical" #385

segfault when tabulating data incorrectly guessed as type "logical" #385

Comments

nealrichardson commented Sep 10, 2017

jennybc commented Sep 10, 2017

nealrichardson commented Sep 10, 2017

jennybc commented Sep 10, 2017

jennybc commented Sep 11, 2017

nealrichardson commented Sep 11, 2017