inconsistent logical created with read_xlsx #414

mplatzer · 2017-12-18T13:16:08Z

when reading in the attached demo.xlsx which has the following content:

colA	colB
100
101
102	1
103	4 or more
104	2

and I try to read it with read_excel but haven't specified guess_max large enough, I get a tibble with an inconsistent logical vector. the colB is a logical, but when checking for unique values, I get two different TRUE values reported.

> tb <- read_excel('~/Downloads/demo.xlsx', guess_max = 2)
Warning message:
In read_fun(path = path, sheet = sheet, limits = limits, shim = shim,  :
  Expecting logical in B5 / R5C2: got '4 or more'
> tb[['colB']]
[1]   NA   NA TRUE   NA TRUE
> unique(tb[['colB']])
[1]   NA TRUE TRUE
> uniqueN(tb[['colB']])
[1] 3
> dput(tb[['colB']])
c(NA, NA, TRUE, NA, TRUE)

The text was updated successfully, but these errors were encountered:

cderv · 2017-12-18T13:56:11Z

I think readxl behaviour is correct. You can find information on how it works on readxl.tidyverse.org.

guess_max determines how many cells in each column are used to make a guess of the column type. You provide a guess_max too small for read_excel to correctly guess the column type. Without other information, it provides its best guess. 2 empty row = NA = logical column.

Here a reprex to illustrate how it works

library(readxl)

# download the file (thanks for uploading it in GH)
tmp_file <- tempfile(fileext = ".xlsx")
download.file("https://github.com/tidyverse/readxl/files/1567997/demo.xlsx", destfile = tmp_file, mode = "wb")

# guess with enough data
read_excel(tmp_file)
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# Two first rows in colB are empty so NA is guessed and it is typed logical. So it is correct, but not what you want
read_excel(tmp_file, guess_max = 2)
#> Warning in read_fun(path = path, sheet = sheet, limits = limits, shim =
#> shim, : Expecting logical in B5 / R5C2: got '4 or more'
#> # A tibble: 5 x 2
#>    colA  colB
#>   <dbl> <lgl>
#> 1   100    NA
#> 2   101    NA
#> 3   102  TRUE
#> 4   103    NA
#> 5   104  TRUE

# if you take 3 first rows, it will be numeric and fourth row will be NA
read_excel(tmp_file, guess_max = 3)
#> Warning in read_fun(path = path, sheet = sheet, limits = limits, shim =
#> shim, : Expecting numeric in B5 / R5C2: got '4 or more'
#> # A tibble: 5 x 2
#>    colA  colB
#>   <dbl> <dbl>
#> 1   100    NA
#> 2   101    NA
#> 3   102     1
#> 4   103    NA
#> 5   104     2

# if you take 4 first rows, it will be text 
read_excel(tmp_file, guess_max = 4)
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# If you know the type of a particular empty column, you set col_types, even with guess guess_max
read_excel(tmp_file, col_types = c("guess", "text"), guess_max = 2)
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# or don't guess
read_excel(tmp_file, col_types = c("numeric", "text"))
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# delete temp file
unlink(tmp_file)

Why did you want to put a small guess_max in the first place?
If the default (1000) is not enough, you should increase it or sets manually the column type.

Hopes it helps!

mplatzer · 2017-12-18T14:11:40Z

The actual use case is that I am reading various Excels with 100k+ rows and hundreds of columns (typical wide survey data export), where I don't know the content beforehand. For some columns the first 1000 rows are empty, e.g because that variable has only been added at a later surveying wave.

My fix is to set guess_max to the maximum number of rows that a standard Excel can have (1,048,576). This still performs fast, making me wonder, what is the benefit of even having guess_max in the first place.

So, yes, I agree, that guess_max works correctly. But my point was, that the resulting logical vector is of a kind that I have never encountered before. How come that a logical vector is created that has several distinctive TRUE values!? This shouldn't be possible, right?

nacnudus · 2017-12-18T14:17:47Z

The issue with multiple distinctive TRUE values is the same as #385 and has been fixed, but is not yet on CRAN.

To benefit from the fix now, you could install the development version.

# install.packages("devtools")
devtools::install_github("tidyverse/readxl")

batpigandme · 2017-12-18T14:18:01Z

Edit: Well, looks like @nacnudus beat me to it!

My fix is to set guess_max to the maximum number of rows that a standard Excel can have (1,048,576). This still performs fast, making me wonder, what is the benefit of even having guess_max in the first place.

Depending on the nature of the data, a guess_max of 1million+ could slow things down. My read on it is that it's one of those things where it works for most scenarios, and it's easy enough to override by setting it manually.

It's hard to look into this without that actual .xls file.

How to provide a readxl reprex

We're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:

An actual xls or xlsx file. Pick one:
- Your personal xls or xlsx: try to strip it down to the minimal size and complexity to demonstrate your point. And, obviously, remove any sensitive data.
- A publicly available xls or xlsx: provide URL and the code you used to download.
A small bit of R code that uses readxl on the provided xls or xlsx file and demonstrates your point.
- Consider using the reprex package to prepare this. In addition to nice formatting, this ensures your reprex is self-contained.
Any details about your environment that seem clearly relevant, such as operating system.
reprex(..., si = TRUE)
will append a standard summary, folded neatly away, at the bottom of your reprex.

How to provide your own xls/xlsx file? In order of preference:

Attach the file directly to your issue. Instructions are always at the bottom of the issue or comment box. .xlsx is a supported file type. You'll need to zip or gzip .xls so it appears as .zip or .gz.
Share via DropBox or Google Drive and provide the link in your issue.
Explain you absolutely cannot provide a relevant file via github.com and offer to provide privately.
Don't share a file and realize you're hoping for, e.g., a bug fix with no concrete example to go on.

cderv · 2017-12-18T14:22:19Z

@mplatzer Oh sorry, I did not understand the real issue at first. 😟
Seemed fixed in dev version. Good to know.

mplatzer · 2017-12-18T15:33:57Z

great, thanks. I can confirm that the issue is resolved in latest master.

mplatzer closed this as completed Dec 18, 2017

raviolli77 mentioned this issue Jun 26, 2018

Fixing cleaning script R-Meetup-SB/hackathon-201806#3

Merged

lock bot locked and limited conversation to collaborators Oct 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inconsistent logical created with read_xlsx #414

inconsistent logical created with read_xlsx #414

mplatzer commented Dec 18, 2017

cderv commented Dec 18, 2017

mplatzer commented Dec 18, 2017

nacnudus commented Dec 18, 2017

batpigandme commented Dec 18, 2017 •

edited

Loading

cderv commented Dec 18, 2017

mplatzer commented Dec 18, 2017

inconsistent logical created with read_xlsx #414

inconsistent logical created with read_xlsx #414

Comments

mplatzer commented Dec 18, 2017

cderv commented Dec 18, 2017

mplatzer commented Dec 18, 2017

nacnudus commented Dec 18, 2017

batpigandme commented Dec 18, 2017 • edited Loading

How to provide a readxl reprex

cderv commented Dec 18, 2017

mplatzer commented Dec 18, 2017

batpigandme commented Dec 18, 2017 •

edited

Loading