Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inconsistent logical created with read_xlsx #414

Closed
mplatzer opened this issue Dec 18, 2017 · 6 comments
Closed

inconsistent logical created with read_xlsx #414

mplatzer opened this issue Dec 18, 2017 · 6 comments

Comments

@mplatzer
Copy link

when reading in the attached demo.xlsx which has the following content:

colA colB
100  
101  
102 1
103 4 or more
104 2

and I try to read it with read_excel but haven't specified guess_max large enough, I get a tibble with an inconsistent logical vector. the colB is a logical, but when checking for unique values, I get two different TRUE values reported.

> tb <- read_excel('~/Downloads/demo.xlsx', guess_max = 2)
Warning message:
In read_fun(path = path, sheet = sheet, limits = limits, shim = shim,  :
  Expecting logical in B5 / R5C2: got '4 or more'
> tb[['colB']]
[1]   NA   NA TRUE   NA TRUE
> unique(tb[['colB']])
[1]   NA TRUE TRUE
> uniqueN(tb[['colB']])
[1] 3
> dput(tb[['colB']])
c(NA, NA, TRUE, NA, TRUE)
@cderv
Copy link

cderv commented Dec 18, 2017

I think readxl behaviour is correct. You can find information on how it works on readxl.tidyverse.org.

guess_max determines how many cells in each column are used to make a guess of the column type. You provide a guess_max too small for read_excel to correctly guess the column type. Without other information, it provides its best guess. 2 empty row = NA = logical column.

Here a reprex to illustrate how it works

library(readxl)

# download the file (thanks for uploading it in GH)
tmp_file <- tempfile(fileext = ".xlsx")
download.file("https://github.com/tidyverse/readxl/files/1567997/demo.xlsx", destfile = tmp_file, mode = "wb")

# guess with enough data
read_excel(tmp_file)
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# Two first rows in colB are empty so NA is guessed and it is typed logical. So it is correct, but not what you want
read_excel(tmp_file, guess_max = 2)
#> Warning in read_fun(path = path, sheet = sheet, limits = limits, shim =
#> shim, : Expecting logical in B5 / R5C2: got '4 or more'
#> # A tibble: 5 x 2
#>    colA  colB
#>   <dbl> <lgl>
#> 1   100    NA
#> 2   101    NA
#> 3   102  TRUE
#> 4   103    NA
#> 5   104  TRUE

# if you take 3 first rows, it will be numeric and fourth row will be NA
read_excel(tmp_file, guess_max = 3)
#> Warning in read_fun(path = path, sheet = sheet, limits = limits, shim =
#> shim, : Expecting numeric in B5 / R5C2: got '4 or more'
#> # A tibble: 5 x 2
#>    colA  colB
#>   <dbl> <dbl>
#> 1   100    NA
#> 2   101    NA
#> 3   102     1
#> 4   103    NA
#> 5   104     2

# if you take 4 first rows, it will be text 
read_excel(tmp_file, guess_max = 4)
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# If you know the type of a particular empty column, you set col_types, even with guess guess_max
read_excel(tmp_file, col_types = c("guess", "text"), guess_max = 2)
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# or don't guess
read_excel(tmp_file, col_types = c("numeric", "text"))
#> # A tibble: 5 x 2
#>    colA      colB
#>   <dbl>     <chr>
#> 1   100      <NA>
#> 2   101      <NA>
#> 3   102         1
#> 4   103 4 or more
#> 5   104         2

# delete temp file
unlink(tmp_file)

Why did you want to put a small guess_max in the first place?
If the default (1000) is not enough, you should increase it or sets manually the column type.

Hopes it helps!

@mplatzer
Copy link
Author

The actual use case is that I am reading various Excels with 100k+ rows and hundreds of columns (typical wide survey data export), where I don't know the content beforehand. For some columns the first 1000 rows are empty, e.g because that variable has only been added at a later surveying wave.

My fix is to set guess_max to the maximum number of rows that a standard Excel can have (1,048,576). This still performs fast, making me wonder, what is the benefit of even having guess_max in the first place.

So, yes, I agree, that guess_max works correctly. But my point was, that the resulting logical vector is of a kind that I have never encountered before. How come that a logical vector is created that has several distinctive TRUE values!? This shouldn't be possible, right?

@nacnudus
Copy link
Contributor

The issue with multiple distinctive TRUE values is the same as #385 and has been fixed, but is not yet on CRAN.

To benefit from the fix now, you could install the development version.

# install.packages("devtools")
devtools::install_github("tidyverse/readxl")

@batpigandme
Copy link
Contributor

batpigandme commented Dec 18, 2017

Edit: Well, looks like @nacnudus beat me to it!

My fix is to set guess_max to the maximum number of rows that a standard Excel can have (1,048,576). This still performs fast, making me wonder, what is the benefit of even having guess_max in the first place.

Depending on the nature of the data, a guess_max of 1million+ could slow things down. My read on it is that it's one of those things where it works for most scenarios, and it's easy enough to override by setting it manually.

It's hard to look into this without that actual .xls file.

How to provide a readxl reprex

We're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:

  • An actual xls or xlsx file. Pick one:
    • Your personal xls or xlsx: try to strip it down to the minimal size and complexity to demonstrate your point. And, obviously, remove any sensitive data.
    • A publicly available xls or xlsx: provide URL and the code you used to download.
  • A small bit of R code that uses readxl on the provided xls or xlsx file and demonstrates your point.
    • Consider using the reprex package to prepare this. In addition to nice formatting, this ensures your reprex is self-contained.
  • Any details about your environment that seem clearly relevant, such as operating system.
    reprex(..., si = TRUE)
    will append a standard summary, folded neatly away, at the bottom of your reprex.

How to provide your own xls/xlsx file? In order of preference:

  1. Attach the file directly to your issue. Instructions are always at the bottom of the issue or comment box. .xlsx is a supported file type. You'll need to zip or gzip .xls so it appears as .zip or .gz.
  2. Share via DropBox or Google Drive and provide the link in your issue.
  3. Explain you absolutely cannot provide a relevant file via github.com and offer to provide privately.
  4. Don't share a file and realize you're hoping for, e.g., a bug fix with no concrete example to go on.

@cderv
Copy link

cderv commented Dec 18, 2017

@mplatzer Oh sorry, I did not understand the real issue at first. 😟
Seemed fixed in dev version. Good to know.

@mplatzer
Copy link
Author

great, thanks. I can confirm that the issue is resolved in latest master.

@lock lock bot locked and limited conversation to collaborators Oct 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants