-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistent logical created with read_xlsx #414
Comments
I think
Here a reprex to illustrate how it works library(readxl)
# download the file (thanks for uploading it in GH)
tmp_file <- tempfile(fileext = ".xlsx")
download.file("https://github.com/tidyverse/readxl/files/1567997/demo.xlsx", destfile = tmp_file, mode = "wb")
# guess with enough data
read_excel(tmp_file)
#> # A tibble: 5 x 2
#> colA colB
#> <dbl> <chr>
#> 1 100 <NA>
#> 2 101 <NA>
#> 3 102 1
#> 4 103 4 or more
#> 5 104 2
# Two first rows in colB are empty so NA is guessed and it is typed logical. So it is correct, but not what you want
read_excel(tmp_file, guess_max = 2)
#> Warning in read_fun(path = path, sheet = sheet, limits = limits, shim =
#> shim, : Expecting logical in B5 / R5C2: got '4 or more'
#> # A tibble: 5 x 2
#> colA colB
#> <dbl> <lgl>
#> 1 100 NA
#> 2 101 NA
#> 3 102 TRUE
#> 4 103 NA
#> 5 104 TRUE
# if you take 3 first rows, it will be numeric and fourth row will be NA
read_excel(tmp_file, guess_max = 3)
#> Warning in read_fun(path = path, sheet = sheet, limits = limits, shim =
#> shim, : Expecting numeric in B5 / R5C2: got '4 or more'
#> # A tibble: 5 x 2
#> colA colB
#> <dbl> <dbl>
#> 1 100 NA
#> 2 101 NA
#> 3 102 1
#> 4 103 NA
#> 5 104 2
# if you take 4 first rows, it will be text
read_excel(tmp_file, guess_max = 4)
#> # A tibble: 5 x 2
#> colA colB
#> <dbl> <chr>
#> 1 100 <NA>
#> 2 101 <NA>
#> 3 102 1
#> 4 103 4 or more
#> 5 104 2
# If you know the type of a particular empty column, you set col_types, even with guess guess_max
read_excel(tmp_file, col_types = c("guess", "text"), guess_max = 2)
#> # A tibble: 5 x 2
#> colA colB
#> <dbl> <chr>
#> 1 100 <NA>
#> 2 101 <NA>
#> 3 102 1
#> 4 103 4 or more
#> 5 104 2
# or don't guess
read_excel(tmp_file, col_types = c("numeric", "text"))
#> # A tibble: 5 x 2
#> colA colB
#> <dbl> <chr>
#> 1 100 <NA>
#> 2 101 <NA>
#> 3 102 1
#> 4 103 4 or more
#> 5 104 2
# delete temp file
unlink(tmp_file) Why did you want to put a small Hopes it helps! |
The actual use case is that I am reading various Excels with 100k+ rows and hundreds of columns (typical wide survey data export), where I don't know the content beforehand. For some columns the first 1000 rows are empty, e.g because that variable has only been added at a later surveying wave. My fix is to set So, yes, I agree, that |
The issue with multiple distinctive To benefit from the fix now, you could install the development version. # install.packages("devtools")
devtools::install_github("tidyverse/readxl") |
Edit: Well, looks like @nacnudus beat me to it!
Depending on the nature of the data, a guess_max of 1million+ could slow things down. My read on it is that it's one of those things where it works for most scenarios, and it's easy enough to override by setting it manually. It's hard to look into this without that actual How to provide a readxl reprexWe're in a much better position to address your issue if you can provide a reprex (reproducible example). Provide as much of this as you can:
How to provide your own xls/xlsx file? In order of preference:
|
@mplatzer Oh sorry, I did not understand the real issue at first. 😟 |
great, thanks. I can confirm that the issue is resolved in latest |
when reading in the attached demo.xlsx which has the following content:
and I try to read it with
read_excel
but haven't specifiedguess_max
large enough, I get a tibble with an inconsistent logical vector. thecolB
is a logical, but when checking for unique values, I get two differentTRUE
values reported.The text was updated successfully, but these errors were encountered: