Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reads blank rows after end of data #203

Closed
tcquinn opened this issue Sep 17, 2016 · 7 comments
Closed

Reads blank rows after end of data #203

tcquinn opened this issue Sep 17, 2016 · 7 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@tcquinn
Copy link

tcquinn commented Sep 17, 2016

With some Excel files, readxl seems to read a large number of blank rows after the end of the visible data. For example, when reading this file from the New York City Department of Finance (with skip=4), readxl reads 25,141 rows, but 19,990 of them are blank. I'm using the CRAN version (0.1.1). (Sorry if it's verboten to report issues with the CRAN version here; I posted on Stack Overflow but didn't get any useful responses)

@smckenna
Copy link

smckenna commented Nov 2, 2016

I have seen same behavior and it's a killer! Deleting rows by selecting the cells and hitting the delete key does not seem to delete the content. It disappears, but readxl finds it somehow! I did find that if you delete rows by selecting rows using their row numbers, right-clicking, and doing delete, that seems to really delete the rows.

@jamesui
Copy link

jamesui commented Nov 18, 2016

Dealing with the same issue, though it is easy to fix by using tidyr::drop.na(df) after using read_excel.

@jennybc
Copy link
Member

jennybc commented Nov 18, 2016

@smckenna @jamesui Do you see this with xls also (as the example @tcquinn links to) or with xlsx? If xlsx, can you share or link an example, to complete the set?

@jennybc jennybc mentioned this issue Jan 7, 2017
@jennybc jennybc added bug an unexpected problem or unintended behavior blanks labels Jan 7, 2017
@jennybc
Copy link
Member

jennybc commented Jan 7, 2017

library(readxl)
#download.file("https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_bronx.xls", destfile = "2012_bronx.xls")

df <- read_excel("2012_bronx.xls")
#> DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 04 00 04 00 00 00 14 00 
#> and so on and so forth
df
#> # A tibble: 25,145 × 21
#>    `BRONX ANNUALIZE SALE FOR YEAR 2012.  All Sales From  Jan 1, 2012 - Dec 31, 2012.`
#>                                                                                 <chr>
#> 1                      Sales File as of 03/31/2013  Coop Sales Files as of 03/31/2013
#> 2                            Neighborhood Name and Descriptive Data is as of 04/19/13
#> 3                 Building Class Category is based on Building Class at Time of Sale.
#> 4                                                                           BOROUGH\n
#> 5                                                                            2.000000
#> 6                                                                            2.000000
#> 7                                                                            2.000000
#> 8                                                                            2.000000
#> 9                                                                            2.000000
#> 10                                                                           2.000000
#> # ... with 25,135 more rows, and 20 more variables: `NA` <chr>,
#> #   `NA` <chr>, `NA` <chr>, `NA` <chr>, `NA` <chr>, `` <chr>, `` <chr>,
#> #   `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>,
#> #   `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>
dim(df)
#> [1] 25145    21

df2 <- tidyr::drop_na(df)
dim(df2)
#> [1] 5152   21

as.data.frame(tail(df))
#>   BRONX ANNUALIZE SALE FOR YEAR 2012.  All Sales From  Jan 1, 2012 - Dec 31, 2012.
#> 1                                                                             <NA>
#> 2                                                                             <NA>
#> 3                                                                             <NA>
#> 4                                                                             <NA>
#> 5                                                                             <NA>
#> 6                                                                             <NA>
#>     NA   NA   NA   NA   NA                                             
#> 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#>                                
#> 1 <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> <NA> <NA> <NA> <NA> <NA>

@jennybc
Copy link
Member

jennybc commented Jan 7, 2017

I've "moved" @burchill's comment to #224, because it's about a distinct problem.

@jennybc
Copy link
Member

jennybc commented Feb 5, 2017

This comes from taking the declared dimensions literally. I can already fix for xlsx, by replacingcacheDimensions() with computeDimensions() here:

cacheDimension();

Verified with xlsx saved from 2012_bronx.xls sheet from above.

@jennybc jennybc closed this as completed in c6e0f8f Feb 6, 2017
@zbrainsoft

This comment has been minimized.

@lock lock bot locked and limited conversation to collaborators Oct 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

5 participants