Reads blank rows after end of data #203

tcquinn · 2016-09-17T15:52:42Z

With some Excel files, readxl seems to read a large number of blank rows after the end of the visible data. For example, when reading this file from the New York City Department of Finance (with skip=4), readxl reads 25,141 rows, but 19,990 of them are blank. I'm using the CRAN version (0.1.1). (Sorry if it's verboten to report issues with the CRAN version here; I posted on Stack Overflow but didn't get any useful responses)

The text was updated successfully, but these errors were encountered:

smckenna · 2016-11-02T00:17:06Z

I have seen same behavior and it's a killer! Deleting rows by selecting the cells and hitting the delete key does not seem to delete the content. It disappears, but readxl finds it somehow! I did find that if you delete rows by selecting rows using their row numbers, right-clicking, and doing delete, that seems to really delete the rows.

jamesui · 2016-11-18T00:23:09Z

Dealing with the same issue, though it is easy to fix by using tidyr::drop.na(df) after using read_excel.

jennybc · 2016-11-18T12:18:18Z

@smckenna @jamesui Do you see this with xls also (as the example @tcquinn links to) or with xlsx? If xlsx, can you share or link an example, to complete the set?

jennybc · 2017-01-07T23:01:17Z

library(readxl)
#download.file("https://www1.nyc.gov/assets/finance/downloads/pdf/rolling_sales/annualized-sales/2012/2012_bronx.xls", destfile = "2012_bronx.xls")

df <- read_excel("2012_bronx.xls")
#> DEFINEDNAME: 21 00 00 01 0b 00 00 00 01 00 00 00 00 00 00 0d 3b 00 00 04 00 04 00 00 00 14 00 
#> and so on and so forth
df
#> # A tibble: 25,145 × 21
#>    `BRONX ANNUALIZE SALE FOR YEAR 2012.  All Sales From  Jan 1, 2012 - Dec 31, 2012.`
#>                                                                                 <chr>
#> 1                      Sales File as of 03/31/2013  Coop Sales Files as of 03/31/2013
#> 2                            Neighborhood Name and Descriptive Data is as of 04/19/13
#> 3                 Building Class Category is based on Building Class at Time of Sale.
#> 4                                                                           BOROUGH\n
#> 5                                                                            2.000000
#> 6                                                                            2.000000
#> 7                                                                            2.000000
#> 8                                                                            2.000000
#> 9                                                                            2.000000
#> 10                                                                           2.000000
#> # ... with 25,135 more rows, and 20 more variables: `NA` <chr>,
#> #   `NA` <chr>, `NA` <chr>, `NA` <chr>, `NA` <chr>, `` <chr>, `` <chr>,
#> #   `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>,
#> #   `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>, `` <chr>
dim(df)
#> [1] 25145    21

df2 <- tidyr::drop_na(df)
dim(df2)
#> [1] 5152   21

as.data.frame(tail(df))
#>   BRONX ANNUALIZE SALE FOR YEAR 2012.  All Sales From  Jan 1, 2012 - Dec 31, 2012.
#> 1                                                                             <NA>
#> 2                                                                             <NA>
#> 3                                                                             <NA>
#> 4                                                                             <NA>
#> 5                                                                             <NA>
#> 6                                                                             <NA>
#>     NA   NA   NA   NA   NA                                             
#> 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#>                                
#> 1 <NA> <NA> <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> <NA> <NA> <NA> <NA> <NA>

jennybc · 2017-01-07T23:20:22Z

I've "moved" @burchill's comment to #224, because it's about a distinct problem.

jennybc · 2017-02-05T00:02:09Z

This comes from taking the declared dimensions literally. I can already fix for xlsx, by replacingcacheDimensions() with computeDimensions() here:

readxl/src/XlsxWorkSheet.h

Line 56 in 3338f92

cacheDimension();

Verified with xlsx saved from 2012_bronx.xls sheet from above.

jennybc mentioned this issue Jan 7, 2017

Skip empty rows #22

Closed

jennybc added bug an unexpected problem or unintended behavior blanks labels Jan 7, 2017

jennybc mentioned this issue Jan 7, 2017

n blank rows in xlsx lead to n trailing NA rows in data frame #224

Closed

jennybc mentioned this issue Jan 26, 2017

remove empty columns ? #236

Closed

jennybc closed this as completed in c6e0f8f Feb 6, 2017

This comment has been minimized.

Sign in to view

lock bot locked and limited conversation to collaborators Oct 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reads blank rows after end of data #203

Reads blank rows after end of data #203

tcquinn commented Sep 17, 2016 •

edited

Loading

smckenna commented Nov 2, 2016

jamesui commented Nov 18, 2016 •

edited by jennybc

Loading

jennybc commented Nov 18, 2016

jennybc commented Jan 7, 2017

jennybc commented Jan 7, 2017

jennybc commented Feb 5, 2017 •

edited

Loading

This comment has been minimized.

Reads blank rows after end of data #203

Reads blank rows after end of data #203

Comments

tcquinn commented Sep 17, 2016 • edited Loading

smckenna commented Nov 2, 2016

jamesui commented Nov 18, 2016 • edited by jennybc Loading

jennybc commented Nov 18, 2016

jennybc commented Jan 7, 2017

jennybc commented Jan 7, 2017

jennybc commented Feb 5, 2017 • edited Loading

This comment has been minimized.

tcquinn commented Sep 17, 2016 •

edited

Loading

jamesui commented Nov 18, 2016 •

edited by jennybc

Loading

jennybc commented Feb 5, 2017 •

edited

Loading