Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collapse multiple rows of headers (feature request) #486

Closed
brianwdavis opened this issue Jun 8, 2018 · 8 comments
Closed

Collapse multiple rows of headers (feature request) #486

brianwdavis opened this issue Jun 8, 2018 · 8 comments

Comments

@brianwdavis
Copy link

brianwdavis commented Jun 8, 2018

I frequently receive Excel files that have "complex" headers, where there are multiple rows of cells encoding a single column name. Usually there's included units, labels, or additional metadata, but rarely can the extra rows be thrown away and only the first kept. Here's an example:

image

  • In the workflow below, the first attempt erroneously puts the metadata in the first row of the tibble. This incorrectly coerces the parser to guess that 4 columns are character, not numeric. Additionally, we lose the name of the 6th column, converted to X__1, and the second treatment column is coerced to Trt__1.
  • The second attempt correctly parses the column types, but we lose the names on the first 5 columns, and keep only the metadata. Since there are multiple columns with the unit "grams", that's not enough to reconstruct the original headers.
  • The third attempt is a flavor of the first, manually removing the metadata row and forcing a reparse. Again, this loses one of the column names, and confuses the two treatment columns.
  • The fourth attempt is my current workflow: I read in the top few rows that I know contain both names and metadata, collapse them with paste, and unlist them to get a vector of column names. Then I read in the data part of the sheet separately and add the names.

My desire: an additional argument to read_excel that's like row_collapse = 1:2, telling it to do that.


library(readxl)                                                             
library(dplyr)                                                              
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
                                                                            
file_path <- "https://github.com/brianwdavis/public/raw/master/example.xlsx"
                                                                            
download.file(file_path, "example.xlsx", mode = "wb")                       
                                                                            
read_excel("example.xlsx")                                                  
#> # A tibble: 7 x 6
#>   Trt      Trt__1   `Dry wt` `Fresh wt` Temp  X__1     
#>   <chr>    <chr>    <chr>    <chr>      <chr> <chr>    
#> 1 Factor 1 Factor 2 grams    grams      C     replicate
#> 2 a        d        3        5          20    1        
#> 3 a        d        5        10         22    2        
#> 4 a        e        8        15         21    3        
#> 5 b        e        4        9          30    1        
#> 6 b        f        6        11         31    2        
#> 7 b        f        2        4          32    3
read_excel("example.xlsx", skip = 1)                                        
#> # A tibble: 6 x 6
#>   `Factor 1` `Factor 2` grams grams__1     C replicate
#>   <chr>      <chr>      <dbl>    <dbl> <dbl>     <dbl>
#> 1 a          d              3        5    20         1
#> 2 a          d              5       10    22         2
#> 3 a          e              8       15    21         3
#> 4 b          e              4        9    30         1
#> 5 b          f              6       11    31         2
#> 6 b          f              2        4    32         3
                                                                            
                                                                            
                                                                            
read_excel("example.xlsx") %>%                                              
slice(-1) %>%                                                               
mutate_all(readr::parse_guess)                                              
#> # A tibble: 6 x 6
#>   Trt   Trt__1 `Dry wt` `Fresh wt`  Temp  X__1
#>   <chr> <chr>     <int>      <int> <int> <int>
#> 1 a     d             3          5    20     1
#> 2 a     d             5         10    22     2
#> 3 a     e             8         15    21     3
#> 4 b     e             4          9    30     1
#> 5 b     f             6         11    31     2
#> 6 b     f             2          4    32     3
                                                                            
                                                                            
nms <-                                                                      
read_excel("example.xlsx", range = cell_rows(1:2), col_names = F) %>%       
summarise_all(funs(paste(na.omit(.), collapse = "_"))) %>%                  
unlist()                                                                    
nms                                                                         
#>             X__1             X__2             X__3             X__4 
#>   "Trt_Factor 1"   "Trt_Factor 2"   "Dry wt_grams" "Fresh wt_grams" 
#>             X__5             X__6 
#>         "Temp_C"      "replicate"
                                                                            
read_excel("example.xlsx", skip = 2) %>%                                    
magrittr::set_colnames(nms)                                                 
#> # A tibble: 5 x 6
#>   `Trt_Factor 1` `Trt_Factor 2` `Dry wt_grams` `Fresh wt_grams` Temp_C
#>   <chr>          <chr>                   <dbl>            <dbl>  <dbl>
#> 1 a              d                           5               10     22
#> 2 a              e                           8               15     21
#> 3 b              e                           4                9     30
#> 4 b              f                           6               11     31
#> 5 b              f                           2                4     32
#> # ... with 1 more variable: replicate <dbl>
@brianwdavis
Copy link
Author

brianwdavis commented Jun 18, 2018

Having searched again, I saw that I missed #379 (closed) somehow, which starts out dealing with this exact sort of issue and then veers off into how to process pivot tables/wide data.

@jennybc
Copy link
Member

jennybc commented Jun 18, 2018

readxl could perhaps gain a convenience function to deal with such sheets, that would be some version of what you are doing with your two reads: one of header rows and another of data.

So, it would be a thin wrapper around 2 calls to read_excel(). I don't think it's worth it to deal with any of this on C/C++ side. It would be a bit more efficient to do the first read, make colnames, then provide these colnames to the second read (vs. applying them after the fact).

Main question is what should the interface be re: describing how to make column names from multiple header rows.

cc @richfitz, who I think had a similar question recently

@brianwdavis
Copy link
Author

brianwdavis commented Jun 18, 2018

It is indeed true that two calls to read_excel is about 2x as fast as one call to read_excel and then readr::parse_guess on each column (which surprised me! the C/C++ side must be very fast).

My stab would be something like below. In all the examples I've come across in the wild, it would be fine to have a single integer argument row_collapse that counts the number of rows to concatenate. I'm curious what other use cases people have where that wouldn't fit though.


read_excel_multiline <- function(filename, row_collapse = 1, ...) {
  nms <- read_excel(filename, range = cell_rows(seq_len(row_collapse)), col_names = F)  
  nms <- lapply(nms, na.omit)
  nms <- lapply(nms, paste, collapse = "_")

  read_excel(filename, skip = row_collapse, col_names = unlist(nms), ...)                                     
}

read_excel_multiline("example.xlsx", row_collapse = 2)

> # A tibble: 6 x 6
>   `Trt_Factor 1` `Trt_Factor 2` `Dry wt_grams` `Fresh wt_grams` Temp_C replicate
>   <chr>          <chr>                   <dbl>            <dbl>  <dbl>     <dbl>
> 1 a              d                           3                5     20         1
> 2 a              d                           5               10     22         2
> 3 a              e                           8               15     21         3
> 4 b              e                           4                9     30         1
> 5 b              f                           6               11     31         2
> 6 b              f                           2                4     32         3

@jennybc
Copy link
Member

jennybc commented Jul 10, 2018

@brianwdavis would you subscribe/watch #492? Once that mini vignette exists and is merged, perhaps you could add a second worked example, based on the above? While we ruminate on whether to add something to readxl itself and, if so, what, it's nice to have clear worked examples available for users right now.

@jennybc
Copy link
Member

jennybc commented Dec 14, 2018

I still think building this into readxl itself is a non-starter. I still welcome worked examples in the articles of the pkgdown so that people who share this problem can customize some general patterns for tackling this.

@jennybc jennybc closed this as completed Dec 14, 2018
@nacnudus
Copy link
Contributor

Could this be reconsidered in light of the new tidyr functions for multiple-value columns, which make this situation a very attractive one to get into?

@jennybc
Copy link
Member

jennybc commented Mar 21, 2019

Yeah @nacnudus I agree that current developments in tidyr are very interesting re: reading spreadsheets, with complicated headers. I don't think the tidyr work is mature enough to open a specific issue on readxl at this point, but it will be interesting to see if we can make things converge in a fruitful way.

@cbeleites
Copy link

I'd like to bring up this issue again. I encounter it sufficiently often to chime in that a wrapper would be useful (I put mine in one of my private packages, but I'd be happy to share/contribute it).

read_excel_multiheader <- function(data.range, header.range, header.fun = first, ...){
  header <- read_excel(..., range = header.range, col_names = FALSE)
  header <- apply (header, 2, header.fun)

  table <- read_excel(..., range = data.range, col_names = FALSE)
  colnames (table) <- header

  table
}

I may add that I'm not facing arbitrary untidy files here: I mostly work with measurement data, so physical quantities which frequently have units that do matter.
With that I frequently encounter Excel sheets with the column name in one row and the unit in the next. Which IMHO is about as tidy as concatenating the unit to the column name. But while the units belong to the column, but they are also not names, so the point can be made that they should not share the same cell. (IMHO the really tidy solution would be data types with units, so columns could have units in a defined and tidy way.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants