Collapse multiple rows of headers (feature request) #486

brianwdavis · 2018-06-08T19:07:15Z

I frequently receive Excel files that have "complex" headers, where there are multiple rows of cells encoding a single column name. Usually there's included units, labels, or additional metadata, but rarely can the extra rows be thrown away and only the first kept. Here's an example:

In the workflow below, the first attempt erroneously puts the metadata in the first row of the tibble. This incorrectly coerces the parser to guess that 4 columns are character, not numeric. Additionally, we lose the name of the 6th column, converted to X__1, and the second treatment column is coerced to Trt__1.
The second attempt correctly parses the column types, but we lose the names on the first 5 columns, and keep only the metadata. Since there are multiple columns with the unit "grams", that's not enough to reconstruct the original headers.
The third attempt is a flavor of the first, manually removing the metadata row and forcing a reparse. Again, this loses one of the column names, and confuses the two treatment columns.
The fourth attempt is my current workflow: I read in the top few rows that I know contain both names and metadata, collapse them with paste, and unlist them to get a vector of column names. Then I read in the data part of the sheet separately and add the names.

My desire: an additional argument to read_excel that's like row_collapse = 1:2, telling it to do that.

library(readxl)                                                             
library(dplyr)                                                              
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
                                                                            
file_path <- "https://github.com/brianwdavis/public/raw/master/example.xlsx"
                                                                            
download.file(file_path, "example.xlsx", mode = "wb")                       
                                                                            
read_excel("example.xlsx")                                                  
#> # A tibble: 7 x 6
#>   Trt      Trt__1   `Dry wt` `Fresh wt` Temp  X__1     
#>   <chr>    <chr>    <chr>    <chr>      <chr> <chr>    
#> 1 Factor 1 Factor 2 grams    grams      C     replicate
#> 2 a        d        3        5          20    1        
#> 3 a        d        5        10         22    2        
#> 4 a        e        8        15         21    3        
#> 5 b        e        4        9          30    1        
#> 6 b        f        6        11         31    2        
#> 7 b        f        2        4          32    3
read_excel("example.xlsx", skip = 1)                                        
#> # A tibble: 6 x 6
#>   `Factor 1` `Factor 2` grams grams__1     C replicate
#>   <chr>      <chr>      <dbl>    <dbl> <dbl>     <dbl>
#> 1 a          d              3        5    20         1
#> 2 a          d              5       10    22         2
#> 3 a          e              8       15    21         3
#> 4 b          e              4        9    30         1
#> 5 b          f              6       11    31         2
#> 6 b          f              2        4    32         3
                                                                            
                                                                            
                                                                            
read_excel("example.xlsx") %>%                                              
slice(-1) %>%                                                               
mutate_all(readr::parse_guess)                                              
#> # A tibble: 6 x 6
#>   Trt   Trt__1 `Dry wt` `Fresh wt`  Temp  X__1
#>   <chr> <chr>     <int>      <int> <int> <int>
#> 1 a     d             3          5    20     1
#> 2 a     d             5         10    22     2
#> 3 a     e             8         15    21     3
#> 4 b     e             4          9    30     1
#> 5 b     f             6         11    31     2
#> 6 b     f             2          4    32     3
                                                                            
                                                                            
nms <-                                                                      
read_excel("example.xlsx", range = cell_rows(1:2), col_names = F) %>%       
summarise_all(funs(paste(na.omit(.), collapse = "_"))) %>%                  
unlist()                                                                    
nms                                                                         
#>             X__1             X__2             X__3             X__4 
#>   "Trt_Factor 1"   "Trt_Factor 2"   "Dry wt_grams" "Fresh wt_grams" 
#>             X__5             X__6 
#>         "Temp_C"      "replicate"
                                                                            
read_excel("example.xlsx", skip = 2) %>%                                    
magrittr::set_colnames(nms)                                                 
#> # A tibble: 5 x 6
#>   `Trt_Factor 1` `Trt_Factor 2` `Dry wt_grams` `Fresh wt_grams` Temp_C
#>   <chr>          <chr>                   <dbl>            <dbl>  <dbl>
#> 1 a              d                           5               10     22
#> 2 a              e                           8               15     21
#> 3 b              e                           4                9     30
#> 4 b              f                           6               11     31
#> 5 b              f                           2                4     32
#> # ... with 1 more variable: replicate <dbl>

The text was updated successfully, but these errors were encountered:

brianwdavis · 2018-06-18T21:46:28Z

Having searched again, I saw that I missed #379 (closed) somehow, which starts out dealing with this exact sort of issue and then veers off into how to process pivot tables/wide data.

jennybc · 2018-06-18T22:10:28Z

readxl could perhaps gain a convenience function to deal with such sheets, that would be some version of what you are doing with your two reads: one of header rows and another of data.

So, it would be a thin wrapper around 2 calls to read_excel(). I don't think it's worth it to deal with any of this on C/C++ side. It would be a bit more efficient to do the first read, make colnames, then provide these colnames to the second read (vs. applying them after the fact).

Main question is what should the interface be re: describing how to make column names from multiple header rows.

cc @richfitz, who I think had a similar question recently

brianwdavis · 2018-06-18T23:15:52Z

It is indeed true that two calls to read_excel is about 2x as fast as one call to read_excel and then readr::parse_guess on each column (which surprised me! the C/C++ side must be very fast).

My stab would be something like below. In all the examples I've come across in the wild, it would be fine to have a single integer argument row_collapse that counts the number of rows to concatenate. I'm curious what other use cases people have where that wouldn't fit though.

read_excel_multiline <- function(filename, row_collapse = 1, ...) {
  nms <- read_excel(filename, range = cell_rows(seq_len(row_collapse)), col_names = F)  
  nms <- lapply(nms, na.omit)
  nms <- lapply(nms, paste, collapse = "_")

  read_excel(filename, skip = row_collapse, col_names = unlist(nms), ...)                                     
}

read_excel_multiline("example.xlsx", row_collapse = 2)

> # A tibble: 6 x 6
>   `Trt_Factor 1` `Trt_Factor 2` `Dry wt_grams` `Fresh wt_grams` Temp_C replicate
>   <chr>          <chr>                   <dbl>            <dbl>  <dbl>     <dbl>
> 1 a              d                           3                5     20         1
> 2 a              d                           5               10     22         2
> 3 a              e                           8               15     21         3
> 4 b              e                           4                9     30         1
> 5 b              f                           6               11     31         2
> 6 b              f                           2                4     32         3

jennybc · 2018-07-10T04:01:16Z

@brianwdavis would you subscribe/watch #492? Once that mini vignette exists and is merged, perhaps you could add a second worked example, based on the above? While we ruminate on whether to add something to readxl itself and, if so, what, it's nice to have clear worked examples available for users right now.

jennybc · 2018-12-14T01:11:30Z

I still think building this into readxl itself is a non-starter. I still welcome worked examples in the articles of the pkgdown so that people who share this problem can customize some general patterns for tackling this.

nacnudus · 2019-03-19T14:45:53Z

Could this be reconsidered in light of the new tidyr functions for multiple-value columns, which make this situation a very attractive one to get into?

jennybc · 2019-03-21T04:51:12Z

Yeah @nacnudus I agree that current developments in tidyr are very interesting re: reading spreadsheets, with complicated headers. I don't think the tidyr work is mature enough to open a specific issue on readxl at this point, but it will be interesting to see if we can make things converge in a fruitful way.

cbeleites · 2019-12-20T19:32:57Z

I'd like to bring up this issue again. I encounter it sufficiently often to chime in that a wrapper would be useful (I put mine in one of my private packages, but I'd be happy to share/contribute it).

read_excel_multiheader <- function(data.range, header.range, header.fun = first, ...){
  header <- read_excel(..., range = header.range, col_names = FALSE)
  header <- apply (header, 2, header.fun)

  table <- read_excel(..., range = data.range, col_names = FALSE)
  colnames (table) <- header

  table
}

I may add that I'm not facing arbitrary untidy files here: I mostly work with measurement data, so physical quantities which frequently have units that do matter.
With that I frequently encounter Excel sheets with the column name in one row and the unit in the next. Which IMHO is about as tidy as concatenating the unit to the column name. But while the units belong to the column, but they are also not names, so the point can be made that they should not share the same cell. (IMHO the really tidy solution would be data types with units, so columns could have units in a defined and tidy way.)

brianwdavis mentioned this issue Jul 11, 2018

Multiple header rows mini-vignette? #492

Closed

jennybc closed this as completed Dec 14, 2018

avallecam mentioned this issue Aug 13, 2021

feature request: read_excel_multiline avallecam/epihelper#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collapse multiple rows of headers (feature request) #486

Collapse multiple rows of headers (feature request) #486

brianwdavis commented Jun 8, 2018 •

edited

Loading

brianwdavis commented Jun 18, 2018 •

edited

Loading

jennybc commented Jun 18, 2018

brianwdavis commented Jun 18, 2018 •

edited

Loading

jennybc commented Jul 10, 2018

jennybc commented Dec 14, 2018

nacnudus commented Mar 19, 2019

jennybc commented Mar 21, 2019

cbeleites commented Dec 20, 2019

Collapse multiple rows of headers (feature request) #486

Collapse multiple rows of headers (feature request) #486

Comments

brianwdavis commented Jun 8, 2018 • edited Loading

brianwdavis commented Jun 18, 2018 • edited Loading

jennybc commented Jun 18, 2018

brianwdavis commented Jun 18, 2018 • edited Loading

jennybc commented Jul 10, 2018

jennybc commented Dec 14, 2018

nacnudus commented Mar 19, 2019

jennybc commented Mar 21, 2019

cbeleites commented Dec 20, 2019

brianwdavis commented Jun 8, 2018 •

edited

Loading

brianwdavis commented Jun 18, 2018 •

edited

Loading

brianwdavis commented Jun 18, 2018 •

edited

Loading