-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collapse multiple rows of headers (feature request) #486
Comments
Having searched again, I saw that I missed #379 (closed) somehow, which starts out dealing with this exact sort of issue and then veers off into how to process pivot tables/wide data. |
readxl could perhaps gain a convenience function to deal with such sheets, that would be some version of what you are doing with your two reads: one of header rows and another of data. So, it would be a thin wrapper around 2 calls to Main question is what should the interface be re: describing how to make column names from multiple header rows. cc @richfitz, who I think had a similar question recently |
It is indeed true that two calls to My stab would be something like below. In all the examples I've come across in the wild, it would be fine to have a single integer argument read_excel_multiline <- function(filename, row_collapse = 1, ...) {
nms <- read_excel(filename, range = cell_rows(seq_len(row_collapse)), col_names = F)
nms <- lapply(nms, na.omit)
nms <- lapply(nms, paste, collapse = "_")
read_excel(filename, skip = row_collapse, col_names = unlist(nms), ...)
}
read_excel_multiline("example.xlsx", row_collapse = 2)
> # A tibble: 6 x 6
> `Trt_Factor 1` `Trt_Factor 2` `Dry wt_grams` `Fresh wt_grams` Temp_C replicate
> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
> 1 a d 3 5 20 1
> 2 a d 5 10 22 2
> 3 a e 8 15 21 3
> 4 b e 4 9 30 1
> 5 b f 6 11 31 2
> 6 b f 2 4 32 3
|
@brianwdavis would you subscribe/watch #492? Once that mini vignette exists and is merged, perhaps you could add a second worked example, based on the above? While we ruminate on whether to add something to readxl itself and, if so, what, it's nice to have clear worked examples available for users right now. |
I still think building this into readxl itself is a non-starter. I still welcome worked examples in the articles of the pkgdown so that people who share this problem can customize some general patterns for tackling this. |
Could this be reconsidered in light of the new tidyr functions for multiple-value columns, which make this situation a very attractive one to get into? |
Yeah @nacnudus I agree that current developments in tidyr are very interesting re: reading spreadsheets, with complicated headers. I don't think the tidyr work is mature enough to open a specific issue on readxl at this point, but it will be interesting to see if we can make things converge in a fruitful way. |
I'd like to bring up this issue again. I encounter it sufficiently often to chime in that a wrapper would be useful (I put mine in one of my private packages, but I'd be happy to share/contribute it).
I may add that I'm not facing arbitrary untidy files here: I mostly work with measurement data, so physical quantities which frequently have units that do matter. |
I frequently receive Excel files that have "complex" headers, where there are multiple rows of cells encoding a single column name. Usually there's included units, labels, or additional metadata, but rarely can the extra rows be thrown away and only the first kept. Here's an example:
X__1
, and the second treatment column is coerced toTrt__1
.paste
, and unlist them to get a vector of column names. Then I read in the data part of the sheet separately and add the names.My desire: an additional argument to
read_excel
that's likerow_collapse = 1:2
, telling it to do that.The text was updated successfully, but these errors were encountered: