Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional anchor and ncol #8

Closed
smbache opened this issue Mar 19, 2015 · 40 comments
Closed

Optional anchor and ncol #8

smbache opened this issue Mar 19, 2015 · 40 comments
Labels
feature a feature request or enhancement

Comments

@smbache
Copy link
Member

smbache commented Mar 19, 2015

It would be nice to be able to specify where the rectangular data part is located within a sheet (some people like to have "non-data" related things in their sheets too). One way could be to specify the top-left anchor cell, say "D5", and optionally the number of columns to use.

@hadley
Copy link
Member

hadley commented Mar 19, 2015

Shouldn't be too hard to translate (e.g.) B5:X17 to right number of rows and cols to skip/read

@hadley
Copy link
Member

hadley commented Mar 19, 2015

@zentree would you prefer a specification like this?

@jkeirstead
Copy link

I'd second this, both the functionality of being able to pull out a rectangular selection and the ability to specify that selection with Excel-like references. I'm currently using XLConnect's readWorksheet(..., startCol=, endCol=, startRow=, endRow=) syntax but it's a bit long-winded.

@smbache
Copy link
Member Author

smbache commented Mar 19, 2015

Specifying the number of rows should not be a must, as you typically don't know the number of rows,
so "A1:Z32" is typically not something you know within R, but I guess as long e.g. "B:Z" would work it's fine.

@gshotwell
Copy link

I would go with the "A1:Z31" syntax. Typically an excel user will be referencing the data cells anyway in the sheet so they'll be thinking in those terms. You might also want to have it skip the "$" notation in excel, so that "A$1:$B$32" returns the same cell range as A1:B32, just because the user might be copying the reference from an excel sheet. .

@zentree
Copy link

zentree commented Mar 19, 2015

Rectangular specification would be a nice way to deal with number of rows too. From a user point of view I have to deal with two cases:

  1. I'm processing a large number of files—large enough so I can't check every one of them—which would have a default format. These are often machine generated.
  2. I'm reading a small number of files created by hand, where I need to extract a variable rectangular array.

I see a notation like A1:Z32 dealing mostly with case 2.

@DavoOZ
Copy link

DavoOZ commented Mar 20, 2015

Excellent new package; so is readr.
I can see readxl replacing my old workhorse XLConnect very soon (no more hassles with java updates). Also much easier for new users.

I'd use two extra options a lot:

  1. header=TRUE/FALSE
  2. startrow=n

@jennybc
Copy link
Member

jennybc commented Mar 22, 2015

I'm doing exactly this over in googlesheets, which consumes (and writes) data from (to) Google Spreadsheets.

For the targeted data consumption discussed here, the Sheets API forces you to use a cell-by-cell approach (the incredibly slow "cell feed"). In this case, my function lets user specify the data rectangle via min row, max row, min col, max col. There are convenience wrappers to get one or more rows, one or more columns, or a region specified like B3:G17 or R3C2:R17C7 (the other standard positioning notation). Then that output can be processed by one of two functions for reshaping and/or transformation, where there is an argument for header = TRUE/FALSE.(Hmmm... I should probably write even more wrappers to package that sequence of actions.)

https://github.com/jennybc/googlesheets/blob/master/R/consume-data.R

https://github.com/jennybc/googlesheets/blob/master/README.md#convenience-wrappers-and-post-processing-the-data

I'm really interested in this thread because it would be great to keep the two interfaces as similar as possible. I even volunteer to help!

@hadley
Copy link
Member

hadley commented Mar 23, 2015

@jennybc I'd love to incorporate your cell specification code in readxl.

I think bundling all the options into one argument plus helper functions would be ideal. Something like this:

read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))

@jennybc
Copy link
Member

jennybc commented Mar 23, 2015

OK I will isolate and upgrade cell specification over in googlesheets and get back to you. The helper functions are nice idea.

  • Would range = + rc() and excel_range() be the only way to restrict cell consumption or … would you allow cell specification via, e.g., row = 3 or col = "B:D"? Or perhaps range = rows(3) and range = cols("B:D") or range = cols(B:D)? FWIW I am not prepared to deal with any requests for non-contiguous rows or cells, so my inclination is to disallow that.
  • I accept a single cell anywhere I accept a range, i.e. range = "D14" is just as valid as range = "D14:G20". I assume that's OK.
  • Ultimately I need my cell limits as a named list, to pass as a query. So that's where all roads must lead re: cell specification in googlesheets. You probably have more control/responsibility here in readxl.

@wdkrnls
Copy link

wdkrnls commented Mar 24, 2015

The "B7:Z18" syntax sounds great! Just please make sure it can handle more than 26 columns.

@jennybc
Copy link
Member

jennybc commented Mar 24, 2015

@wdkrnls What is the max number of columns in Excel? In Google sheets, it's only 300. It is helpful to know this upper bound for the function that translates columns in "ABDC" positioning notation to actual column numbers.

@smbache
Copy link
Member Author

smbache commented Mar 24, 2015

According to this source excel 2010 is limited to 16384 columns.

https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

@wdkrnls
Copy link

wdkrnls commented Mar 26, 2015

My excel sheets regularly have hundreds of columns by the time I get them. I don't have exact statistics, but my data come from arrays of micro reactors hooked up to online chromatography and have between 500-1000 columns covering concentrations of myriads of detectable chemicals.

@jennybc
Copy link
Member

jennybc commented Mar 30, 2015

@hadley readxl currently has no dependencies other than Rcpp. Would you want that to be true of the range specifying stuff as well, i.e. it's part of the package philosophy?

@hadley
Copy link
Member

hadley commented Mar 30, 2015

I'd prefer to keep dependencies minimal, but it seems like this code needs to be shared between readxl and spreadr

@jennybc
Copy link
Member

jennybc commented Mar 30, 2015

I'll try to do it that way then. And look wistfully at stringr and %>%.

@jennybc
Copy link
Member

jennybc commented Apr 2, 2015

@hadley

In a branch, I isolated the the cell specification functions and refactored them to have no dependencies:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R

It's basically 3 pairs of functions:

  • letter_to_num() <--> num_to_letter() for converting column IDs, e.g., ABD <--> 732
  • A1_to_RC() <--> RC_to_A1() for converting between positioning notation, e.g, AB10 <--> R10C28
  • convert_range_to_limit_list() <--> convert_limit_list_to_range() for, e.g., A1:C4 <--> list(min-row= 1,min-col= 1,max-row= 4,max-col= 3)

I wanted to check back in before going further. Hopefully this get things rolling, to settle on user-facing bits. I don't export any of the above.

Tests are here:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/tests/testthat/test-cell-specification.R

@wdkrnls
Copy link

wdkrnls commented Apr 10, 2015

Would I be able to use excel_range to select all columns to the right?

e.g. excel_range(c(15, NA), c(NA, NA))

This feature would be very useful for reading and cleaning the output of one of the data sources I have.

@jennybc
Copy link
Member

jennybc commented Apr 10, 2015

@wdkrnls Yes that's my interpretation of the proposal and what the NAs would mean. However, I would interpret your example as requesting to read from row 15 on, not column. I'd assume this:

excel_range(c(min-row, max-row), c(min-col, max-col))

where NA for a max means no limit and NA for a min is equivalent to specifying the value as 1.

@wdkrnls
Copy link

wdkrnls commented Apr 11, 2015

Fair enough, I was imagining specifying the corners of the box as c(column, row) pairs analogous to the excel syntax, but your interpretation sounds great to me.

@jennybc
Copy link
Member

jennybc commented Apr 13, 2015

@hadley Let me specify where I need your guidance.

Ways to specify cells. From most Excel-ish and least programmatically useful to least Excel-ish and most programmatically useful:

  • A1:E4
  • R1C1: R4C5
  • named vector or list, e.g. c(min_row = 1, max_row = 4, min_col = 1, max_col = 5)

Internally, my current manipulations always move things down that list above. But the way I read your proposed helper functions:

read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))

… they seem to imply transformations up the list, i.e. towards the A1:E4 style of cell specification. Which feels weird because internally we'll just reverse that and go all the way to named vector or list.

Should I just get over it and proceed anyway? I realize there is no concern about speed or anything.

Is it overkill to have an S3 class for a cell range that holds the specifications in all the formats, so it's always easy to print or use the one best for the task at hand?

Do you approve of row or column-based helpers:

read_excel(..., range = rows(1:5))
read_excel(..., range = rows(2:*))
read_excel(..., range = columns(3:9))
read_excel(..., range = columns("A:G"))

@hadley
Copy link
Member

hadley commented Apr 13, 2015

Altogether I think it would look something like this:

excel_range <- function(rows, col = cols) {
  stopifnot(is.numeric(rows), length(rows) == 2)
  stopifnot(is.numeric(cols), length(cols) == 2)

  structure(list(rows = rows, cols = cols), class = "excel_range")
}

rc <- function(x) {
  ...
  excel_range(rows, cols)
}

as.excel_range <- function(x) UseMethod("as.excel_range")
as.excel_range.excel_range <- function(x) x
as.excel_range.character <- function(x) rc(x)

and then read_excel would call as.excel_range() on its input. Does that make sense?

Row and column based helpers would be fine (although I'd worry a little bit about giving them such short names - generally better to preserve short names for widely applicable tools)

@jennybc
Copy link
Member

jennybc commented Apr 13, 2015

Perfect. I will push it forward and report back.

jennybc pushed a commit to jennybc/googlesheets that referenced this issue Apr 17, 2015
@jennybc
Copy link
Member

jennybc commented Apr 17, 2015

I wrote the parts to process cell specification by the user.

This is still in a branch for me, but there I am using these functions myself. FWIW all is well re: tests and travis.

All the cell specification stuff that could be in common is in this file and there are no package dependencies:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R

Tests are here:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/tests/testthat/test-cell-specification.R

I renamed the class from excel_range to cell_limits, to make less Excel-specific and more descriptive (?). I also found it easier to ditch the rc() helper and just detect whether the range is in A1 or R1C1 notation.

Haven't added any row or column helpers yet.

This should put you in a position to modify read_excel() to accept calls like this now:

read_excel(..., range = "D12:F15")
read_excel(..., range = "R1C12:R6C15")
read_excel(..., range = cell_limits(c(1, 6), c(1, 15))
read_excel(..., range = cell_limits(c(2, NA), c(1, NA))

@wdkrnls BTW your original interpretation of the format of the limits is correct. It will be row min + max in one vector and col min + max in the other. sorry I misremembered this … I stand by my original response

@jennybc
Copy link
Member

jennybc commented Apr 17, 2015

I should probably implement @gshotwell's suggestion to ignore $s. Yes?

@hadley
Copy link
Member

hadley commented Apr 17, 2015

Looks good. I agree you should just ignore $.

Next step is to figure out where to put this? Maybe we should have a small cell tools package?

@jennybc
Copy link
Member

jennybc commented Apr 17, 2015

I can put this in a little package. Call it … sheetcells, cellranges, ???

I'll ignore $ and experiment w/ row and column helpers. If it's to be a package for general cell helpers, I'll also pull in a function to take an anchor cell + some input (1d or 2d) and return the range of cells that would be affected by an edit. Which brings us full circle, i.e. back to @smbache's original post in this issue!

@hadley
Copy link
Member

hadley commented Apr 20, 2015

Cell ranges sounds good to me

@wdkrnls
Copy link

wdkrnls commented Apr 20, 2015

Really excited to try this out!

@jennybc
Copy link
Member

jennybc commented Apr 21, 2015

I've put this stuff in a package, cellranger. I'll open an issue there, @hadley, with a few questions and comments.

@smbache
Copy link
Member Author

smbache commented Apr 21, 2015

You forgot to remove the last 'e' ;) it's convention!

@jennybc
Copy link
Member

jennybc commented Apr 22, 2015

The helper package is on CRAN now:

http://cran.r-project.org/web/packages/cellranger/index.html

@kcandrews
Copy link

Thanks so much for this! I'm really looking forward to never typing xlcFreeMemory() again.

@eibanez
Copy link

eibanez commented May 1, 2015

Could this be extended to support the following?

read_excel(..., range = "Sheet1!D12:F15")

@kcandrews
Copy link

What's the advantage of that versus the already existing sheet= argument to read_excel?

@eibanez
Copy link

eibanez commented May 1, 2015

It mirrors the representation that Excel uses for ranges, e.g., for named ranges (#79). Not a big deal, but wanted to throw it out there.

@jennybc
Copy link
Member

jennybc commented May 1, 2015

It's reasonable to think about expanding the notion of a cell_limits object to include the hosting (work)sheet or tab, in addition to the row and column limits. Both Excel and Google Sheets share that same high-level structure and we already must specify worksheet for all reads and writes.

I'm not exactly eager to rework the handling of worksheets … but it's worth considering.

@bbolker
Copy link

bbolker commented Jan 18, 2017

Bump: is there a way to access this functionality yet? Sorry if it's been implemented and I missed it, but I claim it's not obvious ...

My use case is just to be able to select a limited number of rows. The spreadsheet has header rows interspersed with data (sigh): if I follow the advice to read the whole file (possibly skipping initial rows) and discard the stuff I don't need, then I have the hassle of converting columns back from character to numeric ...

@jennybc
Copy link
Member

jennybc commented Jan 18, 2017

@bbolker Not yet, but this package is now a main focus of mine. Once I'm done with basic triage of the issues/pull requests, this feature will be a high priority. Closing the longest-running issue will be a real pleasure.

@jennybc jennybc added the feature a feature request or enhancement label Jan 31, 2017
@jennybc jennybc closed this as completed in b80ac11 Apr 4, 2017
@lock lock bot locked and limited conversation to collaborators Oct 10, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests