Optional anchor and ncol #8

smbache · 2015-03-19T07:42:08Z

It would be nice to be able to specify where the rectangular data part is located within a sheet (some people like to have "non-data" related things in their sheets too). One way could be to specify the top-left anchor cell, say "D5", and optionally the number of columns to use.

hadley · 2015-03-19T11:25:18Z

Shouldn't be too hard to translate (e.g.) B5:X17 to right number of rows and cols to skip/read

hadley · 2015-03-19T11:25:39Z

@zentree would you prefer a specification like this?

jkeirstead · 2015-03-19T12:46:30Z

I'd second this, both the functionality of being able to pull out a rectangular selection and the ability to specify that selection with Excel-like references. I'm currently using XLConnect's readWorksheet(..., startCol=, endCol=, startRow=, endRow=) syntax but it's a bit long-winded.

smbache · 2015-03-19T12:51:29Z

Specifying the number of rows should not be a must, as you typically don't know the number of rows,
so "A1:Z32" is typically not something you know within R, but I guess as long e.g. "B:Z" would work it's fine.

gshotwell · 2015-03-19T13:19:02Z

I would go with the "A1:Z31" syntax. Typically an excel user will be referencing the data cells anyway in the sheet so they'll be thinking in those terms. You might also want to have it skip the "$" notation in excel, so that "A$1:$B$32" returns the same cell range as A1:B32, just because the user might be copying the reference from an excel sheet. .

zentree · 2015-03-19T20:00:50Z

Rectangular specification would be a nice way to deal with number of rows too. From a user point of view I have to deal with two cases:

I'm processing a large number of files—large enough so I can't check every one of them—which would have a default format. These are often machine generated.
I'm reading a small number of files created by hand, where I need to extract a variable rectangular array.

I see a notation like A1:Z32 dealing mostly with case 2.

DavoOZ · 2015-03-20T01:33:25Z

Excellent new package; so is readr.
I can see readxl replacing my old workhorse XLConnect very soon (no more hassles with java updates). Also much easier for new users.

I'd use two extra options a lot:

header=TRUE/FALSE
startrow=n

jennybc · 2015-03-22T19:29:32Z

I'm doing exactly this over in googlesheets, which consumes (and writes) data from (to) Google Spreadsheets.

For the targeted data consumption discussed here, the Sheets API forces you to use a cell-by-cell approach (the incredibly slow "cell feed"). In this case, my function lets user specify the data rectangle via min row, max row, min col, max col. There are convenience wrappers to get one or more rows, one or more columns, or a region specified like B3:G17 or R3C2:R17C7 (the other standard positioning notation). Then that output can be processed by one of two functions for reshaping and/or transformation, where there is an argument for header = TRUE/FALSE.(Hmmm... I should probably write even more wrappers to package that sequence of actions.)

https://github.com/jennybc/googlesheets/blob/master/R/consume-data.R

https://github.com/jennybc/googlesheets/blob/master/README.md#convenience-wrappers-and-post-processing-the-data

I'm really interested in this thread because it would be great to keep the two interfaces as similar as possible. I even volunteer to help!

hadley · 2015-03-23T11:29:20Z

@jennybc I'd love to incorporate your cell specification code in readxl.

I think bundling all the options into one argument plus helper functions would be ideal. Something like this:

read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))

jennybc · 2015-03-23T14:37:57Z

OK I will isolate and upgrade cell specification over in googlesheets and get back to you. The helper functions are nice idea.

Would range = + rc() and excel_range() be the only way to restrict cell consumption or … would you allow cell specification via, e.g., row = 3 or col = "B:D"? Or perhaps range = rows(3) and range = cols("B:D") or range = cols(B:D)? FWIW I am not prepared to deal with any requests for non-contiguous rows or cells, so my inclination is to disallow that.
I accept a single cell anywhere I accept a range, i.e. range = "D14" is just as valid as range = "D14:G20". I assume that's OK.
Ultimately I need my cell limits as a named list, to pass as a query. So that's where all roads must lead re: cell specification in googlesheets. You probably have more control/responsibility here in readxl.

wdkrnls · 2015-03-24T14:16:50Z

The "B7:Z18" syntax sounds great! Just please make sure it can handle more than 26 columns.

jennybc · 2015-03-24T17:13:47Z

@wdkrnls What is the max number of columns in Excel? In Google sheets, it's only 300. It is helpful to know this upper bound for the function that translates columns in "ABDC" positioning notation to actual column numbers.

smbache · 2015-03-24T17:19:19Z

According to this source excel 2010 is limited to 16384 columns.

https://support.office.com/en-nz/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

wdkrnls · 2015-03-26T02:46:27Z

My excel sheets regularly have hundreds of columns by the time I get them. I don't have exact statistics, but my data come from arrays of micro reactors hooked up to online chromatography and have between 500-1000 columns covering concentrations of myriads of detectable chemicals.

jennybc · 2015-03-30T17:09:12Z

@hadley readxl currently has no dependencies other than Rcpp. Would you want that to be true of the range specifying stuff as well, i.e. it's part of the package philosophy?

hadley · 2015-03-30T18:18:35Z

I'd prefer to keep dependencies minimal, but it seems like this code needs to be shared between readxl and spreadr

jennybc · 2015-03-30T18:21:54Z

I'll try to do it that way then. And look wistfully at stringr and %>%.

jennybc · 2015-04-02T05:56:51Z

@hadley

In a branch, I isolated the the cell specification functions and refactored them to have no dependencies:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R

It's basically 3 pairs of functions:

letter_to_num() <--> num_to_letter() for converting column IDs, e.g., ABD <--> 732
A1_to_RC() <--> RC_to_A1() for converting between positioning notation, e.g, AB10 <--> R10C28
convert_range_to_limit_list() <--> convert_limit_list_to_range() for, e.g., A1:C4 <--> list(min-row= 1,min-col= 1,max-row= 4,max-col= 3)

I wanted to check back in before going further. Hopefully this get things rolling, to settle on user-facing bits. I don't export any of the above.

Tests are here:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/tests/testthat/test-cell-specification.R

wdkrnls · 2015-04-10T22:44:18Z

Would I be able to use excel_range to select all columns to the right?

e.g. excel_range(c(15, NA), c(NA, NA))

This feature would be very useful for reading and cleaning the output of one of the data sources I have.

jennybc · 2015-04-10T23:10:39Z

@wdkrnls Yes that's my interpretation of the proposal and what the NAs would mean. However, I would interpret your example as requesting to read from row 15 on, not column. I'd assume this:

excel_range(c(min-row, max-row), c(min-col, max-col))

where NA for a max means no limit and NA for a min is equivalent to specifying the value as 1.

wdkrnls · 2015-04-11T13:55:34Z

Fair enough, I was imagining specifying the corners of the box as c(column, row) pairs analogous to the excel syntax, but your interpretation sounds great to me.

jennybc · 2015-04-13T18:03:54Z

@hadley Let me specify where I need your guidance.

Ways to specify cells. From most Excel-ish and least programmatically useful to least Excel-ish and most programmatically useful:

A1:E4
R1C1: R4C5
named vector or list, e.g. c(min_row = 1, max_row = 4, min_col = 1, max_col = 5)

Internally, my current manipulations always move things down that list above. But the way I read your proposed helper functions:

read_excel(..., range = "D12:F15")
read_excel(..., range = rc("R1C12:R6C15"))
read_excel(..., range = excel_range(c(1, 6), c(1, 15))
read_excel(..., range = excel_range(c(2, NA), c(1, NA))

… they seem to imply transformations up the list, i.e. towards the A1:E4 style of cell specification. Which feels weird because internally we'll just reverse that and go all the way to named vector or list.

Should I just get over it and proceed anyway? I realize there is no concern about speed or anything.

Is it overkill to have an S3 class for a cell range that holds the specifications in all the formats, so it's always easy to print or use the one best for the task at hand?

Do you approve of row or column-based helpers:

read_excel(..., range = rows(1:5))
read_excel(..., range = rows(2:*))
read_excel(..., range = columns(3:9))
read_excel(..., range = columns("A:G"))

hadley · 2015-04-13T18:36:32Z

Altogether I think it would look something like this:

excel_range <- function(rows, col = cols) {
  stopifnot(is.numeric(rows), length(rows) == 2)
  stopifnot(is.numeric(cols), length(cols) == 2)

  structure(list(rows = rows, cols = cols), class = "excel_range")
}

rc <- function(x) {
  ...
  excel_range(rows, cols)
}

as.excel_range <- function(x) UseMethod("as.excel_range")
as.excel_range.excel_range <- function(x) x
as.excel_range.character <- function(x) rc(x)

and then read_excel would call as.excel_range() on its input. Does that make sense?

Row and column based helpers would be fine (although I'd worry a little bit about giving them such short names - generally better to preserve short names for widely applicable tools)

jennybc · 2015-04-13T18:45:45Z

Perfect. I will push it forward and report back.

tidyverse/readxl#8 [covr]

jennybc · 2015-04-17T00:34:27Z

I wrote the parts to process cell specification by the user.

This is still in a branch for me, but there I am using these functions myself. FWIW all is well re: tests and travis.

All the cell specification stuff that could be in common is in this file and there are no package dependencies:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/R/cell-specification.R

Tests are here:

https://github.com/jennybc/googlesheets/blob/cell-range-specification/tests/testthat/test-cell-specification.R

I renamed the class from excel_range to cell_limits, to make less Excel-specific and more descriptive (?). I also found it easier to ditch the rc() helper and just detect whether the range is in A1 or R1C1 notation.

Haven't added any row or column helpers yet.

This should put you in a position to modify read_excel() to accept calls like this now:

read_excel(..., range = "D12:F15")
read_excel(..., range = "R1C12:R6C15")
read_excel(..., range = cell_limits(c(1, 6), c(1, 15))
read_excel(..., range = cell_limits(c(2, NA), c(1, NA))

~~@wdkrnls BTW your original interpretation of the format of the limits is correct. It will be row min + max in one vector and col min + max in the other.~~ sorry I misremembered this … I stand by my original response

jennybc · 2015-04-17T00:39:42Z

I should probably implement @gshotwell's suggestion to ignore $s. Yes?

hadley · 2015-04-17T10:55:37Z

Looks good. I agree you should just ignore $.

Next step is to figure out where to put this? Maybe we should have a small cell tools package?

jennybc · 2015-04-17T21:57:07Z

I can put this in a little package. Call it … sheetcells, cellranges, ???

I'll ignore $ and experiment w/ row and column helpers. If it's to be a package for general cell helpers, I'll also pull in a function to take an anchor cell + some input (1d or 2d) and return the range of cells that would be affected by an edit. Which brings us full circle, i.e. back to @smbache's original post in this issue!

hadley · 2015-04-20T12:10:39Z

Cell ranges sounds good to me

wdkrnls · 2015-04-20T23:58:16Z

Really excited to try this out!

jennybc · 2015-04-21T04:55:40Z

I've put this stuff in a package, cellranger. I'll open an issue there, @hadley, with a few questions and comments.

smbache · 2015-04-21T05:04:11Z

You forgot to remove the last 'e' ;) it's convention!

jennybc · 2015-04-22T20:19:53Z

The helper package is on CRAN now:

http://cran.r-project.org/web/packages/cellranger/index.html

kcandrews · 2015-04-30T17:49:20Z

Thanks so much for this! I'm really looking forward to never typing xlcFreeMemory() again.

eibanez · 2015-05-01T03:19:37Z

Could this be extended to support the following?

read_excel(..., range = "Sheet1!D12:F15")

kcandrews · 2015-05-01T17:46:52Z

What's the advantage of that versus the already existing sheet= argument to read_excel?

eibanez · 2015-05-01T18:57:42Z

It mirrors the representation that Excel uses for ranges, e.g., for named ranges (#79). Not a big deal, but wanted to throw it out there.

jennybc · 2015-05-01T20:41:29Z

It's reasonable to think about expanding the notion of a cell_limits object to include the hosting (work)sheet or tab, in addition to the row and column limits. Both Excel and Google Sheets share that same high-level structure and we already must specify worksheet for all reads and writes.

I'm not exactly eager to rework the handling of worksheets … but it's worth considering.

bbolker · 2017-01-18T00:27:48Z

Bump: is there a way to access this functionality yet? Sorry if it's been implemented and I missed it, but I claim it's not obvious ...

My use case is just to be able to select a limited number of rows. The spreadsheet has header rows interspersed with data (sigh): if I follow the advice to read the whole file (possibly skipping initial rows) and discard the stuff I don't need, then I have the hassle of converting columns back from character to numeric ...

jennybc · 2017-01-18T04:21:46Z

@bbolker Not yet, but this package is now a main focus of mine. Once I'm done with basic triage of the issues/pull requests, this feature will be a high priority. Closing the longest-running issue will be a real pleasure.

jennybc mentioned this issue Mar 22, 2015

[feature request] read in region of sheet #30

Closed

jennybc mentioned this issue Mar 23, 2015

Controlling output of read_excel() #33

Closed

jennybc mentioned this issue Apr 12, 2015

how to read in specific rows and columns only? #64

Closed

eibanez mentioned this issue Apr 16, 2015

Read named ranges #79

Closed

jennybc pushed a commit to jennybc/googlesheets that referenced this issue Apr 17, 2015

moving towards cell specification interface discussed over in readxl

171f7de

tidyverse/readxl#8 [covr]

jennybc mentioned this issue Apr 17, 2015

Add an option to limit the number of rows to read #7

Closed

jennybc mentioned this issue Apr 22, 2015

Feature request: Specify which columns to extract from a sheet. #90

Closed

jennybc mentioned this issue Aug 31, 2015

Any plans re: formatted numbers or formulas? #123

Closed

jennybc mentioned this issue Oct 12, 2015

Feature Request: drop or select columns #133

Closed

tklebel mentioned this issue Nov 10, 2015

Feature Request: nrows option #147

Closed

bhive01 mentioned this issue Feb 26, 2016

should the col_types= parameter accept vector recycling? #127

Closed

This was referenced Jan 21, 2017

Problem in loading the xlsx sheet link #67

Closed

Read all worksheets in a single request? jennybc/googlesheets#289

Closed

jennybc added the feature a feature request or enhancement label Jan 31, 2017

jennybc mentioned this issue Feb 5, 2017

Empty columns: to drop or not to drop? #157

Closed

jennybc mentioned this issue Apr 3, 2017

Target arbitrary and open rectangles; fixes #8, fixes #313 #314

Merged

jennybc closed this as completed in b80ac11 Apr 4, 2017

lock bot locked and limited conversation to collaborators Oct 10, 2019

Optional anchor and ncol #8

Optional anchor and ncol #8

Comments

smbache commented Mar 19, 2015

hadley commented Mar 19, 2015

hadley commented Mar 19, 2015

jkeirstead commented Mar 19, 2015

smbache commented Mar 19, 2015

gshotwell commented Mar 19, 2015

zentree commented Mar 19, 2015

DavoOZ commented Mar 20, 2015

jennybc commented Mar 22, 2015

hadley commented Mar 23, 2015

jennybc commented Mar 23, 2015

wdkrnls commented Mar 24, 2015

jennybc commented Mar 24, 2015

smbache commented Mar 24, 2015

wdkrnls commented Mar 26, 2015

jennybc commented Mar 30, 2015

hadley commented Mar 30, 2015

jennybc commented Mar 30, 2015

jennybc commented Apr 2, 2015

wdkrnls commented Apr 10, 2015

jennybc commented Apr 10, 2015

wdkrnls commented Apr 11, 2015

jennybc commented Apr 13, 2015

hadley commented Apr 13, 2015

jennybc commented Apr 13, 2015

jennybc commented Apr 17, 2015

jennybc commented Apr 17, 2015

hadley commented Apr 17, 2015

jennybc commented Apr 17, 2015

hadley commented Apr 20, 2015

wdkrnls commented Apr 20, 2015

jennybc commented Apr 21, 2015

smbache commented Apr 21, 2015

jennybc commented Apr 22, 2015

kcandrews commented Apr 30, 2015

eibanez commented May 1, 2015

kcandrews commented May 1, 2015

eibanez commented May 1, 2015

jennybc commented May 1, 2015

bbolker commented Jan 18, 2017

jennybc commented Jan 18, 2017