Don't load empty cells; use empirical worksheet dimensions #248

jennybc · 2017-02-06T07:13:10Z

There can be cells that look empty to the human eye, but that are styled, e.g. someone applied a custom number format to the cell at some point. I propose we do not load those anymore. There's nothing to learn from their XML.

If such a cell still falls into the target rectangle, it will simply be NA in the output (ok, won't be totally true until we stop dropping unnamed, empty columns #157).

If such a cell falls outside the target rectangle, this PR prevents the creation of trailing row(s) or column(s) consisting entirely of NA.

In addition to rows/columns of NA, this has been a puzzling source of errors about column names and types not being compatible.

To fully fix this, you also can't trust a worksheet's declared dimensions, because these cells "count". Therefore I propose we always compute dimension ourselves.

I haven't dealt with xls yet. I can leave this open and work on that. Or we could resolve this for xlsx and I'd open an issue to fix it for xls.

This branch builds off #247. I'm assuming it will be merged first.

hadley

LGTM. Just a few tiny niggles

hadley · 2017-02-06T17:01:50Z

src/XlsxWorkSheet.h

-        XlsxCell xcell(cell, i, j);
-        cells_.push_back(xcell);
+        // don't load a cell with no child nodes, e.g. it only has style
+        rapidxml::xml_node<>* first_child = cell->first_node(0);


I think the check here probably eliminates a check somewhere else.

There's not an obvious one, but I'll keep my eyes peeled.

Styled empty cells were a major source of "cell lacks the t attribute" here, which is why they always lead to numeric NA columns. But it's not clear they are the only source? When I get to column types, I'll try to settle this definitively.

readxl/src/XlsxCell.h

Lines 122 to 124 in a78440d

rapidxml::xml_attribute<>* t = cell_->first_attribute("t");

if (t == NULL || strncmp(t->value(), "n", 5) == 0) {

Similar situation with this check for a cell having a v node. But I'm not sure if these cells are the only source of this either?

readxl/src/XlsxCell.h

Lines 102 to 104 in a78440d

rapidxml::xml_node<>* v = cell_->first_node("v");

if (v == NULL)

return NA_STRING;

hadley · 2017-02-06T17:02:18Z

tests/testthat/test-empty.R

+  ## in a trailing empty column
+  ## in some trailing rows
+  out <- read_excel(test_sheet("style-only-cells.xlsx"))
+  expect_equal(


I think code like this is slightly cleaner if you define the tibble outside the test

hadley · 2017-02-06T17:02:45Z

tests/testthat/test-empty.R

+test_that("user-supplied column names play nicely with empty columns", {
+  skip("waiting for dust to settle re: treatment of empty columns")
+  ## do stuff like this:
+  out <- read_excel(test_sheet("style-only-cells.xlsx"),


House style is

read_excel( test_sheet("style-only-cells.xlsx"), col_names = LETTERS[1:4] )

jennybc requested a review from hadley February 6, 2017 07:14

hadley reviewed Feb 6, 2017

View reviewed changes

jennybc added 5 commits February 6, 2017 12:35

Don't load cells that have no child nodes; fixes #162

a4a03cd

Compute worksheet dimension based on loaded cells; fixes #203

1cdc486

Work on NEWS

5fa28a3

Tests re: empty, styled cells

fc5d8ee

Improve style of test code

933a241

jennybc merged commit c6e0f8f into tidyverse:master Feb 6, 2017

jennybc deleted the format-only-cells branch February 9, 2017 00:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't load empty cells; use empirical worksheet dimensions #248

Don't load empty cells; use empirical worksheet dimensions #248

jennybc commented Feb 6, 2017

hadley left a comment

hadley Feb 6, 2017

jennybc Feb 6, 2017

jennybc Feb 6, 2017

hadley Feb 6, 2017

jennybc Feb 6, 2017

hadley Feb 6, 2017

jennybc Feb 6, 2017

	rapidxml::xml_attribute<>* t = cell_->first_attribute("t");

	if (t == NULL \|\| strncmp(t->value(), "n", 5) == 0) {

	rapidxml::xml_node<>* v = cell_->first_node("v");
	if (v == NULL)
	return NA_STRING;

Don't load empty cells; use empirical worksheet dimensions #248

Don't load empty cells; use empirical worksheet dimensions #248

Conversation

jennybc commented Feb 6, 2017

hadley left a comment

Choose a reason for hiding this comment

hadley Feb 6, 2017

Choose a reason for hiding this comment

jennybc Feb 6, 2017

Choose a reason for hiding this comment

jennybc Feb 6, 2017

Choose a reason for hiding this comment

hadley Feb 6, 2017

Choose a reason for hiding this comment

jennybc Feb 6, 2017

Choose a reason for hiding this comment

hadley Feb 6, 2017

Choose a reason for hiding this comment

jennybc Feb 6, 2017

Choose a reason for hiding this comment