Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't load empty cells; use empirical worksheet dimensions #248

Merged
merged 5 commits into from
Feb 6, 2017
Merged

Don't load empty cells; use empirical worksheet dimensions #248

merged 5 commits into from
Feb 6, 2017

Conversation

jennybc
Copy link
Member

@jennybc jennybc commented Feb 6, 2017

There can be cells that look empty to the human eye, but that are styled, e.g. someone applied a custom number format to the cell at some point. I propose we do not load those anymore. There's nothing to learn from their XML.

If such a cell still falls into the target rectangle, it will simply be NA in the output (ok, won't be totally true until we stop dropping unnamed, empty columns #157).

If such a cell falls outside the target rectangle, this PR prevents the creation of trailing row(s) or column(s) consisting entirely of NA.

In addition to rows/columns of NA, this has been a puzzling source of errors about column names and types not being compatible.

To fully fix this, you also can't trust a worksheet's declared dimensions, because these cells "count". Therefore I propose we always compute dimension ourselves.

I haven't dealt with xls yet. I can leave this open and work on that. Or we could resolve this for xlsx and I'd open an issue to fix it for xls.

This branch builds off #247. I'm assuming it will be merged first.

@jennybc jennybc requested a review from hadley February 6, 2017 07:14
Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a few tiny niggles

XlsxCell xcell(cell, i, j);
cells_.push_back(xcell);
// don't load a cell with no child nodes, e.g. it only has style
rapidxml::xml_node<>* first_child = cell->first_node(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check here probably eliminates a check somewhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not an obvious one, but I'll keep my eyes peeled.

Styled empty cells were a major source of "cell lacks the t attribute" here, which is why they always lead to numeric NA columns. But it's not clear they are the only source? When I get to column types, I'll try to settle this definitively.

readxl/src/XlsxCell.h

Lines 122 to 124 in a78440d

rapidxml::xml_attribute<>* t = cell_->first_attribute("t");
if (t == NULL || strncmp(t->value(), "n", 5) == 0) {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar situation with this check for a cell having a v node. But I'm not sure if these cells are the only source of this either?

readxl/src/XlsxCell.h

Lines 102 to 104 in a78440d

rapidxml::xml_node<>* v = cell_->first_node("v");
if (v == NULL)
return NA_STRING;

## in a trailing empty column
## in some trailing rows
out <- read_excel(test_sheet("style-only-cells.xlsx"))
expect_equal(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think code like this is slightly cleaner if you define the tibble outside the test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

test_that("user-supplied column names play nicely with empty columns", {
skip("waiting for dust to settle re: treatment of empty columns")
## do stuff like this:
out <- read_excel(test_sheet("style-only-cells.xlsx"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

House style is

read_excel(
  test_sheet("style-only-cells.xlsx"),
  col_names = LETTERS[1:4]
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jennybc jennybc merged commit c6e0f8f into tidyverse:master Feb 6, 2017
@jennybc jennybc deleted the format-only-cells branch February 9, 2017 00:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants