-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't load empty cells; use empirical worksheet dimensions #248
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a few tiny niggles
XlsxCell xcell(cell, i, j); | ||
cells_.push_back(xcell); | ||
// don't load a cell with no child nodes, e.g. it only has style | ||
rapidxml::xml_node<>* first_child = cell->first_node(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check here probably eliminates a check somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's not an obvious one, but I'll keep my eyes peeled.
Styled empty cells were a major source of "cell lacks the t attribute" here, which is why they always lead to numeric NA
columns. But it's not clear they are the only source? When I get to column types, I'll try to settle this definitively.
Lines 122 to 124 in a78440d
rapidxml::xml_attribute<>* t = cell_->first_attribute("t"); | |
if (t == NULL || strncmp(t->value(), "n", 5) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar situation with this check for a cell having a v
node. But I'm not sure if these cells are the only source of this either?
Lines 102 to 104 in a78440d
rapidxml::xml_node<>* v = cell_->first_node("v"); | |
if (v == NULL) | |
return NA_STRING; |
tests/testthat/test-empty.R
Outdated
## in a trailing empty column | ||
## in some trailing rows | ||
out <- read_excel(test_sheet("style-only-cells.xlsx")) | ||
expect_equal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think code like this is slightly cleaner if you define the tibble outside the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
tests/testthat/test-empty.R
Outdated
test_that("user-supplied column names play nicely with empty columns", { | ||
skip("waiting for dust to settle re: treatment of empty columns") | ||
## do stuff like this: | ||
out <- read_excel(test_sheet("style-only-cells.xlsx"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
House style is
read_excel(
test_sheet("style-only-cells.xlsx"),
col_names = LETTERS[1:4]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There can be cells that look empty to the human eye, but that are styled, e.g. someone applied a custom number format to the cell at some point. I propose we do not load those anymore. There's nothing to learn from their XML.
If such a cell still falls into the target rectangle, it will simply be
NA
in the output (ok, won't be totally true until we stop dropping unnamed, empty columns #157).If such a cell falls outside the target rectangle, this PR prevents the creation of trailing row(s) or column(s) consisting entirely of
NA
.In addition to rows/columns of
NA
, this has been a puzzling source of errors about column names and types not being compatible.To fully fix this, you also can't trust a worksheet's declared dimensions, because these cells "count". Therefore I propose we always compute dimension ourselves.
I haven't dealt with xls yet. I can leave this open and work on that. Or we could resolve this for xlsx and I'd open an issue to fix it for xls.
This branch builds off #247. I'm assuming it will be merged first.