Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor file format guessing #457

Merged
merged 1 commit into from
Apr 17, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ export(cell_limits)
export(cell_rows)
export(excel_format)
export(excel_sheets)
export(format_from_ext)
export(format_from_signature)
export(read_excel)
export(read_xls)
export(read_xlsx)
Expand Down
66 changes: 49 additions & 17 deletions R/excel-format.R
Original file line number Diff line number Diff line change
@@ -1,14 +1,26 @@
#' Determine file format
#'
#' Determine if files are xlsx or xls. First the file extension is consulted. If
#' that is unsuccessful and `guess = TRUE` and the file exists, the format is
#' guessed from the [file
#' signature](https://en.wikipedia.org/wiki/List_of_file_signatures) or "magic
#' number".
#' @description Determine if files are xls or xlsx (or from the xlsx family).
#'
#' @description `excel_format(guess = TRUE)` is used by `read_excel()` to
#' determine format. It draws on logic from two lower level functions:
#' * `format_from_ext()` attempts to determine format from the file extension.
#' * `format_from_signature()` consults the [file
#' signature](https://en.wikipedia.org/wiki/List_of_file_signatures) or "magic
#' number".
#'
#' @description File extensions associated with xlsx vs. xls:
#' * xlsx: `.xlsx`, `.xlsm`, `.xltx`, `.xltm`
#' * xls: `.xls`
#'
#' @description File signatures (in hexadecimal) for xlsx vs xls:
#' * xlsx: First 4 bytes are `50 4B 03 04`
#' * xls: First 8 bytes are `D0 CF 11 E0 A1 B1 1A E1`
#'
#' @inheritParams read_excel
#' @param guess Logical. Whether to guess format based on the file itself, if
#' the extension is neither `"xlsx"` nor `"xls"`.
#' @param guess Logical. If the file extension is absent or not recognized, this
#' controls whether we attempt to guess format based on the file signature or
#' "magic number".
#'
#' @return Character vector with values `"xlsx"`, `"xls"`, or `NA`.
#' @export
Expand All @@ -24,22 +36,34 @@
#' )
#' excel_format(files)
excel_format <- function(path, guess = TRUE) {
ext <- tolower(tools::file_ext(path))

formats <- c(xls = "xls", xlsx = "xlsx", xlsm = "xlsx")
format <- unname(formats[ext])

if (!guess || !anyNA(format)) {
format <- format_from_ext(path)
if (!isTRUE(guess)) {
return(format)
}

guess_me <- is.na(format) & file.exists(path)
format[guess_me] <- guess_format(path[guess_me])
format[guess_me] <- format_from_signature(path[guess_me])
format
}

guess_format <- function(x) {
signature <- lapply(x, first_8_bytes)
#' @rdname excel_format
#' @export
format_from_ext <- function(path) {
ext <- tolower(tools::file_ext(path))

formats <- c(
xls = "xls",
xlsx = "xlsx",
xlsm = "xlsx",
xltx = "xlsx",
xltm = "xlsx"
)
unname(formats[ext])
}

#' @rdname excel_format
#' @export
format_from_signature <- function(path) {
signature <- lapply(path, first_8_bytes)
vapply(signature, sig_to_fmt, "xlsx?")
}

Expand All @@ -62,3 +86,11 @@ sig_to_fmt <- function(x) {
NA_character_
}
}

check_format <- function(path) {
format <- excel_format(path)
if (is.na(format)) {
stop("Can't establish that the input is either xls or xlsx.", call. = FALSE)
}
format
}
1 change: 1 addition & 0 deletions R/excel-sheets.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#' path <- readxl_example("datasets.xls")
#' lapply(excel_sheets(path), read_excel, path = path)
excel_sheets <- function(path) {
path <- check_file(path)
format <- check_format(path)

switch(format,
Expand Down
35 changes: 8 additions & 27 deletions R/read_excel.R
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ NULL

#' Read xls and xlsx files
#'
#' @param path Path to the xls/xlsx file
#' @param path Path to the xls/xlsx file.
#' @param sheet Sheet to read. Either a string (the name of a sheet), or an
#' integer (the position of the sheet). Ignored if the sheet is specified via
#' `range`. If neither argument specifies the sheet, defaults to the first
Expand Down Expand Up @@ -85,6 +85,7 @@ read_excel <- function(path, sheet = NULL, range = NULL,
col_names = TRUE, col_types = NULL,
na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max)) {
path <- check_file(path)
format <- check_format(path)
read_excel_(
path = path, sheet = sheet, range = range,
Expand All @@ -95,15 +96,17 @@ read_excel <- function(path, sheet = NULL, range = NULL,
)
}

#' `read_excel()` tries to determine format from the file extension and the file
#' itself, in that order. Use `read_xls()` and `read_xlsx()` directly to
#' eliminate the guessing.
#' `read_excel()` calls [excel_format()] to determine if `path` is xls or xlsx,
#' based on the file extension and the file itself, in that order. Use
#' `read_xls()` and `read_xlsx()` directly if you know better and want to
#' prevent such guessing.
#' @rdname read_excel
#' @export
read_xls <- function(path, sheet = NULL, range = NULL,
col_names = TRUE, col_types = NULL,
na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max)) {
path <- check_file(path)
read_excel_(
path = path, sheet = sheet, range = range,
col_names = col_names, col_types = col_types,
Expand All @@ -118,6 +121,7 @@ read_xlsx <- function(path, sheet = NULL, range = NULL,
col_names = TRUE, col_types = NULL,
na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max)) {
path <- check_file(path)
read_excel_(
path = path, sheet = sheet, range = range,
col_names = col_names, col_types = col_types,
Expand All @@ -130,7 +134,6 @@ read_excel_ <- function(path, sheet = NULL, range = NULL,
col_names = TRUE, col_types = NULL,
na = "", trim_ws = TRUE, skip = 0, n_max = Inf,
guess_max = min(1000, n_max), format) {
path <- check_file(path)
if (format == "xls") {
sheets_fun <- xls_sheets
read_fun <- read_xls_
Expand Down Expand Up @@ -162,28 +165,6 @@ read_excel_ <- function(path, sheet = NULL, range = NULL,

# Helper functions -------------------------------------------------------------

check_format <- function(path) {
path <- check_file(path)
format <- excel_format(path)
if (is.na(format)) {
ext <- tolower(tools::file_ext(path))
if (nzchar(ext)) {
stop(
"Extension is neither 'xlsx' nor 'xls': ",
sQuote(ext),
call. = FALSE
)
} else {
stop(
"File has no extension and doesn't seem to be xlsx or xls: ",
sQuote(path),
call. = FALSE
)
}
}
format
}

## return a zero-indexed sheet number
standardise_sheet <- function(sheet, range, sheet_names) {
range_sheet <- cellranger::as.cell_limits(range)[["sheet"]]
Expand Down
1 change: 1 addition & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ reference:
Functions to learn properties of xls and xlsx files.
contents:
- excel_sheets
- excel_format
- title: "Describe a target rectangle"
desc: >
Flexible specification of cell rectangles.
Expand Down
12 changes: 6 additions & 6 deletions docs/articles/articles/readxl-workflows.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/news/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

34 changes: 25 additions & 9 deletions docs/reference/excel_format.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion docs/reference/excel_sheets.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

6 changes: 6 additions & 0 deletions docs/reference/index.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 7 additions & 6 deletions docs/reference/read_excel.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 2 additions & 2 deletions docs/reference/readxl-package.html

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading