-
Notifications
You must be signed in to change notification settings - Fork 26
/
read_write_files.qmd
168 lines (106 loc) · 8.39 KB
/
read_write_files.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
# Reading and Writing Data Files
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("readxl", "Hmisc"))
```
## Reading data files into R
Data files can be loaded from the R session's working directory, from a directory structure *relative* to the working directory using the single dot `.` or double dot `..` syntax, or (for some file types) directly from a website. The following sections will expose you to a mixture of data file environments. For a refresher on directory structures, review [Understanding directory structures](The_R_environment.html#understanding-directory-structures).
### Reading from a comma delimitted (CSV) file
A popular data file format (and one that has withstood the test of time) is the text file format where columns are separated by a *tab*, *space* or *comma*. In the following example, R reads a comma delimited file called *ACS.csv* into a data object called `dat`.
```{r eval=FALSE, tidy=FALSE, warning=FALSE}
dat <- read.csv("ACS.csv", header=TRUE)
```
If the CSV file resides on a website, you can load the file directly from that site as follows:
```{r eval=FALSE, tidy=FALSE, warning=FALSE}
dat <- read.csv("http://mgimond.github.io/ES218/Data/ACS.csv", header=TRUE)
```
Note that not all data file formats can be readily loaded directly from a website in a "read" function without additional lines of code. Examples are given in the next two sub-sections.
To read other text formats that use different delimiters, invoke the command `read.table()` and define the type of delimiter using the `sep=` parameter. For example, to read a tab delimited data file called *ACS.txt*, run the command `read.table("ACS.txt", sep="\t")`.
Note that if a number or a string is identified as being a placeholder for missing values in the data file, you can use the `na.strings =` parameter in the `read.csv` function. For example, assume that the word `"missing"` was used in the csv file to denote a missing value, the function would be modified as follows:
```{r eval=FALSE, tidy=FALSE, warning=FALSE}
dat <- read.csv("ACS.csv", na.strings = "missing")
```
If more than one value is used as a placeholder for a missing value, you will need to combine the values using the `c()` operator. For example, if in addition to the word `"missing"` the value of `-9999` was used to designate missing values, you would modify the above chunk of code as follows:
```{r eval=FALSE, tidy=FALSE, warning=FALSE}
dat <- read.csv("ACS.csv", na.strings = c("missing", "-9999") )
```
Note how the number is wrapped in double quotes. Also, note that the `na.strings` parameter is applied to *all* columns in the dataframe. So if the word `"missing"` or the number `-9999` are valid values for some of the columns, you should not use this option. Instead, you would need to selectively replace the missing values after the dataset is loaded. You will learn how to replace values in a dataframe in subsequent chapters.
### Reading from an R data file
R has its own data file format--it's usually saved using the *.rds* extension. To read an R data file, invoke the `readRDS()` function.
```{r eval=FALSE, tidy=FALSE, warning=FALSE}
dat <- readRDS("ACS.rds")
```
As with a CSV file, you can load an *.rds* file straight from a website, however, you must first run the file through a *decompressor* before attempting to load it via `readRDS`. A built-in decompressor function called `gzcon` can be used for this purpose.
```{r eval=FALSE, tidy=FALSE, warning=FALSE}
dat <- readRDS(gzcon(url("http://mgimond.github.io/ES218/Data/ACS.rds")))
```
The .rds file format is usually smaller than its text file counterpart and will therefore take up less storage space. The .rds file will also preserve data types and classes such as factors and dates eliminating the need to redefine data types after loading the file.
### Reading from an Excel file
A package that does a good job in importing Excel files is `readxl`. It recognizes most column formats defined by Excel including date formats. However, only one sheet can be loaded at a time. So, if multiple Excel sheets are to be worked on, each sheet will need to be loaded into separate dataframe objects.
If you don't have the `readxl` package installed, install the package as you would any other package via RStudio's interface or in R using the following command:
```{r eval=FALSE}
install.packages("readxl")
```
In this example, we will load an Excel data sheet called `Discharge` which tabulates daily river water discharge. The sample file, `Discharge_2004_2014.xlsx`, can be downloaded [here](http://mgimond.github.io/ES218/Data/Discharge_2004_2014.xlsx).
```{r eval=FALSE}
library(readxl)
xl <- read_excel("Discharge_2004_2014.xlsx", sheet = "Discharge")
```
```{r echo=FALSE}
library(readxl)
xl <- read_excel("./Data/Discharge_2004_2014.xlsx", sheet = "Discharge")
```
An advantage to using this package for loading Excel files is its ability to preserve data types--including date formatted columns! In the above example, the Excel file has a column called `Date` which stores the month/day/year data as a date object. We can check that the loaded `xl` object recognizes the `Date` column as a `date` data type:
```{r}
str(xl)
```
The `Date` column is defined as a `POSIXct` data type; this is the computer's way of storing dates as the number of seconds since some internal reference date. We would therefore not need to convert the date column as would be the case if the date column was loaded from a CSV file. If such was the case, then the date column would most likely be loaded as a character or factor data type. A more in-depth discussion on date objects and their manipulation in R is covered in the [next chapter](date_objects.html).
Excel files can be loaded directly from the web using the following code chunk:
```{r eval=FALSE}
web.file <- "http://mgimond.github.io/ES218/Data/Discharge_2004_2014.xlsx"
tmp <- tempfile(fileext=".xlsx")
download.file(web.file,destfile=tmp, mode="wb")
xl <- read_excel(tmp, sheet = "Discharge")
```
Instead of downloading the file into virtual memory, R needs to download the file into a temporary folder before it can open it. However, that temporary file my not be available in a later session, so you will probably need to reload the data if you launch a new R session.
### Importing data from proprietary data file formats
It's usually recommended that a data file be stored as a CSV or tab delimited file format if compatibility across software platforms is desired. However, you might find yourself in a situation where you have no option but to import data stored in a proprietary format. This requires the use (and installation) of a package called `Hmisc`. The package will convert the following file formats: SAS (XPT format), SPSS (SAV format) and Stata (dta format). You can install the package on your computer as follows:
```{r eval=FALSE}
install.packages("Hmisc")
```
In this example, a SAS file of blood pressure from the [CDC](http://www.cdc.gov/nchs/nhanes.htm) will be loaded into an object called `dat` (file documentation can be found [here](https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/BPXO_J.htm)). You can donwload the file [here](http://mgimond.github.io/ES218/Data/BPX_J.XPT).
```{r eval=FALSE}
library(Hmisc)
dat <- sasxport.get("BPX_J.xpt")
```
Likewise, to import an SPSS file, use the `spss.get()` function; and to import a STATA file, use the `stata.get()` function.
## How to save R objects to data files
### Export to a CSV file
To export a data object called `dat.sub` as a comma delimited file, run the following:
```{r eval=FALSE, tidy=TRUE, warning=FALSE}
write.csv(dat.sub, "ACS_sub.csv")
```
### Export to a .rds file
To export a data object called `dat.sub` to an R native *.rds* file format, run the following:
```{r eval=FALSE, tidy=TRUE, warning=FALSE}
saveRDS(dat.sub, "ACS_sub.rds")
```
## Saving an R session
You can save an entire R session (which includes *all* data objects) using the `save` function.
To save *all* objects, set the `list=` parameter to `ls()`:
```{r eval=FALSE, tidy=TRUE, warning=FALSE}
save(list=ls(), file = "ACS_all.Rdata")
```
To save only two R session objects--`dat` and `dat.sub`--to a file, pass the list of objects to the `list=` parameter:
```{r eval=FALSE, tidy=TRUE, warning=FALSE}
save(list=c(dat, dat.sub), file = "ACS_subset.Rdata")
```
## Loading an R session
To load a previously saved R session type:
```{r eval=FALSE, tidy=TRUE, warning=FALSE}
load("ACS_all.Rdata")
```