-
Notifications
You must be signed in to change notification settings - Fork 0
/
clustering_tutorial.Rmd
342 lines (242 loc) · 11.3 KB
/
clustering_tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: |
| A tutorial on k-means clustering
| *(with more R lessons along the way)*
author: "A. Paxton"
output:
html_document:
keep_md: yes
---
This R markdown presents a tutorial to implement K-means clustering in R.
It is intended to complement the first half of UConn's
Social-Ecological-Environmental (SEE) Lab's introduction to machine learning
and K-means clustering, led by Shu Jiang.
I'll also be using this as an opportunity to share a few R programming tips
along the way. It may be especially helpful for those who are not familiar
with programming with the Tidyverse, a useful cluster of libraries in R.
With many, many thanks to Bradley Boehmke for the
[K-means Cluster Analysis tutorial](https://uc-r.github.io/kmeans_clustering)
on the University of Cincinnati's Business Analytics R Programming Guide,
on which this tutorial is modeled.
***
# Preliminaries
First, let's get ready for our analyses. We do this by clearing our workspace
and loading in the libraries we'll need. It's good to get in the habit of
clearing your workspace from the start so that you don't accidentally have
clashes with unneeded variables or dataframes that could affect your results.
If you get an error here, it may be that you don't have the library installed
yet. You can install a library by typing `install.packages("xxx")` into your
RStudio console, where `xxx` is replaced by the specific library you want.
```{r preliminaries, warning=FALSE}
# clear the workspace
rm(list=ls())
# set seed for reproducibility
set.seed(4102020)
# read in libraries we'll need
library(tidyverse)
library(ggplot2)
library(viridis)
library(cluster)
library(factoextra)
```
***
# Data preparation
***
## Load the data
Next, we'll load our data. For this first tutorial, we'll use a toy dataset
that's already included in R: `USArrests`. It includes arrest statistics for
U.S. states from 1973.
```{r load-data}
# read in the dataset
arrest_df = USArrests
```
We gave our variable an informative name---here, `arrest_df`. A common
convention is to use `df` as an abbreviation for "dataframe," or a collection
of variables. It's good to get in the habit of giving your variables more
informative names than a single letter (e.g., `x`, `a`) so that you can
remember what's in the variable.
## Remove missing values
Then, we'll need to clear any missing values.
```{r clear-missing}
# clean out missing values, the tidyverse way
arrest_df = arrest_df %>%
drop_na() %>%
# convert state names from rowname to column
rownames_to_column(var="State")
```
Note that we're here using this symbol: `%>%`. This is called a "pipe" in the
Tidyverse. The Tidyverse is controversial: To some, it's a way of streamlining
sometimes complex code to make it more human-readable; to others, it's a
headache-inducing abomination. I admit that I fall into the former camp, but
I recognize that many folks---especially those who are more fond of Python than
R---hold the opposing opinion. However, even if you don't personally prefer
this style, it's useful for you to know how to read and interact with the
Tidyverse, since it's relatively common within the cognitive science and
psychology community.
The pipe allows you to pass the results of one function to another function
in the same chunk of code. It's like an assembly line, passing the dataframe
that you call in the first line (above, `= arrest_df %>%`) to the next line
(above, `drop_na()`). You can continue passing the results to additional
functions by continuing to pipe.
You might wonder why we're putting each function on a new line. Another good
habit for programmers---especially for programmers who use git as part of
their code-sharing or version control---is to keep your lines to 80 characters
or less. If you take a look at the `.Rmd` of this tutorial, you'll see that
I'm breaking the lines of text to be about 80 characters, too. This not only
helps your code be a bit more readable (rather than really long lines of code)
but also helps you see more easily the changes that have happened from version
to version in git.
## Data visualization
Always, always, always plot your data. Let's see what we've got here.
```{r plot-assault, fig.height=5, fig.width=3}
# plot the assault rate by state
ggplot(data = arrest_df,
aes(x=Assault, y=State, color=UrbanPop)) +
geom_point() +
theme(legend.position="bottom")
```
```{r plot-murder, fig.height=5, fig.width=3}
# plot the murder rate by state
ggplot(data = arrest_df,
aes(x=Murder, y=State, color=UrbanPop)) +
geom_point() +
theme(legend.position="bottom")
```
```{r plot-rape, fig.height=5, fig.width=3}
# plot the rape rate by state
ggplot(data = arrest_df,
aes(x=Rape, y=State, color=UrbanPop)) +
geom_point() +
theme(legend.position="bottom")
```
## Standardize the data
Standardizing your data allows you to make comparisons across variables
or examine them with similar tools. In our case, we want to make sure
that everything is on a similar scale before we start clustering.
```{r standardize}
# standardize each numeric variable
arrest_df = arrest_df %>%
mutate_if(is.numeric,scale)
```
The `mutate_if` syntax is just one example of a way that we can complete
more complex functions over dataframes using the Tidyverse. In this case,
we are asking to scale each variable that is a numeric-type variable.
If we had a dataframe of only numerics, we could simply call `scale(df)` to
scale all variables. However, because we have one character variable (`State`),
`scale(arrest_df)` would fail. Rather than using four separate lines of code
to scale each numeric variable (e.g.,
`arrest_df$Murder = scale(arrest_df$Murder)`), we are able to---through one
line of code---identify and scale only the numeric variables.
## Convert `State` variable back to rowname
It'll be helpful for us to have `State` converted back to a rowname, rather
than a column name. Let's go ahead and use some tidy code to do that for us.
```{r var-to-rowname}
# put State variable back to rowname
arrest_df = arrest_df %>%
column_to_rownames("State")
# let's make sure it works
head(arrest_df)
```
The function `head()` is a helpful way to view the first few rows of your
dataframe. It gives you just a snippet of the whole dataframe, but you can look
at the entire dataframe by clicking the name of the dataframe in the
"Environment" pane in RStudio.
***
# Data analysis
***
## Calculate and visualize the distance matrix
Now that our data are prepared, we can calculate our distance matrix. We'll
use the `factoextra` library's `get_dist()` function, which calculates a
Euclidean distance matrix for the variables in the dataframe.
```{r calculate-distance-matrix, fig.height= 4, fig.width=5}
# calculate the distance matrix
arrest_distance_matrix = get_dist(arrest_df)
# visualize the distance matrix
fviz_dist(arrest_distance_matrix,
gradient = list(low = viridis(3)[1],
mid = viridis(3)[2],
high = viridis(3)[3]))
```
We've plotted the distance matrix with the very friendly `viridis` palette
to help us visualize it. Here, darker blue indicates that the two states
are more similar, and brighter yellow indicates that the two states are less
similar. (And, if you're not an R user, the lovely color schemes of viridis
are also available in other programming languages, like Python.)
## Identify number of clusters
Before we start clustering, we need to figure out how many clusters we want
to use. While there are a number of ways to do this, the majority of methods
are focused on identifying how compact or tightly clustered together the
clusters tend to be.
One popular way to identify the number of clusters is through the "elbow
method." In this method, we choose the number of clusters that will
minimize the amount of variation within each cluster (or, more formally,
that will minimize the total within-cluster sum of square).
We achieve this by running k-means clustering multiple times, each time
increasing the number of clusters in our dataset. We then calculate the total
*w*ithin-cluster *s*um of *s*quare (`wss`) for each time we ran the algorithm.
We plot the `wss` value for each number of clusters, and we look for the number
of clusters at which we see a bend in the plot. That number will be our chosen
number of clusters, according to this method.
```{r elbow-method, fig.width=6, fig.height=3}
# apply the elbow method to figure out number of clusters
fviz_nbclust(arrest_df, kmeans, method = "wss")
```
We'll go ahead and choose 4 clusters as our chosen number, given that `wss`
increases at 5 and isn't much better after 6.
However, the elbow method isn't our only option. We could choose another
method, like the "silhouette method," which is measures the width of the
clusters. This method can be easier to interpret because we're looking for a
much more dramatic peak in the silhouette plot (compared to a potentially
more subtle curve of the elbow method).
```{r silhouette-method,fig.width=6, fig.height=3}
# apply the silhouette method
fviz_nbclust(arrest_df, kmeans, method = "silhouette")
```
If we'd chosen to use the silhouette method, we might have chosen to use
only 2 clusters rather than the 4 we chose from the elbow method. Looking
at the elbow method plot, we can see that there is also a slight change in
angle at `k=2`, perhaps suggesting that `k=2` could be a defensible choice
from that method, too.
## Implement clustering
Now that we've identified the number of clusters (`k=4`), we can go ahead
and run our clustering algorithm to see how our states separate by arrests. We
can do this with the built-in `kmeans()` function.
```{r k-means}
# run our clustering algorithm
arrest_clusters = kmeans(arrest_df, centers=4, nstart=15)
# see what we get!
arrest_clusters
```
And, as always, we plot our data. Because we have a four-dimensional space,
we can use `fviz_cluster()` from the package `factoextra` to visualize our
clusters in a two-dimensional space. This useful plotting tool creates this
space by choosing the locations of each datapoint according to the first two
principal components (from a principal components analysis [PCA]).
```{r plot-clusters, fig.height=5, fig.width=6}
# let's visualize our clusters on those first two principal components
fviz_cluster(arrest_clusters, data = arrest_df)
```
## Exploratory analysis: Alternative number of clusters
Although our analysis above used the four clusters identified by the elbow
method, what would it look like if we chose to use only the two clusters
identified by the silhouette method (and hinted at by the elbow method)?
```{r run-k2-clustering}
# run our 2-cluster clustering algorithm
arrest_clusters_k2 = kmeans(arrest_df, centers=2, nstart=15)
# see what we get!
arrest_clusters_k2
```
```{r plot-k2-clusters, fig.height=5, fig.width=6}
# what do our k=2 results look like?
fviz_cluster(arrest_clusters_k2, data = arrest_df)
```
In looking at the cluster means from the k-means clustering output, it looks
as though we're mostly dividing here into low-arrest and high-arrest states.
This demonstrates the effectiveness of the method, but it may not be as
insightful as the `k=4` clustering results (above).
***
# Additional exercises
Now that you've tackled this in a non-psychology dataset, check out some of
the open datasets provided by the
[Open-Source Psychometrics Project](https://openpsychometrics.org/_rawdata/).
Try your hand at clustering with them!