-
Notifications
You must be signed in to change notification settings - Fork 3
/
README.Rmd
205 lines (135 loc) · 5.26 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "Harry Potter Data"
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(tidyverse)
library(tidytext)
```
# 1) About
This repository contains various data files that can be used to perform a
text analysis of [Harry Potter](https://en.wikipedia.org/wiki/Harry_Potter)
books, written by Joanne Kathleen Rowling:
1) Harry Potter and the Philosopher's Stone
2) Harry Potter and the Chamber of Secrets
3) Harry Potter and the Prisoner of Azkaban
4) Harry Potter and the Goblet of Fire
5) Harry Potter and the Order of the Phoenix
6) Harry Potter and the Half-Blood Prince
7) Harry Potter and the Deathly Hallows
To perform the text analysis, we recommend using _tidyverse_ tools (see
packages below) and getting inspiration from the book
[Text Mining with R: A Tidy Approach](https://www.tidytextmining.com/index.html)
(by Silge & Robinson):
```r
library(tidyverse)
library(tidytext)
```
-----
# 2) Content
The content of this repo is divided in __three directories__, each one containing
different types of files.
- [csv-data-file/](csv-data-file) contains the text of all Harry Potter books
in a single CSV file.
- [rda-data-files/](rda-data-files) contains the seven Harry Potter books
stored in R-Data (binary) files---one file per book.
- [sentiment-lexicons/](sentiment-lexicons) contains a handful of sentiment
lexicons, also stored in R-Data (binary) files---one file per lexicon.
-----
## 2.1) Harry Potter CSV file
The data of all the books is available in `csv` format---in a single file:
`harry_potter_books.csv`.
Assuming that this file is in your working directory, you can import it---via
tidyverse's `readr()`---as follows:
```r
# requires package tidyverse
hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")
```
```{r echo = FALSE}
hp_books = read_csv("csv-data-file/harry_potter_books.csv", col_types = "ccc")
```
This data set is fairly simple---in terms of its structure---although the
text content is far from being tidy. The dataset has `r nrow(hp_books)` rows
and `r ncol(hp_books)` columns:
1) `text`: text content
2) `book`: title of associated book
3) `chapter`: associated chapter number
-----
## 2.2) Harry Potter R-Data Files
The data of each book is also available in its own R-Data `rda` file
(see [rda-data-files/](rda-data-files)):
- `"philosophers_stone.rda"`
- `"chamber_of_secrets.rda"`
- `"prisoner_of_azkaban.rda"`
- `"goblet_of_fire.rda"`
- `"order_of_the_phoenix.rda"`
- `"half_blood_prince.rda"`
- `"deathly_hallows.rda"`
These files come from the R package `"harrypotter"` by Bradley Boehmke
[https://github.com/bradleyboehmke/harrypotter](https://github.com/bradleyboehmke/harrypotter)
To import these files use the `load()` function. For example, consider the
first book "Harry Potter and the Philosopher's Stone"; here's how to `load()`
it in R:
```r
# assuming that the rda file is in your working directory
load("philosophers_stone.rda")
```
```{r echo = FALSE}
load("rda-data-files/philosophers_stone.rda")
```
Assuming that `"philosophers_stone.rda"` has been loaded, the text of this
book is available in the homonym character vector `philosophers_stone`
```{r}
# text is in a character vector
# (with as many elements as chapters in the book)
length(philosophers_stone)
```
The number of elements in `philosophers_stone` corresponds
to the number of chapters in this book: 17 chapters.
You may want to use these files to perform bigram analysis (or other type of
n-gram analysis).
-----
## 2.3) Sentiment Lexicons
In addition to the Harry Potter text, you can also find data for a handful of
sentiment lexicons from the R package `"textdata"` (by Hvitfeldt and Silge):
- `"bing"`: Bing Liu's General purpose English sentiment lexicon that
categorizes words in a binary fashion, either positive or negative
- `"afinn"`: AFINN is a lexicon of English words rated for valence with an
integer between minus five (negative) and plus five (positive). The words have
been manually labeled by Finn Årup Nielsen in 2009-2011.
- `"nrc"`: General purpose English sentiment/emotion lexicon. This lexicon
labels words with six possible sentiments or emotions: "negative", "positive",
"anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", or
"trust". The annotations were manually done through Amazon's Mechanical Turk.
- `"loughran"`: English sentiment lexicon created for use with financial
documents. This lexicon labels words with six possible sentiments important in
financial contexts: "negative", "positive", "litigious", "uncertainty",
"constraining", or "superfluous".
These lexicons come in `rda` data files (see
[sentiment-lexicons/](sentiment-lexicons)):
- `bing.rda`
- `afinn.rda`
- `bing.rda`
- `loughran.rda`
To import them in R, use the `load()` function. For example, here's how to
import the Bing lexicon:
```r
# assuming that the rda files are in your working directory
load("bing.rda")
```
```{r echo = FALSE}
# assuming that the rda file is in your working directory
load("sentiment-lexicons/bing.rda")
```
Assuming that you've loaded the file `"bing.rda"`, the associated lexicon is
available in the homonym tibble `bing`
```{r}
bing
```