-
Notifications
You must be signed in to change notification settings - Fork 2
/
04-S07-pkg-name-length.Rmd
146 lines (118 loc) · 7 KB
/
04-S07-pkg-name-length.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
## Compare the package name lengths with download counts
Here, we still focused on all of R-packages on CRAN, and made comparison between their name lengths and total downloads over the most recent 6 month period.
```{block, type="discovery", echo = TRUE}
**Finding 1**: The name lengths of R-packages have no significant correlation with total downloads, over the most recent 6 month period.
```
```{r namelength1}
cran_namelth_download <- cran_names_download %>%
mutate(length_name = nchar(package)) %>%
rowwise()%>%
mutate(length_name = as.numeric(length_name)) %>%
arrange(desc(total))
```
We could see from Figure \@ref(fig:namelength-plot) that, the influence on download volume resulted from name length is not obvious. But we could still observe that the name lengths of most of R-packages are centered before 10 characters long. The names , with more than 6,000,000 downloads, are between 5 and 9 characters long. The most downloaded one is R-package `r cran_namelth_download$package[1]` whose name length is `r cran_namelth_download$length_name[1]`.
(ref:namelength-plot) The names of R-packages with more than 6,000,000 downloads are between 5 and 9 characters long.
```{r namelength-plot,fig.cap="(ref:namelength-plot)"}
cran_namelth_download %>%
group_by(length_name) %>%
ggplot(aes(x = length_name, y = total)) +
geom_point()+
geom_smooth(se = F) +
scale_y_sqrt() +
scale_x_log10() +
labs( y = 'Total download counts',
x = 'The length of package names',
title = "Name lengths against the download counts",
subtitle = "for all R-packages on CRAN") +
#theme_minimal() +
theme(panel.grid.major = element_blank()) +
theme(
axis.text=element_text(size=10),
axis.title=element_text(size=12,face="bold"),
plot.title = element_text(h = 0.5),
plot.subtitle = element_text(h = 0.5))
```
```{r namelength}
`total average name length` <- as.numeric(mean(cran_namelth_download$length_name))
`pkg_trending_total` <- as.numeric(length(cran_namelth_download$package))
cran_namelth_download1 <- cran_namelth_download %>%
group_by(package) %>%
dplyr::filter(length_name < `total average name length`) %>%
ungroup() %>%
dplyr::summarise(`number of short names` = length(package)) %>%
mutate(`percentage of short names` = (`number of short names`/pkg_trending_total)*100)
compare_all_namelth <- cbind(cran_namelth_download1,`total average name length`)
```
```{block, type="discovery", echo = TRUE}
**Finding 2**: The average name length of R-packages is about 7.8 characters long, and over half of the R-packages tend to have shorter names, which may make it more easier to be remembered by users.
```
Table \@ref(tab:pct-lngname) shows that the average name length of all the R-packages is `r compare_all_namelth[[3]]`. And over half of the CRAN R-packages are more likely with name lengths below average. And R-packages with shorter names can be easier to get relatively higher downloads. That may because shorter named packages are easier for users to remember.
```{r pct-lngname}
library(kableExtra)
compare_all_namelth %>%
kable(caption = "Percentage of packages whose name lengths are below average") %>%
kable_styling(bootstrap_options = c("hover", "striped"))
```
After finding that there is no obvious relationship between the name lengths and the download volume, a new question came up : Can the name lengths of the R-packages be linked to the time of initial release date?
```{block, type="discovery", echo = TRUE}
**Finding 4**: For task view R-packages, the name lengths increase with the initial release dates, especially for Bayesian packages.
```
We may have this kind of experience in life : for example, detective novels, the later they are released, the less names can be chosen, because many names have been occupied by the books published earlier, with the same theme. Therefore, those later published books often have to lengthen their names to distinguish themselves from the existing books. Coincidentally, we guessed the naming of R-packages from the same topic would also be correlated to the initial release time. So, we looked back to the CRAN task view R-packages[@crantaskviews], for conducting comparison among R-packages from the same topic. Figure \@ref(fig:namelth-taskview) shows the name lengths of CRAN task view R-packages against the initial release dates. It is obvious that the name lengths of task view R-package tend to increase with the initial release dates, especially for `Bayesian` R-packages.
(ref:namelth-taskview) The name lengths of task view R-packages slightly increase with the initial release dates.
```{r namelth-taskview,fig.height = 10,fig.cap="(ref:namelth-taskview)"}
taskview_downloads %>%
group_by(package) %>%
summarise(total = sum(count),
taskview = unique(taskview)) %>%
left_join(initial_release, by = "package") %>%
mutate(length_name = nchar(package))%>%
ggplot(aes(x = initial, y = length_name)) +
geom_point(colour = "grey")+
geom_smooth(aes(colour = taskview),se = F) +
facet_wrap(~taskview, ncol = 1) +
scale_y_log10() +
labs( y = 'Name length',
x = 'The initial release date',
title = "Name lengths against the initail release dates",
subtitle = "for three types of R-packages from CRAN task view") +
#theme_minimal() +
theme(panel.grid.major = element_blank()) +
theme(
axis.text=element_text(size=10),
axis.title=element_text(size=12,face="bold"),
plot.title = element_text(h = 0.5),
plot.subtitle = element_text(h = 0.5))
```
```{block, type="discovery", echo = TRUE}
**Finding 5**: For all of R-packages on CRAN, the average name length tends to generally increase with the initial release date.
```
Although we'd better explore this question among R-packages within the same topic, we also had a view on the annual change in the average name length, for all of R-packages on CRAN.
Figure \@ref(fig:avgnamelth-year) shows the average name length for all of R-packages on CRAN, released in each year. It is obvious that the name lengths of those R-packages generally increase year by year.
(ref:avgnamelth-year) The average name length for all of R-packages on CRAN released in each year, tends to rise along the time.
```{r avgnamelth-year,fig.cap="(ref:avgnamelth-year)"}
pkg_updates_all %>%
mutate(release = ymd(release)) %>%
na.omit() %>%
group_by(package) %>%
arrange(release, .by_group = TRUE) %>% #arrange within group
top_n(-1, release) %>%
ungroup()%>%
mutate(length_name = nchar(package)) %>%
mutate(Year = year(release)) %>%
group_by(Year) %>%
summarise(avg_namelth = mean(length_name)) %>%
ggplot(aes(x = Year,y = avg_namelth)) +
geom_point(colour = "grey")+
geom_smooth(se = F) +
labs( y = 'Average name length',
x = 'The initial release date',
title = "Average name length of R-packages",
subtitle = "released in each year") +
theme_minimal() +
theme(panel.grid.major = element_blank()) +
theme(
axis.text=element_text(size=10),
axis.title=element_text(size=12,face="bold"),
plot.title = element_text(h = 0.5),
plot.subtitle = element_text(h = 0.5))
```