-
Notifications
You must be signed in to change notification settings - Fork 2
/
04-S03-release-date.Rmd
115 lines (93 loc) · 6.19 KB
/
04-S03-release-date.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
## Compare last year's downloads with the initial release date
```{block, type="discovery", echo = TRUE}
**Finding**: R-packages that are initially released earlier on CRAN tend to have higher download counts in the past year. That is perhaps because, in earlier times, there were fewer R-packages in the same category, then users had 'no choice' but to use them. Due to that, those R-packages would accumulate user base, which makes it more possible to attract new users.
```
In our common cognition, it may be assumed that the earlier an R-package is released, the more people can get to know it, and thus the more downloads it can have. However, R-packages related to different topics cannot be directly compared, because download counts of R-packages in one topic can be higher than that in another. Therefore, in order to test this conjecture as clearly as possible, we selected three domain R-packages through CRAN task view[@crantaskviews], calculated their respective downloads in the previous one year, and extracted their earliest release dates for comparison. Those three topics are :
- R-packages for Time Series Analysis
The first topic is Time Series Analysis. Time Series Analysis is a statistical technique that deals with time series data, or trend analysis. Time series data means that data is in a series of particular time periods or intervals[@timeseries].
- Bayesian R-packages for general model fitting
The second topic is Bayesian Inference. Bayesian statistics is a mathematical procedure that applies probabilities to statistical problems. It provides people the tools to update their beliefs in the evidence of new data[@bayesian].
- Econometrics R-packages
The last topic is related to econometrics. Econometrics is the use of statistical methods using quantitative data to develop theories or test existing hypotheses in economics or finance, which relies on techniques such as regression models and null hypothesis testing[@econometrics].
```{r vars-release-date, cache = TRUE,eval=FALSE}
# function used to get the release dates from CRAN
pkg_url <- "https://cran.r-project.org/web/packages/{pkg}/index.html"
pkg_archive <- "https://cran.r-project.org/src/contrib/Archive/{pkg}/"
releases <- function(pkg) {
archive_dates <- tryCatch({
read_html(glue(pkg_archive)) %>%
html_table() %>%
.[[1]] %>%
pull(`Last modified`) %>%
ymd_hm() %>%
na.omit() %>%
as.Date()
}, error = function(e) {
NULL
})
last_update <- read_html(glue(pkg_url)) %>%
html_table() %>%
.[[1]] %>%
filter(X1=="Published:") %>%
pull(X2) %>%
as.Date()
data.frame(package = pkg,
release = as.character(c(archive_dates, last_update)))
}
taskviews <- c("Econometrics", "TimeSeries", "Bayesian")
pkg_releases <- map_dfr(taskviews, ~{
pkgs <- ctv:::.get_pkgs_from_ctv_or_repos(.x,
repos = "http://cran.rstudio.com/")[[1]]
map_dfr(pkgs, function(x) releases(x)) %>%
mutate(taskview = .x)
})
```
```{r vars-taskview-downloads, cache = TRUE,eval=FALSE}
#get total downloads for three types of Taskview packages
td_start <- Sys.Date() - 365 - 2 # one year
td_end <- Sys.Date() - 2
taskview_downloads <- map_dfr(taskviews, ~{
pkgs <- ctv:::.get_pkgs_from_ctv_or_repos(.x,
repos = "http://cran.rstudio.com/")[[1]]
cran_downloads(pkgs, from = td_start, to = td_end) %>%
mutate(taskview = .x)
})
```
```{r loadtaskview}
#save(taskview_downloads, file=here::here("data/taskview_downloads.rda"))
#save(pkg_releases, file=here::here("data/pkg_releases.rda"))
load("data/taskview_downloads.rda")
load("data/pkg_releases.rda")
```
Figure \@ref(fig:release-downloads) displays the scatterplot of the past year's download counts and the earliest release dates, for `Time Series Analysis`, `Econometrics` and `Bayesian` R-packages. It can be seen that, generally, as the earliest release dates get later and later, the numbers of download logs become lower and lower. For `Time Series Analysis` R-packages, they are mainly released between 2012 and 2019. For `Bayesian` R-packages, most of the R-packages are born from 2007 to 2012. And most `Econometrics` are centered between 2013 and 2016.
(ref:release-downloads) The download counts decrease with the initial release dates.
```{r release-downloads, fig.cap="(ref:release-downloads)",fig.height = 12}
initial_release <- pkg_releases %>%
mutate(releases = as.Date(release)) %>%
group_by(package) %>%
summarise(initial = min(releases)) %>%
ungroup()
taskview_downloads %>%
group_by(package) %>%
summarise(total = sum(count),
taskview = unique(taskview)) %>%
left_join(initial_release, by = "package") %>%
ggplot(aes(initial, total,colour = taskview)) +
geom_point() +
geom_smooth() +
facet_wrap(~taskview, ncol = 1) +
scale_y_log10() +
labs(x = "Initial release dates",
y = "Download counts",
title = "Initial release dates VS Total download counts",
subtitle = "for Time Series Analysis, Bayesian and Econometrics R-packages") +
theme(panel.grid.major = element_blank()) +
theme(
axis.text=element_text(size=10),
axis.title=element_text(size=12,face="bold"),
plot.title = element_text(h = 0.5),
plot.subtitle = element_text(h = 0.5)
)
```
In conclusion, it is not surprising to find that the earlier the R-package is released, the more downloads it could have, which is reflected in all of three topics above. That is probably because the R-packages released earlier can be better-known. When they are released early, there may be a relatively small number of R-packages in the same topic, under non-serious competition. As a result, the R-packages coming later can easily be covered up, since people may generally tend to use well-known, mature and habitual packages.
That is to say, earlier R-packages are more conducive to the cultivation of user habits. After all, habits are influenced by the length of time. For example, if the teacher is a senior user of an R-package, they may recommend that R-package to their students when teaching, especially when they obtain a satisfying user experience.