-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-model-frequency.Rmd
152 lines (121 loc) · 4.46 KB
/
07-model-frequency.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
# individual poke
最も単純な帰無モデルと比較するために、リアルデータのCCDFを両対数グラフで書く。
## まずはなにもせずカウント
```{r}
df_poke_count_raw <- df_gene |>
filter(is_pokemon) |>
group_by(card_name) |>
summarise(n = n()) |>
arrange(desc(n))
```
### ECDF
[負うた子に教えられて浅瀬を渡る](https://qiita.com/xerroxcopy/items/b79635ef3dbcc29644c6)
```{r}
# pokemon name. the only datatype that is comparable to the random draw null model
df_count_pokemon <- df_gene |>
filter(is_pokemon) |>
group_by(pokemon_name) |>
summarise(n = n()) |>
arrange(desc(n)) |>
mutate(ecdf = 1 - ecdf(n)(n - .01)) |>
select(-pokemon_name) |>
distinct() |>
mutate(condition = "pokémon name") |>
arrange(desc(n))
# card name.
df_count_full_name <- df_gene |>
filter(is_pokemon) |>
group_by(card_name) |>
summarise(n = n()) |>
arrange(desc(n)) |>
mutate(ecdf = 1 - ecdf(n)(n - .01)) |>
select(-card_name) |>
distinct() |>
mutate(condition = "card name")
# card name + type.
# distinguish pokemons with the same name, but with different types
df_count_full_name_type <- df_gene |>
filter(is_pokemon) |>
mutate(name_type = paste(card_name, card_type2)) |>
group_by(name_type) |>
summarise(n = n()) |>
arrange(desc(n)) |>
mutate(ecdf = 1 - ecdf(n)(n - .01)) |>
select(-name_type) |>
distinct() |>
mutate(condition = "card name + type")
# advanced:
# full name, but remove deco tsuki cards
# count only the cards named "Pikachu" for pikachu.
df_count_full_name_remove_decorated <- df_gene |>
filter(is_pokemon) |>
filter(card_name == pokemon_name) |>
group_by(card_name) |> # same as grouping by pokemon_name since card_name == pokemon_name
summarise(n = n()) |>
arrange(desc(n)) |>
mutate(ecdf = 1 - ecdf(n)(n - .01)) |>
select(-card_name) |>
distinct() |>
mutate(condition = "card name, remove decorated")
```
`ecdf()`はX > nである割合であってX >= nである割合ではないのだな
やりたいこと的には、`n = 1`のときに、1以上のコピーがあるポケカの割合は1になってないとおかしい。n = 1のときに1よりもたくさんのコピーがある割合1 - .32とかを計算してしまっている。同様にn = 58のとき(ピカチュウさん)、X >= 58である割合は 1 / 2425 (2425はユニークなポケモン名の数、`df$card_name |> unique() |> length()`)かと思うが、0になってしまっている、つまりX > 58である割合を計算している。
## plot ECDF
Complementary Cumulative Distribution Function
### full name, type
```{r}
df_count_full_name_type |>
ggplot(aes(x = n, y = ecdf)) +
geom_point() +
scale_x_continuous(trans = "log10", breaks = 10^(0:10), name = expression(italic("x"))) +
scale_y_continuous(trans = "log10", breaks = 10^(0:-10), name = expression(italic("Pr") ( X>= x) )) +
labs(title = "card name + type") +
theme_pokemon +
theme(
aspect.ratio = 1
)
p_count_full_name_type
```
### full name
```{r}
df_count_full_name |>
ggplot(aes(x = n, y = ecdf)) +
geom_point() +
scale_x_continuous(trans = "log10", breaks = 10^(0:10), name = expression(italic("x"))) +
scale_y_continuous(trans = "log10", breaks = 10^(0:-10), name = expression(italic("Pr") ( X>= x) )) +
labs(title = "card name") +
theme_pokemon +
theme(
aspect.ratio = 1
)
p_count_full_name
```
### full name, remove decorated
decorated cards, like "Surfing Pikachu VMAX bababa" are omitted. Only Pikachu cards with the exact name "Pikachu" on the card are counted.
```{r}
df_count_full_name_remove_decorated |>
ggplot(aes(x = n, y = ecdf)) +
geom_point() +
scale_x_continuous(trans = "log10", breaks = 10^(0:10), name = expression(italic("x"))) +
scale_y_continuous(trans = "log10", breaks = 10^(0:-10), name = expression(italic("Pr") ( X>= x) )) +
labs(title = "only the cards with plain pokemon name") +
theme_pokemon +
theme(
aspect.ratio = 1
)
```
・・・あんま変わらんな!!
### poke name
pokemon name. the only datatype that is comparable to the random draw null model
```{r}
df_count_pokemon |>
ggplot(aes(x = n, y = ecdf)) +
geom_point() +
scale_x_continuous(trans = "log10", breaks = 10^(0:10), name = expression(italic("x"))) +
scale_y_continuous(trans = "log10", breaks = 10^(0:-10), name = expression(italic("Pr") ( X>= x) )) +
labs(title = "pokemon name") +
theme_pokemon +
theme(
aspect.ratio = 1
)
```