02-data-cards.Rmd

# card data

## read raw

using `series_names_spreadsheet` vector (which does not necessarily match the series names provided by Bulbapedia), read the excels.
```{r}
raw_xlsx <- 
  map(
    series_names_spreadsheet,
    ~ read_xlsx(
      excel_path,
      sheet = .x,
      skip = 3,
      col_types = "text"
    )
  ) # 30 sec
```

## convert to df

`raw_xlsx` is still a list. Convert it to df: 

```{r}
df_series |> select(!ends_with("_ja")) #156 series
df_series_en <- df_series |> filter(!is.na(series)) |> select(!ends_with("_ja")) |> distinct() # 143 series
cards_df <- 
  map2(
    .x = raw_xlsx,
    .y = c_series_spreadsheet_renamed,
    ~ mutate(.data = .x, series = rep(.y, nrow(.x)))
  ) |> 
  map(
    ~ mutate(
      .data = .x,
      set_no = `Set #` |> as.character() # some are num, some are char
    )
  ) |> 
  bind_rows() |> 
  rename(card_type = Type, card_name = Name) |> 
  filter(!is.na(card_type)) |> # remove empty rows
  select(set_no, card_name, card_type, series) |> 
  left_join(
    df_series_en,
    by = c("series")
  )
cards_df |> 
  filter(
    series == "DP Black Star Promos"
  ) # no duplicates
cards_df # 14565 
```


## fix typos in card_type

"Fightning", "Coloress" etc.


```{r}
# check typos

pokemon_df_summarised <- cards_df |> 
  group_by(card_type) |> 
  summarise(n = n()) |> 
  arrange(n |> desc())
cards_df$card_type |> unique()
```

```{r}
cards_df_typo <- cards_df |> 
  mutate(
    card_type = recode(card_type, 
      Fightning = "Fighting",
      FIGHting = "Fighting",
      Coloress = "Colorless",
      Colorelss = "Colorless",
      Normal = "Colorless",
      Electric = "Lightning",
      Lightnijg = "Lightning",
      Lighting = "Lightning"
    ),
    card_name = 
      str_replace(
        card_name, "\\s+", " "
      ) |> # Garados* δ EX Holon Phantoms 102/110 includes two spaces :(
      str_replace("Dartix", "Dartrix") |> 
      str_replace("\\sForm\\s", " Forme ") |> 
      str_replace("Melmetal\\sV", "MelmetalV") |> 
      str_replace("Exeggutor\\sV", "ExeggutorV") |> 
      str_replace("Hatternee", "Hatterene") |> 
      str_replace("Primal KyogreEK", "Primal KyogreEX") |> 
      str_replace("StaraptorFCLV.X", "StaraptorFBLV.X") |>
      str_replace("Sirfetch’d", "Sirfetch'd") |> 
      str_replace("Castform\\sRain\\s", "Castform Rainy\\s"),
    card_type = if_else(card_name == "Morty", "Supporter", card_type), # one of the two Morty incorrectly classified as "Psychic"
  )
```


```{r}
c_types_non_pokemon <-  c("Trainer", "Energy", "Supporter", "Item", "Stadium", "Tool", "TM")
pokemon_type_ranking <- cards_df_typo |>
  filter(!card_type %in% c_types_non_pokemon) |> 
  group_by(card_type) |> 
  summarise(n = n()) |> 
  arrange(desc(n))
c_major_types <- pokemon_type_ranking |> 
  filter(n > 100) |> # omit type+type pokemons
  select(card_type) |> 
  pull()
#check
c_major_types
```


```{r}
c_mixed_types <- pokemon_type_ranking |> 
  filter(n <= 100) |> # select type+type pokemons
  select(card_type) |> 
  pull()
#check
c_mixed_types
```

## color the types

[kawaii](https://pokepalettes.com/)
[palettetown and pokepal](https://github.com/timcdlucas/palettetown) not so much helpful here
[r base color sucks](https://bookdown.org/hneth/ds4psy/D-3-apx-colors-basics.html)
[manually pick colour](https://codepen.io/HunorMarton/details/eWvewo)


```{r}

df_type_colour <- tribble(
  ~ card_type2, ~ rgb, ~ colour,
      "non-Pokémon", "#f8f8f8", "grey",
      "Water", "#4AA9DC", "lightblue",
      "Grass", "#9BDE71", "green",
      "Psychic", "#7D66A3", "purple",
      "Colorless", "#D1D1D1", "lightgrey",
      "Fighting", "#FCAC26", "orange",
      "Fire", "#F5592D", "red",
      "Lightning", "#F9E000", "yellow",
      "Darkness", "#37464C", "darkgrey",
      "Metal", "#707C79", "silver",
      "Dragon", "#ABAD00", "gold",
      "Fairy", "#F2499A", "pink",
      "mixed", "#000000", "black",
)
df_type_colour %>%
  ggplot(aes(x = 1, y = 1:nrow(.), colour = I(rgb), label = card_type2)) +
  geom_text(fontface = "bold")+
  theme_pokemon
c_named_colour <- df_type_colour |> 
  # column_to_rownames(var = "card_type") |> 
  select(card_type2, rgb) |> 
  deframe() # https://stackoverflow.com/questions/19265172/converting-two-columns-of-a-data-frame-to-a-named-vector
```


## repair set numbers that are converted to Dates

regex: [Rdrr](https://rdrr.io/cran/stringi/man/about_search_regex.html)

```{r}
cards_df_date <- cards_df_typo|> 
  mutate(
    card_type2 = case_when(
      card_type %in% c_mixed_types ~ "mixed",
      TRUE ~ card_type),
    is_pokemon = !card_type %in% c_types_non_pokemon & card_name != "Buried Fossil"
    # tricky edge case: Buried Fossil is not a pokemon but 
    # Colorless item that can be used like a pokemon.
  ) |> 
  mutate(
    set = 
      case_when(
        str_detect(
          set_no,
          pattern = "\\d{4}-\\d{2}-\\d{2}"
        ) ~
        # use {} to use dot operator more than once,
        # and use %>% instead of base |> to do so. base pipe doesn't support it
        ymd(set_no) %>%
          {str_c(month(.), "/", day(.))}, # 11086 failed to parse
        str_detect(
          set_no,
          pattern = "\\d{5}.0"
        ) ~
        as.character(set_no) |> 
          as.numeric() |>
          as.Date(origin = "1899/12/30") %>%
          {str_c(month(.), "/", day(.))}, # NAs introduced by coersion
        TRUE ~ set_no
      )
  ) |> 
  select(set, everything(), -set_no)
cards_df_date
```
Ignore the warnings (`Warning:  11193 failed to parse.Warning: NAs introduced by coercion`). Idk why it warns me that.

## card number, promo or not

[regex stringr](https://stringr.tidyverse.org/articles/regular-expressions)
[cheat sheet](https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_strings.pdf) look arounds
```{r}
# https://stackoverflow.com/questions/6109882/regex-match-all-characters-between-two-strings
# str_view_all("BH233/BH244", "(?<=[:alpha:]{0,8})\\d+(?=/)")
# str_view_all("50a/147", "(?<=[:alpha:]{0,8})\\d+(?=[:lower:]{0,1}/)")

cards_df_card_no <- cards_df_date |> 
  mutate(
    card_number = 
      case_when(
        str_detect(set, ".?/.?") ~
          str_extract(
            set, 
            "(?<=[:alpha:]{0,3})\\d+(?=[:lower:]{0,1}/)"
          ) |> 
          as.integer(),
        str_detect(set, "\\d+") ~
          str_extract(set, "\\d+") |> as.integer(),
        str_detect(set, "\\d+$") ~
          str_extract(set, "\\d+$") |> as.integer(),
        TRUE ~ NA_integer_
      ),
    cards_total_official = 
      if_else(
        str_detect(set, ".?/.?"),
        str_extract(
          set,
          "(?<=/[:alpha:]{0,3})\\d+"
        ) |> 
        as.integer(),
        NA_integer_
      ),
    is_secret_card =  (card_number > cards_total_official), # !is.na(card_total) &&
    release_date = as_date(release_date) # POSIXct to Date
  )
cards_df_card_no |> colnames()
cards_df_card_no2 <- 
  cards_df_card_no |> 
  select(
    card_name, 
    series, 
    release_date, 
    set, 
    card_number, 
    cards_total_official, 
    starts_with("is_"),  
    everything()
  ) 
cards_df_card_no2
```
## count of data points

Add a row to df that counts up the number of actual data points of pokemon cards available in the dataset [#4](https://github.com/xerroxcopy/pokemon-tcg/issues/4)
```{r}
summarise_df_count_cards <- 
  cards_df_card_no2 |> 
  filter(is_pokemon) |> # count only pokemons
  group_by(series) |> 
  summarise(pokemon_data_count = n())
summarise_df_count_cards
df_series2 <- df_series |> 
  left_join(summarise_df_count_cards, by = "series") |> 
  select(series_class:cards_total, pokemon_data_count, everything())
```

## name the final df df_card

```{r}
df_card <- cards_df_card_no2
```

wrangle the `df` further in `pokemon-genes.Rmd` next. `df` to `df_genes`.
## columns 
```{r}
# df |> colnames() |> paste0(collapse = "|\n")
# df$series_class|> unique()
# df |> 
#   filter(series == "Pokémon GO")
df_card |> filter(release_date == as.Date("2009-12-01"))
df_card |> dim() # 14565 x 13
```

|column|meaning|
|---|---|
|`card_name`|name from V3.25, fixed typo.|
|`series_gen`|generation of the series according to V3.25|
|`series`|name of the series. the names are aligned to that on bulbapedia.|
|`release_date`|release date of the series according to bulbapedia (except `Trainer Kit`s, which are based on V3.25). date, not POSIXct. |
|`set`|set_no in V3.25. `chr`|
|`card_number`|card number of `set`, `int`. `12` if `GH12/GH124`, `123` if `AB123`.|
|`cards_total_official`|official total number of cards, excluding secret cards. `int`. Available only if the card number is stylized as a fraction, e.g., `GH12/GH124`.|
|`is_pokemon`|`TRUE` if pokemon. `Buried Fossil` is `FALSE` though it has a `card_type` of `Colorless`.|
|`is_secret_card`|`TRUE` if `card_number` of that card exceeds `cards_total_official`.|
|`card_type`|type of the card, `df$card_type |> unique()` `Lightning`, `Tool`, `Metal/Fighting`, etc. Some typos are fixed.|
|`series_class`|from `df_series`, `Black Star Promo`, `Main Expansion`, `Special Expansions`, `Trainer Kit`, `Pop Series`, and `McDonalds Collection`. classified based on bulbapedia.|
|`cards_total`|the total count of the cards, according to Bulbapedia. Doesn't necessarily match the number of cards listed in `df`, since the spreadsheet may be incomplete.|
|`series_abb`|abbreviation on Bulbapedia.|
|`meta_is_bulba_only`|metadata: `TRUE` if the series is available on Bulbapedia. in `df` this is all `FALSE`.|
|`meta_is_v325_only`|metadata: `TRUE` if the series is available on spreadsheet but not on the series list of Bulbapedia. `TRUE`: `Trainer Kit`s, `FALSE` everything else.|
|`colour`||
|`card_type2`|simplified `card_type`s.|