WikiSpeciesHabitats dataset

This repository containts the scripts to create the WikiSpeciesHabitats dataset. This dataset contains habitats and species occurence pairs, with textual descriptions of the species which are their wikipedia page text. Its range is limited to Switzerland because of the coverage of habitat maps.

It is obtained by merging information from three initial sources :

Dataset building

1 Packages and setup

First, install the required packages in your environments by running

pip install -r requirements.txt

Then create the following directories : ./raw_data, ./raw_data/studyArea, ./processed_data, ./WikiSpeciesHabitats/,./WikiSpeciesHabitats/species/, ./wikipedia_species/

2 Study Area

Define a study area, and save it in the shapefile format with the name studyArea.shp
You can also create a .geojson version of the file as it can be used to extract data
Save your files in the ./raw_data/studyArea/ directory

3 Habitat maps

Download the cantonal habitat maps for each canton you are interested in (in this case, only VD and VS have been used)
Each download should be named habitatmap_xx_yyyymmdd, with xx being the two letter code for the canton
Rename each downloaded folders as habitatmap_xx
Place your downloaded folders in ./raw_data/

4 Species occurences

Select the filters you want (country, administrative area, etc...), and download the occurence data in .csv format. You should choose the "simple" option.
rename your download as gbif_raw.csv and place it in the ./raw_data/ directory

5 Wikipedia articles

Download a Wikipedia dump (this post might help you)
Put all the dump files in the ./wikipedia_dump/ directory
Make sure that the ./wikipedia_species/ directory exits. If not, create it
Run the wikipedia dump parsing script

python parse_wikipedia.py

6 Merge data sources

At this stage, your working directory (here named WSH) should kind of look like this :

 WSH
  ├── processed_data/
  ├── raw_data/
  │   ├── habitatmap_vd/
  │   │   ├── HabitatMap_VD.gdb/
  │   │   └── ...
  │   ├── habitatmap_vs/
  │   │   ├── HabitatMap_VS.gdb/
  │   │   └── ...
  │   ├── studyArea/
  │   │   ├── studyArea.shp
  │   │   └── ...
  │   └── gbif_raw.csv
  ├── wikipedia_dump/
  │   └── ....
  ├── WikiSpeciesHabitats/
  │   └── species/
  │       ├── 100124.json
  │       └── ...
  ├── create_dataset.py
  ├── grid.py
  ├── parse_wikipedia.py
  └── requirements.txt

Then, you can run the following command to create the dataset

python create_dataset.py --STEP all

You might want to edit the create_dataset.py file to configure which cantons you are using.

Finished dataset

Once all the steps have been executed, the following files should have been created :

.json files named in the format specieskey.json with the wikipedia page content for the species. These files are in the WikiSpeciesData/species/ directory
WikiSpeciesData/habitatsData.json that contains information for each habitat type
WikiSpeciesData/speciesData.son that contains taxonomy and informations for each species
WikiSpeciesData/speciesInZones that contain a set of observed species for each zone, as well as the corresponding habitat type
A bunch of other files, which are usefull to compute statistics about the dataset, which are in the processed_data/ directory.

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
bert_finetuning		bert_finetuning
classifiers_training		classifiers_training
dataset_building		dataset_building
dataset_stats		dataset_stats
doc2vec_training		doc2vec_training
fusion_tuning		fusion_tuning
images		images
make_paragraphs_logits		make_paragraphs_logits
make_samples_embeddings		make_samples_embeddings
make_species_embeddings		make_species_embeddings
.gitignore		.gitignore
README.md		README.md
make_dataset.sh		make_dataset.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikiSpeciesHabitats dataset

Dataset building

1 Packages and setup

2 Study Area

3 Habitat maps

4 Species occurences

5 Wikipedia articles

6 Merge data sources

Finished dataset

About

Releases

Packages

Languages

tha-santacruz/WikiSpeciesHabitats

Folders and files

Latest commit

History

Repository files navigation

WikiSpeciesHabitats dataset

Dataset building

1 Packages and setup

2 Study Area

3 Habitat maps

4 Species occurences

5 Wikipedia articles

6 Merge data sources

Finished dataset

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages