This repository containts the scripts to create the WikiSpeciesHabitats dataset. This dataset contains habitats and species occurence pairs, with textual descriptions of the species which are their wikipedia page text. Its range is limited to Switzerland because of the coverage of habitat maps.
It is obtained by merging information from three initial sources :
- The Habitat Map of Switzerland v1
- Global Biodiversity Information Facility occurence data
- Wikipedia dump
First, install the required packages in your environments by running
pip install -r requirements.txt
Then create the following directories : ./raw_data, ./raw_data/studyArea, ./processed_data, ./WikiSpeciesHabitats/,./WikiSpeciesHabitats/species/, ./wikipedia_species/
- Define a study area, and save it in the shapefile format with the name studyArea.shp
- You can also create a .geojson version of the file as it can be used to extract data
- Save your files in the ./raw_data/studyArea/ directory
- Download the cantonal habitat maps for each canton you are interested in (in this case, only VD and VS have been used)
- Each download should be named habitatmap_xx_yyyymmdd, with xx being the two letter code for the canton
- Rename each downloaded folders as habitatmap_xx
- Place your downloaded folders in ./raw_data/
- Select the filters you want (country, administrative area, etc...), and download the occurence data in .csv format. You should choose the "simple" option.
- rename your download as gbif_raw.csv and place it in the ./raw_data/ directory
- Download a Wikipedia dump (this post might help you)
- Put all the dump files in the ./wikipedia_dump/ directory
- Make sure that the ./wikipedia_species/ directory exits. If not, create it
- Run the wikipedia dump parsing script
python parse_wikipedia.py
At this stage, your working directory (here named WSH) should kind of look like this :
WSH
├── processed_data/
├── raw_data/
│ ├── habitatmap_vd/
│ │ ├── HabitatMap_VD.gdb/
│ │ └── ...
│ ├── habitatmap_vs/
│ │ ├── HabitatMap_VS.gdb/
│ │ └── ...
│ ├── studyArea/
│ │ ├── studyArea.shp
│ │ └── ...
│ └── gbif_raw.csv
├── wikipedia_dump/
│ └── ....
├── WikiSpeciesHabitats/
│ └── species/
│ ├── 100124.json
│ └── ...
├── create_dataset.py
├── grid.py
├── parse_wikipedia.py
└── requirements.txt
Then, you can run the following command to create the dataset
python create_dataset.py --STEP all
You might want to edit the create_dataset.py file to configure which cantons you are using.
Once all the steps have been executed, the following files should have been created :
- .json files named in the format specieskey.json with the wikipedia page content for the species. These files are in the WikiSpeciesData/species/ directory
- WikiSpeciesData/habitatsData.json that contains information for each habitat type
- WikiSpeciesData/speciesData.son that contains taxonomy and informations for each species
- WikiSpeciesData/speciesInZones that contain a set of observed species for each zone, as well as the corresponding habitat type
- A bunch of other files, which are usefull to compute statistics about the dataset, which are in the processed_data/ directory.