Skip to content

Commit

Permalink
edit contribution for pre-commit cue
Browse files Browse the repository at this point in the history
  • Loading branch information
chenyangkang committed Jan 10, 2024
1 parent 2d51ce9 commit 2d7519b
Show file tree
Hide file tree
Showing 7 changed files with 230 additions and 399 deletions.
94 changes: 94 additions & 0 deletions docs/A_brief_introduction/A_brief_introduction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# A brief introduction to `stemflow`

**stemflow** is a toolkit for Adaptive Spatio-Temporal Exploratory Model (AdaSTEM \[[1](https://ojs.aaai.org/index.php/AAAI/article/view/8484), [2](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/eap.2056)\]) in python. A typical usage is daily abundance estimation using [eBird](https://ebird.org/home) citizen science data (survey data).

**stemflow** adapts ["split-apply-combine"](https://vita.had.co.nz/papers/plyr.pdf) philosophy. It

1. Splits input data using Quadtree algorithm
1. Train each spatiotemporal split (called stixel) seperately.
1. Aggregate then ensemble to make prediction.


The framework leverages the "adjacency" information of surroundings in space and time to model/predict the values of target spatiotemporal points. This framework ameliorates the **long-distance/long-range prediction problem** [[3](https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/09-1340.1)], and have a good spatiotemporal smoothing effect.


Technically, stemflow is positioned as a user-friendly python package to meet the need of general application of modeling spatio-temporal large datasets. Scikit-learn style object-oriented modeling pipeline enables concise model construction with compact parameterization at the user end, while the rest of the modeling procedures are carried out under the hood. Once the fitting method is called, the model class recursively splits the input training data into smaller spatio-temporal grids (called stixels) using QuadTree algorithm. For each of the stixels, a base model is trained only using data falls into that stixel. Stixels are then aggregated and constitute an ensemble. In the prediction phase, stemflow queries stixels for the input data according to their spatial and temporal index, followed by corresponding base model prediction. Finally, prediction results are aggregated across ensembles to generate robust estimations (see [Fink et al., 2013](https://ojs.aaai.org/index.php/AAAI/article/view/8484) and stemflow documentation for details).

## Choosing the model framework

In the [demo](https://chenyangkang.github.io/stemflow/Examples/01.AdaSTEM_demo.html), we use `a two-step hurdle model` as "base model" (see more information about `hurdle` model [here](https://chenyangkang.github.io/stemflow/Tips/Tips_for_different_tasks.html)), with XGBoostClassifier for binary occurrence modeling and XGBoostRegressor for abundance modeling. If the task is to predict abundance, there are two ways to leverage the hurdle model.

1. First, **hurdle in AdaSTEM**: one can use hurdle model in each AdaSTEM (regressor) stixel;
1. Second, **AdaSTEM in hurdle**: one can use `AdaSTEMClassifier` as the classifier of the hurdle model, and `AdaSTEMRegressor` as the regressor of the hurdle model.

In the first case, the classifier and regressor "talk" to each other in each separate stixel (hereafter, "hurdle in Ada"); In the second case, the classifiers and regressors form two "unions" separately, and these two unions only "talk" to each other at the final combination, instead of in each stixel (hereafter, "Ada in hurdle"). In [Johnston (2015)](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/14-1826.1) the first method was used. See section [[Hurdle in AdaSTEM or AdaSTEM in hurdle?]](https://chenyangkang.github.io/stemflow/Examples/05.Hurdle_in_ada_or_ada_in_hurdle.html) for further comparisons.

## Choosing the gird size
User can define the size of the stixels (spatial temporal grids) in terms of space and time. Larger stixel promotes generalizability but loses precision in fine resolution; Smaller stixel may have better predictability in the exact area but reduced ability of extrapolation for points outside the stixel. See section [Optimizing Stixel Size](https://chenyangkang.github.io/stemflow/Examples/07.Optimizing_Stixel_Size.html) for discussion about selection gridding parameters.

## A simple demo
In the demo, we first split the training data using temporal sliding windows with size of 50 day of year (DOY) and step of 20 DOY (`temporal_start = 1`, `temporal_end=366`, `temporal_step=20`, `temporal_bin_interval=50`). For each temporal slice, a spatial gridding is applied, where we force the stixel to be split into smaller 1/4 pieces if the edge is larger than 25 units (measured in longitude and latitude, `grid_len_lon_upper_threshold=25`, `grid_len_lat_upper_threshold=25`), and stop splitting to prevent the edge length being chunked below 5 units (`grid_len_lon_lower_threshold=5`, `grid_len_lat_lower_threshold=5`) or containing less than 50 checklists (`points_lower_threshold=50`). Model fitting is run using 1 core (`njobs=1`).

This process is executed 10 times (`ensemble_fold = 10`), each time with random jitter and random rotation of the gridding, generating 10 ensembles. In the prediction phase, only spatial-temporal points with more than 7 (`min_ensemble_required = 7`) ensembles usable are predicted (otherwise, set as `np.nan`).

That is:

```py
from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier, AdaSTEMRegressor
from stemflow.model.Hurdle import Hurdle
from xgboost import XGBClassifier, XGBRegressor

## "hurdle in Ada"
model = AdaSTEMRegressor(
base_model=Hurdle(
classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
), # hurdel model for zero-inflated problem (e.g., count)
save_gridding_plot = True,
ensemble_fold=10, # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
min_ensemble_required=7, # Only points covered by > 7 stixels will be predicted
grid_len_lon_upper_threshold=25, # force splitting if the longitudinal edge of grid exceeds 25
grid_len_lon_lower_threshold=5, # stop splitting if the longitudinal edge of grid fall short 5
grid_len_lat_upper_threshold=25, # similar to the previous one, but latitudinal
grid_len_lat_lower_threshold=5,
temporal_start=1, # The next 4 params define the temporal sliding window
temporal_end=366,
temporal_step=20,
temporal_bin_interval=50,
points_lower_threshold=50, # Only stixels with more than 50 samples are trained
Spatio1='longitude', # The next three params define the name of
Spatio2='latitude', # spatial coordinates shown in the dataframe
Temporal1='DOY',
use_temporal_to_train=True, # In each stixel, whether 'DOY' should be a predictor
njobs=1
)
```


Fitting and prediction methods follow the style of sklearn `BaseEstimator` class:

```py
## fit
model = model.fit(X_train.reset_index(drop=True), y_train)

## predict
pred = model.predict(X_test)
pred = np.where(pred<0, 0, pred)
eval_metrics = AdaSTEM.eval_STEM_res('hurdle',y_test, pred_mean)
print(eval_metrics)
```

Where the `pred` is the mean of the predicted values across ensembles.

See [AdaSTEM demo](https://chenyangkang.github.io/stemflow/Examples/01.AdaSTEM_demo.html) for further functionality.

-----
References:

1. [Fink, D., Damoulas, T., & Dave, J. (2013, June). Adaptive Spatio-Temporal Exploratory Models: Hemisphere-wide species distributions from massively crowdsourced eBird data. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 27, No. 1, pp. 1284-1290).](https://ojs.aaai.org/index.php/AAAI/article/view/8484)

1. [Fink, D., Auer, T., Johnston, A., Ruiz‐Gutierrez, V., Hochachka, W. M., & Kelling, S. (2020). Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 30(3), e02056.](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/eap.2056)

1. [Fink, D., Hochachka, W. M., Zuckerberg, B., Winkler, D. W., Shaby, B., Munson, M. A., ... & Kelling, S. (2010). Spatiotemporal exploratory models for broad‐scale survey data. Ecological Applications, 20(8), 2131-2147.](https://esajournals.onlinelibrary.wiley.com/doi/abs/10.1890/09-1340.1)

1. [Johnston, A., Fink, D., Reynolds, M. D., Hochachka, W. M., Sullivan, B. L., Bruns, N. E., ... & Kelling, S. (2015). Abundance models improve spatial and temporal prioritization of conservation resources. Ecological Applications, 25(7), 1749-1756.](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/14-1826.1)
11 changes: 11 additions & 0 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,17 @@ Then you can render the docs locally with:
mkdocs serve
```

## Commit

After you finish editing. Commit with words that summarize the changes.

```
git commit -m 'what I have changed'
```

You will possibly find that pre-commit trimmed your scripts. In this case you need to add those changed file again and commit again to save the changes.


---

## Submit a pull request
Expand Down
65 changes: 62 additions & 3 deletions docs/Tips/Tips_for_data_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ In the both the [mini test](https://chenyangkang.github.io/stemflow/Examples/00.

Here, we present more tips and examples on how to play with these indexing systems.

------
## Flexible coordinate systems

`stemflow` support all types of spatial coordinate reference system (CRS) and temporal indexing (for example, week month, year, or decades). `stemflow` only support tabular point data currently. You should transform your data to desired CRS before feeding them to `stemflow`.
Expand Down Expand Up @@ -70,6 +71,7 @@ pred = np.where(pred<0, 0, pred)
eval_metrics = AdaSTEM.eval_STEM_res('classification',y_test, pred_mean)
```

------
## Spatial-only modeling

By playing some tricks, you can also do a `spatial-only` modeling, without splitting the data into temporal blocks:
Expand All @@ -86,8 +88,8 @@ model = AdaSTEMClassifier(
grid_len_lat_lower_threshold=1e3,
temporal_start=1,
temporal_end=52,
temporal_step=2,
temporal_bin_interval=4,
temporal_step=1000, # Setting step and interval largely outweigh
temporal_bin_interval=1000, # temporal scale of data
points_lower_threshold=50,
Spatio1='proj_lng',
Spatio2='proj_lat',
Expand All @@ -97,10 +99,67 @@ model = AdaSTEMClassifier(
)
```

Setting `temporal_step` and `temporal_bin_interval` largely outweigh the temporal scale (1000 compared with 52) of your data will render only `one` temporal window during splitting. Consequently, your model would become a spatial model. This could be beneficial if temporal heterogeneity is not of interest, or without enough data to investigate.

------
## Continuous and categorical features

Basically, `stemflow` is a framework for spatial temporal indexing during modeling. It serves as a container to help `base model` do better jobs, and prevent distant modeling/prediction problem in space and time. Therefore, any feature you use during common tabular data modeling could be used here. It means that both continuous and categorical features can be the input, based on your expectation in the feature engineering.

For categorical features, we recommend [one-hot encoding](https://en.wikipedia.org/wiki/One-hot) if the size of the category is not too large.

## Continuous and categorical features
Tree-based models (e.g., decision tree, boosting tree, random forest) are robust to missing values so you can fill the missing values with artificial values like `-1`. For other methods, there are different ways to fill the missing values, with [pros and cons](https://towardsdatascience.com/7-ways-to-handle-missing-values-in-machine-learning-1a6326adf79e).

------
## Static and dynamic features

### Concepts and examples
Static features are summaries of status, while dynamic features vary largely across each almost each records.

For example, for a task modeling the abundance of a bird each day in 2020;

Static features:

- Land cover of year 2020:
- percentage cover of forest
- patch density of urban area

- Climate of year 2020:
- BIO1: Annual Mean Temperature
- BIO2: Mean Diurnal Range (Mean of monthly (max temp - min temp))
- BIO19: Precipitation of Coldest Quarter

- Normalized difference vegetation index (NDVI):
- NDVI_max: Highest NDVI of the annual cycle
- NDVI_std: Variation of NDVI in the annual cycle


Dynamic features:

- Weather of each checklist (record):
- the temperature of the hour (that we observed this bird)
- the total precipitation of the hour (that we observed this bird)
- V component of wind of the hour (that we observed this bird)

- Normalized difference vegetation index (NDVI):
- The absolute NDVI of the day (that we observed this bird)


### Use of static and dynamic features

Although all features except `DOY` in our [mini test](https://chenyangkang.github.io/stemflow/Examples/00.Mini_test.html) and [AdaSTEM Demo](https://chenyangkang.github.io/stemflow/Examples/01.AdaSTEM_demo.html) are static features, the model fully support dynamic feature input.

Noteworthy, the choice of static or dynamic features depends on some aspects:

1. **Model assumption**: Does the target value vary in response to static summaries or agilely in response to realtime changes?
1. **Scale of interest**: Are you interested in overall smoothed pattern or zig-zag chaotic pattern?
1. **Caution for overfitting**: `stemflow` splits data into smaller spatiotemporal grids. It may induce local overfitting to some extent. By using dynamic features, you should be additionally cautious for overfitting in the scale of time.
1. **Anchor the prediction set**: Make sure you use the same dynamic variables in your prediction set if they are used to train the model. This may cause additional computational challenges.

Likewise, we use static features for several reasons:

1. In our demonstration, static features are used as "geographical configuration". In other words, we are interested in **how birds choose different types of land according to the season**. These static features are highly summarized and have good representation for biogeographic properties.
1. We are interested in large-scale season pattern of bird migration, and are not interested in transient variation like hourly weather.
1. Keep only `DOY` as dynamic features (temporal variables) reduce the work in compiling a prediction set. Instead of making a realtime one, now we only need to change DOY (by adding one each time) and feed it to `stemflow`. It also reduces memory/IO use.

We recommend users thinking carefully before choosing appropriate features, considering the questions above and availability of computational resources.
Loading

0 comments on commit 2d7519b

Please sign in to comment.