Merge pull request #15 from chenyangkang/chenyangkang-JOSS-review

complete "Tips for different tasks"
chenyangkang · Jan 10, 2024 · a9b4d87 · a9b4d87
2 parents 349c444 + 0f83154
commit a9b4d87
Show file tree

Hide file tree

Showing 9 changed files with 242 additions and 11 deletions.
diff --git a/docs/.DS_Store b/docs/.DS_Store
diff --git a/docs/A_brief_introduction/A_brief_introduction.md b/docs/A_brief_introduction/A_brief_introduction.md
@@ -14,17 +14,27 @@ The framework leverages the "adjacency" information of surroundings in space and
 
 Technically, stemflow is positioned as a user-friendly python package to meet the need of general application of modeling spatio-temporal large datasets. Scikit-learn style object-oriented modeling pipeline enables concise model construction with compact parameterization at the user end, while the rest of the modeling procedures are carried out under the hood. Once the fitting method is called, the model class recursively splits the input training data into smaller spatio-temporal grids (called stixels) using QuadTree algorithm. For each of the stixels, a base model is trained only using data falls into that stixel. Stixels are then aggregated and constitute an ensemble. In the prediction phase, stemflow queries stixels for the input data according to their spatial and temporal index, followed by corresponding base model prediction. Finally, prediction results are aggregated across ensembles to generate robust estimations (see [Fink et al., 2013](https://ojs.aaai.org/index.php/AAAI/article/view/8484) and stemflow documentation for details).
 
+## Know your goal
+
+`stemflow` supports different types of tabular data modeling tasks, including
+
+- Binary classification
+- Regression
+- Hurdle regression (first classify then regress on the positive part) for zero-inflated data
+
+If you are not familiar with these tasks and concepts, see [Tips for different tasks](https://chenyangkang.github.io/stemflow/Tips/Tips_for_different_tasks.html)
+
 ## Choosing the model framework
 
 In the [demo](https://chenyangkang.github.io/stemflow/Examples/01.AdaSTEM_demo.html), we use `a two-step hurdle model` as "base model" (see more information about `hurdle` model [here](https://chenyangkang.github.io/stemflow/Tips/Tips_for_different_tasks.html)), with XGBoostClassifier for binary occurrence modeling and XGBoostRegressor for abundance modeling. If the task is to predict abundance, there are two ways to leverage the hurdle model. 
 
 1. First, **hurdle in AdaSTEM**: one can use hurdle model in each AdaSTEM (regressor) stixel; 
 1. Second, **AdaSTEM in hurdle**: one can use `AdaSTEMClassifier` as the classifier of the hurdle model, and `AdaSTEMRegressor` as the regressor of the hurdle model. 
 
-In the first case, the classifier and regressor "talk" to each other in each separate stixel (hereafter, "hurdle in Ada"); In the second case, the classifiers and regressors form two "unions" separately, and these two unions only "talk" to each other at the final combination, instead of in each stixel (hereafter, "Ada in hurdle"). In [Johnston (2015)](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/14-1826.1) the first method was used. See section [[Hurdle in AdaSTEM or AdaSTEM in hurdle?]](https://chenyangkang.github.io/stemflow/Examples/05.Hurdle_in_ada_or_ada_in_hurdle.html) for further comparisons.
+In the first case, the classifier and regressor "talk" to each other in each separate stixel (hereafter, "hurdle in Ada"); In the second case, the classifiers and regressors form two "unions" separately, and these two unions only "talk" to each other at the final combination, instead of in each stixel (hereafter, "Ada in hurdle"). In [Johnston (2015)](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/14-1826.1) the first method was used. See section "[Hurdle in AdaSTEM or AdaSTEM in hurdle?](https://chenyangkang.github.io/stemflow/Examples/05.Hurdle_in_ada_or_ada_in_hurdle.html)" for further comparisons.
 
 ## Choosing the gird size
-User can define the size of the stixels (spatial temporal grids) in terms of space and time. Larger stixel promotes generalizability but loses precision in fine resolution; Smaller stixel may have better predictability in the exact area but reduced ability of extrapolation for points outside the stixel. See section [Optimizing Stixel Size](https://chenyangkang.github.io/stemflow/Examples/07.Optimizing_Stixel_Size.html) for discussion about selection gridding parameters.
+User can define the size of the stixels (spatial temporal grids) in terms of space and time. Larger stixel promotes generalizability but loses precision in fine resolution; Smaller stixel may have better predictability in the exact area but reduced ability of extrapolation for points outside the stixel. See section [Optimizing Stixel Size](https://chenyangkang.github.io/stemflow/Examples/07.Optimizing_Stixel_Size.html) for discussion about selecting gridding parameters.
 
 ## A simple demo
 In the demo, we first split the training data using temporal sliding windows with size of 50 day of year (DOY) and step of 20 DOY (`temporal_start = 1`, `temporal_end=366`, `temporal_step=20`, `temporal_bin_interval=50`). For each temporal slice, a spatial gridding is applied, where we force the stixel to be split into smaller 1/4 pieces if the edge is larger than 25 units (measured in longitude and latitude, `grid_len_lon_upper_threshold=25`, `grid_len_lat_upper_threshold=25`), and stop splitting to prevent the edge length being chunked below 5 units (`grid_len_lon_lower_threshold=5`, `grid_len_lat_lower_threshold=5`) or containing less than 50 checklists (`points_lower_threshold=50`).  Model fitting is run using 1 core (`njobs=1`).
@@ -80,7 +90,7 @@ print(eval_metrics)
 
 Where the `pred` is the mean of the predicted values across ensembles.
 
-See [AdaSTEM demo](https://chenyangkang.github.io/stemflow/Examples/01.AdaSTEM_demo.html) for further functionality.
+See [AdaSTEM demo](https://chenyangkang.github.io/stemflow/Examples/01.AdaSTEM_demo.html) for further functionality and demonstration.
 
 -----
 References:

diff --git a/docs/Hurdle_workflow.png b/docs/Hurdle_workflow.png
diff --git a/docs/Presentation1.pptx b/docs/Presentation1.pptx
diff --git a/docs/Tips/Tips_for_data_types.md b/docs/Tips/Tips_for_data_types.md
@@ -71,6 +71,8 @@ pred = np.where(pred<0, 0, pred)
 eval_metrics = AdaSTEM.eval_STEM_res('classification',y_test, pred_mean)
 ```
 
+Note that the Quadtree algo is limited to 6 digits for efficiency. So transform your coordinate of it exceeds that threshold. For example, x=0.0000001 and y=0.0000012 will be problematic. Consider changing them to x=100 and y=1200.
+
 ------
 ## Spatial-only modeling
 
@@ -101,6 +103,39 @@ model = AdaSTEMClassifier(
 
 Setting `temporal_step` and `temporal_bin_interval` largely outweigh the temporal scale (1000 compared with 52) of your data will render only `one` temporal window during splitting. Consequently, your model would become a spatial model. This could be beneficial if temporal heterogeneity is not of interest, or without enough data to investigate.
 
+-------
+
+## Fix the gird size of Quadtree algorithm
+
+By using some tricks we can fix the gird size/edge length:
+
+```python
+model = AdaSTEMClassifier(
+    base_model=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0,n_jobs=1),
+    save_gridding_plot = True,
+    ensemble_fold=10,
+    min_ensemble_required=7,
+    grid_len_lon_upper_threshold=1000,
+    grid_len_lon_lower_threshold=1000,
+    grid_len_lat_upper_threshold=1000,
+    grid_len_lat_lower_threshold=1000,
+    temporal_start=1,
+    temporal_end=52,                            
+    temporal_step=2,                 
+    temporal_bin_interval=4,         
+    points_lower_threshold=0, 
+    stixel_training_size_threshold=50,            
+    Spatio1='proj_lng',                   
+    Spatio2='proj_lat',
+    Temporal1='Week',
+    use_temporal_to_train=True,
+    njobs=1
+)
+```
+
+Quadtree will keep splitting until it hits an edge length lower than 1000 meters. Data volume won't hamper this process because the splitting threshold is set to 0 (`points_lower_threshold=0`). Stixels with sample volume less than 50 still won't be trained (`stixel_training_size_threshold=50`). However, we cannot guarantee the exact grid length. It should be somewhere between 500m and 1000m since each time Quadtree do a bifurcated splitting.
+
+
 ------
 ## Continuous and categorical features
 
@@ -160,6 +195,6 @@ Likewise, we use static features for several reasons:
 
 1. In our demonstration, static features are used as "geographical configuration". In other words, we are interested in **how birds choose different types of land according to the season**. These static features are highly summarized and have good representation for biogeographic properties.
 1. We are interested in large-scale season pattern of bird migration, and are not interested in transient variation like hourly weather.
-1. Keep only `DOY` as dynamic features (temporal variables) reduce the work in compiling a prediction set. Instead of making a realtime one, now we only need to change DOY (by adding one each time) and feed it to `stemflow`. It also reduces memory/IO use.
+1. Keeping only `DOY` as dynamic features (temporal variables) reduces the work in compiling a prediction set. Instead of making a realtime one, now we only need to change DOY (by adding one each time) and feed it to `stemflow`. It also reduces memory/IO use.
 
 We recommend users thinking carefully before choosing appropriate features, considering the questions above and availability of computational resources.
diff --git a/docs/Tips/Tips_for_different_tasks.md b/docs/Tips/Tips_for_different_tasks.md
@@ -1,9 +1,183 @@
 # Tips for different tasks
 
-## Regression and classification
+`stemflow` supports different types of tabular data modeling tasks, including
+
+- Binary classification
+- Regression
+- Hurdle regression (first classify then regress on the positive part) for zero-inflated data
+
+
+## Classification and regression
+
+To create a classification or regression model, you can simply use the corresponding model classes.
+
+### Binary classification
+
+#### Create a model
+
+For binary classification, you can pass a sklearn `BaseEstimator` style classifier (in this case `XGBClassifier`) to the AdaSTEM classifier wrapper (`AdaSTEMClassifier`):
+
+```python
+from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMClassifier
+from xgboost import XGBClassifier
+
+model = AdaSTEMClassifier(
+    base_model=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0,n_jobs=1),
+    save_gridding_plot = True,
+    ensemble_fold=10,
+    min_ensemble_required=7,
+    grid_len_lon_upper_threshold=1e5,
+    grid_len_lon_lower_threshold=1e3,
+    grid_len_lat_upper_threshold=1e5,
+    grid_len_lat_lower_threshold=1e3,
+    temporal_start=1,
+    temporal_end=366,                            
+    temporal_step=20,  
+    temporal_bin_interval=50, 
+    points_lower_threshold=50,             
+    Spatio1='proj_lng',                   
+    Spatio2='proj_lat',
+    Temporal1='DOY',
+    use_temporal_to_train=True,
+    njobs=1
+)
+```
+
+`stemflow` automatically calculates binary class weights (using `sklearn.utils.class_weight.compute_class_weight` with `class_weight` set as `'balanced'`) in each stixels for imbalanced data and pass them to base model during training.
+
+#### Prediction
+
+```py
+## fit
+model = model.fit(X_train.reset_index(drop=True), y_train)
+
+## predict
+pred = model.predict(X_test)
+```
+
+Alternatively, you can predict the probability:
+
+```py
+## get class probability
+pred_proba = model.predict_proba(X_test)
+```
+
+Or return the prediction variation across the ensembles:
+
+```py
+## also return the prediction std
+pred_proba_mean, pred_proba_std = model.predict_proba(X_test, return_std=True)
+```
+
+
+#### Evaluation
+
+Correspondingly, you would use a set of metrics for the classification problem:
+```
+# Evaluation
+eval_metrics = AdaSTEM.eval_STEM_res('classification',y_test, pred_mean)
+```
+This will return a bunch of metrics, including AUC, precision, recall, etc.
+
+
+
+### Regression
+
+For Regression problem, you can pass a sklearn `BaseEstimator` style regressor (in this case `XGBRegressor`) to the AdaSTEM regressor wrapper (`AdaSTEMRegressor`):
+
+```python
+from stemflow.model.AdaSTEM import AdaSTEM, AdaSTEMRegressor
+from xgboost import XGBRegressor
+
+model = AdaSTEMRegressor(
+    base_model=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0,n_jobs=1),
+    save_gridding_plot = True,
+    ensemble_fold=10,
+    min_ensemble_required=7,
+    grid_len_lon_upper_threshold=1e5,
+    grid_len_lon_lower_threshold=1e3,
+    grid_len_lat_upper_threshold=1e5,
+    grid_len_lat_lower_threshold=1e3,
+    temporal_start=1,
+    temporal_end=366,                            
+    temporal_step=20,  
+    temporal_bin_interval=50, 
+    points_lower_threshold=50,             
+    Spatio1='proj_lng',                   
+    Spatio2='proj_lat',
+    Temporal1='DOY',
+    use_temporal_to_train=True,
+    njobs=1
+)
+```
+Correspondingly, you would use a set of metrics for the regression problem:
+
+```py
+## fit
+model = model.fit(X_train.reset_index(drop=True), y_train)
+
+## predict
+pred = model.predict(X_test)
+pred = np.where(pred<0, 0, pred)
+
+# Evaluation
+eval_metrics = AdaSTEM.eval_STEM_res('regression',y_test, pred_mean)
+```
+
+Likewise, you could also return the variation of prediction by setting `return_std=True` in method `predict`. `predict_proba` is not available for regression.
 
-TODO
 
 ## Hurdle
 
-TODO
+[Hurdle model](https://en.wikipedia.org/wiki/Hurdle_model#:~:text=A%20hurdle%20model%20is%20a,of%20the%20non%2Dzero%20values.) is different from regression or classification model – it combines them.
+
+Hurdle model is designed to solve the [zero-inflation problems](https://en.wikipedia.org/wiki/Zero-inflated_model), which is commonly seen in **count data**, for example, the count of individual birds for a species in each checklists. Too many "zeros" in target data will bias the loss function and induce imbalanced likelihood, weaken the model performance. Similar methods to solve zero-inflation problems include Zero-Inflated Poisson Model (ZIP), Zero-Inflated Negative Binomial Model (ZINB).
+
+### Hurdle model workflow
+Hurdle model, as its name indicates, uses two-step modeling method to solve the zero-inflation problems:
+
+![Hurdle model workflow](../Hurdle_workflow.png)
+
+As shown in this figure, zero-inflated input data was first used to train a classifier, with labels being binary transformed. Then, the samples with positive labels are used to train a regressor. Together these two compose a hurdle model. When new data comes in, classifier and regressor make prediction independently, and the results are combined/masked to yield the final prediction (only those samples classified as positive will be assigned continuous prediction from the regressor).
+
+### Hurdle model common practice
+Combined with AdaSTEM framework, hurdle model is usually used as a based model:
+
+```python
+model = AdaSTEMRegressor(
+    base_model=Hurdle(
+        classifier=XGBClassifier(...),
+        regressor=XGBRegressor(...)
+    ),  
+    ...
+)
+```
+
+This is a common practice in most paper I've seen, including reference [1-3].
+
+However, AdaSTEM model could also be used as regressor or classifier is needed:
+
+```python
+model_Ada_in_Hurdle = Hurdle_for_AdaSTEM(
+    classifier=AdaSTEMClassifier(
+        base_model=XGBClassifier(...),
+        ...),
+    regressor=AdaSTEMRegressor(
+        base_model=XGBRegressor(...),
+        ...)
+)
+
+```
+
+The later one is more a conservative framework and have higher `precision` score, but lower recall and overall performance. For more details see "[Hurdle in AdaSTEM or AdaSTEM in hurdle?](https://chenyangkang.github.io/stemflow/Examples/05.Hurdle_in_ada_or_ada_in_hurdle.html)"
+
+
+
+-----
+References:
+
+1. [Fink, D., Damoulas, T., & Dave, J. (2013, June). Adaptive Spatio-Temporal Exploratory Models: Hemisphere-wide species distributions from massively crowdsourced eBird data. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 27, No. 1, pp. 1284-1290).](https://ojs.aaai.org/index.php/AAAI/article/view/8484)
+
+1. [Fink, D., Auer, T., Johnston, A., Ruiz‐Gutierrez, V., Hochachka, W. M., & Kelling, S. (2020). Modeling avian full annual cycle distribution and population trends with citizen science data. Ecological Applications, 30(3), e02056.](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/eap.2056)
+
+1. [Johnston, A., Fink, D., Reynolds, M. D., Hochachka, W. M., Sullivan, B. L., Bruns, N. E., ... & Kelling, S. (2015). Abundance models improve spatial and temporal prioritization of conservation resources. Ecological Applications, 25(7), 1749-1756.](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/14-1826.1)
diff --git a/docs/index.md b/docs/index.md
@@ -45,7 +45,7 @@ conda install -c conda-forge stemflow
 
 **stemflow** adopts ["split-apply-combine"](https://vita.had.co.nz/papers/plyr.pdf) philosophy. It 
 
-1. Splits input data using Quadtree algorithm.
+1. Splits input data using [Quadtree algorithm](https://en.wikipedia.org/wiki/Quadtree#:~:text=A%20quadtree%20is%20a%20tree,into%20four%20quadrants%20or%20regions.).
 1. Trains each spatiotemporal split (called stixel) separately.
 1. Aggregates the ensemble to make prediction.
 
@@ -83,7 +83,7 @@ For details and tips see [Tips for data types](https://chenyangkang.github.io/st
 
 ### Supported tasks
 
-:white_check_mark: Classification task<br>
+:white_check_mark: Binary classification task<br>
 :white_check_mark: Regression task<br>
 :white_check_mark: Hurdle task (two step regression – classify then regress the non-zero part)<br>
 

diff --git a/docs/~$Presentation1.pptx b/docs/~$Presentation1.pptx
diff --git a/stemflow/model/AdaSTEM.py b/stemflow/model/AdaSTEM.py
@@ -63,6 +63,7 @@ def __init__(
         grid_len_lat_upper_threshold: Union[float, int] = 25,
         grid_len_lat_lower_threshold: Union[float, int] = 5,
         points_lower_threshold: int = 50,
+        stixel_training_size_threshold: int = None,
         temporal_start: Union[float, int] = 1,
         temporal_end: Union[float, int] = 366,
         temporal_step: Union[float, int] = 20,
@@ -107,8 +108,11 @@ def __init__(
             grid_len_lat_lower_threshold:
                 stop divide if grid latitude **will** be below than the threshold. Defaults to 5.
             points_lower_threshold:
+                Do not further split the gird if split results in less samples than this threshold.
+                Overrided by grid_len_*_upper_threshold parameters. Defaults to 50.
+            stixel_training_size_threshold:
                 Do not train the model if the available data records for this stixel is less than this threshold,
-                and directly set the value to np.nan. Defaults to 25.
+                and directly set the value to np.nan. Defaults to 50.
             temporal_start:
                 start of the temporal sequence. Defaults to 1.
             temporal_end:
@@ -237,7 +241,11 @@ def __init__(
         self.temporal_bin_start_jitter = temporal_bin_start_jitter
 
         #
-        self.stixel_training_size_threshold = points_lower_threshold
+        if stixel_training_size_threshold is None:
+            self.stixel_training_size_threshold = points_lower_threshold
+        else:
+            self.stixel_training_size_threshold = stixel_training_size_threshold
+
         self.save_gridding_plot = save_gridding_plot
         self.save_tmp = save_tmp
         self.save_dir = save_dir
@@ -1104,6 +1112,7 @@ def __init__(
         grid_len_lat_upper_threshold=25,
         grid_len_lat_lower_threshold=5,
         points_lower_threshold=50,
+        stixel_training_size_threshold=None,
         temporal_start=1,
         temporal_end=366,
         temporal_step=20,
@@ -1160,6 +1169,7 @@ def __init__(
             grid_len_lat_upper_threshold,
             grid_len_lat_lower_threshold,
             points_lower_threshold,
+            stixel_training_size_threshold,
             temporal_start,
             temporal_end,
             temporal_step,
@@ -1268,6 +1278,7 @@ def __init__(
         grid_len_lat_upper_threshold=25,
         grid_len_lat_lower_threshold=5,
         points_lower_threshold=50,
+        stixel_training_size_threshold=None,
         temporal_start=1,
         temporal_end=366,
         temporal_step=20,
@@ -1324,6 +1335,7 @@ def __init__(
             grid_len_lat_upper_threshold,
             grid_len_lat_lower_threshold,
             points_lower_threshold,
+            stixel_training_size_threshold,
             temporal_start,
             temporal_end,
             temporal_step,