Skip to content

Commit

Permalink
test
Browse files Browse the repository at this point in the history
  • Loading branch information
chenyangkang committed Jan 10, 2024
1 parent f6e7a81 commit 196e7cf
Show file tree
Hide file tree
Showing 15 changed files with 164 additions and 18 deletions.
Binary file modified docs/.DS_Store
Binary file not shown.
2 changes: 1 addition & 1 deletion docs/Examples/06.Base_model_choices.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Learning curve analysis\n",
"# Base model choices\n",
"\n",
"Yangkang Chen<br>\n",
"Sep 24, 2023"
Expand Down
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file removed docs/Examples/stemflow_mini_test/data_plot.pdf
Binary file not shown.
Binary file removed docs/Examples/stemflow_mini_test/error_plot.pdf
Binary file not shown.
File renamed without changes.
106 changes: 106 additions & 0 deletions docs/Tips/Tips_for_data_types.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Tips for data types

In the both the [mini test](https://chenyangkang.github.io/stemflow/Examples/00.Mini_test.html) and [AdaSTEM Demo](https://chenyangkang.github.io/stemflow/Examples/01.AdaSTEM_demo.html) we use bird observation data to demonstrate functionality of AdaSTEM. Spatiotemporal coordinate are homogeneously encoded in these two cases, with `longitude` and `latitude` being spatial indexes and `DOY` (day of year) being temporal index.

Here, we present more tips and examples on how to play with these indexing systems.

## Flexible coordinate systems

`stemflow` support all types of spatial coordinate reference system (CRS) and temporal indexing (for example, week month, year, or decades). `stemflow` only support tabular point data currently. You should transform your data to desired CRS before feeding them to `stemflow`.

For example, transforming CRS:

```python
import pyproj

# Define the source and destination coordinate systems
source_crs = pyproj.CRS.from_epsg(4326) # WGS 84 (latitude, longitude)
target_crs = pyproj.CRS.from_string("ESRI:54017") # World Behrmann equal area projection (x, y)

# Create a transformer object
transformer = pyproj.Transformer.from_crs(source_crs, target_crs, always_xy=True)

# Project
data['proj_lng'], data['proj_lat'] = transformer.transform(data['lng'].values, data['lat'].values)
```

Now the projected spatial coordinate for each record is stored in `data['proj_lng']` and `data['proj_lat']`

We can then feed this data to `stemflow`:

```python

from stemflow.model.AdaSTEM import AdaSTEMClassifier
from xgboost import XGBClassifier

model = AdaSTEMClassifier(
base_model=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0,n_jobs=1),
save_gridding_plot = True,
ensemble_fold=10, # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
min_ensemble_required=7, # Only points covered by > 7 stixels will be predicted
grid_len_lon_upper_threshold=1e5, # force splitting if the longitudinal edge of grid exceeds 1e5 meters
grid_len_lon_lower_threshold=1e3, # stop splitting if the longitudinal edge of grid fall short 1e3 meters
grid_len_lat_upper_threshold=1e5, # similar to the previous one, but latitudinal
grid_len_lat_lower_threshold=1e3,
temporal_start=1, # The next 4 params define the temporal sliding window
temporal_end=52,
temporal_step=2,
temporal_bin_interval=4,
points_lower_threshold=50, # Only stixels with more than 50 samples are trained
Spatio1='proj_lng', # Use the column 'proj_lng' and 'proj_lat' as spatial indexes
Spatio2='proj_lat',
Temporal1='Week',
use_temporal_to_train=True, # In each stixel, whether 'Week' should be a predictor
njobs=1
)
```

Here, we use temporal bin of 4 weeks and step of 2 weeks, starting from week 1 to week 52. For spatial indexing, we force the gird size to be `1km (1e3 m) ~ 10km (1e5 m)`. Since `ESRI 54017` is an equal area projection, the unit is meter.


Then we could fit the model:

```py
## fit
model = model.fit(data.drop('target', axis=1), data[['target']])

## predict
pred = model.predict(X_test)
pred = np.where(pred<0, 0, pred)
eval_metrics = AdaSTEM.eval_STEM_res('classification',y_test, pred_mean)
```

## Spatial-only modeling

By playing some tricks, you can also do a `spatial-only` modeling, without splitting the data into temporal blocks:

```python
model = AdaSTEMClassifier(
base_model=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0,n_jobs=1),
save_gridding_plot = True,
ensemble_fold=10,
min_ensemble_required=7,
grid_len_lon_upper_threshold=1e5,
grid_len_lon_lower_threshold=1e3,
grid_len_lat_upper_threshold=1e5,
grid_len_lat_lower_threshold=1e3,
temporal_start=1,
temporal_end=52,
temporal_step=2,
temporal_bin_interval=4,
points_lower_threshold=50,
Spatio1='proj_lng',
Spatio2='proj_lat',
Temporal1='Week',
use_temporal_to_train=True,
njobs=1
)
```





## Continuous and categorical features

## Static and dynamic features
9 changes: 9 additions & 0 deletions docs/Tips/Tips_for_different_tasks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Tips for different tasks

## Regression and classification

TODO

## Hurdle

TODO
62 changes: 45 additions & 17 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,34 @@ In the demo, we first split the training data using temporal sliding windows wit
This process is executed 10 times (`ensemble_fold = 10`), each time with random jitter and random rotation of the gridding, generating 10 ensembles. In the prediction phase, only spatial-temporal points with more than 7 (`min_ensemble_required = 7`) ensembles usable are predicted (otherwise, set as `np.nan`).


## Model and data :slot_machine:

### Supported data types

:white_check_mark: All spatial indexing (CRS)<br>
:white_check_mark: All temporal indexing<br>
:white_check_mark: Both continuous and categorical features (prefer [one-hot encoding](https://en.wikipedia.org/wiki/One-hot))<br>
:white_check_mark: Both static (e.g., yearly mean temperature) and dynamic features (e.g., daily temperature)

For details and tips see [Tips for data types](https://chenyangkang.github.io/stemflow/Tips/Tips_for_data_types.html)

---

### Supported tasks

:white_check_mark: Classification task<br>
:white_check_mark: Regression task<br>
:white_check_mark: Hurdle task (two step regression – classify then regress the non-zero part)<br>

For details and tips see [Tips for different tasks](https://chenyangkang.github.io/stemflow/Tips/Tips_for_different_tasks.html)

---

### Supported base models

:white_check_mark: sklearn style `BaseEstimator` classes ([you can make your own base model](https://scikit-learn.org/stable/developers/develop.html)), for example [here](https://chenyangkang.github.io/stemflow/Examples/06.Base_model_choices.html).


## Usage :star:

Use Hurdle model as the base model of AdaSTEMRegressor:
Expand All @@ -67,23 +95,23 @@ model = AdaSTEMRegressor(
base_model=Hurdle(
classifier=XGBClassifier(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1),
regressor=XGBRegressor(tree_method='hist',random_state=42, verbosity = 0, n_jobs=1)
),
), # hurdel model for zero-inflated problem (e.g., count)
save_gridding_plot = True,
ensemble_fold=10,
min_ensemble_required=7,
grid_len_lon_upper_threshold=25,
grid_len_lon_lower_threshold=5,
grid_len_lat_upper_threshold=25,
grid_len_lat_lower_threshold=5,
temporal_start = 1,
temporal_end =366,
ensemble_fold=10, # data are modeled 10 times, each time with jitter and rotation in Quadtree algo
min_ensemble_required=7, # Only points covered by > 7 stixels will be predicted
grid_len_lon_upper_threshold=25, # force splitting if the longitudinal edge of grid exceeds 25
grid_len_lon_lower_threshold=5, # stop splitting if the longitudinal edge of grid fall short 5
grid_len_lat_upper_threshold=25, # similar to the previous one, but latitudinal
grid_len_lat_lower_threshold=5,
temporal_start=1, # The next 4 params define the temporal sliding window
temporal_end=366,
temporal_step=20,
temporal_bin_interval = 50,
points_lower_threshold=50,
Spatio1='longitude',
Spatio2 = 'latitude',
Temporal1 = 'DOY',
use_temporal_to_train=True,
temporal_bin_interval=50,
points_lower_threshold=50, # Only stixels with more than 50 samples are trained
Spatio1='longitude', # The next three params define the name of
Spatio2='latitude', # spatial coordinates shown in the dataframe
Temporal1='DOY',
use_temporal_to_train=True, # In each stixel, whether 'DOY' should be a predictor
njobs=1
)
```
Expand All @@ -93,7 +121,7 @@ Fitting and prediction methods follow the style of sklearn `BaseEstimator` class

```py
## fit
model.fit(X_train.reset_index(drop=True), y_train)
model = model.fit(X_train.reset_index(drop=True), y_train)

## predict
pred = model.predict(X_test)
Expand Down Expand Up @@ -135,7 +163,7 @@ See section [Prediction and Visualization](https://chenyangkang.github.io/stemfl

We welcome pull requests. Contributors should follow [contributor guidelines](https://github.com/chenyangkang/stemflow/blob/main/docs/CONTRIBUTING.md).

Application level cooperation is also welcomed. We recognized that stemflow may consume large computational resources especially as data volume boost in the future. We always welcome research collaboration of any kind. Contact me at chenyangkang24@outlook.com
Application level cooperation is also welcomed. We recognized that stemflow may consume large computational resources especially as data volume boosts in the future. We always welcome research collaboration of all kinds. Contact me at chenyangkang24@outlook.com


-----
Expand Down
3 changes: 3 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ nav:
- Examples/05.Hurdle_in_ada_or_ada_in_hurdle.ipynb
- Examples/06.Base_model_choices.ipynb
- Examples/07.Optimizing_Stixel_Size.ipynb
- Tips:
- 'Tips for data types': Tips/Tips_for_data_types.md
- 'Tips for different tasks': Tips/Tips_for_different_tasks.md
- API Documentation:
- stemflow.model:
- 'AdaSTEM': API_Documentation/stemflow.model.AdaSTEM.md
Expand Down
Binary file removed tests/.coverage
Binary file not shown.
Binary file removed tests/stemflow_mini_test/data_plot.pdf
Binary file not shown.
Binary file removed tests/stemflow_mini_test/mini_data.pkl
Binary file not shown.

0 comments on commit 196e7cf

Please sign in to comment.