-
Notifications
You must be signed in to change notification settings - Fork 29
Evaluation Metric
The purpose of this page is to identify the goal and an appropriate evaluation metric for the crop-mask project.
It is roughly based on Andrew Ng's course on Structuring ML Projects
Produce high quality crop masks using a fast and cost-efficient method.
A map for a region of interest showing the likelihood of crop presence on each pixel.
A crop-mask where:
- most crop and non-crop pixels are correctly identified [OPTIMIZING]
- there are no visible artifacts (clouds, stripes from satellite tracks) [SATISFICING]
- Respond to agricultural disasters (know where to look)
- Concentrate downstream models (e.g., crop type classifiers, yield estimators) on crop areas
- Create food security assessments including production, planted area, change in planted area, etc.
- Forecast vegetation and land use trends
We can approximate the quality of a crop mask by:
- Obtaining ground truth data for a sample of pixels in the region of interest
- Generating predictions for the sample of pixels
- Comparing the predicted pixels against the ground truth pixels to generate a metric.
For most countries, land dedicated to cropland is typically below 20 percent, with many countries dedicating less than 10 percent. Source
We'll consider several prediction scenarios to decide on an appropriate metric. For this example, we'll use a representative sample dataset consisting of 100 data points, with 15 points classified as crop and 85 points classified as non-crop.
-
All non-crop predictions
Prediction/Actual Crop Non-crop Crop 0 0 Non-crop 15 85 - Accuracy: 85%
- Precision: Nan
- Recall: 0%
- F1 Score: 0%
A model that predicts all non-crop would not be useful, but accuracy says otherwise. Therefore, accuracy alone is not a good measure.
-
Only crop if super sure
Prediction/Actual Crop Non-crop Crop 5 0 Non-crop 10 85 - Accuracy: 90%
- Precision: 100%
- Recall: 25%
- F1 Score: 40%
A model that predicts crop only if it is super sure is better than the first one but still not great. Therefore, precision alone is not a good measure.
-
Over predict crop
Prediction/Actual Crop Non-crop Crop 15 40 Non-crop 0 45 - Accuracy: 60%
- Precision: 27%
- Recall: 100%
- F1 Score: 43%
A model that over predicts crop is also not great. Therefore, recall alone is not a good measure.
-
Balanced errors
Prediction/Actual Crop Non-crop Crop 11 7 Non-crop 4 78 - Accuracy: 89%
- Precision: 61%
- Recall: 92%
- F1 Score: 67%
A model that predicts most crop correctly and most non-crop correctly is a decent model.
Metric Conclusion: Given a representative dataset, the F1-score is the only metric which is consistent with our understanding of what a good crop mask is.
Interpreting the F1-score (scale is not scientifically rigorous):
- 0.0 <= F1 < 0.6 is probably not a useful map
- 0.6 <= F1 < 0.7 may be a useful map
- 0.7 <= F1 < 0.8 is a pretty good map
- 0.8 <= F1 < 0.9 is a good map
- 0.9 <= F1 < 1.0 is a great map
Given the above, the goal can be exactly defined as: Generate crop masks using a model given that the model achieves an F1-score of at least 0.7.
In [1b. Goal Setup] Dataset Splitting we'll discuss the data needed so that the measured F1-score is reflective of the real world.