Skip to content

Evaluation Metric

Ivan Zvonkov edited this page Apr 3, 2023 · 1 revision

The purpose of this page is to identify the goal and an appropriate evaluation metric for the crop-mask project.

It is roughly based on Andrew Ng's course on Structuring ML Projects

Identifying the goal

What is the goal of this project?

Produce high quality crop masks using a fast and cost-efficient method.

What is a crop mask?

A map for a region of interest showing the likelihood of crop presence on each pixel.

What is a high quality crop mask?

A crop-mask where:

  • most crop and non-crop pixels are correctly identified [OPTIMIZING]
  • there are no visible artifacts (clouds, stripes from satellite tracks) [SATISFICING]

Why are high quality crop masks needed?

  • Respond to agricultural disasters (know where to look)
  • Concentrate downstream models (e.g., crop type classifiers, yield estimators) on crop areas
  • Create food security assessments including production, planted area, change in planted area, etc.
  • Forecast vegetation and land use trends

How do we know we are doing well?

How do we measure the quality of a crop mask?

We can approximate the quality of a crop mask by:

  1. Obtaining ground truth data for a sample of pixels in the region of interest
  2. Generating predictions for the sample of pixels
  3. Comparing the predicted pixels against the ground truth pixels to generate a metric.

What metric should be used to measure the quality of a crop mask?

For most countries, land dedicated to cropland is typically below 20 percent, with many countries dedicating less than 10 percent. Source

We'll consider several prediction scenarios to decide on an appropriate metric. For this example, we'll use a representative sample dataset consisting of 100 data points, with 15 points classified as crop and 85 points classified as non-crop.

  1. All non-crop predictions

    Prediction/Actual Crop Non-crop
    Crop 0 0
    Non-crop 15 85
    • Accuracy: 85%
    • Precision: Nan
    • Recall: 0%
    • F1 Score: 0%

    A model that predicts all non-crop would not be useful, but accuracy says otherwise. Therefore, accuracy alone is not a good measure.

  2. Only crop if super sure

    Prediction/Actual Crop Non-crop
    Crop 5 0
    Non-crop 10 85
    • Accuracy: 90%
    • Precision: 100%
    • Recall: 25%
    • F1 Score: 40%

    A model that predicts crop only if it is super sure is better than the first one but still not great. Therefore, precision alone is not a good measure.

  3. Over predict crop

    Prediction/Actual Crop Non-crop
    Crop 15 40
    Non-crop 0 45
    • Accuracy: 60%
    • Precision: 27%
    • Recall: 100%
    • F1 Score: 43%

    A model that over predicts crop is also not great. Therefore, recall alone is not a good measure.

  4. Balanced errors

    Prediction/Actual Crop Non-crop
    Crop 11 7
    Non-crop 4 78
    • Accuracy: 89%
    • Precision: 61%
    • Recall: 92%
    • F1 Score: 67%

    A model that predicts most crop correctly and most non-crop correctly is a decent model.

Metric Conclusion: Given a representative dataset, the F1-score is the only metric which is consistent with our understanding of what a good crop mask is.

Interpreting the F1-score (scale is not scientifically rigorous):

  • 0.0 <= F1 < 0.6 is probably not a useful map
  • 0.6 <= F1 < 0.7 may be a useful map
  • 0.7 <= F1 < 0.8 is a pretty good map
  • 0.8 <= F1 < 0.9 is a good map
  • 0.9 <= F1 < 1.0 is a great map

Evaluation Metric

Given the above, the goal can be exactly defined as: Generate crop masks using a model given that the model achieves an F1-score of at least 0.7.

In [1b. Goal Setup] Dataset Splitting we'll discuss the data needed so that the measured F1-score is reflective of the real world.