Skip to content

3. Adding labeled data

Ivan Zvonkov edited this page Nov 27, 2023 · 11 revisions

The following instructions explain how to add labeled data to crop-mask. Adding labeled data involves transforming raw crop/non-crop labels into a ML (machine learning)-ready Earth observation dataset. This dataset can then be used for machine learning model training or evaluation.

Prerequisite: Getting Started

Instructions

1. Setup

Inside a local crop-mask directory run the following commands to setup your project for adding data:

conda activate landcover-mapping                # Activates python environment
gcloud auth application-default login           # Logs into Google Cloud (where data is stored with dvc)
git checkout master                             # Switches to the master branch
git pull                                        # Pulls the latest code (including .dvc versioning files)
dvc pull                                        # Pulls the latest data (using .dvc versioning files)
git checkout -b'Add-data-<YOUR-DATASET-NAME>'   # Creates new branch where your data will be added

2. Add raw crop/non-crop labels to project

Drag and drop your raw crop/non-crop label file (csv/shp/zip/geojson/txt) into data/raw directory.

Then, run the following commands to add the data to the project:

dvc commit data/raw  # Updates the data/raw versioning file (data/raw.dvc)
dvc push             # Pushes the latest data/raw folder to Google Cloud

3. Write code to standardize labels

Write a LabeledDataset class in datasets.py with a load_labels function that converts raw labels to a standard format.

LabeledDataset for TogoCrop2019 Example
class TogoCrop2019(LabeledDataset):
    def load_labels(self) -> pd.DataFrame:
        # Read in raw label file
        df = pd.read_csv(PROJECT_ROOT / DataPaths.RAW_LABELS / "Togo_2019.csv")

        # Rename coordinate columns to be used for getting Earth observation data
        df.rename(columns={"latitude": LAT, "longitude": LON}, inplace=True)

        # Set start and end date for Earth observation data
        df[START], df[END] = date(2019, 1, 1), date(2020, 12, 31)

        # Set consistent label column
        df[label_col] = df["crop"].astype(float)

        # Split labels into train, validation, and test sets
        df[SUBSET] = train_val_test_split(index=df.index, val=0.2, test=0.2)

        # Set country column for later analysis
        df[COUNTRY] = "Togo"

        return df

datasets: List[LabeledDataset] = [TogoCrop2019(), ...]

Check your new dataset load_labels function

openmapflow verify TogoCrop2019

If the above command does not pass all check, amend your load_labels() function based on the failed check. If the above command passes all checks, add your new class to the datasets list: datasets = [..., TogoCrop2019()] in datastets.py

4. Push new code to Github

git add .
git commit -m'Created new dataset'
git push

5. Open a Pull Request

Open a Pull Request from your branch name Add-data-<YOUR-DATASET-NAME> to main. The Pull Request should include:

  • an update to the datasets.py file (made by you) and
  • an update to the data/raw.dvc file (made by running the commands in step 2).

If you followed all the above steps, opening the Pull Request will automatically trigger the Data Pipeline. This will check the validity of the new dataset and kick off Earth Engine export tasks to fetch the Earth observation data for each coordinate. Depending on the amount of points in the dataset this can take significant time (E.g. a few hours for hundreds of points, more than a day for thousands of points).

Since the dataset will not be immediately available the datasets-test and models-test will fail. image

As the Earth Engine export tasks complete in the background the Data Pipeline action will report the current status of the data in the form of an update to the report.txt file: image

6. Checking progress of the dataset

As the Earth Engine export tasks complete in the background the Data Pipeline action should be manually re-run every day until the report.txt shows a complete dataset. This can be done by clicking on the Data Pipeline run associated with your branch and clicking the "Re-run jobs" button.

Once the report.txt shows a complete dataset push an empty commit to rerun all tests:

git commit --allow-empty -m "Trigger Build"
git push

7. Merging the Pull Request

Once all conflicts are resolved and all tests pass, the Pull Request can be merged. image

Common Issues

Changing dataset code after data pipeline has already run

Related to: 6. Checking progress of the dataset

If the dataset code is updated (e.g. dataset split is changed in datasets.py), the data/datasets.dvc file needs to be reset with the following command:

git pull
git checkout origin/master -- data/datasets.dvc
git add .
git commit -m'Reset dataset.dvc'
git push

DVC Merge Conflicts

Related: 7. Merging the Pull Request

If other team members have added data and merged their changes into the project a merge conflict may arise and a warning on the Pull Request will come up to resolve merge conflicts in data/datasets.dvc. Here's the easiest way I have found to resolve these conflicts:

# Setup git merge driver (only needs to be done once)
git config merge.dvc.name 'DVC merge driver'
git config merge.dvc.driver 'dvc git-hook merge-driver --ancestor %O --our %A --their %B'

# Get the latest data in the master branch
git checkout master
git pull
dvc pull

# Get the latest data in your branch
git checkout 'Add-data-<YOUR-DATASET-NAME>'
git pull
dvc pull

# Merge master into your branch (dvc should auto resolve the conflicts)
git merge master
dvc push 
git push

Updating raw dataset files

If the datasets/<dataset-name>.csv already exists, updating its source raw files will not regenerate the datasets/<dataset-name>.csv. To regenerate the dataset using new raw files do the following:

# Delete the dataset locally
rm data/datasets/<dataset-name>.csv

# Commit the updated raw files and deletion of the dataset file to dvc
dvc commit data/raw
dvc commit data/datasets
dvc push

This will retrigger the pipeline and generate an updated dataset.