Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autolabelling: Setup a data collection pipeline (e.g. what happens when new data comes in?) #64

Open
mrdbourke opened this issue Jan 31, 2023 · 4 comments

Comments

@mrdbourke
Copy link
Owner

mrdbourke commented Jan 31, 2023

Data collection pipeline should be reactive to data coming into a bucket, for example:

Images get added to bucket -> autolabelling pipeline happens for unlabelled images -> labelling cleaning happens -> model training pipeline happens for when all images are labelled -> evaluation pipeline happens -> deployment happens

See:

@mrdbourke mrdbourke changed the title Setup a data collection pipeline (e.g. what happens when new data comes in?) Autolabelling: Setup a data collection pipeline (e.g. what happens when new data comes in?) Feb 6, 2023
@mrdbourke
Copy link
Owner Author

Potential autolabelling pipeline:

  • raw images downloaded (e.g. filtered images from large dataset, such as, LAION-COCO)
  • several rounds of zero-shot classification are run to further filter images
    • "edible_food" vs "other" (only keep images which contain edible food
    • "contains_logo" vs "other" (remove images with logos/text)
    • "apple" vs "banana" ... (label images with their appropriate class name)

Could use the pipeline above with multiple variants of CLIP-style models for redundancy.

@mrdbourke
Copy link
Owner Author

mrdbourke commented Feb 6, 2023

See openclip for zero-shot classification: https://github.com/mlfoundations/open_clip

Also see clip-retrieval for just embedding/searching a large existing dataset for images specific to a certain task: https://github.com/rom1504/clip-retrieval

Can download a large number of images from web links using: https://github.com/rom1504/img2dataset

@mrdbourke
Copy link
Owner Author

Much better to compute image embeddings + class embeddings up front.

Then reuse over time where necessary.

This could be setup via:

  • image gets given UUID
  • image embedding gets calculated
  • if the image UUID has an existing embedding, use that (can force to compute new if necessary)

@mrdbourke
Copy link
Owner Author

See this resource for autolabelling object detection: https://github.com/facebookresearch/CutLER

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant