Use the following guide to reproduce my results and run the jupyter notebooks.
Clone this repo and cd into the project.
$ git clone https://github.com/WillieMaddox/Airbus_SDC_dup.git
$ cd Airbus_SDC_dup
I use virtualenv along with virtualenvwrapper to setup an isolated python environment:
$ mkvirtualenv --python=python3.6 Airbus_SDC_dup
You should now see your command prompt prefixed with (sdcdup)
indicating you are in the virtualenv.
If using conda, you can instead try using the make script to create your environment.
$ make create_environment
I do not use conda so I haven't had a chance to verify if this works.
From the root of the project install all requirements.
(Airbus_SDC_dup) $ pip install -r requirements.txt
or
$ make requirements
The dataset for this project is hosted on Kaggle. Airbus Ship Detection Challenge You'll need to sign in with your Kaggle username. If you don't have an account, it's free to sign up.
You can extract the dataset to wherever you like. I extracted it to data/raw/train_768
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- This README.
├── data
│ ├── raw <- Data dump from the Airbu_SDC Kaggle competition goes in here.
│ ├── train_768 <- The images from train_v2.zip go in here.
│ ├── 00003e153.jpg
│ ├── 0001124c7.jpg
│ ├── ...
│ └── ffffe97f3.jpg
│ └── train_ship_segmentations_v2.csv <- The run length encoded ship labels.
│ ├── ...
├── ...
Once the raw dataset has been downloaded and extracted, run the image preprocessing scripts.
First generate the 256 x 256 image tiles. You can do this by running a script,
$ make data
or by running the provided jupyter notebook.
Note: Make sure you have at least 32 GB free to store the tiles. The data will be saved to data/processed/train_256
.
It took me about 5 minutes to process all tiles using threading (over 30 min without threading).
Next generate the image feature metadata:
$ make features
Note: Make sure you have at least 1 GB free to store the interim data. The data will be saved to data/interim
.
With threading, tt takes approx 2.5 hrs to run on my dev system; 24 hours without threading. YMMV.
The newly generated files will be placed into the interim and processed directories. Once complete, your directory structure should look like the following:
├── ...
├── data
│ ├── raw
│ ├── interim
│ ├── image_bmh.pkl
│ ├── image_cmh.pkl
│ ├── image_sol.pkl
│ ├── image_md5.pkl
│ ├── image_shp.pkl
│ ├── matches_bmh_0.9.csv
│ ├── overlap_bmh.pkl
│ ├── overlap_cmh.pkl
│ ├── overlap_enp.pkl
│ ├── overlap_pix.pkl
│ ├── overlap_px0.pkl
│ └── overlap_shp.pkl
│ ├── processed
│ └── train_256
│ ├── 00003e153_0.jpg
│ ├── 00003e153_1.jpg
│ ├── 00003e153_2.jpg
│ ├── 00003e153_3.jpg
│ ├── 00003e153_4.jpg
│ ├── 00003e153_5.jpg
│ ├── 00003e153_6.jpg
│ ├── 00003e153_7.jpg
│ ├── 00003e153_8.jpg
│ ├── 0001124c7_0.jpg
│ ├── ...
│ └── ffffe97f3_8.jpg
│ ├── ...
├── ...