Skip to content

Latest commit

 

History

History
111 lines (97 loc) · 3.87 KB

INSTALL.md

File metadata and controls

111 lines (97 loc) · 3.87 KB

Setup and Install

Use the following guide to reproduce my results and run the jupyter notebooks.

Clone this repo

Clone this repo and cd into the project.

$ git clone https://github.com/WillieMaddox/Airbus_SDC_dup.git
$ cd Airbus_SDC_dup

Setup the environment

I use virtualenv along with virtualenvwrapper to setup an isolated python environment:

$ mkvirtualenv --python=python3.6 Airbus_SDC_dup

You should now see your command prompt prefixed with (sdcdup) indicating you are in the virtualenv.

If using conda, you can instead try using the make script to create your environment.

$ make create_environment 

I do not use conda so I haven't had a chance to verify if this works.

Install requirements

From the root of the project install all requirements.

(Airbus_SDC_dup) $ pip install -r requirements.txt

or

$ make requirements

Download the data

The dataset for this project is hosted on Kaggle. Airbus Ship Detection Challenge You'll need to sign in with your Kaggle username. If you don't have an account, it's free to sign up.

You can extract the dataset to wherever you like. I extracted it to data/raw/train_768

├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- This README.
├── data
│   ├── raw            <- Data dump from the Airbu_SDC Kaggle competition goes in here.
│       ├── train_768  <- The images from train_v2.zip go in here.
│           ├── 00003e153.jpg
│           ├── 0001124c7.jpg
│           ├── ...
│           └── ffffe97f3.jpg
│       └── train_ship_segmentations_v2.csv <- The run length encoded ship labels.
│   ├── ...
├── ...

Preprocess tiles and interim data

Once the raw dataset has been downloaded and extracted, run the image preprocessing scripts.

First generate the 256 x 256 image tiles. You can do this by running a script,

$ make data 

or by running the provided jupyter notebook. Note: Make sure you have at least 32 GB free to store the tiles. The data will be saved to data/processed/train_256. It took me about 5 minutes to process all tiles using threading (over 30 min without threading).

Next generate the image feature metadata:

$ make features

Note: Make sure you have at least 1 GB free to store the interim data. The data will be saved to data/interim. With threading, tt takes approx 2.5 hrs to run on my dev system; 24 hours without threading. YMMV.

The newly generated files will be placed into the interim and processed directories. Once complete, your directory structure should look like the following:

├── ...
├── data
│   ├── raw
│   ├── interim
│       ├── image_bmh.pkl
│       ├── image_cmh.pkl
│       ├── image_sol.pkl
│       ├── image_md5.pkl
│       ├── image_shp.pkl
│       ├── matches_bmh_0.9.csv
│       ├── overlap_bmh.pkl
│       ├── overlap_cmh.pkl
│       ├── overlap_enp.pkl
│       ├── overlap_pix.pkl
│       ├── overlap_px0.pkl
│       └── overlap_shp.pkl
│   ├── processed
│       └── train_256
│           ├── 00003e153_0.jpg
│           ├── 00003e153_1.jpg
│           ├── 00003e153_2.jpg
│           ├── 00003e153_3.jpg
│           ├── 00003e153_4.jpg
│           ├── 00003e153_5.jpg
│           ├── 00003e153_6.jpg
│           ├── 00003e153_7.jpg
│           ├── 00003e153_8.jpg
│           ├── 0001124c7_0.jpg
│           ├── ...
│           └── ffffe97f3_8.jpg
│   ├── ...
├── ...