Skip to content

Machine learning project comparing several algorithms to predict the outcome of shelter animals. Based on the former Kaggle competition: https://www.kaggle.com/c/shelter-animal-outcomes.

Notifications You must be signed in to change notification settings

danielle-altshuler/shelter_animal_predictions

Repository files navigation

shelter_animal_predictions

Introduction and Data Setup

Co-authors include Alla Hale and John Pette. For a Kaggle dataset research project, we chose to focus on the Shelter Animal Outcome project: https://www.kaggle.com/c/shelter-animal-outcomes#description. We decided to use this dataset and project to motivate the research question: “What is the outcome for a shelter animal, based on breed, color, sex, and age?” If we saw patterns, we could then provide recommendations to shelters on animals that seemed to have a higher likelihood of some outcomes over others. The animal features within this dataset are Name, Date of Outcome, Breed, Color, Gender, and Spayed/Neutered vs. Intact. We will use these features to predict which outcome is the most likely for our animals: Return to Owner, Death, Euthanasia, Transfer, or Adoption.

In order to transform our dataset and make it ready for model evaluation, we performed exploratory data analysis, and created dummy variables for all of our continuous variables to binarize all of our features. For our age variable, we normalized all of the data to weeks old, as the variable is continuous. Our features had to be consistent between the training and test datasets for prediction purposes. Therefore, we made the same feature transformations and binarizations on our test dataset.

Model Creation

We evaluated all of our models using the weighted f1-score to understand prediction performance. The weighted f1-score takes a proportional weight of the f1-score of all classes. The f1-score itself takes precision (ratio of the true positives to true and false positives) and recall (ratio of the true positives to true positives and false negatives) into account. Therefore, the f1-score is essentially taking both false positives and false negatives into account without needing to weight them equally. This type of scoring system to evaluate prediction performance works best when classes are not balanced, which was the case with our dataset.

In order to decide which algorithm to use, we tested several approaches. Our first step was to create a function, using the StratifiedKFold library within sckit-learn, to split our training dataset into training and test. This function randomly splits the training and test data (for our purposes, development data) a specified number of times using a proportional distribution (helpful for unbalanced classes). We optimized the weighted f1-score utilizing this function.

We started with creating a baseline model that predicted all outcomes as the majority class. This model was not a good one, but it was the baseline against which we tested our performance and accuracy improvements. To test our algorithms we evaluated five separate predictions to check if the weighted f1-score changed each time with logistic regression, decision trees, multinomial naive bayes and random forests. Since we are working with a sparse data set, we also used XGboost, which builds upon the decision tree approach with a gradient-boosted decision trees method. Gradient boosting constructs new decision tree models that predict the residuals of prior models and then adds them together to make final predictions. We also used the RandomOverSampler() function to oversample from classes with low quantities of outcomes and principal component analysis to help our model run faster.

Optimizing our Best Algorithm

Our random forest classifier had the best weighted f1-score, so we optimized this algorithm further. We performed a GridSearch on our random forest classifier to identify its best parameters. We want to fine tune the parameters max_depth, min_samples_split, and the criterion to improve the model's ability to generalize.

We found that the best parameters included a max_depth of 40, minimum_sample_split of 4, and criterion gini. The max_depth represents the maximum depth of each decision tree, or the maximum number of levels of each of those trees. The min_samples_split is the minimum number of observations or samples placed in a node before the node is split. We had to evaluate a balance with min_samples_split and max_depth. The criterion in this instance refers to the quality of each node split within our tree. The gini criterion was chosen as the best parameter over the entropy criterion. This measures how often a randomly chosen observation within our training data would be mislabeled.

Error Analysis

Our goal was to look into confusion matrices to understand which features existed when certain observations were mislabeled for other classes in our training folds. We evaluated the most common types of errors, although it was difficult to interpret due to the sparsity of our dataset. Very common breeds, such as domestic shorthairs, seem to be mispredicted as “transfer” or “died”, when the actual label was frequently euthanasia; this was especially common for cats. As the domestic shorthair breed appears for both cats and dogs, our model was getting confused on the prediction class. We briefly looked at the effect of adding interaction features in our models (i.e. domestic shorthair * animal type), but we saw no significant improvement.

Prediction on the Test Data

We decided to test our performance by evaluating the log loss function of our classifiers. Log loss analyzes the accuracy of a classifier, as it penalizes false predictions, and it is the measure used in the Kaggle competition. Our goal was to minimize the log loss function, or at least decrease it substantially with respect to our dummy classifier.

The log loss of our "dumb" classifier that predicted all shelter animals to fall in the majority class of adoption was 20.62, showing a high degree of inaccuracy in its predictions. Our random forest classifier with the best parameters had a log loss of 1.62; a huge improvement compared to the dumb classifier. We also evaluated the log loss of each of our main classifiers. The logistic regression and XGBoost had a slightly lower log loss output than the random forest classifier. However, we have to weigh the combination of log loss and weighted f1-score when thinking about the predictive power of an algorithm.

Releases

No releases published

Packages

No packages published